pith. sign in

arxiv: 2511.03182 · v2 · submitted 2025-11-05 · 💻 cs.SE · cs.LG

Understanding Robustness of Model Editing in Code LLMs

Pith reviewed 2026-05-18 01:43 UTC · model grok-4.3

classification 💻 cs.SE cs.LG
keywords model editingcode LLMsAPI migrationrobustnessgeneralizationsuccessive editsexecution evaluation
0
0 comments X

The pith

Model editing in code LLMs produces poor generalization to new API uses and degrades performance on unmodified tasks, with successive edits driving most models to near-zero success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether lightweight model editing can update code LLMs to adopt new APIs as libraries evolve. It builds a benchmark of 2040 problems across 140 synthetic API changes drawn from HumanEval, MBPP, and APPS, then runs edited models inside an execution sandbox that enforces the new API rules and checks whether solutions truly use the updated calls or merely bypass them. Under single edits the models rarely apply the change to unseen code patterns, many passing solutions turn out to be workarounds, and accuracy on tasks that still use the original API drops. When the same models receive edits one after another, performance on both updated and original tasks collapses for nearly all method-model pairs.

Core claim

Under single edits, edited models generalize poorly to unseen uses of the modified API, and many apparent successes are workaround-based rather than true API migrations. Performance on tasks involving unmodified APIs also degrades, although memory-based methods and fine-tuning preserve specificity better than locate-then-edit methods. Under successive edits, most method-model combinations collapse to near-zero Pass@k on both generalization and specificity. A two-factor Shapley decomposition further shows that single-edit failures on generalization include a substantial compilation component, whereas specificity failures are more often post-compilation. Under successive edits, failures become

What carries the argument

Execution sandbox that enforces edited APIs under standard Python semantics together with execution-based metrics that separate genuine adoption of the new API from workaround solutions that complete the task without using the edit.

If this is right

  • Single edits cannot be assumed to produce reliable API migration because many passing solutions avoid the new API entirely.
  • Performance on tasks that continue to use the original API declines after an edit, limiting safe use of edited models in mixed codebases.
  • Successive edits trigger broad interference that destroys capability on both edited and unedited APIs for most current methods.
  • Memory-based and fine-tuning approaches maintain higher specificity than locate-then-edit methods after a single change.
  • Generalization failures contain a large compilation component while specificity failures tend to occur after successful compilation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Real deployment of edited code models would require additional runtime checks or test suites to detect hidden workarounds and unintended side effects on legacy code.
  • Editing pipelines may need explicit mechanisms to track interactions between multiple changes if they are to remain viable as libraries evolve over time.
  • The observed compilation-driven versus post-compilation failure split points to different intervention points: syntax-level regularization for generalization and semantic consistency checks for specificity.

Load-bearing premise

The synthetic API modifications and the execution-based metrics in the sandbox correctly distinguish genuine API adoption from workaround solutions that would not be possible or detectable in real-world usage of the edited models.

What would settle it

Measuring whether edited models emit code that actually invokes the new API function on fresh test cases that require the updated signature in ways never shown during editing, rather than completing the task through alternative code that avoids the edited symbol.

Figures

Figures reproduced from arXiv: 2511.03182 by A.B Siddique, Moghis Fereidouni, Umar Farooq, Vinaik Chhetri.

Figure 1
Figure 1. Figure 1: Performance degradation across sequential edits. Each subplot shows how the [PITH_FULL_IMAGE:figures/full_fig_p016_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Taxonomy of outcomes when editing code LMs for API evolution. [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗
read the original abstract

Large language models (LLMs) for code are increasingly used in software development, but they remain static after pretraining while APIs and software libraries continue to evolve. Model editing offers a lightweight alternative to retraining for incorporating API updates, yet it remains unclear whether existing editing methods can induce correct API migration, generalize that behavior to unseen tasks, and preserve performance on tasks involving unmodified APIs. We present a controlled benchmark for evaluating model editing under API updates in code LLMs, built from HumanEval, MBPP, and APPS, with 2,040 problems spanning 140 unique synthetic API modifications, together with an execution sandbox that enforces edited APIs under standard Python semantics. We evaluate several state-of-the-art editing methods on three code LLMs under both single-edit and successive-edit regimes using execution-based metrics that distinguish successful API adoption from workaround-based task completion. Under single edits, edited models generalize poorly to unseen uses of the modified API, and many apparent successes are workaround-based rather than true API migrations. Performance on tasks involving unmodified APIs also degrades, although memory-based methods and fine-tuning preserve specificity better than locate-then-edit methods. Under successive edits, most method-model combinations collapse to near-zero Pass@k on both generalization and specificity, revealing substantial interference beyond the target edits. A two-factor Shapley decomposition further shows that single-edit failures on generalization include a substantial compilation component, whereas specificity failures are more often post-compilation. Under successive edits, failures become predominantly compilation-driven.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces a controlled benchmark for evaluating model editing in code LLMs under API updates, constructed from 2,040 problems spanning HumanEval, MBPP, and APPS with 140 synthetic API modifications and an execution sandbox enforcing edited APIs under Python semantics. It evaluates state-of-the-art editing methods on three code LLMs in single-edit and successive-edit regimes using execution-based Pass@k metrics that distinguish true API adoption from workarounds. Key claims include poor generalization to unseen uses of modified APIs, prevalence of workaround-based successes, degradation on unmodified APIs (with memory-based methods faring better), and near-total collapse under successive edits; a Shapley decomposition attributes single-edit generalization failures partly to compilation issues and specificity failures to post-compilation errors, with successive-edit failures becoming predominantly compilation-driven.

Significance. If the results hold, the work is significant for providing empirical evidence that current model editing techniques are inadequate for robust API migration in code LLMs, revealing issues of poor generalization, workaround reliance, specificity loss, and edit interference. The benchmark design with execution metrics and post-hoc Shapley decomposition offers a reproducible framework that could steer development of more reliable editing approaches for maintaining LLMs amid evolving libraries.

major comments (3)
  1. [§3] §3 (Benchmark and Sandbox): The central claims rest on the assumption that the 140 synthetic API modifications and sandbox execution correctly separate genuine migrations from workarounds. The manuscript must provide explicit justification or ablation showing how these modifications replicate real API changes (e.g., signature shifts, behavioral semantics, import side effects) rather than allowing artificial workarounds detectable only in the sandbox; without this, the Pass@k distinctions for generalization may not proxy real-world API updates.
  2. [§5] §5 (Single-Edit Experiments): The claim that edited models degrade on tasks involving unmodified APIs is load-bearing for the specificity argument, yet the manuscript should report per-method degradation magnitudes with statistical tests and confirm that the observed differences between memory-based and locate-then-edit methods are not confounded by edit magnitude or hyperparameter choices.
  3. [§6] §6 (Successive-Edit Regime): The reported collapse to near-zero Pass@k on both generalization and specificity under successive edits is a strong negative result, but the paper needs to detail the edit ordering, cumulative interference measurement, and whether failures stem from overwriting prior edits versus other mechanisms, as this directly supports the interference conclusion.
minor comments (2)
  1. [Abstract] The abstract mentions evaluation on 'three code LLMs' without naming them; list the specific models in the abstract and early introduction for immediate clarity.
  2. [Results] Ensure figures or tables presenting Pass@k results include variance estimates or multiple-run statistics to support the reported trends.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. The comments highlight important areas for strengthening the justification of our benchmark, the statistical rigor of our specificity analysis, and the mechanistic details of interference under successive edits. We address each major comment below and commit to revisions that enhance the paper without altering its core findings.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark and Sandbox): The central claims rest on the assumption that the 140 synthetic API modifications and sandbox execution correctly separate genuine migrations from workarounds. The manuscript must provide explicit justification or ablation showing how these modifications replicate real API changes (e.g., signature shifts, behavioral semantics, import side effects) rather than allowing artificial workarounds detectable only in the sandbox; without this, the Pass@k distinctions for generalization may not proxy real-world API updates.

    Authors: We agree that a clearer justification of the synthetic modifications is required to support the benchmark's ecological validity. In the revised manuscript we will add a new subsection in §3 that (i) categorizes the 140 modifications according to real-world API evolution patterns (signature changes, semantic shifts, import side-effects), (ii) provides explicit mappings to historical changes in libraries such as NumPy, pandas and requests, and (iii) reports an ablation that removes each modification category in turn and measures the resulting change in generalization Pass@k and workaround rates. These additions will demonstrate that the observed distinctions between true migration and workarounds are not artifacts of the sandbox alone. revision: yes

  2. Referee: [§5] §5 (Single-Edit Experiments): The claim that edited models degrade on tasks involving unmodified APIs is load-bearing for the specificity argument, yet the manuscript should report per-method degradation magnitudes with statistical tests and confirm that the observed differences between memory-based and locate-then-edit methods are not confounded by edit magnitude or hyperparameter choices.

    Authors: We accept the need for quantitative reporting and controls. The revision will include a new table in §5 that lists, for each method, the mean degradation on unmodified-API tasks together with standard deviations and p-values from paired Wilcoxon signed-rank tests. We will also add a paragraph and appendix sensitivity analysis showing that (a) edit magnitudes (measured by L2 norm of parameter updates) were matched across methods via a common hyperparameter search on a validation split, and (b) the relative advantage of memory-based methods persists across a grid of learning rates and edit strengths. These changes will be incorporated without modifying the original conclusions. revision: yes

  3. Referee: [§6] §6 (Successive-Edit Regime): The reported collapse to near-zero Pass@k on both generalization and specificity under successive edits is a strong negative result, but the paper needs to detail the edit ordering, cumulative interference measurement, and whether failures stem from overwriting prior edits versus other mechanisms, as this directly supports the interference conclusion.

    Authors: We welcome the request for greater transparency on the successive-edit protocol. In the revised §6 we will specify that edit order was randomized per experimental run but fixed by seed for reproducibility; introduce a cumulative interference metric (average performance drop on previously edited APIs after each new edit); and provide a failure-mode breakdown derived from execution logs indicating that overwriting of prior edits accounts for the majority of the observed collapse, with the remainder attributable to rising compilation errors. A supplementary figure will illustrate the progressive degradation trajectory. These details will be added while preserving the reported near-zero Pass@k outcome. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results with post-hoc attribution

full rationale

The paper constructs a new benchmark from existing datasets (HumanEval, MBPP, APPS) with synthetic API modifications and measures editing performance via execution-based Pass@k metrics in a sandbox. These are direct empirical observations, not derivations. The two-factor Shapley decomposition is applied after the fact to decompose already-computed pass rates into compilation vs. post-compilation components and does not define or presuppose the success metric. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the central claims. The evaluation chain is self-contained against external benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on standard assumptions from LLM benchmarking literature that Pass@k and execution-based correctness are valid proxies for real developer utility, plus the assumption that the chosen synthetic modifications capture the difficulty of real API changes.

axioms (1)
  • domain assumption Execution-based metrics in a controlled sandbox accurately reflect whether an edit has produced correct API usage versus a workaround.
    Invoked when the paper distinguishes successful API adoption from workaround-based task completion.

pith-pipeline@v0.9.0 · 5804 in / 1289 out tokens · 32957 ms · 2026-05-18T01:43:14.532225+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 9 internal anchors

  1. [1]

    Amazon. 2023. Amazon CodeWhisperer: Build applications faster and more securely with your AI coding companion. https://aws.amazon.com/codewhisperer/

  2. [2]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Proc. ACM Softw. Eng., Vol. 1, No. 1, Article . Publication date: November 2025. 24 Vinaik Chhetri, A.B Siddique, and Umar Farooq Rui ...

  3. [3]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  4. [4]

    Nicola De Cao, Wilker Aziz, and Ivan Titov. 2021. Editing Factual Knowledge in Language Models.Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP2021). https://arxiv.org/abs/2104.08164

  5. [6]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

  6. [7]

    OpenJS Foundation / Node.js contributors. 2025. Deprecations — Node.js API (latestv20.x). https://nodejs.org/docs/ latest-v20.x/api/deprecations.html. Accessed: September 20, 2025

  7. [8]

    Oracle Corporation. 2025. Deprecated List — Java SE 23 API Documentation. https://docs.oracle.com/en/java/javase/ 23/docs/api/deprecated-list.html. Accessed: September 20, 2025

  8. [9]

    NumPy Developers. 2024. NumPy 2.0.0 Release Notes. https://numpy.org/doc/2.0/release/2.0.0-notes.html. Accessed: September 20, 2025

  9. [10]

    NumPy Developers. 2025. NumPy. https://numpy.org/. Accessed: September 20, 2025

  10. [11]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186

  11. [12]

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational ...

  12. [13]

    Node.js Foundation. 2025. Node.js. https://nodejs.org/. Accessed: September 20, 2025

  13. [14]

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer Feed-Forward Layers Are Key-Value Memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, Online and Punta Cana, ...

  14. [15]

    GitHub. 2021. GitHub Copilot: Your AI Pair Programmer. https://copilot.github.com/

  15. [16]

    Jian Gu, Aldeida Aleti, Chunyang Chen, and Hongyu Zhang. 2024. Neuron Patching: Semantic-based Neuron-level Language Model Repair for Code Generation. arXiv:2312.05356 [cs.SE] https://arxiv.org/abs/2312.05356

  16. [17]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. arXiv:2401.14196 [cs.SE] https://arxiv.org/abs/2401.14196

  17. [18]

    Akshat Gupta, Anurag Rao, and Gopala Anumanchipalli. 2024. Model Editing at Scale leads to Gradual and Catastrophic Forgetting. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 15202–15232. doi:10.18653/v1/ 2024.finding...

  18. [19]

    Akshat Gupta, Dev Sajnani, and Gopala Anumanchipalli. 2024. A Unified Framework for Model Editing. arXiv:2403.14236 [cs.LG] https://arxiv.org/abs/2403.14236

  19. [20]

    Thomas Hartvigsen, Swami Sankaranarayanan, Hamid Palangi, Yoon Kim, and Marzyeh Ghassemi. 2023. Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors. InAdvances in Neural Information Processing Proc. ACM Softw. Eng., Vol. 1, No. 1, Article . Publication date: November 2025. Understanding Robustness of Model Editing in Code LLMs: An Em...

  20. [21]

    Peter Hase, Mohit Bansal, Been Kim, and Asma Ghandeharioun. 2023. Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models. InThirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=EldbUlZtbd

  21. [22]

    Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2020. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. arXiv:1909.09436 [cs.LG] https://arxiv.org/abs/1909.09436

  22. [23]

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Loge...

  23. [24]

    Xiaopeng Li, Shasha Li, Shezheng Song, Huijun Liu, Bin Ji, Xi Wang, Jun Ma, Jie Yu, Xiaodong Liu, Jing Wang, and Weimin Zhang. 2025. SWEA: updating factual knowledge in large language models via subject word embedding altering. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applicat...

  24. [25]

    Xiaopeng Li, Shasha Li, Shezheng Song, Jing Yang, Jun Ma, and Jie Yu. 2024. Pmet: Precise model editing in a transformer. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 18564–18572

  25. [26]

    Xiaopeng Li, Shangwen Wang, Shasha Li, Jun Ma, Jie Yu, Xiaodong Liu, Jing Wang, Bin Ji, and Weimin Zhang. 2025. Model Editing for LLMs4Code: How Far are We?. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering(Ottawa, Ontario, Canada)(ICSE ’25). IEEE Press, 937–949. doi:10.1109/ICSE55347.2025.00049

  26. [27]

    Zeyu Leo Liu, Shrey Pandit, Xi Ye, Eunsol Choi, and Greg Durrett. 2025. CodeUpdateArena: Benchmarking Knowledge Editing on API Updates. arXiv:2407.06249 [cs.CL] https://arxiv.org/abs/2407.06249

  27. [28]

    Google LLC. 2024. API Differences Between 34 and 35 — Android Developers. https://developer.android.com/sdk/api_ diff/35/changes. Accessed: September 20, 2025

  28. [29]

    Google LLC. 2025. Android Developers. https://developer.android.com. Accessed: September 20, 2025

  29. [30]

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and Editing Factual Associations in GPT.Advances in Neural Information Processing Systems36 (2022). arXiv:2202.05262

  30. [31]

    Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. 2023. Mass Editing Memory in a Transformer.The Eleventh International Conference on Learning Representations (ICLR)(2023)

  31. [32]

    Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D Manning. 2022. Fast Model Editing at Scale. InInternational Conference on Learning Representations. https://openreview.net/pdf?id=0DcZxeWfOPt

  32. [33]

    Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. 2022. Memory-Based Model Editing at Scale. InInternational Conference on Machine Learning. https://arxiv.org/pdf/2206.06520.pdf

  33. [34]

    Oracle. 2025. Java Platform, Standard Edition Documentation. https://docs.oracle.com/en/java/javase/. Accessed: September 20, 2025

  34. [35]

    The pandas development team. 2022. Deprecations — pandas 1.5.0. https://pandas.pydata.org/pandas-docs/version/1. 5/whatsnew/v1.5.0.html#deprecations. Accessed: September 20, 2025

  35. [36]

    The pandas development team. 2022. pandas: pandas.concat. https://pandas.pydata.org/docs/reference/api/pandas. concat.html Accessed: 2025-09-20

  36. [37]

    The pandas development team. 2022. pandas: pandas.DataFrame.append. https://pandas.pydata.org/pandas-docs/ version/1.4/reference/api/pandas.DataFrame.append.html Accessed: 2025-09-20

  37. [38]

    The pandas development team. 2025. pandas — Python Data Analysis Library. https://pandas.pydata.org/. Accessed: September 20, 2025

  38. [39]

    Google Research. 2025. mbpp: Mostly Basic Python Problems. https://github.com/google-research/google-research/ tree/master/mbpp. Accessed: 2025-08-15

  39. [41]

    Code Llama: Open Foundation Models for Code

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Proc. ACM Softw. Eng., Vol. 1, No. 1, Article . Publication date: November 2025. 26 Vinaik Chhetri, A.B Siddique, and U...

  40. [42]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

  41. [43]

    Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. 2020. Investigating Gender Bias in Language Models Using Causal Mediation Analysis. InAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 12388–...

  42. [44]

    Peng Wang, Ningyu Zhang, Bozhong Tian, Zekun Xi, Yunzhi Yao, Ziwen Xu, Mengru Wang, Shengyu Mao, Xiaohan Wang, Siyuan Cheng, Kangwei Liu, Yuansheng Ni, Guozhou Zheng, and Huajun Chen. 2024. EasyEdit: An Easy- to-use Knowledge Editing Framework for Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguist...

  43. [45]

    Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen- tau Yih (Eds.). Association fo...

  44. [46]

    Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to mine aligned code and natural language pairs from stack overflow. InProceedings of the 15th International Conference on Mining Software Repositories(Gothenburg, Sweden)(MSR ’18). Association for Computing Machinery, New York, NY, USA, 476–486. doi:10.1145/3196398.3196408

  45. [47]

    Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh Bhojanapalli, Daliang Li, Felix Yu, and Sanjiv Kumar. 2020. Modifying Memories in Transformer Models. arXiv:2012.00363 [cs.CL] https://arxiv.org/abs/2012.00363 Proc. ACM Softw. Eng., Vol. 1, No. 1, Article . Publication date: November 2025