Understanding Robustness of Model Editing in Code LLMs

A.B Siddique; Moghis Fereidouni; Umar Farooq; Vinaik Chhetri

arxiv: 2511.03182 · v2 · submitted 2025-11-05 · 💻 cs.SE · cs.LG

Understanding Robustness of Model Editing in Code LLMs

Vinaik Chhetri , Moghis Fereidouni , A.B Siddique , Umar Farooq This is my paper

Pith reviewed 2026-05-18 01:43 UTC · model grok-4.3

classification 💻 cs.SE cs.LG

keywords model editingcode LLMsAPI migrationrobustnessgeneralizationsuccessive editsexecution evaluation

0 comments

The pith

Model editing in code LLMs produces poor generalization to new API uses and degrades performance on unmodified tasks, with successive edits driving most models to near-zero success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether lightweight model editing can update code LLMs to adopt new APIs as libraries evolve. It builds a benchmark of 2040 problems across 140 synthetic API changes drawn from HumanEval, MBPP, and APPS, then runs edited models inside an execution sandbox that enforces the new API rules and checks whether solutions truly use the updated calls or merely bypass them. Under single edits the models rarely apply the change to unseen code patterns, many passing solutions turn out to be workarounds, and accuracy on tasks that still use the original API drops. When the same models receive edits one after another, performance on both updated and original tasks collapses for nearly all method-model pairs.

Core claim

Under single edits, edited models generalize poorly to unseen uses of the modified API, and many apparent successes are workaround-based rather than true API migrations. Performance on tasks involving unmodified APIs also degrades, although memory-based methods and fine-tuning preserve specificity better than locate-then-edit methods. Under successive edits, most method-model combinations collapse to near-zero Pass@k on both generalization and specificity. A two-factor Shapley decomposition further shows that single-edit failures on generalization include a substantial compilation component, whereas specificity failures are more often post-compilation. Under successive edits, failures become

What carries the argument

Execution sandbox that enforces edited APIs under standard Python semantics together with execution-based metrics that separate genuine adoption of the new API from workaround solutions that complete the task without using the edit.

If this is right

Single edits cannot be assumed to produce reliable API migration because many passing solutions avoid the new API entirely.
Performance on tasks that continue to use the original API declines after an edit, limiting safe use of edited models in mixed codebases.
Successive edits trigger broad interference that destroys capability on both edited and unedited APIs for most current methods.
Memory-based and fine-tuning approaches maintain higher specificity than locate-then-edit methods after a single change.
Generalization failures contain a large compilation component while specificity failures tend to occur after successful compilation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real deployment of edited code models would require additional runtime checks or test suites to detect hidden workarounds and unintended side effects on legacy code.
Editing pipelines may need explicit mechanisms to track interactions between multiple changes if they are to remain viable as libraries evolve over time.
The observed compilation-driven versus post-compilation failure split points to different intervention points: syntax-level regularization for generalization and semantic consistency checks for specificity.

Load-bearing premise

The synthetic API modifications and the execution-based metrics in the sandbox correctly distinguish genuine API adoption from workaround solutions that would not be possible or detectable in real-world usage of the edited models.

What would settle it

Measuring whether edited models emit code that actually invokes the new API function on fresh test cases that require the updated signature in ways never shown during editing, rather than completing the task through alternative code that avoids the edited symbol.

Figures

Figures reproduced from arXiv: 2511.03182 by A.B Siddique, Moghis Fereidouni, Umar Farooq, Vinaik Chhetri.

**Figure 2.** Figure 2: Taxonomy of outcomes when editing code LMs for API evolution. [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗

read the original abstract

Large language models (LLMs) for code are increasingly used in software development, but they remain static after pretraining while APIs and software libraries continue to evolve. Model editing offers a lightweight alternative to retraining for incorporating API updates, yet it remains unclear whether existing editing methods can induce correct API migration, generalize that behavior to unseen tasks, and preserve performance on tasks involving unmodified APIs. We present a controlled benchmark for evaluating model editing under API updates in code LLMs, built from HumanEval, MBPP, and APPS, with 2,040 problems spanning 140 unique synthetic API modifications, together with an execution sandbox that enforces edited APIs under standard Python semantics. We evaluate several state-of-the-art editing methods on three code LLMs under both single-edit and successive-edit regimes using execution-based metrics that distinguish successful API adoption from workaround-based task completion. Under single edits, edited models generalize poorly to unseen uses of the modified API, and many apparent successes are workaround-based rather than true API migrations. Performance on tasks involving unmodified APIs also degrades, although memory-based methods and fine-tuning preserve specificity better than locate-then-edit methods. Under successive edits, most method-model combinations collapse to near-zero Pass@k on both generalization and specificity, revealing substantial interference beyond the target edits. A two-factor Shapley decomposition further shows that single-edit failures on generalization include a substantial compilation component, whereas specificity failures are more often post-compilation. Under successive edits, failures become predominantly compilation-driven.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces a controlled benchmark for evaluating model editing in code LLMs under API updates, constructed from 2,040 problems spanning HumanEval, MBPP, and APPS with 140 synthetic API modifications and an execution sandbox enforcing edited APIs under Python semantics. It evaluates state-of-the-art editing methods on three code LLMs in single-edit and successive-edit regimes using execution-based Pass@k metrics that distinguish true API adoption from workarounds. Key claims include poor generalization to unseen uses of modified APIs, prevalence of workaround-based successes, degradation on unmodified APIs (with memory-based methods faring better), and near-total collapse under successive edits; a Shapley decomposition attributes single-edit generalization failures partly to compilation issues and specificity failures to post-compilation errors, with successive-edit failures becoming predominantly compilation-driven.

Significance. If the results hold, the work is significant for providing empirical evidence that current model editing techniques are inadequate for robust API migration in code LLMs, revealing issues of poor generalization, workaround reliance, specificity loss, and edit interference. The benchmark design with execution metrics and post-hoc Shapley decomposition offers a reproducible framework that could steer development of more reliable editing approaches for maintaining LLMs amid evolving libraries.

major comments (3)

[§3] §3 (Benchmark and Sandbox): The central claims rest on the assumption that the 140 synthetic API modifications and sandbox execution correctly separate genuine migrations from workarounds. The manuscript must provide explicit justification or ablation showing how these modifications replicate real API changes (e.g., signature shifts, behavioral semantics, import side effects) rather than allowing artificial workarounds detectable only in the sandbox; without this, the Pass@k distinctions for generalization may not proxy real-world API updates.
[§5] §5 (Single-Edit Experiments): The claim that edited models degrade on tasks involving unmodified APIs is load-bearing for the specificity argument, yet the manuscript should report per-method degradation magnitudes with statistical tests and confirm that the observed differences between memory-based and locate-then-edit methods are not confounded by edit magnitude or hyperparameter choices.
[§6] §6 (Successive-Edit Regime): The reported collapse to near-zero Pass@k on both generalization and specificity under successive edits is a strong negative result, but the paper needs to detail the edit ordering, cumulative interference measurement, and whether failures stem from overwriting prior edits versus other mechanisms, as this directly supports the interference conclusion.

minor comments (2)

[Abstract] The abstract mentions evaluation on 'three code LLMs' without naming them; list the specific models in the abstract and early introduction for immediate clarity.
[Results] Ensure figures or tables presenting Pass@k results include variance estimates or multiple-run statistics to support the reported trends.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. The comments highlight important areas for strengthening the justification of our benchmark, the statistical rigor of our specificity analysis, and the mechanistic details of interference under successive edits. We address each major comment below and commit to revisions that enhance the paper without altering its core findings.

read point-by-point responses

Referee: [§3] §3 (Benchmark and Sandbox): The central claims rest on the assumption that the 140 synthetic API modifications and sandbox execution correctly separate genuine migrations from workarounds. The manuscript must provide explicit justification or ablation showing how these modifications replicate real API changes (e.g., signature shifts, behavioral semantics, import side effects) rather than allowing artificial workarounds detectable only in the sandbox; without this, the Pass@k distinctions for generalization may not proxy real-world API updates.

Authors: We agree that a clearer justification of the synthetic modifications is required to support the benchmark's ecological validity. In the revised manuscript we will add a new subsection in §3 that (i) categorizes the 140 modifications according to real-world API evolution patterns (signature changes, semantic shifts, import side-effects), (ii) provides explicit mappings to historical changes in libraries such as NumPy, pandas and requests, and (iii) reports an ablation that removes each modification category in turn and measures the resulting change in generalization Pass@k and workaround rates. These additions will demonstrate that the observed distinctions between true migration and workarounds are not artifacts of the sandbox alone. revision: yes
Referee: [§5] §5 (Single-Edit Experiments): The claim that edited models degrade on tasks involving unmodified APIs is load-bearing for the specificity argument, yet the manuscript should report per-method degradation magnitudes with statistical tests and confirm that the observed differences between memory-based and locate-then-edit methods are not confounded by edit magnitude or hyperparameter choices.

Authors: We accept the need for quantitative reporting and controls. The revision will include a new table in §5 that lists, for each method, the mean degradation on unmodified-API tasks together with standard deviations and p-values from paired Wilcoxon signed-rank tests. We will also add a paragraph and appendix sensitivity analysis showing that (a) edit magnitudes (measured by L2 norm of parameter updates) were matched across methods via a common hyperparameter search on a validation split, and (b) the relative advantage of memory-based methods persists across a grid of learning rates and edit strengths. These changes will be incorporated without modifying the original conclusions. revision: yes
Referee: [§6] §6 (Successive-Edit Regime): The reported collapse to near-zero Pass@k on both generalization and specificity under successive edits is a strong negative result, but the paper needs to detail the edit ordering, cumulative interference measurement, and whether failures stem from overwriting prior edits versus other mechanisms, as this directly supports the interference conclusion.

Authors: We welcome the request for greater transparency on the successive-edit protocol. In the revised §6 we will specify that edit order was randomized per experimental run but fixed by seed for reproducibility; introduce a cumulative interference metric (average performance drop on previously edited APIs after each new edit); and provide a failure-mode breakdown derived from execution logs indicating that overwriting of prior edits accounts for the majority of the observed collapse, with the remainder attributable to rising compilation errors. A supplementary figure will illustrate the progressive degradation trajectory. These details will be added while preserving the reported near-zero Pass@k outcome. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results with post-hoc attribution

full rationale

The paper constructs a new benchmark from existing datasets (HumanEval, MBPP, APPS) with synthetic API modifications and measures editing performance via execution-based Pass@k metrics in a sandbox. These are direct empirical observations, not derivations. The two-factor Shapley decomposition is applied after the fact to decompose already-computed pass rates into compilation vs. post-compilation components and does not define or presuppose the success metric. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the central claims. The evaluation chain is self-contained against external benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on standard assumptions from LLM benchmarking literature that Pass@k and execution-based correctness are valid proxies for real developer utility, plus the assumption that the chosen synthetic modifications capture the difficulty of real API changes.

axioms (1)

domain assumption Execution-based metrics in a controlled sandbox accurately reflect whether an edit has produced correct API usage versus a workaround.
Invoked when the paper distinguishes successful API adoption from workaround-based task completion.

pith-pipeline@v0.9.0 · 5804 in / 1289 out tokens · 32957 ms · 2026-05-18T01:43:14.532225+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present a controlled benchmark for evaluating model editing under API updates in code LLMs, built from HumanEval, MBPP, and APPS, with 2,040 problems spanning 140 unique synthetic API modifications, together with an execution sandbox that enforces edited APIs under standard Python semantics.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Under single edits, edited models generalize poorly to unseen uses of the modified API, and many apparent successes are workaround-based rather than true API migrations.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 9 internal anchors

[1]

Amazon. 2023. Amazon CodeWhisperer: Build applications faster and more securely with your AI coding companion. https://aws.amazon.com/codewhisperer/

work page 2023
[2]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Proc. ACM Softw. Eng., Vol. 1, No. 1, Article . Publication date: November 2025. 24 Vinaik Chhetri, A.B Siddique, and Umar Farooq Rui ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[4]

Nicola De Cao, Wilker Aziz, and Ivan Titov. 2021. Editing Factual Knowledge in Language Models.Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP2021). https://arxiv.org/abs/2104.08164

work page arXiv 2021
[6]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

OpenJS Foundation / Node.js contributors. 2025. Deprecations — Node.js API (latestv20.x). https://nodejs.org/docs/ latest-v20.x/api/deprecations.html. Accessed: September 20, 2025

work page 2025
[8]

Oracle Corporation. 2025. Deprecated List — Java SE 23 API Documentation. https://docs.oracle.com/en/java/javase/ 23/docs/api/deprecated-list.html. Accessed: September 20, 2025

work page 2025
[9]

NumPy Developers. 2024. NumPy 2.0.0 Release Notes. https://numpy.org/doc/2.0/release/2.0.0-notes.html. Accessed: September 20, 2025

work page 2024
[10]

NumPy Developers. 2025. NumPy. https://numpy.org/. Accessed: September 20, 2025

work page 2025
[11]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186

work page 2019
[12]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational ...

work page doi:10.18653/v1/2020.findings-emnlp.139 2020
[13]

Node.js Foundation. 2025. Node.js. https://nodejs.org/. Accessed: September 20, 2025

work page 2025
[14]

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer Feed-Forward Layers Are Key-Value Memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, Online and Punta Cana, ...

work page internal anchor Pith review doi:10.18653/v1/2021.emnlp-main.446 2021
[15]

GitHub. 2021. GitHub Copilot: Your AI Pair Programmer. https://copilot.github.com/

work page 2021
[16]

Jian Gu, Aldeida Aleti, Chunyang Chen, and Hongyu Zhang. 2024. Neuron Patching: Semantic-based Neuron-level Language Model Repair for Code Generation. arXiv:2312.05356 [cs.SE] https://arxiv.org/abs/2312.05356

work page arXiv 2024
[17]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. arXiv:2401.14196 [cs.SE] https://arxiv.org/abs/2401.14196

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Akshat Gupta, Anurag Rao, and Gopala Anumanchipalli. 2024. Model Editing at Scale leads to Gradual and Catastrophic Forgetting. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 15202–15232. doi:10.18653/v1/ 2024.finding...

work page doi:10.18653/v1/ 2024
[19]

Akshat Gupta, Dev Sajnani, and Gopala Anumanchipalli. 2024. A Unified Framework for Model Editing. arXiv:2403.14236 [cs.LG] https://arxiv.org/abs/2403.14236

work page arXiv 2024
[20]

Thomas Hartvigsen, Swami Sankaranarayanan, Hamid Palangi, Yoon Kim, and Marzyeh Ghassemi. 2023. Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors. InAdvances in Neural Information Processing Proc. ACM Softw. Eng., Vol. 1, No. 1, Article . Publication date: November 2025. Understanding Robustness of Model Editing in Code LLMs: An Em...

work page 2023
[21]

Peter Hase, Mohit Bansal, Been Kim, and Asma Ghandeharioun. 2023. Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models. InThirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=EldbUlZtbd

work page 2023
[22]

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2020. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. arXiv:1909.09436 [cs.LG] https://arxiv.org/abs/1909.09436

work page internal anchor Pith review Pith/arXiv arXiv 2020
[23]

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Loge...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Xiaopeng Li, Shasha Li, Shezheng Song, Huijun Liu, Bin Ji, Xi Wang, Jun Ma, Jie Yu, Xiaodong Liu, Jing Wang, and Weimin Zhang. 2025. SWEA: updating factual knowledge in large language models via subject word embedding altering. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applicat...

work page doi:10.1609/aaai.v39i23.34628 2025
[25]

Xiaopeng Li, Shasha Li, Shezheng Song, Jing Yang, Jun Ma, and Jie Yu. 2024. Pmet: Precise model editing in a transformer. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 18564–18572

work page 2024
[26]

Xiaopeng Li, Shangwen Wang, Shasha Li, Jun Ma, Jie Yu, Xiaodong Liu, Jing Wang, Bin Ji, and Weimin Zhang. 2025. Model Editing for LLMs4Code: How Far are We?. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering(Ottawa, Ontario, Canada)(ICSE ’25). IEEE Press, 937–949. doi:10.1109/ICSE55347.2025.00049

work page doi:10.1109/icse55347.2025.00049 2025
[27]

Zeyu Leo Liu, Shrey Pandit, Xi Ye, Eunsol Choi, and Greg Durrett. 2025. CodeUpdateArena: Benchmarking Knowledge Editing on API Updates. arXiv:2407.06249 [cs.CL] https://arxiv.org/abs/2407.06249

work page arXiv 2025
[28]

Google LLC. 2024. API Differences Between 34 and 35 — Android Developers. https://developer.android.com/sdk/api_ diff/35/changes. Accessed: September 20, 2025

work page 2024
[29]

Google LLC. 2025. Android Developers. https://developer.android.com. Accessed: September 20, 2025

work page 2025
[30]

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and Editing Factual Associations in GPT.Advances in Neural Information Processing Systems36 (2022). arXiv:2202.05262

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. 2023. Mass Editing Memory in a Transformer.The Eleventh International Conference on Learning Representations (ICLR)(2023)

work page 2023
[32]

Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D Manning. 2022. Fast Model Editing at Scale. InInternational Conference on Learning Representations. https://openreview.net/pdf?id=0DcZxeWfOPt

work page 2022
[33]

Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. 2022. Memory-Based Model Editing at Scale. InInternational Conference on Machine Learning. https://arxiv.org/pdf/2206.06520.pdf

work page arXiv 2022
[34]

Oracle. 2025. Java Platform, Standard Edition Documentation. https://docs.oracle.com/en/java/javase/. Accessed: September 20, 2025

work page 2025
[35]

The pandas development team. 2022. Deprecations — pandas 1.5.0. https://pandas.pydata.org/pandas-docs/version/1. 5/whatsnew/v1.5.0.html#deprecations. Accessed: September 20, 2025

work page 2022
[36]

The pandas development team. 2022. pandas: pandas.concat. https://pandas.pydata.org/docs/reference/api/pandas. concat.html Accessed: 2025-09-20

work page 2022
[37]

The pandas development team. 2022. pandas: pandas.DataFrame.append. https://pandas.pydata.org/pandas-docs/ version/1.4/reference/api/pandas.DataFrame.append.html Accessed: 2025-09-20

work page 2022
[38]

The pandas development team. 2025. pandas — Python Data Analysis Library. https://pandas.pydata.org/. Accessed: September 20, 2025

work page 2025
[39]

Google Research. 2025. mbpp: Mostly Basic Python Problems. https://github.com/google-research/google-research/ tree/master/mbpp. Accessed: 2025-08-15

work page 2025
[41]

Code Llama: Open Foundation Models for Code

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Proc. ACM Softw. Eng., Vol. 1, No. 1, Article . Publication date: November 2025. 26 Vinaik Chhetri, A.B Siddique, and U...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

work page 2017
[43]

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. 2020. Investigating Gender Bias in Language Models Using Causal Mediation Analysis. InAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 12388–...

work page 2020
[44]

Peng Wang, Ningyu Zhang, Bozhong Tian, Zekun Xi, Yunzhi Yao, Ziwen Xu, Mengru Wang, Shengyu Mao, Xiaohan Wang, Siyuan Cheng, Kangwei Liu, Yuansheng Ni, Guozhou Zheng, and Huajun Chen. 2024. EasyEdit: An Easy- to-use Knowledge Editing Framework for Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguist...

work page doi:10.18653/v1/2024.acl-demos.9 2024
[45]

Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen- tau Yih (Eds.). Association fo...

work page doi:10.18653/v1/2021.emnlp-main.685 2021
[46]

Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to mine aligned code and natural language pairs from stack overflow. InProceedings of the 15th International Conference on Mining Software Repositories(Gothenburg, Sweden)(MSR ’18). Association for Computing Machinery, New York, NY, USA, 476–486. doi:10.1145/3196398.3196408

work page doi:10.1145/3196398.3196408 2018
[47]

Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh Bhojanapalli, Daliang Li, Felix Yu, and Sanjiv Kumar. 2020. Modifying Memories in Transformer Models. arXiv:2012.00363 [cs.CL] https://arxiv.org/abs/2012.00363 Proc. ACM Softw. Eng., Vol. 1, No. 1, Article . Publication date: November 2025

work page arXiv 2020

[1] [1]

Amazon. 2023. Amazon CodeWhisperer: Build applications faster and more securely with your AI coding companion. https://aws.amazon.com/codewhisperer/

work page 2023

[2] [2]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Proc. ACM Softw. Eng., Vol. 1, No. 1, Article . Publication date: November 2025. 24 Vinaik Chhetri, A.B Siddique, and Umar Farooq Rui ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[4] [4]

Nicola De Cao, Wilker Aziz, and Ivan Titov. 2021. Editing Factual Knowledge in Language Models.Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP2021). https://arxiv.org/abs/2104.08164

work page arXiv 2021

[5] [6]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [7]

OpenJS Foundation / Node.js contributors. 2025. Deprecations — Node.js API (latestv20.x). https://nodejs.org/docs/ latest-v20.x/api/deprecations.html. Accessed: September 20, 2025

work page 2025

[7] [8]

Oracle Corporation. 2025. Deprecated List — Java SE 23 API Documentation. https://docs.oracle.com/en/java/javase/ 23/docs/api/deprecated-list.html. Accessed: September 20, 2025

work page 2025

[8] [9]

NumPy Developers. 2024. NumPy 2.0.0 Release Notes. https://numpy.org/doc/2.0/release/2.0.0-notes.html. Accessed: September 20, 2025

work page 2024

[9] [10]

NumPy Developers. 2025. NumPy. https://numpy.org/. Accessed: September 20, 2025

work page 2025

[10] [11]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186

work page 2019

[11] [12]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational ...

work page doi:10.18653/v1/2020.findings-emnlp.139 2020

[12] [13]

Node.js Foundation. 2025. Node.js. https://nodejs.org/. Accessed: September 20, 2025

work page 2025

[13] [14]

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer Feed-Forward Layers Are Key-Value Memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, Online and Punta Cana, ...

work page internal anchor Pith review doi:10.18653/v1/2021.emnlp-main.446 2021

[14] [15]

GitHub. 2021. GitHub Copilot: Your AI Pair Programmer. https://copilot.github.com/

work page 2021

[15] [16]

Jian Gu, Aldeida Aleti, Chunyang Chen, and Hongyu Zhang. 2024. Neuron Patching: Semantic-based Neuron-level Language Model Repair for Code Generation. arXiv:2312.05356 [cs.SE] https://arxiv.org/abs/2312.05356

work page arXiv 2024

[16] [17]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. arXiv:2401.14196 [cs.SE] https://arxiv.org/abs/2401.14196

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [18]

Akshat Gupta, Anurag Rao, and Gopala Anumanchipalli. 2024. Model Editing at Scale leads to Gradual and Catastrophic Forgetting. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 15202–15232. doi:10.18653/v1/ 2024.finding...

work page doi:10.18653/v1/ 2024

[18] [19]

Akshat Gupta, Dev Sajnani, and Gopala Anumanchipalli. 2024. A Unified Framework for Model Editing. arXiv:2403.14236 [cs.LG] https://arxiv.org/abs/2403.14236

work page arXiv 2024

[19] [20]

Thomas Hartvigsen, Swami Sankaranarayanan, Hamid Palangi, Yoon Kim, and Marzyeh Ghassemi. 2023. Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors. InAdvances in Neural Information Processing Proc. ACM Softw. Eng., Vol. 1, No. 1, Article . Publication date: November 2025. Understanding Robustness of Model Editing in Code LLMs: An Em...

work page 2023

[20] [21]

Peter Hase, Mohit Bansal, Been Kim, and Asma Ghandeharioun. 2023. Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models. InThirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=EldbUlZtbd

work page 2023

[21] [22]

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2020. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. arXiv:1909.09436 [cs.LG] https://arxiv.org/abs/1909.09436

work page internal anchor Pith review Pith/arXiv arXiv 2020

[22] [23]

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Loge...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [24]

Xiaopeng Li, Shasha Li, Shezheng Song, Huijun Liu, Bin Ji, Xi Wang, Jun Ma, Jie Yu, Xiaodong Liu, Jing Wang, and Weimin Zhang. 2025. SWEA: updating factual knowledge in large language models via subject word embedding altering. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applicat...

work page doi:10.1609/aaai.v39i23.34628 2025

[24] [25]

Xiaopeng Li, Shasha Li, Shezheng Song, Jing Yang, Jun Ma, and Jie Yu. 2024. Pmet: Precise model editing in a transformer. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 18564–18572

work page 2024

[25] [26]

Xiaopeng Li, Shangwen Wang, Shasha Li, Jun Ma, Jie Yu, Xiaodong Liu, Jing Wang, Bin Ji, and Weimin Zhang. 2025. Model Editing for LLMs4Code: How Far are We?. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering(Ottawa, Ontario, Canada)(ICSE ’25). IEEE Press, 937–949. doi:10.1109/ICSE55347.2025.00049

work page doi:10.1109/icse55347.2025.00049 2025

[26] [27]

Zeyu Leo Liu, Shrey Pandit, Xi Ye, Eunsol Choi, and Greg Durrett. 2025. CodeUpdateArena: Benchmarking Knowledge Editing on API Updates. arXiv:2407.06249 [cs.CL] https://arxiv.org/abs/2407.06249

work page arXiv 2025

[27] [28]

Google LLC. 2024. API Differences Between 34 and 35 — Android Developers. https://developer.android.com/sdk/api_ diff/35/changes. Accessed: September 20, 2025

work page 2024

[28] [29]

Google LLC. 2025. Android Developers. https://developer.android.com. Accessed: September 20, 2025

work page 2025

[29] [30]

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and Editing Factual Associations in GPT.Advances in Neural Information Processing Systems36 (2022). arXiv:2202.05262

work page internal anchor Pith review Pith/arXiv arXiv 2022

[30] [31]

Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. 2023. Mass Editing Memory in a Transformer.The Eleventh International Conference on Learning Representations (ICLR)(2023)

work page 2023

[31] [32]

Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D Manning. 2022. Fast Model Editing at Scale. InInternational Conference on Learning Representations. https://openreview.net/pdf?id=0DcZxeWfOPt

work page 2022

[32] [33]

Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. 2022. Memory-Based Model Editing at Scale. InInternational Conference on Machine Learning. https://arxiv.org/pdf/2206.06520.pdf

work page arXiv 2022

[33] [34]

Oracle. 2025. Java Platform, Standard Edition Documentation. https://docs.oracle.com/en/java/javase/. Accessed: September 20, 2025

work page 2025

[34] [35]

The pandas development team. 2022. Deprecations — pandas 1.5.0. https://pandas.pydata.org/pandas-docs/version/1. 5/whatsnew/v1.5.0.html#deprecations. Accessed: September 20, 2025

work page 2022

[35] [36]

The pandas development team. 2022. pandas: pandas.concat. https://pandas.pydata.org/docs/reference/api/pandas. concat.html Accessed: 2025-09-20

work page 2022

[36] [37]

The pandas development team. 2022. pandas: pandas.DataFrame.append. https://pandas.pydata.org/pandas-docs/ version/1.4/reference/api/pandas.DataFrame.append.html Accessed: 2025-09-20

work page 2022

[37] [38]

The pandas development team. 2025. pandas — Python Data Analysis Library. https://pandas.pydata.org/. Accessed: September 20, 2025

work page 2025

[38] [39]

Google Research. 2025. mbpp: Mostly Basic Python Problems. https://github.com/google-research/google-research/ tree/master/mbpp. Accessed: 2025-08-15

work page 2025

[39] [41]

Code Llama: Open Foundation Models for Code

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Proc. ACM Softw. Eng., Vol. 1, No. 1, Article . Publication date: November 2025. 26 Vinaik Chhetri, A.B Siddique, and U...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [42]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

work page 2017

[41] [43]

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. 2020. Investigating Gender Bias in Language Models Using Causal Mediation Analysis. InAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 12388–...

work page 2020

[42] [44]

Peng Wang, Ningyu Zhang, Bozhong Tian, Zekun Xi, Yunzhi Yao, Ziwen Xu, Mengru Wang, Shengyu Mao, Xiaohan Wang, Siyuan Cheng, Kangwei Liu, Yuansheng Ni, Guozhou Zheng, and Huajun Chen. 2024. EasyEdit: An Easy- to-use Knowledge Editing Framework for Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguist...

work page doi:10.18653/v1/2024.acl-demos.9 2024

[43] [45]

Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen- tau Yih (Eds.). Association fo...

work page doi:10.18653/v1/2021.emnlp-main.685 2021

[44] [46]

Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to mine aligned code and natural language pairs from stack overflow. InProceedings of the 15th International Conference on Mining Software Repositories(Gothenburg, Sweden)(MSR ’18). Association for Computing Machinery, New York, NY, USA, 476–486. doi:10.1145/3196398.3196408

work page doi:10.1145/3196398.3196408 2018

[45] [47]

Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh Bhojanapalli, Daliang Li, Felix Yu, and Sanjiv Kumar. 2020. Modifying Memories in Transformer Models. arXiv:2012.00363 [cs.CL] https://arxiv.org/abs/2012.00363 Proc. ACM Softw. Eng., Vol. 1, No. 1, Article . Publication date: November 2025

work page arXiv 2020