Understanding Robustness of Model Editing in Code LLMs
Pith reviewed 2026-05-18 01:43 UTC · model grok-4.3
The pith
Model editing in code LLMs produces poor generalization to new API uses and degrades performance on unmodified tasks, with successive edits driving most models to near-zero success.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under single edits, edited models generalize poorly to unseen uses of the modified API, and many apparent successes are workaround-based rather than true API migrations. Performance on tasks involving unmodified APIs also degrades, although memory-based methods and fine-tuning preserve specificity better than locate-then-edit methods. Under successive edits, most method-model combinations collapse to near-zero Pass@k on both generalization and specificity. A two-factor Shapley decomposition further shows that single-edit failures on generalization include a substantial compilation component, whereas specificity failures are more often post-compilation. Under successive edits, failures become
What carries the argument
Execution sandbox that enforces edited APIs under standard Python semantics together with execution-based metrics that separate genuine adoption of the new API from workaround solutions that complete the task without using the edit.
If this is right
- Single edits cannot be assumed to produce reliable API migration because many passing solutions avoid the new API entirely.
- Performance on tasks that continue to use the original API declines after an edit, limiting safe use of edited models in mixed codebases.
- Successive edits trigger broad interference that destroys capability on both edited and unedited APIs for most current methods.
- Memory-based and fine-tuning approaches maintain higher specificity than locate-then-edit methods after a single change.
- Generalization failures contain a large compilation component while specificity failures tend to occur after successful compilation.
Where Pith is reading between the lines
- Real deployment of edited code models would require additional runtime checks or test suites to detect hidden workarounds and unintended side effects on legacy code.
- Editing pipelines may need explicit mechanisms to track interactions between multiple changes if they are to remain viable as libraries evolve over time.
- The observed compilation-driven versus post-compilation failure split points to different intervention points: syntax-level regularization for generalization and semantic consistency checks for specificity.
Load-bearing premise
The synthetic API modifications and the execution-based metrics in the sandbox correctly distinguish genuine API adoption from workaround solutions that would not be possible or detectable in real-world usage of the edited models.
What would settle it
Measuring whether edited models emit code that actually invokes the new API function on fresh test cases that require the updated signature in ways never shown during editing, rather than completing the task through alternative code that avoids the edited symbol.
Figures
read the original abstract
Large language models (LLMs) for code are increasingly used in software development, but they remain static after pretraining while APIs and software libraries continue to evolve. Model editing offers a lightweight alternative to retraining for incorporating API updates, yet it remains unclear whether existing editing methods can induce correct API migration, generalize that behavior to unseen tasks, and preserve performance on tasks involving unmodified APIs. We present a controlled benchmark for evaluating model editing under API updates in code LLMs, built from HumanEval, MBPP, and APPS, with 2,040 problems spanning 140 unique synthetic API modifications, together with an execution sandbox that enforces edited APIs under standard Python semantics. We evaluate several state-of-the-art editing methods on three code LLMs under both single-edit and successive-edit regimes using execution-based metrics that distinguish successful API adoption from workaround-based task completion. Under single edits, edited models generalize poorly to unseen uses of the modified API, and many apparent successes are workaround-based rather than true API migrations. Performance on tasks involving unmodified APIs also degrades, although memory-based methods and fine-tuning preserve specificity better than locate-then-edit methods. Under successive edits, most method-model combinations collapse to near-zero Pass@k on both generalization and specificity, revealing substantial interference beyond the target edits. A two-factor Shapley decomposition further shows that single-edit failures on generalization include a substantial compilation component, whereas specificity failures are more often post-compilation. Under successive edits, failures become predominantly compilation-driven.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a controlled benchmark for evaluating model editing in code LLMs under API updates, constructed from 2,040 problems spanning HumanEval, MBPP, and APPS with 140 synthetic API modifications and an execution sandbox enforcing edited APIs under Python semantics. It evaluates state-of-the-art editing methods on three code LLMs in single-edit and successive-edit regimes using execution-based Pass@k metrics that distinguish true API adoption from workarounds. Key claims include poor generalization to unseen uses of modified APIs, prevalence of workaround-based successes, degradation on unmodified APIs (with memory-based methods faring better), and near-total collapse under successive edits; a Shapley decomposition attributes single-edit generalization failures partly to compilation issues and specificity failures to post-compilation errors, with successive-edit failures becoming predominantly compilation-driven.
Significance. If the results hold, the work is significant for providing empirical evidence that current model editing techniques are inadequate for robust API migration in code LLMs, revealing issues of poor generalization, workaround reliance, specificity loss, and edit interference. The benchmark design with execution metrics and post-hoc Shapley decomposition offers a reproducible framework that could steer development of more reliable editing approaches for maintaining LLMs amid evolving libraries.
major comments (3)
- [§3] §3 (Benchmark and Sandbox): The central claims rest on the assumption that the 140 synthetic API modifications and sandbox execution correctly separate genuine migrations from workarounds. The manuscript must provide explicit justification or ablation showing how these modifications replicate real API changes (e.g., signature shifts, behavioral semantics, import side effects) rather than allowing artificial workarounds detectable only in the sandbox; without this, the Pass@k distinctions for generalization may not proxy real-world API updates.
- [§5] §5 (Single-Edit Experiments): The claim that edited models degrade on tasks involving unmodified APIs is load-bearing for the specificity argument, yet the manuscript should report per-method degradation magnitudes with statistical tests and confirm that the observed differences between memory-based and locate-then-edit methods are not confounded by edit magnitude or hyperparameter choices.
- [§6] §6 (Successive-Edit Regime): The reported collapse to near-zero Pass@k on both generalization and specificity under successive edits is a strong negative result, but the paper needs to detail the edit ordering, cumulative interference measurement, and whether failures stem from overwriting prior edits versus other mechanisms, as this directly supports the interference conclusion.
minor comments (2)
- [Abstract] The abstract mentions evaluation on 'three code LLMs' without naming them; list the specific models in the abstract and early introduction for immediate clarity.
- [Results] Ensure figures or tables presenting Pass@k results include variance estimates or multiple-run statistics to support the reported trends.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. The comments highlight important areas for strengthening the justification of our benchmark, the statistical rigor of our specificity analysis, and the mechanistic details of interference under successive edits. We address each major comment below and commit to revisions that enhance the paper without altering its core findings.
read point-by-point responses
-
Referee: [§3] §3 (Benchmark and Sandbox): The central claims rest on the assumption that the 140 synthetic API modifications and sandbox execution correctly separate genuine migrations from workarounds. The manuscript must provide explicit justification or ablation showing how these modifications replicate real API changes (e.g., signature shifts, behavioral semantics, import side effects) rather than allowing artificial workarounds detectable only in the sandbox; without this, the Pass@k distinctions for generalization may not proxy real-world API updates.
Authors: We agree that a clearer justification of the synthetic modifications is required to support the benchmark's ecological validity. In the revised manuscript we will add a new subsection in §3 that (i) categorizes the 140 modifications according to real-world API evolution patterns (signature changes, semantic shifts, import side-effects), (ii) provides explicit mappings to historical changes in libraries such as NumPy, pandas and requests, and (iii) reports an ablation that removes each modification category in turn and measures the resulting change in generalization Pass@k and workaround rates. These additions will demonstrate that the observed distinctions between true migration and workarounds are not artifacts of the sandbox alone. revision: yes
-
Referee: [§5] §5 (Single-Edit Experiments): The claim that edited models degrade on tasks involving unmodified APIs is load-bearing for the specificity argument, yet the manuscript should report per-method degradation magnitudes with statistical tests and confirm that the observed differences between memory-based and locate-then-edit methods are not confounded by edit magnitude or hyperparameter choices.
Authors: We accept the need for quantitative reporting and controls. The revision will include a new table in §5 that lists, for each method, the mean degradation on unmodified-API tasks together with standard deviations and p-values from paired Wilcoxon signed-rank tests. We will also add a paragraph and appendix sensitivity analysis showing that (a) edit magnitudes (measured by L2 norm of parameter updates) were matched across methods via a common hyperparameter search on a validation split, and (b) the relative advantage of memory-based methods persists across a grid of learning rates and edit strengths. These changes will be incorporated without modifying the original conclusions. revision: yes
-
Referee: [§6] §6 (Successive-Edit Regime): The reported collapse to near-zero Pass@k on both generalization and specificity under successive edits is a strong negative result, but the paper needs to detail the edit ordering, cumulative interference measurement, and whether failures stem from overwriting prior edits versus other mechanisms, as this directly supports the interference conclusion.
Authors: We welcome the request for greater transparency on the successive-edit protocol. In the revised §6 we will specify that edit order was randomized per experimental run but fixed by seed for reproducibility; introduce a cumulative interference metric (average performance drop on previously edited APIs after each new edit); and provide a failure-mode breakdown derived from execution logs indicating that overwriting of prior edits accounts for the majority of the observed collapse, with the remainder attributable to rising compilation errors. A supplementary figure will illustrate the progressive degradation trajectory. These details will be added while preserving the reported near-zero Pass@k outcome. revision: yes
Circularity Check
No circularity: empirical benchmark results with post-hoc attribution
full rationale
The paper constructs a new benchmark from existing datasets (HumanEval, MBPP, APPS) with synthetic API modifications and measures editing performance via execution-based Pass@k metrics in a sandbox. These are direct empirical observations, not derivations. The two-factor Shapley decomposition is applied after the fact to decompose already-computed pass rates into compilation vs. post-compilation components and does not define or presuppose the success metric. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the central claims. The evaluation chain is self-contained against external benchmarks and does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Execution-based metrics in a controlled sandbox accurately reflect whether an edit has produced correct API usage versus a workaround.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present a controlled benchmark for evaluating model editing under API updates in code LLMs, built from HumanEval, MBPP, and APPS, with 2,040 problems spanning 140 unique synthetic API modifications, together with an execution sandbox that enforces edited APIs under standard Python semantics.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Under single edits, edited models generalize poorly to unseen uses of the modified API, and many apparent successes are workaround-based rather than true API migrations.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Amazon. 2023. Amazon CodeWhisperer: Build applications faster and more securely with your AI coding companion. https://aws.amazon.com/codewhisperer/
work page 2023
-
[2]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Proc. ACM Softw. Eng., Vol. 1, No. 1, Article . Publication date: November 2025. 24 Vinaik Chhetri, A.B Siddique, and Umar Farooq Rui ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
work page internal anchor Pith review Pith/arXiv arXiv 2020
- [4]
-
[6]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
OpenJS Foundation / Node.js contributors. 2025. Deprecations — Node.js API (latestv20.x). https://nodejs.org/docs/ latest-v20.x/api/deprecations.html. Accessed: September 20, 2025
work page 2025
-
[8]
Oracle Corporation. 2025. Deprecated List — Java SE 23 API Documentation. https://docs.oracle.com/en/java/javase/ 23/docs/api/deprecated-list.html. Accessed: September 20, 2025
work page 2025
-
[9]
NumPy Developers. 2024. NumPy 2.0.0 Release Notes. https://numpy.org/doc/2.0/release/2.0.0-notes.html. Accessed: September 20, 2025
work page 2024
-
[10]
NumPy Developers. 2025. NumPy. https://numpy.org/. Accessed: September 20, 2025
work page 2025
-
[11]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186
work page 2019
-
[12]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational ...
-
[13]
Node.js Foundation. 2025. Node.js. https://nodejs.org/. Accessed: September 20, 2025
work page 2025
-
[14]
Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer Feed-Forward Layers Are Key-Value Memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, Online and Punta Cana, ...
work page internal anchor Pith review doi:10.18653/v1/2021.emnlp-main.446 2021
-
[15]
GitHub. 2021. GitHub Copilot: Your AI Pair Programmer. https://copilot.github.com/
work page 2021
- [16]
-
[17]
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. arXiv:2401.14196 [cs.SE] https://arxiv.org/abs/2401.14196
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Akshat Gupta, Anurag Rao, and Gopala Anumanchipalli. 2024. Model Editing at Scale leads to Gradual and Catastrophic Forgetting. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 15202–15232. doi:10.18653/v1/ 2024.finding...
- [19]
-
[20]
Thomas Hartvigsen, Swami Sankaranarayanan, Hamid Palangi, Yoon Kim, and Marzyeh Ghassemi. 2023. Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors. InAdvances in Neural Information Processing Proc. ACM Softw. Eng., Vol. 1, No. 1, Article . Publication date: November 2025. Understanding Robustness of Model Editing in Code LLMs: An Em...
work page 2023
-
[21]
Peter Hase, Mohit Bansal, Been Kim, and Asma Ghandeharioun. 2023. Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models. InThirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=EldbUlZtbd
work page 2023
-
[22]
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2020. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. arXiv:1909.09436 [cs.LG] https://arxiv.org/abs/1909.09436
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[23]
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Loge...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Xiaopeng Li, Shasha Li, Shezheng Song, Huijun Liu, Bin Ji, Xi Wang, Jun Ma, Jie Yu, Xiaodong Liu, Jing Wang, and Weimin Zhang. 2025. SWEA: updating factual knowledge in large language models via subject word embedding altering. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applicat...
-
[25]
Xiaopeng Li, Shasha Li, Shezheng Song, Jing Yang, Jun Ma, and Jie Yu. 2024. Pmet: Precise model editing in a transformer. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 18564–18572
work page 2024
-
[26]
Xiaopeng Li, Shangwen Wang, Shasha Li, Jun Ma, Jie Yu, Xiaodong Liu, Jing Wang, Bin Ji, and Weimin Zhang. 2025. Model Editing for LLMs4Code: How Far are We?. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering(Ottawa, Ontario, Canada)(ICSE ’25). IEEE Press, 937–949. doi:10.1109/ICSE55347.2025.00049
- [27]
-
[28]
Google LLC. 2024. API Differences Between 34 and 35 — Android Developers. https://developer.android.com/sdk/api_ diff/35/changes. Accessed: September 20, 2025
work page 2024
-
[29]
Google LLC. 2025. Android Developers. https://developer.android.com. Accessed: September 20, 2025
work page 2025
-
[30]
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and Editing Factual Associations in GPT.Advances in Neural Information Processing Systems36 (2022). arXiv:2202.05262
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. 2023. Mass Editing Memory in a Transformer.The Eleventh International Conference on Learning Representations (ICLR)(2023)
work page 2023
-
[32]
Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D Manning. 2022. Fast Model Editing at Scale. InInternational Conference on Learning Representations. https://openreview.net/pdf?id=0DcZxeWfOPt
work page 2022
- [33]
-
[34]
Oracle. 2025. Java Platform, Standard Edition Documentation. https://docs.oracle.com/en/java/javase/. Accessed: September 20, 2025
work page 2025
-
[35]
The pandas development team. 2022. Deprecations — pandas 1.5.0. https://pandas.pydata.org/pandas-docs/version/1. 5/whatsnew/v1.5.0.html#deprecations. Accessed: September 20, 2025
work page 2022
-
[36]
The pandas development team. 2022. pandas: pandas.concat. https://pandas.pydata.org/docs/reference/api/pandas. concat.html Accessed: 2025-09-20
work page 2022
-
[37]
The pandas development team. 2022. pandas: pandas.DataFrame.append. https://pandas.pydata.org/pandas-docs/ version/1.4/reference/api/pandas.DataFrame.append.html Accessed: 2025-09-20
work page 2022
-
[38]
The pandas development team. 2025. pandas — Python Data Analysis Library. https://pandas.pydata.org/. Accessed: September 20, 2025
work page 2025
-
[39]
Google Research. 2025. mbpp: Mostly Basic Python Problems. https://github.com/google-research/google-research/ tree/master/mbpp. Accessed: 2025-08-15
work page 2025
-
[41]
Code Llama: Open Foundation Models for Code
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Proc. ACM Softw. Eng., Vol. 1, No. 1, Article . Publication date: November 2025. 26 Vinaik Chhetri, A.B Siddique, and U...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)
work page 2017
-
[43]
Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. 2020. Investigating Gender Bias in Language Models Using Causal Mediation Analysis. InAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 12388–...
work page 2020
-
[44]
Peng Wang, Ningyu Zhang, Bozhong Tian, Zekun Xi, Yunzhi Yao, Ziwen Xu, Mengru Wang, Shengyu Mao, Xiaohan Wang, Siyuan Cheng, Kangwei Liu, Yuansheng Ni, Guozhou Zheng, and Huajun Chen. 2024. EasyEdit: An Easy- to-use Knowledge Editing Framework for Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguist...
-
[45]
Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen- tau Yih (Eds.). Association fo...
-
[46]
Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to mine aligned code and natural language pairs from stack overflow. InProceedings of the 15th International Conference on Mining Software Repositories(Gothenburg, Sweden)(MSR ’18). Association for Computing Machinery, New York, NY, USA, 476–486. doi:10.1145/3196398.3196408
-
[47]
Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh Bhojanapalli, Daliang Li, Felix Yu, and Sanjiv Kumar. 2020. Modifying Memories in Transformer Models. arXiv:2012.00363 [cs.CL] https://arxiv.org/abs/2012.00363 Proc. ACM Softw. Eng., Vol. 1, No. 1, Article . Publication date: November 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.