arxiv: 2604.09515 · v1 · submitted 2026-04-10 · 💻 cs.SE

Recognition: unknown

When LLMs Lag Behind: Knowledge Conflicts from Evolving APIs in Code Generation

Ahmed Nusayer Ashik , Shaowei Wang , Tse-Hsun Chen , Muhammad Asaduzzaman , Yuan Tian

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:44 UTC · model grok-4.3

classification 💻 cs.SE

keywords LLM code generationAPI evolutionknowledge conflictscontext-memory conflictRAGcode executabilityself-reflectionPython libraries

0 comments

The pith

Large language models often fail to incorporate API updates into their code generation, resulting in low rates of executable output even when current specifications are supplied.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines the challenge that arises when software libraries change after an LLM has finished training. It shows that the models' internal knowledge frequently overrides newer external details about API deprecations, modifications, and additions, so that much of the generated code will not run in the updated environment. The authors test this effect on real changes drawn from popular Python libraries and find that better documentation and larger models raise success rates but still leave a substantial fraction of outputs broken. Reasoning techniques provide an extra boost, yet the underlying conflict between stored and supplied knowledge persists. The work matters because developers increasingly rely on these models for coding assistance in a world where libraries evolve continuously.

Core claim

The paper constructs a benchmark of 270 real-world API updates across eight Python libraries and evaluates eleven models from four families on code generation under conditions of deprecation, modification, and addition. It reports that, without comprehensive documentation, only 42.55 percent of the generated examples execute correctly in the target environment. Structured documentation and larger model scales raise this figure to 66.36 percent, while reasoning-based strategies such as Self-Reflection add a further 11 percent improvement in executability. The central observation is that outdated internal patterns continue to influence outputs even when explicit update information is provided.

What carries the argument

Context-memory conflict between an LLM's static parametric knowledge and external API update specifications, measured by whether generated code examples execute successfully in the target environment.

If this is right

Structured documentation improves LLMs' adoption of API changes but leaves more than one-third of outputs non-executable.
Increasing model scale helps modestly yet does not remove the underlying conflict with outdated internal knowledge.
Reasoning strategies such as Self-Reflection deliver an 11 percent gain in executable code on these tasks.
The persistence of stale patterns indicates a need for benchmarks and techniques explicitly designed around ongoing API evolution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams that integrate LLMs into development pipelines may need extra verification steps whenever libraries they depend on release updates.
Training methods that allow continuous incorporation of new facts could reduce reliance on post-hoc retrieval for time-sensitive information.
Similar knowledge conflicts are likely to appear in other generative tasks where facts change, such as legal drafting or medical advice.
Repeating the evaluation on libraries from additional programming languages would show whether the observed rates are specific to Python or more general.

Load-bearing premise

The 270 updates drawn from eight libraries represent typical API evolution, and the rate at which generated code runs correctly captures the practical impact of knowledge conflicts on development work.

What would settle it

Run the same benchmark on a new set of libraries and models after supplying documentation that is both more complete and formatted differently; if the executable rate remains below 70 percent for the largest models, the conflict persists beyond the tested conditions.

Figures

Figures reproduced from arXiv: 2604.09515 by Ahmed Nusayer Ashik, Muhammad Asaduzzaman, Shaowei Wang, Tse-Hsun Chen, Yuan Tian.

**Figure 2.** Figure 2: Base prompt for code generation To isolate the contribution of the API documentation to model performance, we additionally construct a reduced-context condition in which the model receives only the update description (UD) without the full API documentation. Comparing performance between the full-context condition (UD + Doc) and the reduced-context condition (UD only) allows us to quantify how much the st… view at source ↗

**Figure 3.** Figure 3: The average adoption rate (left) and executable rate [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of failure types in LLM-generated code [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 6.** Figure 6: Example of hallucination: the LLM invented a non [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of failure types in LLM-generated code [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

The rapid evolution of software libraries creates a significant challenge for Large Language Models (LLMs), whose static parametric knowledge often becomes stale post-training. While retrieval-augmented generation (RAG) is commonly used to provide up-to-date API specifications, "context-memory conflict" arises when external instructions contradict a model's internal parametric knowledge. This paper presents a systematic empirical study of LLM code generation under API evolution (e.g., API deprecation, API modification, and API addition), by constructing a benchmark of 270 real-world updates from eight Python libraries. We evaluate four LLM families of 11 models. Our results show that without comprehensive documentation, LLMs struggle to prioritize external context, averaging only 42.55% of generated code examples are executable in the target environment. While structured documentation and larger model scales improve LLMs' ability to update adoption, they do not fully resolve executability issues with a low 66.36% executable rate. In addition, reasoning-based strategies (e.g., Self-Reflection) significantly boost LLMs' performance with 11% improvement on executable rate. Our findings highlight the persistence of outdated patterns from LLMs, even when API update specifications are provided, and emphasize the need for evolution-aware benchmarks and techniques.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a new benchmark of 270 real API updates to show LLMs often fail to adopt fresh specs in code gen, but the executability numbers may not cleanly prove the knowledge conflict is the cause.

read the letter

The main takeaway is that this work creates a benchmark from 270 actual updates across eight Python libraries and tests 11 models on generating code that runs in the updated environment when given the change details. Without docs the models hit only 42% executable rate on average, rising to 66% with structured documentation and another 11 points with self-reflection style reasoning. That is the concrete new piece: a systematic look at deprecation, modification, and addition cases using real library history rather than made-up examples.

Referee Report

2 major / 2 minor

Summary. The paper conducts an empirical study of LLMs' code generation under API evolution in Python libraries, constructing a benchmark of 270 real-world updates (deprecations, modifications, additions) from eight libraries. It evaluates 11 models across four families and reports that LLMs achieve only 42.55% executability without documentation (rising to 66.36% with structured docs), with further gains from scale and reasoning strategies such as self-reflection (11% improvement). The central claim is that LLMs struggle to override stale parametric knowledge with external context even when API updates are provided, motivating evolution-aware benchmarks and techniques.

Significance. If the results hold under a properly validated benchmark, the work provides concrete evidence of a persistent limitation in RAG-augmented code generation for dynamic software ecosystems. The use of real-world API updates across multiple libraries and model families, combined with the evaluation of mitigation strategies, offers actionable insights for improving LLM reliability in software engineering tasks.

major comments (2)

[Benchmark construction and evaluation protocol (likely §3–4)] The interpretation that low executability rates (42.55% without docs, 66.36% with) demonstrate failure to resolve context-memory conflicts requires that the 270 selected updates are breaking changes where old-API code fails in the target environment. The manuscript provides no explicit confirmation or table documenting that each update triggers runtime or deprecation errors under the evaluation setup; without this, the percentages may primarily reflect baseline code-generation quality rather than specific prioritization of external context.
[Abstract and §4 (Experiments)] The abstract and results sections report precise executability percentages and an 11% improvement from self-reflection, yet supply no details on benchmark construction methodology, prompting templates, exact definition of 'executable' (syntax vs. functional correctness), target environment versions, or controls for selection bias in the 270 updates. These omissions make it impossible to assess whether the numbers robustly support the stated conclusions about knowledge conflicts.

minor comments (2)

[§3] The paper would benefit from a table or appendix listing the eight libraries, the distribution of update types (deprecation/modification/addition), and the specific criteria used to verify that each update is a breaking change.
[Evaluation metrics] Clarify whether 'executable' includes runtime success only or also checks for correct functional behavior against expected outputs; this distinction affects how strongly the results speak to practical workflow impact.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our empirical study of LLM code generation under API evolution. The comments highlight important aspects of benchmark validation and methodological transparency that we have addressed in the revision.

read point-by-point responses

Referee: [Benchmark construction and evaluation protocol (likely §3–4)] The interpretation that low executability rates (42.55% without docs, 66.36% with) demonstrate failure to resolve context-memory conflicts requires that the 270 selected updates are breaking changes where old-API code fails in the target environment. The manuscript provides no explicit confirmation or table documenting that each update triggers runtime or deprecation errors under the evaluation setup; without this, the percentages may primarily reflect baseline code-generation quality rather than specific prioritization of external context.

Authors: We agree that confirming the updates as breaking changes is essential to isolate context-memory conflicts from general generation quality. The original selection drew from official release notes and deprecation warnings across the eight libraries, but we did not include per-update error documentation. In the revised manuscript, we have added Table 2 in §3.1, which lists for each update the specific runtime error (e.g., AttributeError, TypeError, or DeprecationWarning treated as failure) observed when executing outdated code in the target environment. We also describe the verification procedure: generating minimal old-API snippets and confirming failure before including the update. This directly supports that the reported executability gaps reflect prioritization of stale parametric knowledge over provided context. revision: yes
Referee: [Abstract and §4 (Experiments)] The abstract and results sections report precise executability percentages and an 11% improvement from self-reflection, yet supply no details on benchmark construction methodology, prompting templates, exact definition of 'executable' (syntax vs. functional correctness), target environment versions, or controls for selection bias in the 270 updates. These omissions make it impossible to assess whether the numbers robustly support the stated conclusions about knowledge conflicts.

Authors: We acknowledge that the abstract and §4 omitted key methodological specifics needed for reproducibility and evaluation of robustness. The full construction details appear in §3, but we have now expanded the abstract with a concise description of the 270-update benchmark (stratified sampling from release notes of eight libraries, covering deprecation, modification, and addition). We added §3.3 on evaluation protocol, including: (i) exact prompting templates (now in Appendix A), (ii) definition of 'executable' as code that runs to completion without exceptions or deprecation-induced failures in the target environment, (iii) target versions (Python 3.9 with latest stable library releases), and (iv) bias controls via library-stratified and type-balanced sampling. These revisions allow direct assessment of whether the results demonstrate knowledge conflicts. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical reporting of observed executability rates

full rationale

The paper constructs a benchmark of 270 real-world API updates and reports direct experimental measurements of LLM code executability (42.55% without docs, 66.36% with structured docs, 11% gain from self-reflection). No equations, fitted parameters, derived predictions, or load-bearing self-citations appear in the derivation chain; all central claims are observational outcomes from the evaluation setup rather than reductions to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work is an empirical study and rests on standard domain assumptions about LLM knowledge being static post-training and executability serving as a proxy for successful conflict resolution; no free parameters or new entities are introduced.

axioms (2)

domain assumption LLMs possess static parametric knowledge that becomes stale after training cutoff
Explicitly stated in the abstract as the source of the knowledge conflict problem.
domain assumption Executability of generated code in the target environment is a valid measure of whether external context has overridden internal outdated knowledge
Used as the primary reported metric for all conditions and conclusions.

pith-pipeline@v0.9.0 · 5535 in / 1481 out tokens · 51215 ms · 2026-05-10T16:44:27.457059+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Retrieval Hurts Code Completion: A Diagnostic Study of Stale Repository Context
cs.SE 2026-05 accept novelty 7.0

Stale repository context in code RAG actively induces models to produce obsolete helper references, raising stale outputs by 76-88 percentage points over current-only retrieval in a 17-sample diagnostic study.

Reference graph

Works this paper leans on

60 extracted references · 23 canonical work pages · cited by 1 Pith paper · 11 internal anchors

[1]

https://github.com/AhmedNusayer/knowledge-conflict-codegen

2026. https://github.com/AhmedNusayer/knowledge-conflict-codegen

2026
[2]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al . 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Aniket Bhattacharyya, Anurag Tripathi, Ujjal Das, Archan Karmakar, Amit Pathak, and Maneesh Gupta. 2025. Information extraction from visually rich documents using LLM-based organization of documents into independent textual segments. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers). 17241–17256

2025
[4]

Liguo Chen, Qi Guo, Hongrui Jia, Zhengran Zeng, Xin Wang, Yijiang Xu, Jian Wu, Yidong Wang, Qing Gao, Jindong Wang, et al. 2024. A survey on evaluating large language models in code generation tasks. arXiv preprint arXiv:2408.16498 (2024)

work page arXiv 2024
[5]

Sitao Cheng, Liangming Pan, Xunjian Yin, Xinyi Wang, and William Yang Wang
[6]

Interplay of parametric and contextual knowledge: A study of parametric knowledge utilisation in LLMs

Understanding the interplay between parametric and contextual knowledge for large language models. arXiv preprint arXiv:2410.08414 (2024)

work page arXiv 2024
[7]

Farbod Daneshyan, Runzhi He, Jianyu Wu, and Minghui Zhou. 2025. Smartnote: An llm-powered, personalised release note generator that just works. Proceedings of the ACM on Software Engineering 2, FSE (2025), 1663–1686

2025
[8]

Yangruibo Ding, Zijian Wang, Wasi Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, et al. 2023. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion. Advances in Neural Information Processing Systems 36 (2023), 46701–46723

2023
[9]

Lingyue Fu, Huacan Chai, Shuang Luo, Kounianhua Du, Weiming Zhang, Longteng Fan, Jiayi Lei, Renting Rui, Jianghao Lin, Yuchen Fang, et al . 2023. Codeapex: A bilingual programming evaluation benchmark for large language models. arXiv preprint arXiv:2309.01940 (2023)

work page arXiv 2023
[10]

Pengfei Gao, Zhao Tian, Xiangxin Meng, Xinchen Wang, Ruida Hu, Yuanan Xiao, Yizhou Liu, Zhao Zhang, Junjie Chen, Cuiyun Gao, et al. 2025. Trae agent: An llm-based agent for software engineering with test-time scaling. arXiv preprint arXiv:2507.23370 (2025)

work page arXiv 2025
[11]

Xiaodong Gu, Meng Chen, Yalan Lin, Yuhan Hu, Hongyu Zhang, Chengcheng Wan, Zhao Wei, Yong Xu, and Juhong Wang. 2025. On the effectiveness of large language models in domain-specific code generation. ACM Transactions on Software Engineering and Methodology 34, 3 (2025), 1–22

2025
[12]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yifan Wu, YK Li, et al. 2024. DeepSeek-Coder: when the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196 (2024)

work page internal anchor Pith review arXiv 2024
[14]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y . Wu, Y . K. Li, Fuli Luo, Yingfei Xiong, and Wen- feng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence. CoRR abs/2401.14196 (2024). https://doi.org/10.48550/arXiv.2401.14196

work page internal anchor Pith review doi:10.48550/arxiv.2401.14196 2024
[15]

Pengfei He, Shaowei Wang, Shaiful Chowdhury, and Tse-Hsun Chen. 2025. Eval- uating the effectiveness and efficiency of demonstration retrievers in rag for coding tasks. In 2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 500–510

2025
[16]

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al . [n. d.]. Measuring Coding Challenge Competence With APPS. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)
[17]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2026. A survey on large language models for code generation. ACM Transactions on Software Engineering and Methodology 35, 2 (2026), 1–72

2026
[19]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Ranim Khojah, Francisco Gomes de Oliveira Neto, Mazen Mohamad, and Philipp Leitner. 2025. The impact of prompt programming on function-level code genera- tion. IEEE Transactions on Software Engineering (2025)

2025
[21]

Sachit Kuhar, Wasi Ahmad, Zijian Wang, Nihal Jain, Haifeng Qian, Baishakhi Ray, Murali Krishna Ramanathan, Xiaofei Ma, and Anoop Deoras. 2025. Libevo- lutioneval: A benchmark and study for version-specific code generation. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human L...

2025
[22]

Maxime Lamothe, Yann-Gaël Guéhéneuc, and Weiyi Shang. 2021. A systematic review of API evolution literature. ACM Computing Surveys (CSUR) 54, 8 (2021), 1–36

2021
[23]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems 33 (2020), 9459–9474

2020
[24]

Jia Li, Xianjie Shi, Kechi Zhang, Ge Li, Zhi Jin, Lei Li, Huangzhao Zhang, Fang Liu, Yuwei Zhang, Zhengwei Tao, et al. 2025. GraphCodeAgent: Dual Graph- Guided LLM Agent for Retrieval-Augmented Repo-Level Code Generation.arXiv preprint arXiv:2504.10046 (2025)

work page arXiv 2025
[25]

R Li, LB Allal, Y Zi, N Muennighoff, D Kocetkov, C Mou, M Marone, C Akiki, J Li, J Chim, et al. 2023. StarCoder: May the Source be With You! Transactions on machine learning research (2023)

2023
[26]

X Li, DG Wang, S Wang, S Wang, Y Wang, Y Wang, Y Wang, Y Wang, Z Wang, Z Wang, et al. 2022. Evaluating large language models trained on code. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). 12345–12356

2022
[27]

Linxi Liang, Jing Gong, Mingwei Liu, Chong Wang, Guangsheng Ou, Yanlin Wang, Xin Peng, and Zibin Zheng. 2025. RustEvoˆ 2: An Evolving Bench- mark for API Evolution in LLM-based Rust Code Generation. arXiv preprint arXiv:2503.16922 (2025)

work page arXiv 2025
[28]

Ming Liang, Xiaoheng Xie, Gehao Zhang, Xunjin Zheng, Peng Di, Hongwei Chen, Chengpeng Wang, Gang Fan, et al. 2024. Repofuse: Repository-level code completion with fused dual context. arXiv preprint arXiv:2402.14323 (2024)

work page arXiv 2024
[29]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems 36 (2023), 21558–21572. Ahmed Nusayer Ashik, Shaowei Wang, Tse-Hsun Chen, Muhammad Asaduzzaman, and Yuan Tian

2023
[30]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=1qvx610Cu7

2023
[31]

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM computing surveys 55, 9 (2023), 1–35

2023
[32]

Tianyang Liu, Canwen Xu, and Julian McAuley. 2023. Repobench: Benchmarking repository-level code auto-completion systems. arXiv preprint arXiv:2306.03091 (2023)

work page arXiv 2023
[33]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al
[34]

Advances in neural information processing systems 36 (2023), 46534–46594

Self-refine: Iterative refinement with self-feedback. Advances in neural information processing systems 36 (2023), 46534–46594

2023
[35]

Diganta Misra, Nizar Islah, Victor May, Brice Rauby, Zihan Wang, Justine Gehring, Antonio Orvieto, Muawiz Chaudhary, Eilif B Muller, Irina Rish, et al . 2025. GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Ver- sion Incompatibilities. arXiv preprint arXiv:2507.12367 (2025)

work page arXiv 2025
[36]

OpenAI. 2024. Models – OpenAI Platform Documentation. https://developers. openai.com/api/docs/models/gpt-4o-mini Accessed: 2026-03-06

2024
[37]

Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The impact of ai on developer productivity: Evidence from github copilot. arXiv preprint arXiv:2302.06590 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. 2024. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv preprint arXiv:2402.07927 1 (2024)

work page internal anchor Pith review arXiv 2024
[40]

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al
[41]

OpenAI GPT-5 System Card

Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Kaiser Sun, Fan Bai, and Mark Dredze. 2025. What is seen cannot be unseen: The disruptive effect of knowledge conflict on large language models. arXiv e-prints (2025), arXiv–2506

2025
[43]

Chong Wang, Kaifeng Huang, Jian Zhang, Yebo Feng, Lyuye Zhang, Yang Liu, and Xin Peng. 2025. LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-Based Code Completion. In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 885–897

2025
[44]

Chong Wang, Kaifeng Huang, Jian Zhang, Yebo Feng, Lyuye Zhang, Yang Liu, and Xin Peng. 2025. Llms meet library evolution: Evaluating deprecated api usage in llm-based code completion. In 2025 ieee/acm 47th international conference on software engineering (icse). IEEE, 885–897

2025
[45]

Jiawei Wang, Li Li, Kui Liu, and Haipeng Cai. 2020. Exploring how deprecated python library apis are (not) handled. InProceedings of the 28th acm joint meeting on european software engineering conference and symposium on the foundations of software engineering. 233–244

2020
[46]

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. 2024. Openhands: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741 (2024)

work page internal anchor Pith review arXiv 2024
[47]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837

2022
[48]

Tongtong Wu, Weigang Wu, Xingyu Wang, Kang Xu, Suyu Ma, Bo Jiang, Ping Yang, Zhenchang Xing, Yuan-Fang Li, and Gholamreza Haffari. 2024. Versicode: Towards version-controllable code generation. arXiv preprint arXiv:2406.07411 (2024)

work page arXiv 2024
[49]

Yixi Wu, Pengfei He, Zehao Wang, Shaowei Wang, Yuan Tian, and Tse-Hsun Chen
[50]

arXiv preprint arXiv:2409.15228 (2024)

A comprehensive framework for evaluating api-oriented code generation in large language models. arXiv preprint arXiv:2409.15228 (2024)

work page arXiv 2024
[51]

Laerte Xavier, Aline Brito, Andre Hora, and Marco Tulio Valente. 2017. His- torical and impact analysis of API breaking changes: A large-scale study. In 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 138–147

2017
[52]

Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and Yu Su. 2023. Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts. In The Twelfth International Conference on Learning Representations

2023
[53]

Rongwu Xu, Zehan Qi, Zhijiang Guo, Cunxiang Wang, Hongru Wang, Yue Zhang, and Wei Xu. 2024. Knowledge conflicts for llms: A survey. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 8541–8565

2024
[54]

Xu Yang, Jiayuan Zhou, Michael Pacheco, Wenhan Zhu, Pengfei He, Shaowei Wang, Kui Liu, and Ruiqi Pan. 2025. Lingxi: Repository-Level Issue Resolution Framework Enhanced by Procedural Knowledge Guided Scaling. arXiv preprint arXiv:2510.11838 (2025)

work page arXiv 2025
[55]

Daoguang Zan, Ailun Yu, Bo Shen, Bei Chen, Wei Li, Yongshun Gong, Xiaolin Chen, Yafen Yao, Weihua Luo, Bei Guan, et al. 2024. DiffCoder: Enhancing large language model on API invocation via analogical code exercises. Proceedings of the ACM on Software Engineering 1, FSE (2024), 406–426

2024
[56]

Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. 2024. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level cod- ing challenges. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers). 13643–13658

2024
[57]

Zhaoxu Zhang, Hengcheng Zhu, Ming Wen, Yida Tao, Yepang Liu, and Yingfei Xiong. 2020. How do python framework apis evolve? an exploratory study. In 2020 ieee 27th international conference on software analysis, evolution and reengineering (saner). IEEE, 81–92

2020
[58]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems 36 (2023), 46595–46623

2023
[59]

Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li, et al. 2023. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5673–5684

2023
[60]

Albert Ziegler, Eirini Kalliamvakou, X Alice Li, Andrew Rice, Devon Rifkin, Shawn Simister, Ganesh Sittampalam, and Edward Aftandilian. 2024. Measuring github copilot’s impact on productivity.Commun. ACM 67, 3 (2024), 54–63

2024