pith. sign in

arxiv: 2605.17957 · v1 · pith:ZA2FURZWnew · submitted 2026-05-18 · 💻 cs.SE

Contextualized Code Pretraining for Code Generation

Pith reviewed 2026-05-20 09:32 UTC · model grok-4.3

classification 💻 cs.SE
keywords code generationpretrainingcalling contextstatic analysiscaller-callee pairssoftware engineeringrepository-level evaluation
0
0 comments X

The pith

Pretraining code models on calling context improves their ability to generate functions that fit into real projects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern code generation models are trained mostly on natural language descriptions, yet real development happens inside existing codebases where the call site already reveals expected behavior. The paper extracts caller-callee pairs from real repositories using static analysis and uses them to pretrain models with an invocation-aware objective that conditions generation on surrounding usage context. The resulting CallerGen models, sized 220M and 0.5B, reach 16.58 percent and 22.81 percent pass@1 on a new benchmark of realistic call-site scenarios, beating same-scale baselines and staying competitive with much larger models.

Core claim

Contextualized code pretraining integrates calling context into both training and evaluation by automatically mining large-scale caller-callee pairs from repositories. Models trained this way learn to implement a callee function given its actual usage site, producing code that integrates more smoothly with surrounding repository code than models trained only on isolated functions or natural-language prompts.

What carries the argument

Invocation-aware pretraining that conditions generation on extracted caller-callee pairs from static analysis of real code.

If this is right

  • Generated functions integrate more reliably into existing codebases rather than requiring post-hoc fixes.
  • Evaluation shifts from isolated function correctness to repository-level compatibility.
  • Static analysis offers a scalable source of training signals that does not rely on natural-language documentation.
  • Smaller models can close the gap with larger ones when context is explicitly supplied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same extraction technique could supply context for other tasks such as bug localization or refactoring.
  • IDE code-completion tools might improve by retrieving similar call sites instead of relying solely on language-model priors.
  • If call context proves as useful as shown, future benchmarks should require models to produce code that passes integration tests against real callers.

Load-bearing premise

Static analysis can automatically extract large-scale, high-quality caller-callee pairs from real repositories without significant errors or selection bias.

What would settle it

A manual audit of thousands of extracted pairs that finds frequent mismatches between the documented call site and the actual callee implementation, or a drop in performance when the same models are tested on hand-curated context-aware tasks.

Figures

Figures reproduced from arXiv: 2605.17957 by Chen Liu, Hanwen Zhang, Lu Zhang, Qingyuan Liang, Yakun Zhang, Zeyu Sun.

Figure 1
Figure 1. Figure 1: Web development: high-level logic first, then called functions. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of HumanEval (a) and CoderEval (b). surface dependency-relevant snippets beyond surface similarity. DraCo [10] leverages extended dataflow analysis to guide retrieval augmentation so that the retrieved context better aligns with required program dependencies. CoCoGen [6] uses compiler feedback and static analysis to diag￾nose missing project-specific information and iteratively improve the retri… view at source ↗
Figure 3
Figure 3. Figure 3: Overall training workflow of CallerGen. process to obtain the necessary caller-driven information from large-scale Python repositories. CallerGen is trained using a unified objective that conditions generation on the calling context. We apply this training on two types of model architectures: an encoder-decoder model (CodeT5) and a decoder-only model (Qwen2.5-Coder), allowing CallerGen to generalize across… view at source ↗
Figure 4
Figure 4. Figure 4: (a) A sample of CallerEval. (b) The end-to-end evaluation process of CallerEval. 3.3 Model Architecture CallerGen adopts invocation-aware training on two different model architectures to demonstrate the generality of our approach. For the encoder-decoder architecture CodeT5, we pretrain on both the 60M version (CodeT5-small) and the 220M version (CodeT5-base) by incorporating calling contexts into the trai… view at source ↗
Figure 5
Figure 5. Figure 5: Case 1: a function handles file-descriptor normalization in an event-driven I/O framework. [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Case 2: a function that converts CSS-style strings into sequences of attribute–value tuples. [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
read the original abstract

As code generation becomes increasingly central to improving software development efficiency, modern code models are largely trained and evaluated on code with natural-language descriptions. In real projects, developers often implement missing functions under limited project-specific artifacts, while the local call-site context is already available in the surrounding code. This usage context provides actionable cues about expected behavior, but existing models are not explicitly optimized to leverage it reliably, leading to implementations that may not integrate smoothly with surrounding usage in repository settings. In this work, we propose contextualized code pretraining, an invocation-aware framework that integrates calling context into both the training and evaluation of code models. Using static analysis, we automatically extract large-scale caller-callee pairs from real repositories to construct pretraining tasks and benchmarks that condition generation on the calling context. We train CallerGen, the first code models pretrained with invocation-aware objectives spanning multiple sizes, and evaluate them on CallerEval, a new benchmark featuring realistic scenarios. Experiments show that CallerGen outperforms comparable-scale models and remains competitive with larger ones across two benchmarks. Our 220M and 0.5B models achieve 16.58% and 22.81@% pass1, surpassing baselines on CallerEval. These results highlight the importance of calling context in realistic code generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes contextualized code pretraining, an invocation-aware framework that uses static analysis to automatically extract large-scale caller-callee pairs from real repositories. These pairs are used to construct pretraining objectives and a new benchmark (CallerEval) that conditions code generation on calling context. The authors train CallerGen models (220M and 0.5B parameters) and report that they achieve 16.58% and 22.81% pass@1 on CallerEval, outperforming comparable-scale baselines while remaining competitive with larger models across two benchmarks.

Significance. If the extraction pipeline proves reliable, the work would demonstrate that explicitly modeling invocation context can improve code generation in realistic repository settings beyond natural-language-only training. The scale of automatically mined pairs and the introduction of CallerEval as a usage-context benchmark are potentially valuable contributions to the field of code models.

major comments (3)
  1. [Abstract and experimental results] Abstract and experimental results section: the central performance claims (16.58% and 22.81% pass@1 on CallerEval, outperforming baselines) rest on unreported details including data splits, error bars, statistical significance, controls for post-hoc choices in static analysis, and validation that the extracted caller-callee pairs are free of substantial noise or selection bias.
  2. [Method (pair extraction)] Method section on pair extraction: the claim that static analysis yields large-scale, high-quality caller-callee pairs providing reliable behavioral signals is load-bearing for both pretraining and CallerEval validity, yet no accuracy metrics, manual validation results, or error-rate analysis for unresolved dynamic calls, virtual methods, or incomplete modules are provided.
  3. [Benchmark construction] Benchmark construction: potential train-test leakage arising from using the same static-analysis pipeline for both pretraining data and CallerEval construction is not addressed or quantified, which directly affects whether the reported gains reflect genuine context utilization.
minor comments (2)
  1. [Abstract] Abstract contains a typographical error: '22.81@% pass1' should read '22.81% pass@1'.
  2. [Throughout] Notation for pass@1 is inconsistent between the abstract and later text; standardize to 'pass@1' throughout.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and valuable suggestions. We address each of the major comments in detail below and commit to making the necessary revisions to improve the clarity and rigor of the paper.

read point-by-point responses
  1. Referee: [Abstract and experimental results] Abstract and experimental results section: the central performance claims (16.58% and 22.81% pass@1 on CallerEval, outperforming baselines) rest on unreported details including data splits, error bars, statistical significance, controls for post-hoc choices in static analysis, and validation that the extracted caller-callee pairs are free of substantial noise or selection bias.

    Authors: We agree with the referee that the experimental claims require more supporting details for reproducibility and credibility. In the revised version of the manuscript, we will expand the experimental results section to include: a clear description of the train/validation/test splits for both pretraining and CallerEval; error bars and standard deviations from at least three independent runs with different random seeds; statistical significance testing (e.g., McNemar's test or bootstrap methods) for the performance improvements; an analysis of sensitivity to post-hoc choices in the static analysis (such as call graph construction parameters); and results from a manual validation of a random sample of 300 extracted caller-callee pairs, reporting precision and any observed biases. These changes will directly address the concerns about unreported details. revision: yes

  2. Referee: [Method (pair extraction)] Method section on pair extraction: the claim that static analysis yields large-scale, high-quality caller-callee pairs providing reliable behavioral signals is load-bearing for both pretraining and CallerEval validity, yet no accuracy metrics, manual validation results, or error-rate analysis for unresolved dynamic calls, virtual methods, or incomplete modules are provided.

    Authors: The referee correctly identifies that the quality of the extracted pairs is foundational to our claims. The original manuscript focuses on the scale and automation of the extraction but does not provide quantitative validation. We will revise the Method section to include accuracy metrics obtained through manual annotation of a stratified sample of pairs (covering different programming languages and project sizes if applicable), error rates specifically for unresolved dynamic dispatches, virtual method calls, and cases with incomplete type information or missing modules. We will also describe any filtering steps applied to mitigate noise. This addition will substantiate the reliability of the behavioral signals used. revision: yes

  3. Referee: [Benchmark construction] Benchmark construction: potential train-test leakage arising from using the same static-analysis pipeline for both pretraining data and CallerEval construction is not addressed or quantified, which directly affects whether the reported gains reflect genuine context utilization.

    Authors: We appreciate the referee highlighting the risk of train-test leakage, which could inflate the perceived benefits of contextualized pretraining. We will add a dedicated analysis in the Benchmark construction subsection quantifying the overlap: specifically, the fraction of CallerEval instances whose caller-callee pairs or surrounding contexts appear in the pretraining data. To mitigate this, we will either use repository-level disjoint splits or apply aggressive deduplication based on code similarity. The revised manuscript will report the measured leakage rate and, if necessary, present re-evaluated results on a leakage-free version of CallerEval. This will help confirm that the gains stem from better context utilization rather than data contamination. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's claims center on an empirical pipeline: static analysis extracts caller-callee pairs from repositories to build pretraining objectives for CallerGen and the CallerEval benchmark, with reported pass@1 scores (16.58% for 220M, 22.81% for 0.5B) outperforming baselines. No equations, self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The performance results are presented as outcomes of training and evaluation rather than forced by construction from the extraction method itself. The approach follows standard pretraining and benchmarking without ansatzes or uniqueness theorems that collapse back to prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that static analysis produces sufficiently accurate and representative caller-callee pairs for pretraining; no free parameters are explicitly fitted in the abstract, and no new physical or mathematical entities are postulated beyond the new models and benchmark.

axioms (1)
  • domain assumption Static analysis tools can reliably extract caller-callee relationships at scale from diverse real-world repositories without introducing systematic biases or errors that affect model training.
    Invoked in the description of constructing pretraining tasks and benchmarks from real repositories.
invented entities (2)
  • CallerGen no independent evidence
    purpose: Code models pretrained with invocation-aware objectives
    New family of models introduced in the work; no independent evidence provided beyond the reported benchmark results.
  • CallerEval no independent evidence
    purpose: Benchmark featuring realistic calling-context scenarios
    New evaluation benchmark constructed for the paper; no external validation mentioned.

pith-pipeline@v0.9.0 · 5760 in / 1411 out tokens · 28933 ms · 2026-05-20T09:32:13.516772+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 12 internal anchors

  1. [1]

    ChatGPT: Optimizing Language Models for Dialogue

    2022. ChatGPT: Optimizing Language Models for Dialogue. https://openai.com/blog/chatgpt/ Accessed: 2023-01-16

  2. [2]

    Ali Asgari, Milan de Koning, Pouria Derakhshanfar, and Annibale Panichella. 2025. Metamorphic Testing of Deep Code Models: A Systematic Literature Review.arXiv preprint arXiv:2507.22610(2025)

  3. [3]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. arXiv:2108.07732 [cs.PL] https://arxiv.org/abs/2108.07732

  4. [4]

    Ruqi Bai, Yao Ji, Zeyu Zhou, and David I Inouye. 2025. From Invariant Representations to Invariant Data: Provable Robustness to Spurious Correlations via Noisy Counterfactual Matching.arXiv preprint arXiv:2505.24843(2025)

  5. [5]

    Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al . 2024. Deepseek llm: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954(2024)

  6. [6]

    Zhangqian Bi, Yao Wan, Zheng Wang, Hongyu Zhang, Batu Guan, Fangxin Lu, Zili Zhang, Yulei Sui, Hai Jin, and Xuanhua Shi. 2024. Iterative refinement of project-level code context for precise code generation with compiler feedback.arXiv preprint arXiv:2403.16792(2024)

  7. [7]

    CallerGen. 2025. CallerGen. (2025). https://anonymous.4open.science/r/callergen

  8. [8]

    Liguo Chen, Qi Guo, Hongrui Jia, Zhengran Zeng, Xin Wang, Yijiang Xu, Jian Wu, Yidong Wang, Qing Gao, Jindong Wang, et al. 2024. A survey on evaluating large language models in code generation tasks.arXiv preprint arXiv:2408.16498 (2024)

  9. [9]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  10. [10]

    Wei Cheng, Yuhan Wu, and Wei Hu. 2024. Dataflow-guided retrieval augmentation for repository-level code completion. arXiv preprint arXiv:2405.19782(2024). , Vol. 1, No. 1, Article . Publication date: May 2026. Contextualized Code Pretraining for Code Generation 31

  11. [11]

    Steven Cho, Stefano Ruberto, and Valerio Terragni. 2025. Metamorphic testing of large language models for natural language processing. In2025 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 174–186

  12. [12]

    Fenia Christopoulou, Gerasimos Lampouras, Milan Gritta, Guchun Zhang, Yinpeng Guo, Zhongqi Li, Qi Zhang, Meng Xiao, Bo Shen, Lin Li, et al . 2022. Pangu-coder: Program synthesis with function-level language modeling.arXiv preprint arXiv:2207.11280(2022)

  13. [13]

    Nurit Cohen-Inger, Yehonatan Elisha, Bracha Shapira, Lior Rokach, and Seffi Cohen. 2025. Forget What You Know about LLMs Evaluations–LLMs are Like a Chameleon.arXiv preprint arXiv:2502.07445(2025)

  14. [15]

    Yangruibo Ding, Zijian Wang, Wasi Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, et al. 2023. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion.Advances in Neural Information Processing Systems36 (2023), 46701–46723

  15. [16]

    Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2024. Evaluating large language models in class-level code generation. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

  16. [17]

    Yali Du and Zhongxing Yu. 2023. Pre-training code representation with semantic flow graph for effective bug localization. InProceedings of the 31st ACM joint European software engineering conference and symposium on the foundations of software engineering. 579–591

  17. [18]

    Cuiyun Gao, Xing Hu, Shan Gao, Xin Xia, and Zhi Jin. 2025. The current challenges of software engineering in the era of large language models.ACM Transactions on Software Engineering and Methodology34, 5 (2025), 1–30

  18. [19]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196(2024)

  19. [20]

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79

  20. [22]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5-Coder Technical Report. arXiv:2409.12186 [cs.CL...

  21. [23]

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A survey on large language models for code generation.arXiv preprint arXiv:2406.00515(2024)

  22. [24]

    Magne Jørgensen. 2004. Top-down and bottom-up expert estimation of software development effort.Information and Software Technology46, 1 (2004), 3–16

  23. [25]

    Thomas D LaToza, Maryam Arab, Dastyni Loksa, and Amy J Ko. 2020. Explicit programming strategies.Empirical Software Engineering25, 4 (2020), 2416–2449

  24. [26]

    Nam Le Hai, Dung Manh Nguyen, and Nghi DQ Bui. 2024. Repoexec: Evaluate code generation with a repository-level executable benchmark.arXiv e-prints(2024), arXiv–2406

  25. [27]

    Jia Li, Ge Li, Yunfei Zhao, Yongmin Li, Huanyu Liu, Hao Zhu, Lecheng Wang, Kaibo Liu, Zheng Fang, Lanshen Wang, et al. 2024. Deveval: A manually-annotated code generation benchmark aligned with real-world code repositories. In Findings of the Association for Computational Linguistics: ACL 2024. 3603–3614

  26. [28]

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. 2022. Competition-level code generation with alphacode.Science378, 6624 (2022), 1092–1097

  27. [29]

    Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81

  28. [30]

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems36 (2023), 21558–21572. , Vol. 1, No. 1, Article . Publication date: May 2026. 32 Liu et al

  29. [31]

    Tianyang Liu, Canwen Xu, and Julian McAuley. 2023. Repobench: Benchmarking repository-level code auto-completion systems.arXiv preprint arXiv:2306.03091(2023)

  30. [32]

    Wei Liu, Ailun Yu, Daoguang Zan, Bo Shen, Wei Zhang, Haiyan Zhao, Zhi Jin, and Qianxiang Wang. 2024. Graphcoder: Enhancing repository-level code completion via code context graph-based retrieval and language model.arXiv preprint arXiv:2406.07003(2024)

  31. [33]

    Carma L Mcclure. 2012. Top-down, bottom-up, and structured programming.IEEE Transactions on Software Engineering 4 (2012), 397–403

  32. [34]

    Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022. Codegen: An open large language model for code with multi-turn program synthesis.arXiv preprint arXiv:2203.13474 (2022)

  33. [35]

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. (2018)

  34. [36]

    S Ren, D Guo, S Lu, L Zhou, S Liu, D Tang, N Sundaresan, M Zhou, A Blanco, and S Codebleu Ma. [n. d.]. A method for automatic evaluation of code synthesis. arXiv 2020.arXiv preprint arXiv:2009.10297([n. d.])

  35. [37]

    Tobias Roehm, Rebecca Tiarks, Rainer Koschke, and Walid Maalej. 2012. How do professional developers comprehend software?. In2012 34th International Conference on Software Engineering (ICSE). IEEE, 255–265

  36. [38]

    Vitalis Salis, Thodoris Sotiropoulos, Panos Louridas, Diomidis Spinellis, and Dimitris Mitropoulos. 2021. Pycg: Practical call graph generation in python. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1646–1657

  37. [39]

    Chantal Shaib, Vinith M Suriyakumar, Levent Sagun, Byron C Wallace, and Marzyeh Ghassemi. 2025. Learning the wrong lessons: syntactic-domain spurious correlations in language models.arXiv preprint arXiv:2509.21155(2025)

  38. [40]

    Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. Intellicode compose: Code generation using transformer. InProceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 1433–1443

  39. [41]

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530(2024)

  40. [42]

    Qwen Team. 2024. Qwen2.5: A Party of Foundation Models. https://qwenlm.github.io/blog/qwen2.5/

  41. [43]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

  42. [44]

    Victor Veitch, Alexander D’Amour, Steve Yadlowsky, and Jacob Eisenstein. 2021. Counterfactual invariance to spurious correlations: Why and how to pass stress tests.arXiv preprint arXiv:2106.00545(2021)

  43. [45]

    Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder- Decoder Models for Code Understanding and Generation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8696–8708

  44. [46]

    Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2024. Magicoder: empowering code generation with OSS-INSTRUCT. InProceedings of the 41st International Conference on Machine Learning. 52632–52657

  45. [47]

    W Ye, L Jiang, E Xie, G Zheng, Y Ma, X Cao, D Guo, D Qi, Z He, Y Tian, et al . 2024. The clever Hans mirage: A comprehensive survey on spurious correlations in machine learning.arXiv: 2402.12715(2024)

  46. [48]

    Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12

  47. [49]

    Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. 2023. Repocoder: Repository-level code completion through iterative retrieval and generation.arXiv preprint arXiv:2303.12570(2023)

  48. [50]

    Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. 2024. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges.arXiv preprint arXiv:2401.07339(2024)

  49. [51]

    Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, and Zhenyu Chen

  50. [52]

    A survey on large language models for software engineering.arXiv preprint arXiv:2312.15223(2023)

  51. [53]

    Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, and Zibin Zheng. 2025. Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 481–503

  52. [54]

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. 2024. LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). Association for Computational Linguistics, Bangkok, Thaila...

  53. [55]

    Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. 2024. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence.arXiv preprint arXiv:2406.11931(2024)

  54. [56]

    Terry Yue Zhuo, Junda He, Jiamou Sun, Zhenchang Xing, David Lo, John Grundy, and Xiaoning Du. 2026. Identifying and Mitigating API Misuse in Large Language Models.IEEE Transactions on Software Engineering(2026). , Vol. 1, No. 1, Article . Publication date: May 2026