Contextualized Code Pretraining for Code Generation

Chen Liu; Hanwen Zhang; Lu Zhang; Qingyuan Liang; Yakun Zhang; Zeyu Sun

arxiv: 2605.17957 · v1 · pith:ZA2FURZWnew · submitted 2026-05-18 · 💻 cs.SE

Contextualized Code Pretraining for Code Generation

Chen Liu , Qingyuan Liang , Hanwen Zhang , Zeyu Sun , Yakun Zhang , Lu Zhang This is my paper

Pith reviewed 2026-05-20 09:32 UTC · model grok-4.3

classification 💻 cs.SE

keywords code generationpretrainingcalling contextstatic analysiscaller-callee pairssoftware engineeringrepository-level evaluation

0 comments

The pith

Pretraining code models on calling context improves their ability to generate functions that fit into real projects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern code generation models are trained mostly on natural language descriptions, yet real development happens inside existing codebases where the call site already reveals expected behavior. The paper extracts caller-callee pairs from real repositories using static analysis and uses them to pretrain models with an invocation-aware objective that conditions generation on surrounding usage context. The resulting CallerGen models, sized 220M and 0.5B, reach 16.58 percent and 22.81 percent pass@1 on a new benchmark of realistic call-site scenarios, beating same-scale baselines and staying competitive with much larger models.

Core claim

Contextualized code pretraining integrates calling context into both training and evaluation by automatically mining large-scale caller-callee pairs from repositories. Models trained this way learn to implement a callee function given its actual usage site, producing code that integrates more smoothly with surrounding repository code than models trained only on isolated functions or natural-language prompts.

What carries the argument

Invocation-aware pretraining that conditions generation on extracted caller-callee pairs from static analysis of real code.

If this is right

Generated functions integrate more reliably into existing codebases rather than requiring post-hoc fixes.
Evaluation shifts from isolated function correctness to repository-level compatibility.
Static analysis offers a scalable source of training signals that does not rely on natural-language documentation.
Smaller models can close the gap with larger ones when context is explicitly supplied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same extraction technique could supply context for other tasks such as bug localization or refactoring.
IDE code-completion tools might improve by retrieving similar call sites instead of relying solely on language-model priors.
If call context proves as useful as shown, future benchmarks should require models to produce code that passes integration tests against real callers.

Load-bearing premise

Static analysis can automatically extract large-scale, high-quality caller-callee pairs from real repositories without significant errors or selection bias.

What would settle it

A manual audit of thousands of extracted pairs that finds frequent mismatches between the documented call site and the actual callee implementation, or a drop in performance when the same models are tested on hand-curated context-aware tasks.

Figures

Figures reproduced from arXiv: 2605.17957 by Chen Liu, Hanwen Zhang, Lu Zhang, Qingyuan Liang, Yakun Zhang, Zeyu Sun.

**Figure 2.** Figure 2: Comparison of HumanEval (a) and CoderEval (b). surface dependency-relevant snippets beyond surface similarity. DraCo [10] leverages extended dataflow analysis to guide retrieval augmentation so that the retrieved context better aligns with required program dependencies. CoCoGen [6] uses compiler feedback and static analysis to diagnose missing project-specific information and iteratively improve the retri… view at source ↗

**Figure 3.** Figure 3: Overall training workflow of CallerGen. process to obtain the necessary caller-driven information from large-scale Python repositories. CallerGen is trained using a unified objective that conditions generation on the calling context. We apply this training on two types of model architectures: an encoder-decoder model (CodeT5) and a decoder-only model (Qwen2.5-Coder), allowing CallerGen to generalize across… view at source ↗

**Figure 4.** Figure 4: (a) A sample of CallerEval. (b) The end-to-end evaluation process of CallerEval. 3.3 Model Architecture CallerGen adopts invocation-aware training on two different model architectures to demonstrate the generality of our approach. For the encoder-decoder architecture CodeT5, we pretrain on both the 60M version (CodeT5-small) and the 220M version (CodeT5-base) by incorporating calling contexts into the trai… view at source ↗

**Figure 5.** Figure 5: Case 1: a function handles file-descriptor normalization in an event-driven I/O framework. [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Case 2: a function that converts CSS-style strings into sequences of attribute–value tuples. [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

read the original abstract

As code generation becomes increasingly central to improving software development efficiency, modern code models are largely trained and evaluated on code with natural-language descriptions. In real projects, developers often implement missing functions under limited project-specific artifacts, while the local call-site context is already available in the surrounding code. This usage context provides actionable cues about expected behavior, but existing models are not explicitly optimized to leverage it reliably, leading to implementations that may not integrate smoothly with surrounding usage in repository settings. In this work, we propose contextualized code pretraining, an invocation-aware framework that integrates calling context into both the training and evaluation of code models. Using static analysis, we automatically extract large-scale caller-callee pairs from real repositories to construct pretraining tasks and benchmarks that condition generation on the calling context. We train CallerGen, the first code models pretrained with invocation-aware objectives spanning multiple sizes, and evaluate them on CallerEval, a new benchmark featuring realistic scenarios. Experiments show that CallerGen outperforms comparable-scale models and remains competitive with larger ones across two benchmarks. Our 220M and 0.5B models achieve 16.58% and 22.81@% pass1, surpassing baselines on CallerEval. These results highlight the importance of calling context in realistic code generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that invocation-aware pretraining on mined caller-callee pairs lifts smaller models on a new benchmark, but the static analysis extraction step lacks reported validation.

read the letter

The main thing to know is that pretraining with calling context extracted from real repositories produces clear gains on a new benchmark for code generation, though the reliability of that extraction is not yet demonstrated in detail. They shift away from natural-language docstrings toward conditioning on the actual call site in surrounding code. Static analysis pulls caller-callee pairs at scale from repositories, which then serve as both pretraining signals and the basis for CallerEval. Their 220M and 0.5B CallerGen models reach 16.58% and 22.81% pass@1 on that benchmark, beating the baselines they test and staying competitive with larger models on a second one as well. This setup matches how developers often work inside existing codebases, where the local usage context is already present and more informative than a standalone description. Building the data directly from project repositories rather than synthetic or isolated snippets is a practical choice that gives the work a grounded feel. The soft spot sits in the data construction. Static analysis routinely struggles with dynamic calls, virtual methods, incomplete modules, or library-heavy code, any of which can inject noise or selection bias into the pairs. The abstract states the performance numbers without error rates, manual validation samples, or checks for train-test leakage from the same extraction process. If those pairs contain more artifacts than genuine behavioral signals, the reported improvements could shrink or disappear under tighter controls. This work is aimed at people building or evaluating code models for repository-level tasks in software engineering. A reader already thinking about context utilization or new benchmarks would find the empirical comparisons and the CallerEval construction worth examining. It has enough of a distinct objective and dataset to merit peer review rather than a desk reject, even if the methods section will need expansion on extraction accuracy and experimental robustness. I would send it out and specifically ask the authors to add validation of the mined pairs and clearer reporting on data splits and variance.

Referee Report

3 major / 2 minor

Summary. The paper proposes contextualized code pretraining, an invocation-aware framework that uses static analysis to automatically extract large-scale caller-callee pairs from real repositories. These pairs are used to construct pretraining objectives and a new benchmark (CallerEval) that conditions code generation on calling context. The authors train CallerGen models (220M and 0.5B parameters) and report that they achieve 16.58% and 22.81% pass@1 on CallerEval, outperforming comparable-scale baselines while remaining competitive with larger models across two benchmarks.

Significance. If the extraction pipeline proves reliable, the work would demonstrate that explicitly modeling invocation context can improve code generation in realistic repository settings beyond natural-language-only training. The scale of automatically mined pairs and the introduction of CallerEval as a usage-context benchmark are potentially valuable contributions to the field of code models.

major comments (3)

[Abstract and experimental results] Abstract and experimental results section: the central performance claims (16.58% and 22.81% pass@1 on CallerEval, outperforming baselines) rest on unreported details including data splits, error bars, statistical significance, controls for post-hoc choices in static analysis, and validation that the extracted caller-callee pairs are free of substantial noise or selection bias.
[Method (pair extraction)] Method section on pair extraction: the claim that static analysis yields large-scale, high-quality caller-callee pairs providing reliable behavioral signals is load-bearing for both pretraining and CallerEval validity, yet no accuracy metrics, manual validation results, or error-rate analysis for unresolved dynamic calls, virtual methods, or incomplete modules are provided.
[Benchmark construction] Benchmark construction: potential train-test leakage arising from using the same static-analysis pipeline for both pretraining data and CallerEval construction is not addressed or quantified, which directly affects whether the reported gains reflect genuine context utilization.

minor comments (2)

[Abstract] Abstract contains a typographical error: '22.81@% pass1' should read '22.81% pass@1'.
[Throughout] Notation for pass@1 is inconsistent between the abstract and later text; standardize to 'pass@1' throughout.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and valuable suggestions. We address each of the major comments in detail below and commit to making the necessary revisions to improve the clarity and rigor of the paper.

read point-by-point responses

Referee: [Abstract and experimental results] Abstract and experimental results section: the central performance claims (16.58% and 22.81% pass@1 on CallerEval, outperforming baselines) rest on unreported details including data splits, error bars, statistical significance, controls for post-hoc choices in static analysis, and validation that the extracted caller-callee pairs are free of substantial noise or selection bias.

Authors: We agree with the referee that the experimental claims require more supporting details for reproducibility and credibility. In the revised version of the manuscript, we will expand the experimental results section to include: a clear description of the train/validation/test splits for both pretraining and CallerEval; error bars and standard deviations from at least three independent runs with different random seeds; statistical significance testing (e.g., McNemar's test or bootstrap methods) for the performance improvements; an analysis of sensitivity to post-hoc choices in the static analysis (such as call graph construction parameters); and results from a manual validation of a random sample of 300 extracted caller-callee pairs, reporting precision and any observed biases. These changes will directly address the concerns about unreported details. revision: yes
Referee: [Method (pair extraction)] Method section on pair extraction: the claim that static analysis yields large-scale, high-quality caller-callee pairs providing reliable behavioral signals is load-bearing for both pretraining and CallerEval validity, yet no accuracy metrics, manual validation results, or error-rate analysis for unresolved dynamic calls, virtual methods, or incomplete modules are provided.

Authors: The referee correctly identifies that the quality of the extracted pairs is foundational to our claims. The original manuscript focuses on the scale and automation of the extraction but does not provide quantitative validation. We will revise the Method section to include accuracy metrics obtained through manual annotation of a stratified sample of pairs (covering different programming languages and project sizes if applicable), error rates specifically for unresolved dynamic dispatches, virtual method calls, and cases with incomplete type information or missing modules. We will also describe any filtering steps applied to mitigate noise. This addition will substantiate the reliability of the behavioral signals used. revision: yes
Referee: [Benchmark construction] Benchmark construction: potential train-test leakage arising from using the same static-analysis pipeline for both pretraining data and CallerEval construction is not addressed or quantified, which directly affects whether the reported gains reflect genuine context utilization.

Authors: We appreciate the referee highlighting the risk of train-test leakage, which could inflate the perceived benefits of contextualized pretraining. We will add a dedicated analysis in the Benchmark construction subsection quantifying the overlap: specifically, the fraction of CallerEval instances whose caller-callee pairs or surrounding contexts appear in the pretraining data. To mitigate this, we will either use repository-level disjoint splits or apply aggressive deduplication based on code similarity. The revised manuscript will report the measured leakage rate and, if necessary, present re-evaluated results on a leakage-free version of CallerEval. This will help confirm that the gains stem from better context utilization rather than data contamination. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's claims center on an empirical pipeline: static analysis extracts caller-callee pairs from repositories to build pretraining objectives for CallerGen and the CallerEval benchmark, with reported pass@1 scores (16.58% for 220M, 22.81% for 0.5B) outperforming baselines. No equations, self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The performance results are presented as outcomes of training and evaluation rather than forced by construction from the extraction method itself. The approach follows standard pretraining and benchmarking without ansatzes or uniqueness theorems that collapse back to prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that static analysis produces sufficiently accurate and representative caller-callee pairs for pretraining; no free parameters are explicitly fitted in the abstract, and no new physical or mathematical entities are postulated beyond the new models and benchmark.

axioms (1)

domain assumption Static analysis tools can reliably extract caller-callee relationships at scale from diverse real-world repositories without introducing systematic biases or errors that affect model training.
Invoked in the description of constructing pretraining tasks and benchmarks from real repositories.

invented entities (2)

CallerGen no independent evidence
purpose: Code models pretrained with invocation-aware objectives
New family of models introduced in the work; no independent evidence provided beyond the reported benchmark results.
CallerEval no independent evidence
purpose: Benchmark featuring realistic calling-context scenarios
New evaluation benchmark constructed for the paper; no external validation mentioned.

pith-pipeline@v0.9.0 · 5760 in / 1411 out tokens · 28933 ms · 2026-05-20T09:32:13.516772+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Using static analysis, we automatically extract large-scale caller-callee pairs from real repositories to construct pretraining tasks and benchmarks that condition generation on the calling context.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 12 internal anchors

[1]

ChatGPT: Optimizing Language Models for Dialogue

2022. ChatGPT: Optimizing Language Models for Dialogue. https://openai.com/blog/chatgpt/ Accessed: 2023-01-16

work page 2022
[2]

Ali Asgari, Milan de Koning, Pouria Derakhshanfar, and Annibale Panichella. 2025. Metamorphic Testing of Deep Code Models: A Systematic Literature Review.arXiv preprint arXiv:2507.22610(2025)

work page arXiv 2025
[3]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. arXiv:2108.07732 [cs.PL] https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Ruqi Bai, Yao Ji, Zeyu Zhou, and David I Inouye. 2025. From Invariant Representations to Invariant Data: Provable Robustness to Spurious Correlations via Noisy Counterfactual Matching.arXiv preprint arXiv:2505.24843(2025)

work page arXiv 2025
[5]

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al . 2024. Deepseek llm: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Zhangqian Bi, Yao Wan, Zheng Wang, Hongyu Zhang, Batu Guan, Fangxin Lu, Zili Zhang, Yulei Sui, Hai Jin, and Xuanhua Shi. 2024. Iterative refinement of project-level code context for precise code generation with compiler feedback.arXiv preprint arXiv:2403.16792(2024)

work page arXiv 2024
[7]

CallerGen. 2025. CallerGen. (2025). https://anonymous.4open.science/r/callergen

work page 2025
[8]

Liguo Chen, Qi Guo, Hongrui Jia, Zhengran Zeng, Xin Wang, Yijiang Xu, Jian Wu, Yidong Wang, Qing Gao, Jindong Wang, et al. 2024. A survey on evaluating large language models in code generation tasks.arXiv preprint arXiv:2408.16498 (2024)

work page arXiv 2024
[9]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Wei Cheng, Yuhan Wu, and Wei Hu. 2024. Dataflow-guided retrieval augmentation for repository-level code completion. arXiv preprint arXiv:2405.19782(2024). , Vol. 1, No. 1, Article . Publication date: May 2026. Contextualized Code Pretraining for Code Generation 31

work page arXiv 2024
[11]

Steven Cho, Stefano Ruberto, and Valerio Terragni. 2025. Metamorphic testing of large language models for natural language processing. In2025 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 174–186

work page 2025
[12]

Fenia Christopoulou, Gerasimos Lampouras, Milan Gritta, Guchun Zhang, Yinpeng Guo, Zhongqi Li, Qi Zhang, Meng Xiao, Bo Shen, Lin Li, et al . 2022. Pangu-coder: Program synthesis with function-level language modeling.arXiv preprint arXiv:2207.11280(2022)

work page arXiv 2022
[13]

Nurit Cohen-Inger, Yehonatan Elisha, Bracha Shapira, Lior Rokach, and Seffi Cohen. 2025. Forget What You Know about LLMs Evaluations–LLMs are Like a Chameleon.arXiv preprint arXiv:2502.07445(2025)

work page arXiv 2025
[15]

Yangruibo Ding, Zijian Wang, Wasi Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, et al. 2023. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion.Advances in Neural Information Processing Systems36 (2023), 46701–46723

work page 2023
[16]

Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2024. Evaluating large language models in class-level code generation. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

work page 2024
[17]

Yali Du and Zhongxing Yu. 2023. Pre-training code representation with semantic flow graph for effective bug localization. InProceedings of the 31st ACM joint European software engineering conference and symposium on the foundations of software engineering. 579–591

work page 2023
[18]

Cuiyun Gao, Xing Hu, Shan Gao, Xin Xia, and Zhi Jin. 2025. The current challenges of software engineering in the era of large language models.ACM Transactions on Software Engineering and Methodology34, 5 (2025), 1–30

work page 2025
[19]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79

work page 2024
[22]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5-Coder Technical Report. arXiv:2409.12186 [cs.CL...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A survey on large language models for code generation.arXiv preprint arXiv:2406.00515(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Magne Jørgensen. 2004. Top-down and bottom-up expert estimation of software development effort.Information and Software Technology46, 1 (2004), 3–16

work page 2004
[25]

Thomas D LaToza, Maryam Arab, Dastyni Loksa, and Amy J Ko. 2020. Explicit programming strategies.Empirical Software Engineering25, 4 (2020), 2416–2449

work page 2020
[26]

Nam Le Hai, Dung Manh Nguyen, and Nghi DQ Bui. 2024. Repoexec: Evaluate code generation with a repository-level executable benchmark.arXiv e-prints(2024), arXiv–2406

work page 2024
[27]

Jia Li, Ge Li, Yunfei Zhao, Yongmin Li, Huanyu Liu, Hao Zhu, Lecheng Wang, Kaibo Liu, Zheng Fang, Lanshen Wang, et al. 2024. Deveval: A manually-annotated code generation benchmark aligned with real-world code repositories. In Findings of the Association for Computational Linguistics: ACL 2024. 3603–3614

work page 2024
[28]

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. 2022. Competition-level code generation with alphacode.Science378, 6624 (2022), 1092–1097

work page 2022
[29]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81

work page 2004
[30]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems36 (2023), 21558–21572. , Vol. 1, No. 1, Article . Publication date: May 2026. 32 Liu et al

work page 2023
[31]

Tianyang Liu, Canwen Xu, and Julian McAuley. 2023. Repobench: Benchmarking repository-level code auto-completion systems.arXiv preprint arXiv:2306.03091(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Wei Liu, Ailun Yu, Daoguang Zan, Bo Shen, Wei Zhang, Haiyan Zhao, Zhi Jin, and Qianxiang Wang. 2024. Graphcoder: Enhancing repository-level code completion via code context graph-based retrieval and language model.arXiv preprint arXiv:2406.07003(2024)

work page arXiv 2024
[33]

Carma L Mcclure. 2012. Top-down, bottom-up, and structured programming.IEEE Transactions on Software Engineering 4 (2012), 397–403

work page 2012
[34]

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022. Codegen: An open large language model for code with multi-turn program synthesis.arXiv preprint arXiv:2203.13474 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. (2018)

work page 2018
[36]

S Ren, D Guo, S Lu, L Zhou, S Liu, D Tang, N Sundaresan, M Zhou, A Blanco, and S Codebleu Ma. [n. d.]. A method for automatic evaluation of code synthesis. arXiv 2020.arXiv preprint arXiv:2009.10297([n. d.])

work page internal anchor Pith review Pith/arXiv arXiv 2020
[37]

Tobias Roehm, Rebecca Tiarks, Rainer Koschke, and Walid Maalej. 2012. How do professional developers comprehend software?. In2012 34th International Conference on Software Engineering (ICSE). IEEE, 255–265

work page 2012
[38]

Vitalis Salis, Thodoris Sotiropoulos, Panos Louridas, Diomidis Spinellis, and Dimitris Mitropoulos. 2021. Pycg: Practical call graph generation in python. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1646–1657

work page 2021
[39]

Chantal Shaib, Vinith M Suriyakumar, Levent Sagun, Byron C Wallace, and Marzyeh Ghassemi. 2025. Learning the wrong lessons: syntactic-domain spurious correlations in language models.arXiv preprint arXiv:2509.21155(2025)

work page arXiv 2025
[40]

Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. Intellicode compose: Code generation using transformer. InProceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 1433–1443

work page 2020
[41]

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Qwen Team. 2024. Qwen2.5: A Party of Foundation Models. https://qwenlm.github.io/blog/qwen2.5/

work page 2024
[43]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

work page 2017
[44]

Victor Veitch, Alexander D’Amour, Steve Yadlowsky, and Jacob Eisenstein. 2021. Counterfactual invariance to spurious correlations: Why and how to pass stress tests.arXiv preprint arXiv:2106.00545(2021)

work page arXiv 2021
[45]

Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder- Decoder Models for Code Understanding and Generation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8696–8708

work page 2021
[46]

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2024. Magicoder: empowering code generation with OSS-INSTRUCT. InProceedings of the 41st International Conference on Machine Learning. 52632–52657

work page 2024
[47]

W Ye, L Jiang, E Xie, G Zheng, Y Ma, X Cao, D Guo, D Qi, Z He, Y Tian, et al . 2024. The clever Hans mirage: A comprehensive survey on spurious correlations in machine learning.arXiv: 2402.12715(2024)

work page arXiv 2024
[48]

Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12

work page 2024
[49]

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. 2023. Repocoder: Repository-level code completion through iterative retrieval and generation.arXiv preprint arXiv:2303.12570(2023)

work page arXiv 2023
[50]

Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. 2024. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges.arXiv preprint arXiv:2401.07339(2024)

work page arXiv 2024
[51]

Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, and Zhenyu Chen

work page
[52]

A survey on large language models for software engineering.arXiv preprint arXiv:2312.15223(2023)

work page arXiv 2023
[53]

Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, and Zibin Zheng. 2025. Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 481–503

work page 2025
[54]

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. 2024. LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). Association for Computational Linguistics, Bangkok, Thaila...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. 2024. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence.arXiv preprint arXiv:2406.11931(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Terry Yue Zhuo, Junda He, Jiamou Sun, Zhenchang Xing, David Lo, John Grundy, and Xiaoning Du. 2026. Identifying and Mitigating API Misuse in Large Language Models.IEEE Transactions on Software Engineering(2026). , Vol. 1, No. 1, Article . Publication date: May 2026

work page 2026

[1] [1]

ChatGPT: Optimizing Language Models for Dialogue

2022. ChatGPT: Optimizing Language Models for Dialogue. https://openai.com/blog/chatgpt/ Accessed: 2023-01-16

work page 2022

[2] [2]

Ali Asgari, Milan de Koning, Pouria Derakhshanfar, and Annibale Panichella. 2025. Metamorphic Testing of Deep Code Models: A Systematic Literature Review.arXiv preprint arXiv:2507.22610(2025)

work page arXiv 2025

[3] [3]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. arXiv:2108.07732 [cs.PL] https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

Ruqi Bai, Yao Ji, Zeyu Zhou, and David I Inouye. 2025. From Invariant Representations to Invariant Data: Provable Robustness to Spurious Correlations via Noisy Counterfactual Matching.arXiv preprint arXiv:2505.24843(2025)

work page arXiv 2025

[5] [5]

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al . 2024. Deepseek llm: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Zhangqian Bi, Yao Wan, Zheng Wang, Hongyu Zhang, Batu Guan, Fangxin Lu, Zili Zhang, Yulei Sui, Hai Jin, and Xuanhua Shi. 2024. Iterative refinement of project-level code context for precise code generation with compiler feedback.arXiv preprint arXiv:2403.16792(2024)

work page arXiv 2024

[7] [7]

CallerGen. 2025. CallerGen. (2025). https://anonymous.4open.science/r/callergen

work page 2025

[8] [8]

Liguo Chen, Qi Guo, Hongrui Jia, Zhengran Zeng, Xin Wang, Yijiang Xu, Jian Wu, Yidong Wang, Qing Gao, Jindong Wang, et al. 2024. A survey on evaluating large language models in code generation tasks.arXiv preprint arXiv:2408.16498 (2024)

work page arXiv 2024

[9] [9]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[10] [10]

Wei Cheng, Yuhan Wu, and Wei Hu. 2024. Dataflow-guided retrieval augmentation for repository-level code completion. arXiv preprint arXiv:2405.19782(2024). , Vol. 1, No. 1, Article . Publication date: May 2026. Contextualized Code Pretraining for Code Generation 31

work page arXiv 2024

[11] [11]

Steven Cho, Stefano Ruberto, and Valerio Terragni. 2025. Metamorphic testing of large language models for natural language processing. In2025 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 174–186

work page 2025

[12] [12]

Fenia Christopoulou, Gerasimos Lampouras, Milan Gritta, Guchun Zhang, Yinpeng Guo, Zhongqi Li, Qi Zhang, Meng Xiao, Bo Shen, Lin Li, et al . 2022. Pangu-coder: Program synthesis with function-level language modeling.arXiv preprint arXiv:2207.11280(2022)

work page arXiv 2022

[13] [13]

Nurit Cohen-Inger, Yehonatan Elisha, Bracha Shapira, Lior Rokach, and Seffi Cohen. 2025. Forget What You Know about LLMs Evaluations–LLMs are Like a Chameleon.arXiv preprint arXiv:2502.07445(2025)

work page arXiv 2025

[14] [15]

Yangruibo Ding, Zijian Wang, Wasi Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, et al. 2023. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion.Advances in Neural Information Processing Systems36 (2023), 46701–46723

work page 2023

[15] [16]

Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2024. Evaluating large language models in class-level code generation. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

work page 2024

[16] [17]

Yali Du and Zhongxing Yu. 2023. Pre-training code representation with semantic flow graph for effective bug localization. InProceedings of the 31st ACM joint European software engineering conference and symposium on the foundations of software engineering. 579–591

work page 2023

[17] [18]

Cuiyun Gao, Xing Hu, Shan Gao, Xin Xia, and Zhi Jin. 2025. The current challenges of software engineering in the era of large language models.ACM Transactions on Software Engineering and Methodology34, 5 (2025), 1–30

work page 2025

[18] [19]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [20]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79

work page 2024

[20] [22]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5-Coder Technical Report. arXiv:2409.12186 [cs.CL...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [23]

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A survey on large language models for code generation.arXiv preprint arXiv:2406.00515(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [24]

Magne Jørgensen. 2004. Top-down and bottom-up expert estimation of software development effort.Information and Software Technology46, 1 (2004), 3–16

work page 2004

[23] [25]

Thomas D LaToza, Maryam Arab, Dastyni Loksa, and Amy J Ko. 2020. Explicit programming strategies.Empirical Software Engineering25, 4 (2020), 2416–2449

work page 2020

[24] [26]

Nam Le Hai, Dung Manh Nguyen, and Nghi DQ Bui. 2024. Repoexec: Evaluate code generation with a repository-level executable benchmark.arXiv e-prints(2024), arXiv–2406

work page 2024

[25] [27]

Jia Li, Ge Li, Yunfei Zhao, Yongmin Li, Huanyu Liu, Hao Zhu, Lecheng Wang, Kaibo Liu, Zheng Fang, Lanshen Wang, et al. 2024. Deveval: A manually-annotated code generation benchmark aligned with real-world code repositories. In Findings of the Association for Computational Linguistics: ACL 2024. 3603–3614

work page 2024

[26] [28]

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. 2022. Competition-level code generation with alphacode.Science378, 6624 (2022), 1092–1097

work page 2022

[27] [29]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81

work page 2004

[28] [30]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems36 (2023), 21558–21572. , Vol. 1, No. 1, Article . Publication date: May 2026. 32 Liu et al

work page 2023

[29] [31]

Tianyang Liu, Canwen Xu, and Julian McAuley. 2023. Repobench: Benchmarking repository-level code auto-completion systems.arXiv preprint arXiv:2306.03091(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [32]

Wei Liu, Ailun Yu, Daoguang Zan, Bo Shen, Wei Zhang, Haiyan Zhao, Zhi Jin, and Qianxiang Wang. 2024. Graphcoder: Enhancing repository-level code completion via code context graph-based retrieval and language model.arXiv preprint arXiv:2406.07003(2024)

work page arXiv 2024

[31] [33]

Carma L Mcclure. 2012. Top-down, bottom-up, and structured programming.IEEE Transactions on Software Engineering 4 (2012), 397–403

work page 2012

[32] [34]

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022. Codegen: An open large language model for code with multi-turn program synthesis.arXiv preprint arXiv:2203.13474 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[33] [35]

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. (2018)

work page 2018

[34] [36]

S Ren, D Guo, S Lu, L Zhou, S Liu, D Tang, N Sundaresan, M Zhou, A Blanco, and S Codebleu Ma. [n. d.]. A method for automatic evaluation of code synthesis. arXiv 2020.arXiv preprint arXiv:2009.10297([n. d.])

work page internal anchor Pith review Pith/arXiv arXiv 2020

[35] [37]

Tobias Roehm, Rebecca Tiarks, Rainer Koschke, and Walid Maalej. 2012. How do professional developers comprehend software?. In2012 34th International Conference on Software Engineering (ICSE). IEEE, 255–265

work page 2012

[36] [38]

Vitalis Salis, Thodoris Sotiropoulos, Panos Louridas, Diomidis Spinellis, and Dimitris Mitropoulos. 2021. Pycg: Practical call graph generation in python. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1646–1657

work page 2021

[37] [39]

Chantal Shaib, Vinith M Suriyakumar, Levent Sagun, Byron C Wallace, and Marzyeh Ghassemi. 2025. Learning the wrong lessons: syntactic-domain spurious correlations in language models.arXiv preprint arXiv:2509.21155(2025)

work page arXiv 2025

[38] [40]

Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. Intellicode compose: Code generation using transformer. InProceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 1433–1443

work page 2020

[39] [41]

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [42]

Qwen Team. 2024. Qwen2.5: A Party of Foundation Models. https://qwenlm.github.io/blog/qwen2.5/

work page 2024

[41] [43]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

work page 2017

[42] [44]

Victor Veitch, Alexander D’Amour, Steve Yadlowsky, and Jacob Eisenstein. 2021. Counterfactual invariance to spurious correlations: Why and how to pass stress tests.arXiv preprint arXiv:2106.00545(2021)

work page arXiv 2021

[43] [45]

Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder- Decoder Models for Code Understanding and Generation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8696–8708

work page 2021

[44] [46]

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2024. Magicoder: empowering code generation with OSS-INSTRUCT. InProceedings of the 41st International Conference on Machine Learning. 52632–52657

work page 2024

[45] [47]

W Ye, L Jiang, E Xie, G Zheng, Y Ma, X Cao, D Guo, D Qi, Z He, Y Tian, et al . 2024. The clever Hans mirage: A comprehensive survey on spurious correlations in machine learning.arXiv: 2402.12715(2024)

work page arXiv 2024

[46] [48]

Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12

work page 2024

[47] [49]

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. 2023. Repocoder: Repository-level code completion through iterative retrieval and generation.arXiv preprint arXiv:2303.12570(2023)

work page arXiv 2023

[48] [50]

Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. 2024. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges.arXiv preprint arXiv:2401.07339(2024)

work page arXiv 2024

[49] [51]

Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, and Zhenyu Chen

work page

[50] [52]

A survey on large language models for software engineering.arXiv preprint arXiv:2312.15223(2023)

work page arXiv 2023

[51] [53]

Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, and Zibin Zheng. 2025. Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 481–503

work page 2025

[52] [54]

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. 2024. LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). Association for Computational Linguistics, Bangkok, Thaila...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [55]

Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. 2024. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence.arXiv preprint arXiv:2406.11931(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [56]

Terry Yue Zhuo, Junda He, Jiamou Sun, Zhenchang Xing, David Lo, John Grundy, and Xiaoning Du. 2026. Identifying and Mitigating API Misuse in Large Language Models.IEEE Transactions on Software Engineering(2026). , Vol. 1, No. 1, Article . Publication date: May 2026

work page 2026