Contextualized Code Pretraining for Code Generation
Pith reviewed 2026-05-20 09:32 UTC · model grok-4.3
The pith
Pretraining code models on calling context improves their ability to generate functions that fit into real projects.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Contextualized code pretraining integrates calling context into both training and evaluation by automatically mining large-scale caller-callee pairs from repositories. Models trained this way learn to implement a callee function given its actual usage site, producing code that integrates more smoothly with surrounding repository code than models trained only on isolated functions or natural-language prompts.
What carries the argument
Invocation-aware pretraining that conditions generation on extracted caller-callee pairs from static analysis of real code.
If this is right
- Generated functions integrate more reliably into existing codebases rather than requiring post-hoc fixes.
- Evaluation shifts from isolated function correctness to repository-level compatibility.
- Static analysis offers a scalable source of training signals that does not rely on natural-language documentation.
- Smaller models can close the gap with larger ones when context is explicitly supplied.
Where Pith is reading between the lines
- The same extraction technique could supply context for other tasks such as bug localization or refactoring.
- IDE code-completion tools might improve by retrieving similar call sites instead of relying solely on language-model priors.
- If call context proves as useful as shown, future benchmarks should require models to produce code that passes integration tests against real callers.
Load-bearing premise
Static analysis can automatically extract large-scale, high-quality caller-callee pairs from real repositories without significant errors or selection bias.
What would settle it
A manual audit of thousands of extracted pairs that finds frequent mismatches between the documented call site and the actual callee implementation, or a drop in performance when the same models are tested on hand-curated context-aware tasks.
Figures
read the original abstract
As code generation becomes increasingly central to improving software development efficiency, modern code models are largely trained and evaluated on code with natural-language descriptions. In real projects, developers often implement missing functions under limited project-specific artifacts, while the local call-site context is already available in the surrounding code. This usage context provides actionable cues about expected behavior, but existing models are not explicitly optimized to leverage it reliably, leading to implementations that may not integrate smoothly with surrounding usage in repository settings. In this work, we propose contextualized code pretraining, an invocation-aware framework that integrates calling context into both the training and evaluation of code models. Using static analysis, we automatically extract large-scale caller-callee pairs from real repositories to construct pretraining tasks and benchmarks that condition generation on the calling context. We train CallerGen, the first code models pretrained with invocation-aware objectives spanning multiple sizes, and evaluate them on CallerEval, a new benchmark featuring realistic scenarios. Experiments show that CallerGen outperforms comparable-scale models and remains competitive with larger ones across two benchmarks. Our 220M and 0.5B models achieve 16.58% and 22.81@% pass1, surpassing baselines on CallerEval. These results highlight the importance of calling context in realistic code generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes contextualized code pretraining, an invocation-aware framework that uses static analysis to automatically extract large-scale caller-callee pairs from real repositories. These pairs are used to construct pretraining objectives and a new benchmark (CallerEval) that conditions code generation on calling context. The authors train CallerGen models (220M and 0.5B parameters) and report that they achieve 16.58% and 22.81% pass@1 on CallerEval, outperforming comparable-scale baselines while remaining competitive with larger models across two benchmarks.
Significance. If the extraction pipeline proves reliable, the work would demonstrate that explicitly modeling invocation context can improve code generation in realistic repository settings beyond natural-language-only training. The scale of automatically mined pairs and the introduction of CallerEval as a usage-context benchmark are potentially valuable contributions to the field of code models.
major comments (3)
- [Abstract and experimental results] Abstract and experimental results section: the central performance claims (16.58% and 22.81% pass@1 on CallerEval, outperforming baselines) rest on unreported details including data splits, error bars, statistical significance, controls for post-hoc choices in static analysis, and validation that the extracted caller-callee pairs are free of substantial noise or selection bias.
- [Method (pair extraction)] Method section on pair extraction: the claim that static analysis yields large-scale, high-quality caller-callee pairs providing reliable behavioral signals is load-bearing for both pretraining and CallerEval validity, yet no accuracy metrics, manual validation results, or error-rate analysis for unresolved dynamic calls, virtual methods, or incomplete modules are provided.
- [Benchmark construction] Benchmark construction: potential train-test leakage arising from using the same static-analysis pipeline for both pretraining data and CallerEval construction is not addressed or quantified, which directly affects whether the reported gains reflect genuine context utilization.
minor comments (2)
- [Abstract] Abstract contains a typographical error: '22.81@% pass1' should read '22.81% pass@1'.
- [Throughout] Notation for pass@1 is inconsistent between the abstract and later text; standardize to 'pass@1' throughout.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable suggestions. We address each of the major comments in detail below and commit to making the necessary revisions to improve the clarity and rigor of the paper.
read point-by-point responses
-
Referee: [Abstract and experimental results] Abstract and experimental results section: the central performance claims (16.58% and 22.81% pass@1 on CallerEval, outperforming baselines) rest on unreported details including data splits, error bars, statistical significance, controls for post-hoc choices in static analysis, and validation that the extracted caller-callee pairs are free of substantial noise or selection bias.
Authors: We agree with the referee that the experimental claims require more supporting details for reproducibility and credibility. In the revised version of the manuscript, we will expand the experimental results section to include: a clear description of the train/validation/test splits for both pretraining and CallerEval; error bars and standard deviations from at least three independent runs with different random seeds; statistical significance testing (e.g., McNemar's test or bootstrap methods) for the performance improvements; an analysis of sensitivity to post-hoc choices in the static analysis (such as call graph construction parameters); and results from a manual validation of a random sample of 300 extracted caller-callee pairs, reporting precision and any observed biases. These changes will directly address the concerns about unreported details. revision: yes
-
Referee: [Method (pair extraction)] Method section on pair extraction: the claim that static analysis yields large-scale, high-quality caller-callee pairs providing reliable behavioral signals is load-bearing for both pretraining and CallerEval validity, yet no accuracy metrics, manual validation results, or error-rate analysis for unresolved dynamic calls, virtual methods, or incomplete modules are provided.
Authors: The referee correctly identifies that the quality of the extracted pairs is foundational to our claims. The original manuscript focuses on the scale and automation of the extraction but does not provide quantitative validation. We will revise the Method section to include accuracy metrics obtained through manual annotation of a stratified sample of pairs (covering different programming languages and project sizes if applicable), error rates specifically for unresolved dynamic dispatches, virtual method calls, and cases with incomplete type information or missing modules. We will also describe any filtering steps applied to mitigate noise. This addition will substantiate the reliability of the behavioral signals used. revision: yes
-
Referee: [Benchmark construction] Benchmark construction: potential train-test leakage arising from using the same static-analysis pipeline for both pretraining data and CallerEval construction is not addressed or quantified, which directly affects whether the reported gains reflect genuine context utilization.
Authors: We appreciate the referee highlighting the risk of train-test leakage, which could inflate the perceived benefits of contextualized pretraining. We will add a dedicated analysis in the Benchmark construction subsection quantifying the overlap: specifically, the fraction of CallerEval instances whose caller-callee pairs or surrounding contexts appear in the pretraining data. To mitigate this, we will either use repository-level disjoint splits or apply aggressive deduplication based on code similarity. The revised manuscript will report the measured leakage rate and, if necessary, present re-evaluated results on a leakage-free version of CallerEval. This will help confirm that the gains stem from better context utilization rather than data contamination. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's claims center on an empirical pipeline: static analysis extracts caller-callee pairs from repositories to build pretraining objectives for CallerGen and the CallerEval benchmark, with reported pass@1 scores (16.58% for 220M, 22.81% for 0.5B) outperforming baselines. No equations, self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The performance results are presented as outcomes of training and evaluation rather than forced by construction from the extraction method itself. The approach follows standard pretraining and benchmarking without ansatzes or uniqueness theorems that collapse back to prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Static analysis tools can reliably extract caller-callee relationships at scale from diverse real-world repositories without introducing systematic biases or errors that affect model training.
invented entities (2)
-
CallerGen
no independent evidence
-
CallerEval
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Using static analysis, we automatically extract large-scale caller-callee pairs from real repositories to construct pretraining tasks and benchmarks that condition generation on the calling context.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
ChatGPT: Optimizing Language Models for Dialogue
2022. ChatGPT: Optimizing Language Models for Dialogue. https://openai.com/blog/chatgpt/ Accessed: 2023-01-16
work page 2022
- [2]
-
[3]
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. arXiv:2108.07732 [cs.PL] https://arxiv.org/abs/2108.07732
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [4]
-
[5]
Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al . 2024. Deepseek llm: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [6]
-
[7]
CallerGen. 2025. CallerGen. (2025). https://anonymous.4open.science/r/callergen
work page 2025
- [8]
-
[9]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [10]
-
[11]
Steven Cho, Stefano Ruberto, and Valerio Terragni. 2025. Metamorphic testing of large language models for natural language processing. In2025 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 174–186
work page 2025
- [12]
- [13]
-
[15]
Yangruibo Ding, Zijian Wang, Wasi Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, et al. 2023. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion.Advances in Neural Information Processing Systems36 (2023), 46701–46723
work page 2023
-
[16]
Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2024. Evaluating large language models in class-level code generation. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13
work page 2024
-
[17]
Yali Du and Zhongxing Yu. 2023. Pre-training code representation with semantic flow graph for effective bug localization. InProceedings of the 31st ACM joint European software engineering conference and symposium on the foundations of software engineering. 579–591
work page 2023
-
[18]
Cuiyun Gao, Xing Hu, Shan Gao, Xin Xia, and Zhi Jin. 2025. The current challenges of software engineering in the era of large language models.ACM Transactions on Software Engineering and Methodology34, 5 (2025), 1–30
work page 2025
-
[19]
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79
work page 2024
-
[22]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5-Coder Technical Report. arXiv:2409.12186 [cs.CL...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A survey on large language models for code generation.arXiv preprint arXiv:2406.00515(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Magne Jørgensen. 2004. Top-down and bottom-up expert estimation of software development effort.Information and Software Technology46, 1 (2004), 3–16
work page 2004
-
[25]
Thomas D LaToza, Maryam Arab, Dastyni Loksa, and Amy J Ko. 2020. Explicit programming strategies.Empirical Software Engineering25, 4 (2020), 2416–2449
work page 2020
-
[26]
Nam Le Hai, Dung Manh Nguyen, and Nghi DQ Bui. 2024. Repoexec: Evaluate code generation with a repository-level executable benchmark.arXiv e-prints(2024), arXiv–2406
work page 2024
-
[27]
Jia Li, Ge Li, Yunfei Zhao, Yongmin Li, Huanyu Liu, Hao Zhu, Lecheng Wang, Kaibo Liu, Zheng Fang, Lanshen Wang, et al. 2024. Deveval: A manually-annotated code generation benchmark aligned with real-world code repositories. In Findings of the Association for Computational Linguistics: ACL 2024. 3603–3614
work page 2024
-
[28]
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. 2022. Competition-level code generation with alphacode.Science378, 6624 (2022), 1092–1097
work page 2022
-
[29]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81
work page 2004
-
[30]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems36 (2023), 21558–21572. , Vol. 1, No. 1, Article . Publication date: May 2026. 32 Liu et al
work page 2023
-
[31]
Tianyang Liu, Canwen Xu, and Julian McAuley. 2023. Repobench: Benchmarking repository-level code auto-completion systems.arXiv preprint arXiv:2306.03091(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [32]
-
[33]
Carma L Mcclure. 2012. Top-down, bottom-up, and structured programming.IEEE Transactions on Software Engineering 4 (2012), 397–403
work page 2012
-
[34]
Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022. Codegen: An open large language model for code with multi-turn program synthesis.arXiv preprint arXiv:2203.13474 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[35]
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. (2018)
work page 2018
-
[36]
S Ren, D Guo, S Lu, L Zhou, S Liu, D Tang, N Sundaresan, M Zhou, A Blanco, and S Codebleu Ma. [n. d.]. A method for automatic evaluation of code synthesis. arXiv 2020.arXiv preprint arXiv:2009.10297([n. d.])
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[37]
Tobias Roehm, Rebecca Tiarks, Rainer Koschke, and Walid Maalej. 2012. How do professional developers comprehend software?. In2012 34th International Conference on Software Engineering (ICSE). IEEE, 255–265
work page 2012
-
[38]
Vitalis Salis, Thodoris Sotiropoulos, Panos Louridas, Diomidis Spinellis, and Dimitris Mitropoulos. 2021. Pycg: Practical call graph generation in python. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1646–1657
work page 2021
- [39]
-
[40]
Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. Intellicode compose: Code generation using transformer. InProceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 1433–1443
work page 2020
-
[41]
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Qwen Team. 2024. Qwen2.5: A Party of Foundation Models. https://qwenlm.github.io/blog/qwen2.5/
work page 2024
-
[43]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)
work page 2017
- [44]
-
[45]
Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder- Decoder Models for Code Understanding and Generation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8696–8708
work page 2021
-
[46]
Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2024. Magicoder: empowering code generation with OSS-INSTRUCT. InProceedings of the 41st International Conference on Machine Learning. 52632–52657
work page 2024
- [47]
-
[48]
Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12
work page 2024
- [49]
- [50]
-
[51]
Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, and Zhenyu Chen
- [52]
-
[53]
Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, and Zibin Zheng. 2025. Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 481–503
work page 2025
-
[54]
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. 2024. LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). Association for Computational Linguistics, Bangkok, Thaila...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. 2024. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence.arXiv preprint arXiv:2406.11931(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
Terry Yue Zhuo, Junda He, Jiamou Sun, Zhenchang Xing, David Lo, John Grundy, and Xiaoning Du. 2026. Identifying and Mitigating API Misuse in Large Language Models.IEEE Transactions on Software Engineering(2026). , Vol. 1, No. 1, Article . Publication date: May 2026
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.