When to Answer and When to Defer: A Decision Framework for Reliable Code Predictions

Ravishka Rathnasuriya; Wei Yang

arxiv: 2605.19369 · v1 · pith:3LCMPTRKnew · submitted 2026-05-19 · 💻 cs.SE

When to Answer and When to Defer: A Decision Framework for Reliable Code Predictions

Ravishka Rathnasuriya , Wei Yang This is my paper

Pith reviewed 2026-05-20 04:49 UTC · model grok-4.3

classification 💻 cs.SE

keywords code language modelsuncertainty estimationmodel calibrationselective predictionabstentionprogram analysisreliable deployment

0 comments

The pith

A unified framework lets code models assign reliable correctness probabilities, abstain when unsure, and call on program analysis for deferred cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Code language models often give overconfident wrong answers or underconfident right ones, which makes them hard to trust in practice. The paper builds a single workflow that combines uncertainty estimation, calibration, and external tool calls so the model can judge its own likelihood of being correct and hand off uncertain outputs to lightweight program analysis. Existing calibration methods from other domains reduce probability errors but do not improve how well predictions can be ranked by their actual chance of being right, which selective prediction requires. Treating uncertainty as a trigger for action rather than a passive flag allows controlled coverage where the model only answers when it can be trusted. The result is a deployment pattern that works for both code classification and code generation while keeping risk manageable.

Core claim

The paper claims that a unified framework integrating uncertainty estimation, model calibration, and tool-based abstention handling enables code models to assign reliable correctness probabilities, abstain under uncertainty, and invoke lightweight program analysis procedures to process abstained cases, supporting risk-aware, coverage-controlled use across both classification and generation settings.

What carries the argument

The unified decision framework that turns uncertainty estimates into actionable signals by routing uncertain predictions to lightweight program analysis tools.

If this is right

Predictions can be ranked by estimated correctness so users can select only the most reliable ones.
Abstained cases receive automated validation or repair from program analysis instead of being discarded.
The same workflow applies to both predicting properties of code and generating new code.
Overall system accuracy can be tuned by adjusting how many cases the model is allowed to answer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same uncertainty-to-tool pattern could be tested in other domains that already have lightweight verifiers, such as mathematical proofs or formal specifications.
Analysis results returned from deferred cases could be logged to improve future uncertainty estimates without retraining the base model.

Load-bearing premise

That post-hoc calibration reduces probability misalignment but fails to improve ranking of predictions by correctness likelihood for code tasks, and that integrating uncertainty estimation with tool-based handling overcomes this limitation.

What would settle it

A direct comparison on a code prediction benchmark showing whether the framework produces higher accuracy than calibrated baselines at any fixed coverage level, or whether the program analysis step correctly resolves a measurable share of the cases the model abstains on.

Figures

Figures reproduced from arXiv: 2605.19369 by Ravishka Rathnasuriya, Wei Yang.

**Figure 1.** Figure 1: Overview of the Proposed Framework Inputs and model scope. The framework begins with taskspecific inputs, each routed to either a classification model or a generative model [4, 8, 11, 22]. Classification tasks include vulnerability detection, defect prediction, or API misuse identification, where the model maps a code representation to a discrete label. Generative tasks include code synthesis, completio… view at source ↗

read the original abstract

Code language models are increasingly adopted for both understanding and generative tasks. Despite their success, these models frequently produce overconfident incorrect predictions and underconfident correct predictions, undermining their reliability in deployment. Practical deployment demands three capabilities: accurately estimating the likelihood of correctness, abstaining on uncertain predictions, and invoking external mechanisms to validate or repair abstained outputs. Existing calibration and uncertainty estimation methods, primarily developed for natural language tasks, do not readily transfer to code. Notably, post-hoc calibration techniques often reduce probability misalignment but fail to improve the ranking of predictions by correctness likelihood-a requirement for selective prediction under partial coverage. Furthermore, most approaches treat uncertainty as a passive indicator rather than an actionable signal. This work introduces a unified framework that integrates uncertainty estimation, model calibration, and tool-based abstention handling for code models. The proposed design enables models to assign reliable correctness probabilities, abstain under uncertainty, and invoke lightweight program analysis procedures to process abstained cases. By combining these components within a single deployment-oriented workflow, this framework supports risk-aware, coverage-controlled use of code models across both classification and generation settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a framework for code models to abstain and defer to program analysis but stays too high-level on the tools and validation to assess its impact yet.

read the letter

The key takeaway is that this work outlines a framework for code models to estimate correctness, abstain when unsure, and hand off to program analysis tools, but it remains high-level without backing results. What stands out as new is the specific point that post-hoc calibration reduces misalignment but does not improve how well predictions are ranked by their actual correctness likelihood, at least for code tasks. The authors then propose integrating that with tool-based handling to make the uncertainty signal useful rather than passive. This combination in one workflow for both classification and generation is the main contribution they claim. The paper does a good job framing the practical barriers to deploying these models, like overconfidence leading to bad reliability in software engineering contexts. It correctly notes that methods from natural language do not transfer directly. Where it falls short is in the execution of the central idea. There is no description of the lightweight program analysis procedures, how they connect to the uncertainty estimates, or evidence that they can handle code-specific issues effectively while staying lightweight. Without experiments, data, or even concrete examples, it is difficult to judge whether the framework actually delivers on the promises. The stress-test note captures this accurately. This paper is for readers focused on reliable AI applications in software engineering. Someone working on selective prediction or calibration for code might pick up the workflow idea, but it would need more substance to influence broader work. I recommend sending it to peer review, provided the authors add the missing specifics on the tools and some form of validation. The topic is relevant enough to warrant referee input even in its current state.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a unified framework that integrates uncertainty estimation, model calibration, and tool-based abstention handling for code language models. It claims this design enables reliable assignment of correctness probabilities, abstention on uncertain predictions, and invocation of lightweight program analysis procedures to process abstained cases, supporting risk-aware and coverage-controlled deployment across classification and generation tasks while addressing limitations of existing post-hoc calibration methods that fail to improve ranking by correctness likelihood.

Significance. If the framework's components are shown to integrate effectively and the tool-based handling proves lightweight and reliable, the work would provide a practical deployment workflow that treats uncertainty as an actionable signal rather than a passive indicator. This could meaningfully advance reliable use of code models by enabling selective prediction and external validation/repair mechanisms for code-specific issues.

major comments (2)

[Abstract] Abstract: The central claim that the framework 'invokes lightweight program analysis procedures to process abstained cases' is load-bearing for overcoming the ranking limitations of prior calibration methods, yet the manuscript provides no description of the specific procedures (e.g., static checkers, symbolic execution, or repair heuristics), their integration with uncertainty signals, or validation that they remain lightweight while addressing semantic mismatches or non-local bugs in code generation and classification.
[Framework description] Framework description (likely §3 or equivalent): The assertion that integrating uncertainty estimation with tool-based handling will overcome the failure of post-hoc calibration to improve ranking of predictions by correctness likelihood requires concrete mechanisms and evidence; without them, the actionable signal for abstention remains unsubstantiated for both classification and generation settings.

minor comments (2)

The abstract would be strengthened by briefly outlining the evaluation approach, datasets, or metrics used to demonstrate the framework's benefits over existing methods.
Notation for 'correctness probabilities' and 'selective prediction under partial coverage' should be defined explicitly on first use to improve clarity for readers unfamiliar with selective classification literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater specificity on the tool-based abstention mechanisms. We address each major comment below and will revise the manuscript accordingly to strengthen the description and evidence.

read point-by-point responses

Referee: [Abstract] The central claim that the framework 'invokes lightweight program analysis procedures to process abstained cases' is load-bearing for overcoming the ranking limitations of prior calibration methods, yet the manuscript provides no description of the specific procedures (e.g., static checkers, symbolic execution, or repair heuristics), their integration with uncertainty signals, or validation that they remain lightweight while addressing semantic mismatches or non-local bugs in code generation and classification.

Authors: We agree that the abstract's claim requires supporting detail to be fully substantiated. The current manuscript provides a high-level overview of tool invocation in Section 3 but does not include the requested specifics on procedures, integration, or lightweight validation. In revision we will expand the abstract for precision and add to Section 3 concrete examples (lightweight static checkers for syntax/type errors and targeted repair heuristics for common code issues), explicit integration logic showing how uncertainty scores trigger tool calls, and new empirical results confirming low overhead relative to model inference while noting limitations on non-local semantic bugs. revision: yes
Referee: [Framework description] The assertion that integrating uncertainty estimation with tool-based handling will overcome the failure of post-hoc calibration to improve ranking of predictions by correctness likelihood requires concrete mechanisms and evidence; without them, the actionable signal for abstention remains unsubstantiated for both classification and generation settings.

Authors: The manuscript presents the integrated framework in Section 3 and reports improved selective-prediction metrics in Sections 4–5, but we acknowledge that the concrete decision mechanisms and direct evidence linking tool handling to ranking gains are not sufficiently explicit. We will add a dedicated subsection detailing the threshold-based abstention logic, conditional tool invocation, and how external validation augments correctness ranking. We will also include targeted ablation results isolating the contribution of the tool component to ranking metrics (e.g., AUC or coverage-accuracy curves) for both classification and generation tasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity in framework proposal

full rationale

The paper presents a high-level design for integrating uncertainty estimation, calibration, and tool-based abstention handling in code models. No equations, fitted parameters, predictions derived from fits, or self-citations are described in the abstract or referenced text. The central claims concern a methodological workflow rather than any derivation that reduces to its own inputs by construction. This qualifies as a self-contained proposal with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central proposal rests on the domain assumption that standard calibration methods fail to transfer to code and that a new integrated workflow will succeed where they do not; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (1)

domain assumption Existing calibration and uncertainty estimation methods developed for natural language do not readily transfer to code.
Explicitly stated in the abstract as motivation for the new framework.

invented entities (1)

Unified framework integrating uncertainty estimation, calibration, and tool-based abstention no independent evidence
purpose: To support risk-aware, coverage-controlled use of code models
Proposed as the main contribution but lacks independent evidence or falsifiable predictions in the abstract.

pith-pipeline@v0.9.0 · 5723 in / 1135 out tokens · 35327 ms · 2026-05-20T04:49:10.020165+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 7 internal anchors

[1]

Yihong Dong, Jiazheng Ding, Xue Jiang, Ge Li, Zhuo Li, and Zhi Jin. 2025. Code- score: Evaluating code generation by learning code execution.ACM Transactions on Software Engineering and Methodology34, 3 (2025), 1–22

work page 2025
[2]

Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Ininternational conference on machine learning. PMLR, 1050–1059

work page 2016
[3]

On Calibration of Modern Neural Networks

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On Calibration of Modern Neural Networks. arXiv:1706.04599 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2017
[4]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence.arXiv preprint arXiv:2401.14196(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Kartik Gupta, Amir Rahimi, Thalaiyasingam Ajanthan, Thomas Mensink, Cristian Sminchisescu, and Richard Hartley. 2020. Calibration of neural networks using splines.arXiv preprint arXiv:2006.12800(2020)

work page arXiv 2020
[6]

Dan Hendrycks and Kevin Gimpel. 2018. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. arXiv:1610.02136 [cs.NE]

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Qiang Hu, Yuejun Guo, Xiaofei Xie, Maxime Cordy, Mike Papadakis, Lei Ma, and Yves Le Traon. 2023. CodeS: towards code model generalization under distribution shift. InInternational Conference on Software Engineering (ICSE): New Ideas and Emerging Results (NIER)

work page 2023
[8]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Xue Jiang, Yihong Dong, Lecheng Wang, Zheng Fang, Qiwei Shang, Ge Li, Zhi Jin, and Wenpin Jiao. 2024. Self-planning code generation with large language models.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–30

work page 2024
[10]

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017. Simple and scalable predictive uncertainty estimation using deep ensembles.Advances in neural information processing systems30 (2017)

work page 2017
[11]

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023. Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Yufei Li, Simin Chen, and Wei Yang. 2021. Estimating predictive uncertainty under program data distribution shift.arXiv preprint arXiv:2107.10989(2021)

work page arXiv 2021
[13]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. InThirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=1qvx610Cu7

work page 2023
[14]

Jiawei Liu, Songrun Xie, Junhao Wang, Yuxiang Wei, Yifeng Ding, and Ling- ming Zhang. 2024. Evaluating Language Models for Efficient Code Generation. InFirst Conference on Language Modeling. https://openreview.net/forum?id= IBCBMeAhmC

work page 2024
[15]

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambro- sio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al . 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation.arXiv preprint arXiv:2102.04664(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

2021.Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI

Robert Munro Monarch. 2021.Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI. Simon and Schuster

work page 2021
[17]

Hui Peng, Yan Shoshitaishvili, and Mathias Payer. 2018. T-Fuzz: fuzzing by program transformation. In2018 IEEE Symposium on Security and Privacy (SP). IEEE, 697–710

work page 2018
[18]

Anh Viet Phan and Minh Le Nguyen. 2017. Convolutional neural networks on assembly code for predicting software defects. In2017 21st Asia Pacific Symposium on Intelligent and Evolutionary Systems (IES). 37–42. doi:10.1109/IESYS.2017. 8233558

work page doi:10.1109/iesys.2017 2017
[19]

Ravishka Rathnasuriya. 2025. A Framework for On the Fly Input Refinement for Deep Learning Models. In2025 IEEE/ACM 47th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 140–144

work page 2025
[20]

Ravishka Rathnasuriya. 2025. On the Fly Input Refinement for Code Language Models. In2025 IEEE/ACM 47th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 230–231

work page 2025
[21]

Ravishka Rathnasuriya, Zijie Zhao, and Wei Yang. 2025. CodeImprove: Program Adaptation for Deep Code Models. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 676–676

work page 2025
[22]

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Anjana Sarkar and Soumyendu Sarkar. 2025. Survey of LLM Agent Communi- cation with MCP: A Software Design Pattern Centric Review.arXiv preprint arXiv:2506.05364(2025)

work page arXiv 2025
[24]

Claude Elwood Shannon. 1948. A mathematical theory of communication.The Bell system technical journal27, 3 (1948), 379–423

work page 1948
[25]

Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Is- lam Rabin, Amin Alipour, Susmit Jha, Prem Devanbu, and Toufique Ahmed

work page
[26]

Calibration and correctness of language models for code.arXiv preprint arXiv:2402.02047(2024)

work page arXiv 2024
[27]

Jacob Steinhardt and Percy S Liang. 2016. Unsupervised risk estimation using only conditional independence structure.Advances in Neural Information Processing Systems29 (2016)

work page 2016
[28]

Zhao Tian, Junjie Chen, and Xiangyu Zhang. 2023. On-the-fly Improving Perfor- mance of Deep Code Models via Input Denoising.arXiv preprint arXiv:2308.09969 (2023)

work page arXiv 2023
[29]

Rijnard van Tonder and Claire Le Goues. 2020. Tailoring programs for static analy- sis via program transformation. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 824–834

work page 2020
[30]

Roman Vashurin, Ekaterina Fadeeva, Artem Vazhentsev, Lyudmila Rvanova, Akim Tsvigun, Daniil Vasilev, Rui Xing, Abdelrahman Boda Sadallah, Kirill Gr- ishchenkov, Sergey Petrakov, et al. 2024. Benchmarking uncertainty quantifi- cation methods for large language models with lm-polygraph.arXiv preprint arXiv:2406.15627(2024)

work page arXiv 2024
[31]

Ruslan Vasilev and Alexander D’yakonov. 2023. Calibration of neural networks. arXiv preprint arXiv:2303.10761(2023)

work page arXiv 2023
[32]

Zhou Yang, Jieke Shi, Junda He, and David Lo. 2022. Natural Attack for Pre- Trained Models of Code. InProceedings of the 44th International Conference on Software Engineering(Pittsburgh, Pennsylvania)(ICSE ’22). Association for Com- puting Machinery, New York, NY, USA, 1482–1493. doi:10.1145/3510003.3510146

work page doi:10.1145/3510003.3510146 2022
[33]

Noam Yefet, Uri Alon, and Eran Yahav. 2020. Adversarial examples for models of code.Proceedings of the ACM on Programming Languages4, OOPSLA (2020), 1–30

work page 2020
[34]

Weiwei Zhang, Shengjian Guo, Hongyu Zhang, Yulei Sui, Yinxing Xue, and Yun Xu. 2023. Challenging Machine Learning-based Clone Detectors via Semantic- preserving Code Transformations.IEEE Transactions on Software Engineering49, 5 (May 2023), 3052–3070. doi:10.1109/TSE.2023.3240118 arXiv:2111.10793 [cs]

work page doi:10.1109/tse.2023.3240118 2023
[35]

Zhenhao Zhou, Chaofeng Sha, and Xin Peng. 2024. On calibration of pre-trained code models. InProceedings of the IEEE/ACM 46th international conference on software engineering. 1–13

work page 2024

[1] [1]

Yihong Dong, Jiazheng Ding, Xue Jiang, Ge Li, Zhuo Li, and Zhi Jin. 2025. Code- score: Evaluating code generation by learning code execution.ACM Transactions on Software Engineering and Methodology34, 3 (2025), 1–22

work page 2025

[2] [2]

Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Ininternational conference on machine learning. PMLR, 1050–1059

work page 2016

[3] [3]

On Calibration of Modern Neural Networks

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On Calibration of Modern Neural Networks. arXiv:1706.04599 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2017

[4] [4]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence.arXiv preprint arXiv:2401.14196(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Kartik Gupta, Amir Rahimi, Thalaiyasingam Ajanthan, Thomas Mensink, Cristian Sminchisescu, and Richard Hartley. 2020. Calibration of neural networks using splines.arXiv preprint arXiv:2006.12800(2020)

work page arXiv 2020

[6] [6]

Dan Hendrycks and Kevin Gimpel. 2018. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. arXiv:1610.02136 [cs.NE]

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

Qiang Hu, Yuejun Guo, Xiaofei Xie, Maxime Cordy, Mike Papadakis, Lei Ma, and Yves Le Traon. 2023. CodeS: towards code model generalization under distribution shift. InInternational Conference on Software Engineering (ICSE): New Ideas and Emerging Results (NIER)

work page 2023

[8] [8]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Xue Jiang, Yihong Dong, Lecheng Wang, Zheng Fang, Qiwei Shang, Ge Li, Zhi Jin, and Wenpin Jiao. 2024. Self-planning code generation with large language models.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–30

work page 2024

[10] [10]

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017. Simple and scalable predictive uncertainty estimation using deep ensembles.Advances in neural information processing systems30 (2017)

work page 2017

[11] [11]

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023. Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Yufei Li, Simin Chen, and Wei Yang. 2021. Estimating predictive uncertainty under program data distribution shift.arXiv preprint arXiv:2107.10989(2021)

work page arXiv 2021

[13] [13]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. InThirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=1qvx610Cu7

work page 2023

[14] [14]

Jiawei Liu, Songrun Xie, Junhao Wang, Yuxiang Wei, Yifeng Ding, and Ling- ming Zhang. 2024. Evaluating Language Models for Efficient Code Generation. InFirst Conference on Language Modeling. https://openreview.net/forum?id= IBCBMeAhmC

work page 2024

[15] [15]

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambro- sio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al . 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation.arXiv preprint arXiv:2102.04664(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[16] [16]

2021.Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI

Robert Munro Monarch. 2021.Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI. Simon and Schuster

work page 2021

[17] [17]

Hui Peng, Yan Shoshitaishvili, and Mathias Payer. 2018. T-Fuzz: fuzzing by program transformation. In2018 IEEE Symposium on Security and Privacy (SP). IEEE, 697–710

work page 2018

[18] [18]

Anh Viet Phan and Minh Le Nguyen. 2017. Convolutional neural networks on assembly code for predicting software defects. In2017 21st Asia Pacific Symposium on Intelligent and Evolutionary Systems (IES). 37–42. doi:10.1109/IESYS.2017. 8233558

work page doi:10.1109/iesys.2017 2017

[19] [19]

Ravishka Rathnasuriya. 2025. A Framework for On the Fly Input Refinement for Deep Learning Models. In2025 IEEE/ACM 47th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 140–144

work page 2025

[20] [20]

Ravishka Rathnasuriya. 2025. On the Fly Input Refinement for Code Language Models. In2025 IEEE/ACM 47th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 230–231

work page 2025

[21] [21]

Ravishka Rathnasuriya, Zijie Zhao, and Wei Yang. 2025. CodeImprove: Program Adaptation for Deep Code Models. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 676–676

work page 2025

[22] [22]

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Anjana Sarkar and Soumyendu Sarkar. 2025. Survey of LLM Agent Communi- cation with MCP: A Software Design Pattern Centric Review.arXiv preprint arXiv:2506.05364(2025)

work page arXiv 2025

[24] [24]

Claude Elwood Shannon. 1948. A mathematical theory of communication.The Bell system technical journal27, 3 (1948), 379–423

work page 1948

[25] [25]

Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Is- lam Rabin, Amin Alipour, Susmit Jha, Prem Devanbu, and Toufique Ahmed

work page

[26] [26]

Calibration and correctness of language models for code.arXiv preprint arXiv:2402.02047(2024)

work page arXiv 2024

[27] [27]

Jacob Steinhardt and Percy S Liang. 2016. Unsupervised risk estimation using only conditional independence structure.Advances in Neural Information Processing Systems29 (2016)

work page 2016

[28] [28]

Zhao Tian, Junjie Chen, and Xiangyu Zhang. 2023. On-the-fly Improving Perfor- mance of Deep Code Models via Input Denoising.arXiv preprint arXiv:2308.09969 (2023)

work page arXiv 2023

[29] [29]

Rijnard van Tonder and Claire Le Goues. 2020. Tailoring programs for static analy- sis via program transformation. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 824–834

work page 2020

[30] [30]

Roman Vashurin, Ekaterina Fadeeva, Artem Vazhentsev, Lyudmila Rvanova, Akim Tsvigun, Daniil Vasilev, Rui Xing, Abdelrahman Boda Sadallah, Kirill Gr- ishchenkov, Sergey Petrakov, et al. 2024. Benchmarking uncertainty quantifi- cation methods for large language models with lm-polygraph.arXiv preprint arXiv:2406.15627(2024)

work page arXiv 2024

[31] [31]

Ruslan Vasilev and Alexander D’yakonov. 2023. Calibration of neural networks. arXiv preprint arXiv:2303.10761(2023)

work page arXiv 2023

[32] [32]

Zhou Yang, Jieke Shi, Junda He, and David Lo. 2022. Natural Attack for Pre- Trained Models of Code. InProceedings of the 44th International Conference on Software Engineering(Pittsburgh, Pennsylvania)(ICSE ’22). Association for Com- puting Machinery, New York, NY, USA, 1482–1493. doi:10.1145/3510003.3510146

work page doi:10.1145/3510003.3510146 2022

[33] [33]

Noam Yefet, Uri Alon, and Eran Yahav. 2020. Adversarial examples for models of code.Proceedings of the ACM on Programming Languages4, OOPSLA (2020), 1–30

work page 2020

[34] [34]

Weiwei Zhang, Shengjian Guo, Hongyu Zhang, Yulei Sui, Yinxing Xue, and Yun Xu. 2023. Challenging Machine Learning-based Clone Detectors via Semantic- preserving Code Transformations.IEEE Transactions on Software Engineering49, 5 (May 2023), 3052–3070. doi:10.1109/TSE.2023.3240118 arXiv:2111.10793 [cs]

work page doi:10.1109/tse.2023.3240118 2023

[35] [35]

Zhenhao Zhou, Chaofeng Sha, and Xin Peng. 2024. On calibration of pre-trained code models. InProceedings of the IEEE/ACM 46th international conference on software engineering. 1–13

work page 2024