Task Abstention for Large Language Models in Code Generation

Senrong Xu; Taolue Chen; Xiaoxing Ma; Yanke Zhou; Yuan Yao; Yuhao Tan; Zenan Li

arxiv: 2605.17029 · v1 · pith:44ZJFCPEnew · submitted 2026-05-16 · 💻 cs.SE · cs.AI

Task Abstention for Large Language Models in Code Generation

Yanke Zhou , Yuhao Tan , Senrong Xu , Zenan Li , Yuan Yao , Taolue Chen , Xiaoxing Ma This is my paper

Pith reviewed 2026-05-19 20:00 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords task abstentionhallucination detectioncode generationlarge language modelsmultiple hypothesis testingexecution consistencyabstention rule

0 comments

The pith

Code-generating LLMs can abstain from tasks likely to produce hallucinations by checking consistency of execution results across multiple generations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method for large language models to decide when to skip a code generation task altogether. It grounds the decision in a calibrated abstention rule drawn from multiple hypothesis testing and judges reliability by whether different generated programs produce matching execution outcomes. The rule works without any oracle test cases or external reference databases and supplies a distribution-free guarantee that the abstention choices will meet a chosen error threshold. A reader would care because the approach gives a concrete way to reduce incorrect code outputs while preserving the model's ability to generate code on tasks it can handle.

Core claim

By treating the problem as multiple hypothesis testing and measuring consistency through code execution outcomes, the abstention rule provides a rigorous, distribution-free theoretical guarantee that controls the probability of incorrect abstention decisions.

What carries the argument

The calibrated abstention rule from multiple hypothesis testing that evaluates consistency of code execution outcomes across generations.

If this is right

Generative models identify and abstain from hallucination-prone tasks more accurately than prior techniques.
Code generation becomes safer and more robust without extra test cases or databases.
The method accommodates syntactic variation among programs that are semantically equivalent.
Abstention decisions carry a proven bound that holds regardless of the underlying data distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same consistency-based rule could be tested on non-code generation tasks where output equivalence is easy to check automatically.
Integration with existing verification tools might further tighten the practical error rate beyond the theoretical guarantee.
Users could tune the significance level of the hypothesis test to match the risk tolerance of a particular deployment.

Load-bearing premise

That agreement in execution results across several independently generated programs is a reliable signal that none of them contains a hallucination.

What would settle it

A dataset of code tasks with hidden ground-truth test cases where the fraction of tasks on which the method abstains deviates from the error rate predicted by its theoretical bound.

Figures

Figures reproduced from arXiv: 2605.17029 by Senrong Xu, Taolue Chen, Xiaoxing Ma, Yanke Zhou, Yuan Yao, Yuhao Tan, Zenan Li.

**Figure 1.** Figure 1: The Overview of CODEREFUSER. 3 Methodology 3.1 Overview In this work, we build our task abstention approach upon the Learn Then Test (LTT) framework Angelopoulos et al. [2025]. We choose LTT as it can provide statistical guarantees for machine learning models by simply adding a post-processing step after the model is trained. Specifically, LTT allows us to calibrate a threshold λ using a calibration set Dc… view at source ↗

**Figure 2.** Figure 2: Admission risk distribution on HumanEval under different risk tolerance [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Trade-off between abstention rate and admis [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Admission risk distribution on HumanEval under different risk tolerance [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

read the original abstract

Large language models (LLMs) have revolutionized automated code generation. One serious concern, however, is the so-called ``hallucination'', i.e., LLMs may generate seemingly plausible but functionally incorrect code. In this paper, we study the task abstention problem, i.e., determining whether a given LLM should abstain from performing a specific code generation task to avoid likely hallucination. Our approach features a calibrated abstention rule, grounded in the principles of multiple hypothesis testing. The rule assesses generation consistency through code execution outcomes, allowing it to handle syntactic diversity of semantically equivalent code without reliance on oracle test cases or external databases. We prove that our approach provides a rigorous, distribution-free theoretical guarantee on its abstention decisions. We evaluate our method on benchmark datasets using several open-source code LLMs. Results show that our method allows generative models to more accurately and efficiently identify and abstain from tasks that induce hallucination compared to existing techniques, providing a reliable mechanism for safer and more robust code generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces an abstention rule for LLM code generation that uses multiple hypothesis testing on execution consistency to get a distribution-free guarantee, but the guarantee only bounds decisions under consistent outcomes and does not directly tie consistency to functional correctness.

read the letter

The main point is a method that lets code-generating LLMs decide when to abstain by checking whether several generations produce matching execution results. They apply multiple hypothesis testing to calibrate the threshold and prove a distribution-free guarantee on the abstention choices. This avoids any need for oracle test cases or external databases, which makes it more deployable than some earlier approaches that rely on those resources. The benchmark runs on open-source code models show clearer identification of hallucination-prone tasks than the baselines they compare against. That combination of a statistical guarantee and practical execution checks is the clearest addition here. The soft spot is the assumption that execution consistency serves as a reliable stand-in for correctness. The proof controls error rates when outcomes are exchangeable, yet nothing in the argument rules out cases where multiple generations produce the same incorrect result. Experiments do not include a direct check for how often consistent-but-wrong outputs occur, so the real-world safety improvement rests on how often that proxy holds. Readers working on reliable LLM tooling in software engineering will find the most use in the calibration procedure and the reported gains. The work is coherent enough on its own terms to justify sending it out for peer review, though referees will probably press on tightening the link between consistency and actual correctness.

Referee Report

1 major / 2 minor

Summary. The paper proposes a task abstention framework for LLMs performing code generation. It introduces a calibrated abstention rule based on multiple hypothesis testing applied to the consistency of code execution outcomes across multiple generations. The method claims to avoid reliance on oracle test cases or external databases while handling syntactic variations in semantically equivalent code. The authors prove a distribution-free theoretical guarantee on the abstention decisions and report improved accuracy and efficiency in identifying hallucination-inducing tasks compared to baselines on benchmark datasets with open-source code LLMs.

Significance. If the consistency of execution outcomes serves as a reliable proxy for functional correctness, the approach offers a practical mechanism for safer LLM-based code generation with a distribution-free guarantee that does not depend on specific data distributions. This could reduce risks from hallucinations in automated programming. The strength lies in the theoretical grounding via multiple testing principles and the empirical evaluation, but significance is tempered by the unproven link between observed consistency and unobserved correctness.

major comments (1)

The central theoretical claim of a rigorous distribution-free guarantee on abstention decisions controls type-I error rates under the null of consistent execution outcomes (via exchangeability assumptions in the multiple testing procedure). However, this does not provide any bound on the probability that consistent executions are functionally incorrect, which is required to support the motivation of abstaining to avoid hallucinations. The manuscript treats execution consistency as a direct proxy without additional analysis or bounds linking it to actual semantic correctness.

minor comments (2)

Clarify in the abstract and introduction that the distribution-free guarantee applies specifically to decisions based on execution consistency rather than directly to functional correctness or hallucination avoidance.
Provide more detail on the exact definition of 'consistent outcomes' (e.g., thresholds for output equivalence) and the hypothesis testing procedure, including how the significance level is chosen, to aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their insightful review and for identifying an important clarification regarding the scope of our theoretical guarantees. We address the major comment in detail below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: The central theoretical claim of a rigorous distribution-free guarantee on abstention decisions controls type-I error rates under the null of consistent execution outcomes (via exchangeability assumptions in the multiple testing procedure). However, this does not provide any bound on the probability that consistent executions are functionally incorrect, which is required to support the motivation of abstaining to avoid hallucinations. The manuscript treats execution consistency as a direct proxy without additional analysis or bounds linking it to actual semantic correctness.

Authors: We thank the referee for this precise observation. Our theoretical result establishes a distribution-free control on the Type I error of the multiple-testing procedure under the null hypothesis of exchangeable (consistent) execution outcomes across generations. This guarantees that the probability of falsely detecting inconsistency—and thus abstaining—when executions are in fact consistent is bounded at the desired level, without distributional assumptions. We agree that this guarantee does not extend to bounding the conditional probability that executions are functionally incorrect given observed consistency. Execution consistency is introduced as a practical, oracle-free proxy for semantic correctness, motivated by the fact that semantically equivalent programs tend to produce matching outputs on the same test inputs, whereas hallucinations frequently produce divergent or erroneous behavior. The manuscript does not claim a theoretical bound on P(incorrect | consistent), as deriving such a bound would require additional assumptions on the code distribution or access to ground-truth oracles, which our framework explicitly avoids. In the revised manuscript we will add a new subsection in the theoretical analysis that explicitly states the scope of the guarantees (control of false abstention under consistency) and distinguishes it from the empirical correlation between consistency and correctness. We will also augment the experimental section with quantitative analysis of this correlation across the evaluated benchmarks and models. These changes will make the relationship between the theoretical claims and the motivating application fully transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: guarantee derived from standard multiple hypothesis testing on consistency

full rationale

The paper's central derivation applies principles of multiple hypothesis testing to assess consistency of code execution outcomes across generations, yielding a distribution-free guarantee on abstention decisions under exchangeability. This relies on general statistical theory rather than any self-citation chain, fitted parameters renamed as predictions, or redefinition of inputs. The abstention rule is calibrated to control error rates for the consistency null without reducing the theoretical claim to observed data by construction. The interpretive link from consistency to hallucination avoidance is an external assumption, not a load-bearing definitional step in the proof.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard statistical principles applied to execution outcomes; no new entities are introduced and free parameters appear limited to the choice of significance level in the testing procedure.

free parameters (1)

significance level for hypothesis testing
The calibrated abstention rule requires choosing a threshold or alpha level to control the error rate in the multiple testing framework.

axioms (1)

standard math Principles of multiple hypothesis testing yield distribution-free guarantees when applied to consistency checks
Invoked to establish the theoretical guarantee on abstention decisions without reliance on specific data distributions.

pith-pipeline@v0.9.0 · 5715 in / 1208 out tokens · 49033 ms · 2026-05-19T20:00:58.915521+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 8 internal anchors

[1]

doi: 10.1126/science.abq1158

ISSN 1095-9203. doi: 10.1126/science.abq1158. URLhttp://dx.doi.org/10.1126/science.abq1158. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, et al. Evaluating Large Language Models Trained on Code, July

work page doi:10.1126/science.abq1158
[2]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, January 2025a. ISSN 1558-2868. doi: 10.1145/3703155...

work page doi:10.1145/3703155
[4]

Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models

Potsawee Manakul, Adian Liusie, and Mark Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. InThe 2023 Conference on Empirical Methods in Natural Language Processing,

work page 2023
[5]

CodeT: Code Generation with Generated Tests

URLhttps://arxiv.org/abs/2207.10397. Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InThe Eleventh International Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Yuheng Huang, Jiayang Song, Zhijie Wang, Shengming Zhao, Huaming Chen, Felix Juefei-Xu, and Lei Ma. Look before you leap: An exploratory study of uncertainty analysis for large language models.IEEE Transactions on Software Engineering, 2025b. Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al....

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

A Study of LLMs' Preferences for Libraries and Programming Languages

URLhttps://openreview.net/forum?id=30XanJanJP. Lukas Twist, Jie M Zhang, Mark Harman, Don Syme, Joost Noppen, and Detlef Nauck. Llms love python: A study of llms’ bias for programming languages and libraries.arXiv preprint arXiv:2503.17181,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M. Zhang. Large language models for software engineering: Survey and open problems. In2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), pages 31–53, 2023a. doi: 10.1109/ICSE-FoSE59343.2023.00008. Vibhor Aga...

work page doi:10.1109/icse-fose59343.2023.00008 2023
[11]

De- Hallucinator : Mitigating LLM Hallucinations in Code Generation Tasks via Iterative Grounding , 2024

Aryaz Eghbali and Michael Pradel. De-hallucinator: Mitigating llm hallucinations in code generation tasks via iterative grounding.arXiv preprint arXiv:2401.01701,

work page arXiv
[12]

Nguyen, Wenbo Wang, and Shaohua Wang

URLhttps://openreview.net/forum?id=qPUbKxKvXq. Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. Automated repair of programs from large language models. InProceedings of the 45th International Conference on Software Engineering, ICSE ’23, page 1469–1481. IEEE Press, 2023b. ISBN 9781665457019. doi: 10.1109/ICSE48619.2023.00128. ...

work page doi:10.1109/icse48619.2023.00128 2023
[13]

Fang Liu, Yang Liu, Lin Shi, Houkun Huang, Ruifeng Wang, Zhen Yang, Li Zhang, Zhongqi Li, and Yuchi Ma

URL https://arxiv.org/abs/ 2406.08731. Fang Liu, Yang Liu, Lin Shi, Houkun Huang, Ruifeng Wang, Zhen Yang, Li Zhang, Zhongqi Li, and Yuchi Ma. Exploring and evaluating hallucinations in llm-powered code generation.arXiv preprint arXiv:2404.00971, 2024a. Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, an...

work page arXiv
[14]

URLhttps://doi.org/10.1145/3728894

doi: 10.1145/3728894. URLhttps://doi.org/10.1145/3728894. 10 PREPRINT Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming- Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. Multipl-e: A scalable and polyglot approach to benchmarking neural code...

work page doi:10.1145/3728894
[15]

MultiPL-E:

ISSN 0098-5589. doi: 10.1109/TSE.2023.3267446. URL https: //doi.org/10.1109/TSE.2023.3267446. Yue Liu, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit Tantithamthavorn, Li Li, Xuan-Bach D. Le, and David Lo. Refining chatgpt-generated code: Characterizing and mitigating code quality issues.ACM Trans. Softw. Eng. Methodol., 33(5), June 2024b. ISSN 1049-331X. d...

work page doi:10.1109/tse.2023.3267446 2023
[16]

URLhttps://doi.org/10.1145/3660810

doi: 10.1145/3660810. URLhttps://doi.org/10.1145/3660810. Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, and Lucy Lu Wang. Know your limits: A survey of abstention in large language models.Transactions of the Association for Computational Linguistics, 13:529–556,

work page doi:10.1145/3660810
[17]

URL https://aclanthology.org/2025.tacl-1.26/

doi: 10.1162/tacl_a_00754. URL https://aclanthology.org/2025.tacl-1.26/. Neeraj Varshney, Pavel Dolin, Agastya Seth, and Chitta Baral. The art of defending: A systematic evaluation and analysis of LLM defense strategies on safety and over-defensiveness. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational...

work page doi:10.1162/tacl_a_00754 2025
[18]

doi: 10.18653/v1/2024.findings-acl.776

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.776. URL https://aclanthology.org/2024.findings-acl.776/. Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do-not-answer: Evaluating safeguards in LLMs. In Yvette Graham and Matthew Purver, editors,Findings of the Association for Computational Linguistics: ...

work page doi:10.18653/v1/2024.findings-acl.776 2024
[19]

URL https://aclanthology.org/2024.findings-eacl.61/

Association for Computational Linguistics. URL https://aclanthology.org/2024.findings-eacl.61/. Hanning Zhang, Shizhe Diao, Yong Lin, Yi Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang. R-tuning: Instructing large language models to say ‘I don’t know’. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conf...

work page 2024
[20]

doi: 10.18653/v1/2024.naacl-long.394

Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.394. URL https://aclanthology.org/2024.naacl-long.394/. Gustaf Ahdritz, Tian Qin, Nikhil Vyas, Boaz Barak, and Benjamin L. Edelman. Distinguishing the knowable from the unknowable with language models. ICML’24. JMLR.org,

work page doi:10.18653/v1/2024.naacl-long.394 2024
[21]

Minsu Kim and James Thorne. Epistemology of language models: Do language models have holistic knowledge? In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 12644–12669, Bangkok, Thailand, August

work page 2024
[22]

doi: 10.18653/v1/2024.findings-acl.751

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.751. URLhttps://aclanthology.org/2024.findings-acl.751/. Lang Cao. Learn to refuse: Making large language models more controllable and reliable through knowledge scope limitation and refusal mechanism. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of ...

work page doi:10.18653/v1/2024.findings-acl.751 2024
[23]

doi: 10.18653/v1/2024.emnlp-main.212

Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.212. URL https://aclanthology.org/2024.emnlp-main.212/. Ella Neeman, Roee Aharoni, Or Honovich, Leshem Choshen, Idan Szpektor, and Omri Abend. DisentQA: Disentangling parametric and contextual knowledge with counterfactual question answering. In Anna Rogers, Jordan Boyd-Graber, an...

work page doi:10.18653/v1/2024.emnlp-main.212 2024
[24]

doi: 10.18653/v1/2023.acl-long.559

Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.559. URLhttps://aclanthology.org/2023.acl-long.559/. Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neubig, and Pengfei Liu. Alignment for honesty. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

work page doi:10.18653/v1/2023.acl-long.559 2023
[25]

URLhttps://openreview.net/forum?id=8s8K2UZGTZ

ISSN 2835-8856. URLhttps://openreview.net/forum?id=8s8K2UZGTZ. 11 PREPRINT Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christo- pher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. InThe 2023 Conferen...

work page 2023
[26]

doi: 10.18653/v1/2024.acl-long.786

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.786. URLhttps://aclanthology.org/2024.acl-long.786/. Bocheng Chen, Advait Paliwal, and Qiben Yan. Jailbreaker in jail: Moving target defense for large language models. InProceedings of the 10th ACM Workshop on Moving Target Defense, MTD ’23, page 29–32, New York, NY , USA,

work page doi:10.18653/v1/2024.acl-long.786 2024
[27]

ISBN 9798400702563

Association for Computing Machinery. ISBN 9798400702563. doi: 10.1145/3605760.3623764. URL https://doi.org/10.1145/3605760.3623764. Ryan J. Tibshirani, Rina Foygel Barber, Emmanuel J. Candès, and Aaditya Ramdas.Conformal prediction under covariate shift. Curran Associates Inc., Red Hook, NY , USA,

work page doi:10.1145/3605760.3623764
[28]

URL https: //doi.org/10.1214/23-AOS2276

doi: 10.1214/23-AOS2276. URL https: //doi.org/10.1214/23-AOS2276. Isaac Gibbs and Emmanuel Candes. Adaptive conformal inference under distribution shift. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems,

work page doi:10.1214/23-aos2276
[29]

Anastasios Nikolas Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster

URLhttps://openreview.net/forum?id=6vaActvpcp3. Anastasios Nikolas Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster. Conformal risk control. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

work page 2024
[30]

Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych

URLhttps://openreview.net/forum?id=33XGfHLtZg. Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych. A survey of confidence estimation and calibration in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Vo...

work page 2024
[31]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with GPT-4.arXiv preprint arXiv:2303.12712,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Cycles of thought: Measuring LLM confidence through stable explanations.arXiv preprint arXiv:2406.03441,

Evan Becker and Stefano Soatto. Cycles of thought: Measuring LLM confidence through stable explanations.arXiv preprint arXiv:2406.03441,

work page arXiv
[34]

Universal self-consistency for large language models

Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, and Denny Zhou. Universal self-consistency for large language models. InICML 2024 Workshop on In-Context Learning,

work page 2024
[35]

[2023a], presents a challenge to ensure the accuracy, reliability and security of AI-generated code Agarwal et al

A Related Work Code Hallucination.The phenomenon of code hallucination, where LLMs generate code that is illogical, incorrect, or unfaithful to user requirements Fan et al. [2023a], presents a challenge to ensure the accuracy, reliability and security of AI-generated code Agarwal et al. [2024], Eghbali and Pradel [2024], Tian et al. [2025]. Existing resea...

work page 2024
[36]

Different from the above work that focuses on the sample-level hallucination problem, we study the task abstention problem

introduced a framework where the LLM proactively asks clarifying questions to help users refine their initial prompts. Different from the above work that focuses on the sample-level hallucination problem, we study the task abstention problem. LLM Abstention.Abstention is increasingly recognized for its potential to mitigate hallucination and enhance safet...

work page 2025
[37]

honesty” alignment datasets by substituting a model’s incorrect response with “I don’t know

construct “honesty” alignment datasets by substituting a model’s incorrect response with “I don’t know” and then fine-tuning on this revised data. At inference time, a common approach is to use post-processing techniques based on model uncertainty. These include calculating the log probability of a ‘True’ token via indirect logit methods Lin et al. [2022]...

work page 2022
[38]

[1998], Blei et al

derives confidence from the probabilities assigned to generated tokens, employing the geometric mean (i.e., perplexity Chen et al. [1998], Blei et al. [2003]) to mitigate sensitivity to output length;verbalized confidenceKadavath et al. [2022], Xiong et al. [2024], Tian et al

work page 1998
[39]

Read the question and give your answer and corresponding confidence score

directly prompts the LLM to explicitly express its confidence alongside its answer (e.g., “Read the question and give your answer and corresponding confidence score”);self-consistency confidenceXiong et al. [2024], Abbasi Yadkori et al. [2024], Becker and Soatto

work page 2024
[40]

[2022], Chen et al

assesses confidence by having the LLM generate multiple answers for the same input and then measuring the consistency among them Wang et al. [2022], Chen et al. [2024], Cheng et al. [2024], with higher consistency indicating greater confidence. B The LTT Framework The Learn Then Test (LTT) framework Angelopoulos et al

work page 2022
[41]

Consider the task where each instance x∈ X is associated with a ground-truth label y∈ Y

is designed to provide statistical guarantees for machine learning models by simply adding a post-processing step on a calibration set after the model is trained. Consider the task where each instance x∈ X is associated with a ground-truth label y∈ Y . Let Dcal ={(x i, yi)}m i=1 ⊆ X × Y be a calibration set composed of the input x and its ground-truth lab...

work page 2025
[42]

To evaluate the sensitivity of CODEREFUSERto this hyperparameter, we conducted additional experiments setting k= 5

In the main experiments, we defined the risk based on the pass rate H@k with k= 3 . To evaluate the sensitivity of CODEREFUSERto this hyperparameter, we conducted additional experiments setting k= 5 . This adjustment implies a slightly more relaxed criterion for success, effectively allowing the model more attempts to yield a correct solution. We re-calib...

work page arXiv

[1] [1]

doi: 10.1126/science.abq1158

ISSN 1095-9203. doi: 10.1126/science.abq1158. URLhttp://dx.doi.org/10.1126/science.abq1158. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, et al. Evaluating Large Language Models Trained on Code, July

work page doi:10.1126/science.abq1158

[2] [2]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, January 2025a. ISSN 1558-2868. doi: 10.1145/3703155...

work page doi:10.1145/3703155

[4] [4]

Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models

Potsawee Manakul, Adian Liusie, and Mark Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. InThe 2023 Conference on Empirical Methods in Natural Language Processing,

work page 2023

[5] [5]

CodeT: Code Generation with Generated Tests

URLhttps://arxiv.org/abs/2207.10397. Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InThe Eleventh International Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Yuheng Huang, Jiayang Song, Zhijie Wang, Shengming Zhao, Huaming Chen, Felix Juefei-Xu, and Lei Ma. Look before you leap: An exploratory study of uncertainty analysis for large language models.IEEE Transactions on Software Engineering, 2025b. Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al....

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

A Study of LLMs' Preferences for Libraries and Programming Languages

URLhttps://openreview.net/forum?id=30XanJanJP. Lukas Twist, Jie M Zhang, Mark Harman, Don Syme, Joost Noppen, and Detlef Nauck. Llms love python: A study of llms’ bias for programming languages and libraries.arXiv preprint arXiv:2503.17181,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M. Zhang. Large language models for software engineering: Survey and open problems. In2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), pages 31–53, 2023a. doi: 10.1109/ICSE-FoSE59343.2023.00008. Vibhor Aga...

work page doi:10.1109/icse-fose59343.2023.00008 2023

[11] [11]

De- Hallucinator : Mitigating LLM Hallucinations in Code Generation Tasks via Iterative Grounding , 2024

Aryaz Eghbali and Michael Pradel. De-hallucinator: Mitigating llm hallucinations in code generation tasks via iterative grounding.arXiv preprint arXiv:2401.01701,

work page arXiv

[12] [12]

Nguyen, Wenbo Wang, and Shaohua Wang

URLhttps://openreview.net/forum?id=qPUbKxKvXq. Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. Automated repair of programs from large language models. InProceedings of the 45th International Conference on Software Engineering, ICSE ’23, page 1469–1481. IEEE Press, 2023b. ISBN 9781665457019. doi: 10.1109/ICSE48619.2023.00128. ...

work page doi:10.1109/icse48619.2023.00128 2023

[13] [13]

Fang Liu, Yang Liu, Lin Shi, Houkun Huang, Ruifeng Wang, Zhen Yang, Li Zhang, Zhongqi Li, and Yuchi Ma

URL https://arxiv.org/abs/ 2406.08731. Fang Liu, Yang Liu, Lin Shi, Houkun Huang, Ruifeng Wang, Zhen Yang, Li Zhang, Zhongqi Li, and Yuchi Ma. Exploring and evaluating hallucinations in llm-powered code generation.arXiv preprint arXiv:2404.00971, 2024a. Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, an...

work page arXiv

[14] [14]

URLhttps://doi.org/10.1145/3728894

doi: 10.1145/3728894. URLhttps://doi.org/10.1145/3728894. 10 PREPRINT Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming- Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. Multipl-e: A scalable and polyglot approach to benchmarking neural code...

work page doi:10.1145/3728894

[15] [15]

MultiPL-E:

ISSN 0098-5589. doi: 10.1109/TSE.2023.3267446. URL https: //doi.org/10.1109/TSE.2023.3267446. Yue Liu, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit Tantithamthavorn, Li Li, Xuan-Bach D. Le, and David Lo. Refining chatgpt-generated code: Characterizing and mitigating code quality issues.ACM Trans. Softw. Eng. Methodol., 33(5), June 2024b. ISSN 1049-331X. d...

work page doi:10.1109/tse.2023.3267446 2023

[16] [16]

URLhttps://doi.org/10.1145/3660810

doi: 10.1145/3660810. URLhttps://doi.org/10.1145/3660810. Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, and Lucy Lu Wang. Know your limits: A survey of abstention in large language models.Transactions of the Association for Computational Linguistics, 13:529–556,

work page doi:10.1145/3660810

[17] [17]

URL https://aclanthology.org/2025.tacl-1.26/

doi: 10.1162/tacl_a_00754. URL https://aclanthology.org/2025.tacl-1.26/. Neeraj Varshney, Pavel Dolin, Agastya Seth, and Chitta Baral. The art of defending: A systematic evaluation and analysis of LLM defense strategies on safety and over-defensiveness. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational...

work page doi:10.1162/tacl_a_00754 2025

[18] [18]

doi: 10.18653/v1/2024.findings-acl.776

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.776. URL https://aclanthology.org/2024.findings-acl.776/. Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do-not-answer: Evaluating safeguards in LLMs. In Yvette Graham and Matthew Purver, editors,Findings of the Association for Computational Linguistics: ...

work page doi:10.18653/v1/2024.findings-acl.776 2024

[19] [19]

URL https://aclanthology.org/2024.findings-eacl.61/

Association for Computational Linguistics. URL https://aclanthology.org/2024.findings-eacl.61/. Hanning Zhang, Shizhe Diao, Yong Lin, Yi Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang. R-tuning: Instructing large language models to say ‘I don’t know’. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conf...

work page 2024

[20] [20]

doi: 10.18653/v1/2024.naacl-long.394

Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.394. URL https://aclanthology.org/2024.naacl-long.394/. Gustaf Ahdritz, Tian Qin, Nikhil Vyas, Boaz Barak, and Benjamin L. Edelman. Distinguishing the knowable from the unknowable with language models. ICML’24. JMLR.org,

work page doi:10.18653/v1/2024.naacl-long.394 2024

[21] [21]

Minsu Kim and James Thorne. Epistemology of language models: Do language models have holistic knowledge? In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 12644–12669, Bangkok, Thailand, August

work page 2024

[22] [22]

doi: 10.18653/v1/2024.findings-acl.751

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.751. URLhttps://aclanthology.org/2024.findings-acl.751/. Lang Cao. Learn to refuse: Making large language models more controllable and reliable through knowledge scope limitation and refusal mechanism. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of ...

work page doi:10.18653/v1/2024.findings-acl.751 2024

[23] [23]

doi: 10.18653/v1/2024.emnlp-main.212

Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.212. URL https://aclanthology.org/2024.emnlp-main.212/. Ella Neeman, Roee Aharoni, Or Honovich, Leshem Choshen, Idan Szpektor, and Omri Abend. DisentQA: Disentangling parametric and contextual knowledge with counterfactual question answering. In Anna Rogers, Jordan Boyd-Graber, an...

work page doi:10.18653/v1/2024.emnlp-main.212 2024

[24] [24]

doi: 10.18653/v1/2023.acl-long.559

Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.559. URLhttps://aclanthology.org/2023.acl-long.559/. Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neubig, and Pengfei Liu. Alignment for honesty. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

work page doi:10.18653/v1/2023.acl-long.559 2023

[25] [25]

URLhttps://openreview.net/forum?id=8s8K2UZGTZ

ISSN 2835-8856. URLhttps://openreview.net/forum?id=8s8K2UZGTZ. 11 PREPRINT Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christo- pher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. InThe 2023 Conferen...

work page 2023

[26] [26]

doi: 10.18653/v1/2024.acl-long.786

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.786. URLhttps://aclanthology.org/2024.acl-long.786/. Bocheng Chen, Advait Paliwal, and Qiben Yan. Jailbreaker in jail: Moving target defense for large language models. InProceedings of the 10th ACM Workshop on Moving Target Defense, MTD ’23, page 29–32, New York, NY , USA,

work page doi:10.18653/v1/2024.acl-long.786 2024

[27] [27]

ISBN 9798400702563

Association for Computing Machinery. ISBN 9798400702563. doi: 10.1145/3605760.3623764. URL https://doi.org/10.1145/3605760.3623764. Ryan J. Tibshirani, Rina Foygel Barber, Emmanuel J. Candès, and Aaditya Ramdas.Conformal prediction under covariate shift. Curran Associates Inc., Red Hook, NY , USA,

work page doi:10.1145/3605760.3623764

[28] [28]

URL https: //doi.org/10.1214/23-AOS2276

doi: 10.1214/23-AOS2276. URL https: //doi.org/10.1214/23-AOS2276. Isaac Gibbs and Emmanuel Candes. Adaptive conformal inference under distribution shift. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems,

work page doi:10.1214/23-aos2276

[29] [29]

Anastasios Nikolas Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster

URLhttps://openreview.net/forum?id=6vaActvpcp3. Anastasios Nikolas Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster. Conformal risk control. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

work page 2024

[30] [30]

Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych

URLhttps://openreview.net/forum?id=33XGfHLtZg. Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych. A survey of confidence estimation and calibration in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Vo...

work page 2024

[31] [31]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with GPT-4.arXiv preprint arXiv:2303.12712,

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

Cycles of thought: Measuring LLM confidence through stable explanations.arXiv preprint arXiv:2406.03441,

Evan Becker and Stefano Soatto. Cycles of thought: Measuring LLM confidence through stable explanations.arXiv preprint arXiv:2406.03441,

work page arXiv

[34] [34]

Universal self-consistency for large language models

Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, and Denny Zhou. Universal self-consistency for large language models. InICML 2024 Workshop on In-Context Learning,

work page 2024

[35] [35]

[2023a], presents a challenge to ensure the accuracy, reliability and security of AI-generated code Agarwal et al

A Related Work Code Hallucination.The phenomenon of code hallucination, where LLMs generate code that is illogical, incorrect, or unfaithful to user requirements Fan et al. [2023a], presents a challenge to ensure the accuracy, reliability and security of AI-generated code Agarwal et al. [2024], Eghbali and Pradel [2024], Tian et al. [2025]. Existing resea...

work page 2024

[36] [36]

Different from the above work that focuses on the sample-level hallucination problem, we study the task abstention problem

introduced a framework where the LLM proactively asks clarifying questions to help users refine their initial prompts. Different from the above work that focuses on the sample-level hallucination problem, we study the task abstention problem. LLM Abstention.Abstention is increasingly recognized for its potential to mitigate hallucination and enhance safet...

work page 2025

[37] [37]

honesty” alignment datasets by substituting a model’s incorrect response with “I don’t know

construct “honesty” alignment datasets by substituting a model’s incorrect response with “I don’t know” and then fine-tuning on this revised data. At inference time, a common approach is to use post-processing techniques based on model uncertainty. These include calculating the log probability of a ‘True’ token via indirect logit methods Lin et al. [2022]...

work page 2022

[38] [38]

[1998], Blei et al

derives confidence from the probabilities assigned to generated tokens, employing the geometric mean (i.e., perplexity Chen et al. [1998], Blei et al. [2003]) to mitigate sensitivity to output length;verbalized confidenceKadavath et al. [2022], Xiong et al. [2024], Tian et al

work page 1998

[39] [39]

Read the question and give your answer and corresponding confidence score

directly prompts the LLM to explicitly express its confidence alongside its answer (e.g., “Read the question and give your answer and corresponding confidence score”);self-consistency confidenceXiong et al. [2024], Abbasi Yadkori et al. [2024], Becker and Soatto

work page 2024

[40] [40]

[2022], Chen et al

assesses confidence by having the LLM generate multiple answers for the same input and then measuring the consistency among them Wang et al. [2022], Chen et al. [2024], Cheng et al. [2024], with higher consistency indicating greater confidence. B The LTT Framework The Learn Then Test (LTT) framework Angelopoulos et al

work page 2022

[41] [41]

Consider the task where each instance x∈ X is associated with a ground-truth label y∈ Y

is designed to provide statistical guarantees for machine learning models by simply adding a post-processing step on a calibration set after the model is trained. Consider the task where each instance x∈ X is associated with a ground-truth label y∈ Y . Let Dcal ={(x i, yi)}m i=1 ⊆ X × Y be a calibration set composed of the input x and its ground-truth lab...

work page 2025

[42] [42]

To evaluate the sensitivity of CODEREFUSERto this hyperparameter, we conducted additional experiments setting k= 5

In the main experiments, we defined the risk based on the pass rate H@k with k= 3 . To evaluate the sensitivity of CODEREFUSERto this hyperparameter, we conducted additional experiments setting k= 5 . This adjustment implies a slightly more relaxed criterion for success, effectively allowing the model more attempts to yield a correct solution. We re-calib...

work page arXiv