pith. sign in

arxiv: 2605.17029 · v1 · pith:44ZJFCPEnew · submitted 2026-05-16 · 💻 cs.SE · cs.AI

Task Abstention for Large Language Models in Code Generation

Pith reviewed 2026-05-19 20:00 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords task abstentionhallucination detectioncode generationlarge language modelsmultiple hypothesis testingexecution consistencyabstention rule
0
0 comments X

The pith

Code-generating LLMs can abstain from tasks likely to produce hallucinations by checking consistency of execution results across multiple generations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method for large language models to decide when to skip a code generation task altogether. It grounds the decision in a calibrated abstention rule drawn from multiple hypothesis testing and judges reliability by whether different generated programs produce matching execution outcomes. The rule works without any oracle test cases or external reference databases and supplies a distribution-free guarantee that the abstention choices will meet a chosen error threshold. A reader would care because the approach gives a concrete way to reduce incorrect code outputs while preserving the model's ability to generate code on tasks it can handle.

Core claim

By treating the problem as multiple hypothesis testing and measuring consistency through code execution outcomes, the abstention rule provides a rigorous, distribution-free theoretical guarantee that controls the probability of incorrect abstention decisions.

What carries the argument

The calibrated abstention rule from multiple hypothesis testing that evaluates consistency of code execution outcomes across generations.

If this is right

  • Generative models identify and abstain from hallucination-prone tasks more accurately than prior techniques.
  • Code generation becomes safer and more robust without extra test cases or databases.
  • The method accommodates syntactic variation among programs that are semantically equivalent.
  • Abstention decisions carry a proven bound that holds regardless of the underlying data distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same consistency-based rule could be tested on non-code generation tasks where output equivalence is easy to check automatically.
  • Integration with existing verification tools might further tighten the practical error rate beyond the theoretical guarantee.
  • Users could tune the significance level of the hypothesis test to match the risk tolerance of a particular deployment.

Load-bearing premise

That agreement in execution results across several independently generated programs is a reliable signal that none of them contains a hallucination.

What would settle it

A dataset of code tasks with hidden ground-truth test cases where the fraction of tasks on which the method abstains deviates from the error rate predicted by its theoretical bound.

Figures

Figures reproduced from arXiv: 2605.17029 by Senrong Xu, Taolue Chen, Xiaoxing Ma, Yanke Zhou, Yuan Yao, Yuhao Tan, Zenan Li.

Figure 1
Figure 1. Figure 1: The Overview of CODEREFUSER. 3 Methodology 3.1 Overview In this work, we build our task abstention approach upon the Learn Then Test (LTT) framework Angelopoulos et al. [2025]. We choose LTT as it can provide statistical guarantees for machine learning models by simply adding a post-processing step after the model is trained. Specifically, LTT allows us to calibrate a threshold λ using a calibration set Dc… view at source ↗
Figure 2
Figure 2. Figure 2: Admission risk distribution on HumanEval under different risk tolerance [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Trade-off between abstention rate and admis [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Admission risk distribution on HumanEval under different risk tolerance [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
read the original abstract

Large language models (LLMs) have revolutionized automated code generation. One serious concern, however, is the so-called ``hallucination'', i.e., LLMs may generate seemingly plausible but functionally incorrect code. In this paper, we study the task abstention problem, i.e., determining whether a given LLM should abstain from performing a specific code generation task to avoid likely hallucination. Our approach features a calibrated abstention rule, grounded in the principles of multiple hypothesis testing. The rule assesses generation consistency through code execution outcomes, allowing it to handle syntactic diversity of semantically equivalent code without reliance on oracle test cases or external databases. We prove that our approach provides a rigorous, distribution-free theoretical guarantee on its abstention decisions. We evaluate our method on benchmark datasets using several open-source code LLMs. Results show that our method allows generative models to more accurately and efficiently identify and abstain from tasks that induce hallucination compared to existing techniques, providing a reliable mechanism for safer and more robust code generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes a task abstention framework for LLMs performing code generation. It introduces a calibrated abstention rule based on multiple hypothesis testing applied to the consistency of code execution outcomes across multiple generations. The method claims to avoid reliance on oracle test cases or external databases while handling syntactic variations in semantically equivalent code. The authors prove a distribution-free theoretical guarantee on the abstention decisions and report improved accuracy and efficiency in identifying hallucination-inducing tasks compared to baselines on benchmark datasets with open-source code LLMs.

Significance. If the consistency of execution outcomes serves as a reliable proxy for functional correctness, the approach offers a practical mechanism for safer LLM-based code generation with a distribution-free guarantee that does not depend on specific data distributions. This could reduce risks from hallucinations in automated programming. The strength lies in the theoretical grounding via multiple testing principles and the empirical evaluation, but significance is tempered by the unproven link between observed consistency and unobserved correctness.

major comments (1)
  1. The central theoretical claim of a rigorous distribution-free guarantee on abstention decisions controls type-I error rates under the null of consistent execution outcomes (via exchangeability assumptions in the multiple testing procedure). However, this does not provide any bound on the probability that consistent executions are functionally incorrect, which is required to support the motivation of abstaining to avoid hallucinations. The manuscript treats execution consistency as a direct proxy without additional analysis or bounds linking it to actual semantic correctness.
minor comments (2)
  1. Clarify in the abstract and introduction that the distribution-free guarantee applies specifically to decisions based on execution consistency rather than directly to functional correctness or hallucination avoidance.
  2. Provide more detail on the exact definition of 'consistent outcomes' (e.g., thresholds for output equivalence) and the hypothesis testing procedure, including how the significance level is chosen, to aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their insightful review and for identifying an important clarification regarding the scope of our theoretical guarantees. We address the major comment in detail below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: The central theoretical claim of a rigorous distribution-free guarantee on abstention decisions controls type-I error rates under the null of consistent execution outcomes (via exchangeability assumptions in the multiple testing procedure). However, this does not provide any bound on the probability that consistent executions are functionally incorrect, which is required to support the motivation of abstaining to avoid hallucinations. The manuscript treats execution consistency as a direct proxy without additional analysis or bounds linking it to actual semantic correctness.

    Authors: We thank the referee for this precise observation. Our theoretical result establishes a distribution-free control on the Type I error of the multiple-testing procedure under the null hypothesis of exchangeable (consistent) execution outcomes across generations. This guarantees that the probability of falsely detecting inconsistency—and thus abstaining—when executions are in fact consistent is bounded at the desired level, without distributional assumptions. We agree that this guarantee does not extend to bounding the conditional probability that executions are functionally incorrect given observed consistency. Execution consistency is introduced as a practical, oracle-free proxy for semantic correctness, motivated by the fact that semantically equivalent programs tend to produce matching outputs on the same test inputs, whereas hallucinations frequently produce divergent or erroneous behavior. The manuscript does not claim a theoretical bound on P(incorrect | consistent), as deriving such a bound would require additional assumptions on the code distribution or access to ground-truth oracles, which our framework explicitly avoids. In the revised manuscript we will add a new subsection in the theoretical analysis that explicitly states the scope of the guarantees (control of false abstention under consistency) and distinguishes it from the empirical correlation between consistency and correctness. We will also augment the experimental section with quantitative analysis of this correlation across the evaluated benchmarks and models. These changes will make the relationship between the theoretical claims and the motivating application fully transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: guarantee derived from standard multiple hypothesis testing on consistency

full rationale

The paper's central derivation applies principles of multiple hypothesis testing to assess consistency of code execution outcomes across generations, yielding a distribution-free guarantee on abstention decisions under exchangeability. This relies on general statistical theory rather than any self-citation chain, fitted parameters renamed as predictions, or redefinition of inputs. The abstention rule is calibrated to control error rates for the consistency null without reducing the theoretical claim to observed data by construction. The interpretive link from consistency to hallucination avoidance is an external assumption, not a load-bearing definitional step in the proof.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard statistical principles applied to execution outcomes; no new entities are introduced and free parameters appear limited to the choice of significance level in the testing procedure.

free parameters (1)
  • significance level for hypothesis testing
    The calibrated abstention rule requires choosing a threshold or alpha level to control the error rate in the multiple testing framework.
axioms (1)
  • standard math Principles of multiple hypothesis testing yield distribution-free guarantees when applied to consistency checks
    Invoked to establish the theoretical guarantee on abstention decisions without reliance on specific data distributions.

pith-pipeline@v0.9.0 · 5715 in / 1208 out tokens · 49033 ms · 2026-05-19T20:00:58.915521+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 8 internal anchors

  1. [1]

    doi: 10.1126/science.abq1158

    ISSN 1095-9203. doi: 10.1126/science.abq1158. URLhttp://dx.doi.org/10.1126/science.abq1158. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, et al. Evaluating Large Language Models Trained on Code, July

  2. [2]

    Code Llama: Open Foundation Models for Code

    Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950,

  3. [3]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, January 2025a. ISSN 1558-2868. doi: 10.1145/3703155...

  4. [4]

    Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models

    Potsawee Manakul, Adian Liusie, and Mark Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. InThe 2023 Conference on Empirical Methods in Natural Language Processing,

  5. [5]

    CodeT: Code Generation with Generated Tests

    URLhttps://arxiv.org/abs/2207.10397. Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InThe Eleventh International Conference on Learning Representations,

  6. [6]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,

  7. [7]

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    Yuheng Huang, Jiayang Song, Zhijie Wang, Shengming Zhao, Huaming Chen, Felix Juefei-Xu, and Lei Ma. Look before you leap: An exploratory study of uncertainty analysis for large language models.IEEE Transactions on Software Engineering, 2025b. Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al....

  8. [8]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186,

  9. [9]

    A Study of LLMs' Preferences for Libraries and Programming Languages

    URLhttps://openreview.net/forum?id=30XanJanJP. Lukas Twist, Jie M Zhang, Mark Harman, Don Syme, Joost Noppen, and Detlef Nauck. Llms love python: A study of llms’ bias for programming languages and libraries.arXiv preprint arXiv:2503.17181,

  10. [10]

    Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M. Zhang. Large language models for software engineering: Survey and open problems. In2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), pages 31–53, 2023a. doi: 10.1109/ICSE-FoSE59343.2023.00008. Vibhor Aga...

  11. [11]

    De- Hallucinator : Mitigating LLM Hallucinations in Code Generation Tasks via Iterative Grounding , 2024

    Aryaz Eghbali and Michael Pradel. De-hallucinator: Mitigating llm hallucinations in code generation tasks via iterative grounding.arXiv preprint arXiv:2401.01701,

  12. [12]

    Nguyen, Wenbo Wang, and Shaohua Wang

    URLhttps://openreview.net/forum?id=qPUbKxKvXq. Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. Automated repair of programs from large language models. InProceedings of the 45th International Conference on Software Engineering, ICSE ’23, page 1469–1481. IEEE Press, 2023b. ISBN 9781665457019. doi: 10.1109/ICSE48619.2023.00128. ...

  13. [13]

    Fang Liu, Yang Liu, Lin Shi, Houkun Huang, Ruifeng Wang, Zhen Yang, Li Zhang, Zhongqi Li, and Yuchi Ma

    URL https://arxiv.org/abs/ 2406.08731. Fang Liu, Yang Liu, Lin Shi, Houkun Huang, Ruifeng Wang, Zhen Yang, Li Zhang, Zhongqi Li, and Yuchi Ma. Exploring and evaluating hallucinations in llm-powered code generation.arXiv preprint arXiv:2404.00971, 2024a. Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, an...

  14. [14]

    URLhttps://doi.org/10.1145/3728894

    doi: 10.1145/3728894. URLhttps://doi.org/10.1145/3728894. 10 PREPRINT Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming- Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. Multipl-e: A scalable and polyglot approach to benchmarking neural code...

  15. [15]

    MultiPL-E:

    ISSN 0098-5589. doi: 10.1109/TSE.2023.3267446. URL https: //doi.org/10.1109/TSE.2023.3267446. Yue Liu, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit Tantithamthavorn, Li Li, Xuan-Bach D. Le, and David Lo. Refining chatgpt-generated code: Characterizing and mitigating code quality issues.ACM Trans. Softw. Eng. Methodol., 33(5), June 2024b. ISSN 1049-331X. d...

  16. [16]

    URLhttps://doi.org/10.1145/3660810

    doi: 10.1145/3660810. URLhttps://doi.org/10.1145/3660810. Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, and Lucy Lu Wang. Know your limits: A survey of abstention in large language models.Transactions of the Association for Computational Linguistics, 13:529–556,

  17. [17]

    URL https://aclanthology.org/2025.tacl-1.26/

    doi: 10.1162/tacl_a_00754. URL https://aclanthology.org/2025.tacl-1.26/. Neeraj Varshney, Pavel Dolin, Agastya Seth, and Chitta Baral. The art of defending: A systematic evaluation and analysis of LLM defense strategies on safety and over-defensiveness. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational...

  18. [18]

    doi: 10.18653/v1/2024.findings-acl.776

    Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.776. URL https://aclanthology.org/2024.findings-acl.776/. Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do-not-answer: Evaluating safeguards in LLMs. In Yvette Graham and Matthew Purver, editors,Findings of the Association for Computational Linguistics: ...

  19. [19]

    URL https://aclanthology.org/2024.findings-eacl.61/

    Association for Computational Linguistics. URL https://aclanthology.org/2024.findings-eacl.61/. Hanning Zhang, Shizhe Diao, Yong Lin, Yi Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang. R-tuning: Instructing large language models to say ‘I don’t know’. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conf...

  20. [20]

    doi: 10.18653/v1/2024.naacl-long.394

    Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.394. URL https://aclanthology.org/2024.naacl-long.394/. Gustaf Ahdritz, Tian Qin, Nikhil Vyas, Boaz Barak, and Benjamin L. Edelman. Distinguishing the knowable from the unknowable with language models. ICML’24. JMLR.org,

  21. [21]

    Minsu Kim and James Thorne. Epistemology of language models: Do language models have holistic knowledge? In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 12644–12669, Bangkok, Thailand, August

  22. [22]

    doi: 10.18653/v1/2024.findings-acl.751

    Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.751. URLhttps://aclanthology.org/2024.findings-acl.751/. Lang Cao. Learn to refuse: Making large language models more controllable and reliable through knowledge scope limitation and refusal mechanism. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of ...

  23. [23]

    doi: 10.18653/v1/2024.emnlp-main.212

    Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.212. URL https://aclanthology.org/2024.emnlp-main.212/. Ella Neeman, Roee Aharoni, Or Honovich, Leshem Choshen, Idan Szpektor, and Omri Abend. DisentQA: Disentangling parametric and contextual knowledge with counterfactual question answering. In Anna Rogers, Jordan Boyd-Graber, an...

  24. [24]

    doi: 10.18653/v1/2023.acl-long.559

    Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.559. URLhttps://aclanthology.org/2023.acl-long.559/. Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neubig, and Pengfei Liu. Alignment for honesty. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

  25. [25]

    URLhttps://openreview.net/forum?id=8s8K2UZGTZ

    ISSN 2835-8856. URLhttps://openreview.net/forum?id=8s8K2UZGTZ. 11 PREPRINT Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christo- pher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. InThe 2023 Conferen...

  26. [26]

    doi: 10.18653/v1/2024.acl-long.786

    Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.786. URLhttps://aclanthology.org/2024.acl-long.786/. Bocheng Chen, Advait Paliwal, and Qiben Yan. Jailbreaker in jail: Moving target defense for large language models. InProceedings of the 10th ACM Workshop on Moving Target Defense, MTD ’23, page 29–32, New York, NY , USA,

  27. [27]

    ISBN 9798400702563

    Association for Computing Machinery. ISBN 9798400702563. doi: 10.1145/3605760.3623764. URL https://doi.org/10.1145/3605760.3623764. Ryan J. Tibshirani, Rina Foygel Barber, Emmanuel J. Candès, and Aaditya Ramdas.Conformal prediction under covariate shift. Curran Associates Inc., Red Hook, NY , USA,

  28. [28]

    URL https: //doi.org/10.1214/23-AOS2276

    doi: 10.1214/23-AOS2276. URL https: //doi.org/10.1214/23-AOS2276. Isaac Gibbs and Emmanuel Candes. Adaptive conformal inference under distribution shift. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems,

  29. [29]

    Anastasios Nikolas Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster

    URLhttps://openreview.net/forum?id=6vaActvpcp3. Anastasios Nikolas Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster. Conformal risk control. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

  30. [30]

    Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych

    URLhttps://openreview.net/forum?id=33XGfHLtZg. Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych. A survey of confidence estimation and calibration in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Vo...

  31. [31]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with GPT-4.arXiv preprint arXiv:2303.12712,

  32. [32]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,

  33. [33]

    Cycles of thought: Measuring LLM confidence through stable explanations.arXiv preprint arXiv:2406.03441,

    Evan Becker and Stefano Soatto. Cycles of thought: Measuring LLM confidence through stable explanations.arXiv preprint arXiv:2406.03441,

  34. [34]

    Universal self-consistency for large language models

    Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, and Denny Zhou. Universal self-consistency for large language models. InICML 2024 Workshop on In-Context Learning,

  35. [35]

    [2023a], presents a challenge to ensure the accuracy, reliability and security of AI-generated code Agarwal et al

    A Related Work Code Hallucination.The phenomenon of code hallucination, where LLMs generate code that is illogical, incorrect, or unfaithful to user requirements Fan et al. [2023a], presents a challenge to ensure the accuracy, reliability and security of AI-generated code Agarwal et al. [2024], Eghbali and Pradel [2024], Tian et al. [2025]. Existing resea...

  36. [36]

    Different from the above work that focuses on the sample-level hallucination problem, we study the task abstention problem

    introduced a framework where the LLM proactively asks clarifying questions to help users refine their initial prompts. Different from the above work that focuses on the sample-level hallucination problem, we study the task abstention problem. LLM Abstention.Abstention is increasingly recognized for its potential to mitigate hallucination and enhance safet...

  37. [37]

    honesty” alignment datasets by substituting a model’s incorrect response with “I don’t know

    construct “honesty” alignment datasets by substituting a model’s incorrect response with “I don’t know” and then fine-tuning on this revised data. At inference time, a common approach is to use post-processing techniques based on model uncertainty. These include calculating the log probability of a ‘True’ token via indirect logit methods Lin et al. [2022]...

  38. [38]

    [1998], Blei et al

    derives confidence from the probabilities assigned to generated tokens, employing the geometric mean (i.e., perplexity Chen et al. [1998], Blei et al. [2003]) to mitigate sensitivity to output length;verbalized confidenceKadavath et al. [2022], Xiong et al. [2024], Tian et al

  39. [39]

    Read the question and give your answer and corresponding confidence score

    directly prompts the LLM to explicitly express its confidence alongside its answer (e.g., “Read the question and give your answer and corresponding confidence score”);self-consistency confidenceXiong et al. [2024], Abbasi Yadkori et al. [2024], Becker and Soatto

  40. [40]

    [2022], Chen et al

    assesses confidence by having the LLM generate multiple answers for the same input and then measuring the consistency among them Wang et al. [2022], Chen et al. [2024], Cheng et al. [2024], with higher consistency indicating greater confidence. B The LTT Framework The Learn Then Test (LTT) framework Angelopoulos et al

  41. [41]

    Consider the task where each instance x∈ X is associated with a ground-truth label y∈ Y

    is designed to provide statistical guarantees for machine learning models by simply adding a post-processing step on a calibration set after the model is trained. Consider the task where each instance x∈ X is associated with a ground-truth label y∈ Y . Let Dcal ={(x i, yi)}m i=1 ⊆ X × Y be a calibration set composed of the input x and its ground-truth lab...

  42. [42]

    To evaluate the sensitivity of CODEREFUSERto this hyperparameter, we conducted additional experiments setting k= 5

    In the main experiments, we defined the risk based on the pass rate H@k with k= 3 . To evaluate the sensitivity of CODEREFUSERto this hyperparameter, we conducted additional experiments setting k= 5 . This adjustment implies a slightly more relaxed criterion for success, effectively allowing the model more attempts to yield a correct solution. We re-calib...