arxiv: 2603.27098 · v2 · submitted 2026-03-28 · 💻 cs.SE

Recognition: 2 theorem links

· Lean Theorem

Ensemble-Based Uncertainty Estimation for Code Correctness Estimation

Yunxiang Wei , Tianlin Li , Yuwei Zheng , Yanni Dong , Aishan Liu , Qiang Hu , Xiaoyu Zhang , Mingfei Cheng

show 1 more author

Jian Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:49 UTC · model grok-4.3

classification 💻 cs.SE

keywords code generationuncertainty estimationensemble methodssemantic entropyprogram correctnesstest-time scalingLLM evaluation

0 comments

The pith

Aggregating semantic consistency across multiple models detects incorrect code more reliably than repeated samples from one model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Single large language models often settle on the same wrong program when generating code from a description, which makes consistency checks within one model a weak signal of correctness. The paper instead measures how much the semantic meaning of outputs varies when the same prompt is given to an ensemble of different models. Greater disagreement across the ensemble turns out to track actual program errors more closely than single-model checks do. This improved uncertainty signal allows selective acceptance of generated code with fewer mistakes and supports a cascading procedure that applies extra computation only when the signal is uncertain.

Core claim

Ensemble Semantic Entropy estimates uncertainty for code generation by measuring the consistency of semantics across samples from multiple different models rather than from one model. On LiveCodeBench this measure correlates more strongly with program correctness, raises prediction accuracy by 53.4 percent in selective generation under strict false-positive constraints, and powers a cascading test-time scaling framework that preserves performance while cutting FLOPs by 64.9 percent relative to uniform single-model scaling.

What carries the argument

Ensemble Semantic Entropy (ESE), which quantifies uncertainty by checking semantic consistency of generated programs aggregated across an ensemble of distinct models.

If this is right

Selective acceptance of generated code achieves higher accuracy at the same strict false-positive rate.
A cascading procedure can apply additional inference steps only when ensemble disagreement is high, cutting total compute while preserving output quality.
Uncertainty estimates improve when models trained on different paths are combined rather than when repeated draws are taken from one training path.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same ensemble disagreement signal could be tested on other structured generation tasks such as theorem proving or data transformation scripts.
Experiments that vary the degree of model diversity while holding ensemble size fixed would show how much training difference is required for the method to work.

Load-bearing premise

Different models will disagree more on incorrect programs than on correct ones without their architectural or training differences creating new shared mistakes.

What would settle it

A benchmark showing that an ensemble of models produces highly consistent but incorrect code outputs at rates comparable to or higher than a single model would demonstrate that the ensemble adds no reliable correctness signal.

Figures

Figures reproduced from arXiv: 2603.27098 by Aishan Liu, Jian Yang, Mingfei Cheng, Qiang Hu, Tianlin Li, Xiaoyu Zhang, Yanni Dong, Yunxiang Wei, Yuwei Zheng.

**Figure 1.** Figure 1: Calculation of Ensemble Semantic Entropy on a motivating example. The problem requires traversing an array [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Comparison of the distribution of the largest cluster [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Pearson correlation coefficients between uncer [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracy-cost comparison on LiveCodeBench [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Large language models (LLMs) have demonstrated remarkable capabilities in generating programs from natural language descriptions, yet ensuring their correctness without an external oracle remains a critical challenge. To solve the challenge, existing methods often rely on uncertainty estimation, measuring the consistency of semantics or execution behaviors across multiple samples generated by a single model. However, we observe that a single model can often converge to a consistent but incorrect solution, rendering such consistency-based proxies ineffective. To address this, we propose Ensemble Semantic Entropy (ESE), which estimates uncertainty by evaluating the consistency of samples aggregated across an ensemble of models. Experiments on LiveCodeBench demonstrate that ESE correlates more strongly with program correctness than single-model semantic entropy. Notably, in selective generation tasks with strict false-positive rate constraints, ESE improves prediction accuracy by 53.4%. Furthermore, by leveraging ESE as the decision signal, we propose a cascading test-time scaling framework Cas, which maintains performance while reducing FLOPs by 64.9% compared to single-model scaling, offering a new perspective on balancing parameter and inference scaling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ensemble Semantic Entropy improves on single-model uncertainty for code correctness but the gains could stem from model differences rather than diversity.

read the letter

The paper's core contribution is Ensemble Semantic Entropy, which measures uncertainty by looking at semantic consistency across samples from an ensemble of models rather than repeated samples from one. This directly targets the issue where a single LLM can converge on a wrong but consistent program. On LiveCodeBench, they find ESE correlates better with correctness than the single-model version. In selective generation with strict false-positive limits, it lifts accuracy by 53.4 percent. They also introduce a cascading test-time scaling approach called Cas that holds performance steady while cutting FLOPs by 64.9 percent compared to scaling one model. This is a clean extension of existing uncertainty ideas in code generation. It gives credit to the limitation of prior proxies and tries to fix it with ensembles. The numbers suggest real practical upside for reducing compute in reliable code gen. The main concern is whether the improvements come from genuine semantic diversity or simply from using models of varying capability. The abstract does not spell out the ensemble composition or controls for that, and there are no error bars or test details visible. If the models are not matched in strength, the consistency signal could be biased. That said, the framing as an empirical observation rather than a fitted metric keeps the circularity low. For someone building LLM tools for software engineering, this is worth reading once the methods are clear. It deserves a serious referee to examine the experimental design and see if the claims hold. I'd bring it to a reading group to talk through how they picked the models and whether the Cas framework generalizes beyond the reported setup.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Ensemble Semantic Entropy (ESE) as an uncertainty estimator for LLM code generation, computing semantic consistency across samples from an ensemble of models instead of a single model. It claims stronger correlation with program correctness on LiveCodeBench than single-model semantic entropy, a 53.4% accuracy lift in selective generation under strict false-positive-rate constraints, and a cascading test-time scaling method (Cas) that preserves performance while cutting FLOPs by 64.9% relative to single-model scaling.

Significance. If the reported gains are reproducible and not artifacts of unmatched model capabilities, ESE could supply a practical signal for reliable selective prediction and efficient test-time scaling in code generation. The work highlights a concrete limitation of single-model consistency measures and offers an ensemble-based alternative that may better surface semantic errors.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): the 53.4% accuracy improvement and 64.9% FLOPs reduction are stated without any description of the ensemble models (sizes, training distributions, or capability matching), baseline implementations, statistical tests, or error bars. These omissions render the central empirical claims unevaluable from the supplied text.
[§3 and §5] §3 (ESE definition) and §5 (Cas framework): the claim that ESE isolates semantic diversity (rather than systematic capability differences) is load-bearing for both the correlation result and the Cas efficiency gain, yet no ablation or control is reported that holds model capability fixed while varying ensemble diversity. Without such evidence the reported improvements could be driven by the ensemble simply including stronger models on certain problems.
[§4.2] §4.2 (selective generation): the strict false-positive-rate constraint is central to the 53.4% figure, but the manuscript supplies neither the exact threshold values used nor the procedure for computing the operating point, preventing verification that the comparison is fair across single-model and ensemble signals.

minor comments (2)

[§3] Notation for semantic entropy and ESE should be defined once in a single equation block rather than re-introduced in prose.
[Figures 2-4] Figure captions for the correlation and scaling plots should include the exact number of problems, models, and samples per model.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. The comments highlight important areas for improving clarity and reproducibility, which we will address in the revised manuscript. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the 53.4% accuracy improvement and 64.9% FLOPs reduction are stated without any description of the ensemble models (sizes, training distributions, or capability matching), baseline implementations, statistical tests, or error bars. These omissions render the central empirical claims unevaluable from the supplied text.

Authors: We agree that the current text lacks sufficient detail on these elements. In the revision we will expand the abstract and §4 to explicitly list the ensemble models (including sizes and training distributions), describe how capabilities were matched across the ensemble, provide full baseline implementation details, report statistical tests (e.g., paired t-tests with p-values), and include error bars on all key metrics including the 53.4% accuracy lift and 64.9% FLOPs reduction. These additions will make the empirical claims fully evaluable. revision: yes
Referee: [§3 and §5] §3 (ESE definition) and §5 (Cas framework): the claim that ESE isolates semantic diversity (rather than systematic capability differences) is load-bearing for both the correlation result and the Cas efficiency gain, yet no ablation or control is reported that holds model capability fixed while varying ensemble diversity. Without such evidence the reported improvements could be driven by the ensemble simply including stronger models on certain problems.

Authors: This is a valid concern. While our ensemble was constructed from models with comparable per-problem performance on LiveCodeBench, we did not include an explicit ablation that holds capability fixed. In the revised §5 we will add a controlled ablation comparing ESE under matched-capability ensembles versus ensembles that deliberately vary capability, demonstrating that the correlation and efficiency gains persist when capability differences are minimized. This will directly address whether the benefits arise from semantic diversity. revision: yes
Referee: [§4.2] §4.2 (selective generation): the strict false-positive-rate constraint is central to the 53.4% figure, but the manuscript supplies neither the exact threshold values used nor the procedure for computing the operating point, preventing verification that the comparison is fair across single-model and ensemble signals.

Authors: We will revise §4.2 to specify the exact FPR thresholds (e.g., 0.05 and 0.10) and the full procedure for determining the operating point, including how thresholds were selected on a held-out validation split to enforce the FPR constraint while maximizing accuracy. This will allow direct verification that the single-model versus ensemble comparison is performed under identical constraints. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical proposal of ESE metric and Cas framework

full rationale

The paper proposes Ensemble Semantic Entropy (ESE) as an uncertainty measure based on cross-model sample consistency and reports experimental results on LiveCodeBench showing stronger correlation with program correctness than single-model semantic entropy, plus accuracy gains under FPR constraints and FLOPs reductions via the Cas cascading framework. No mathematical derivation chain, equations, or first-principles results are claimed. The central claims rest on empirical observations from experiments rather than any reduction to fitted inputs, self-definitions, or self-citation chains. The work is self-contained as a novel metric and framework without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that ensembles expose semantic diversity better than single models and on empirical results whose details are absent from the abstract.

axioms (1)

domain assumption An ensemble of models produces sufficiently diverse samples to expose incorrect but consistent solutions that a single model would miss.
This premise is required for ESE to outperform single-model semantic entropy and is stated as an observation in the abstract.

invented entities (1)

Ensemble Semantic Entropy (ESE) no independent evidence
purpose: Uncertainty score computed from semantic consistency across multiple models
Newly introduced metric whose exact aggregation formula is not supplied in the abstract.

pith-pipeline@v0.9.0 · 5502 in / 1453 out tokens · 62109 ms · 2026-05-14T22:49:41.210146+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose Ensemble Semantic Entropy (ESE), which estimates uncertainty by evaluating the consistency of samples aggregated across an ensemble of models... H_ESE(x) := −∑_s p(s|x,M) log p(s|x,M)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments on LiveCodeBench demonstrate that ESE correlates more strongly with program correctness than single-model semantic entropy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 15 internal anchors

[1]

Aradhye Agarwal, Ayan Sengupta, and Tanmoy Chakraborty. 2025. The Art of Scaling Test-Time Compute for Large Language Models. arXiv:2512.02008 [cs.CL] https://arxiv.org/abs/2512.02008

work page arXiv 2025
[2]

Anonymous. 2026. Anonymous Artifact for Ensemble Semantic Entropy. https: //anonymous.4open.science/r/Ensemble-Semantic-Entropy-F376. Anonymous repository for peer review

work page 2026
[3]

Anthropic. 2026. Claude Code: AI-powered coding assistant for developers. Product page. https://claude.com/product/claude-code Accessed: 2026-03-24

work page 2026
[4]

Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2022. CodeT: Code Generation with Generated Tests. arXiv:2207.10397 [cs.CL] https://arxiv.org/abs/2207.10397

work page arXiv 2022
[5]

Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2023. CodeT: Code Generation with Generated Tests. InThe Eleventh International Conference on Learning Representations. https: //openreview.net/forum?id=ktrw68Cmu9c

work page 2023
[6]

Jiefeng Chen, Jie Ren, Xinyun Chen, Chengrun Yang, Ruoxi Sun, Jinsung Yoon, and Sercan Ö Arık. 2025. SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling. arXiv:2501.19306 [cs.AI] https://arxiv.org/abs/ 2501.19306

work page arXiv 2025
[7]

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching Large Language Models to Self-Debug. arXiv:2304.05128 [cs.CL] https://arxiv. org/abs/2304.05128

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Xiancai Chen, Zhengwei Tao, Kechi Zhang, Changzhi Zhou, Xinyu Zhang, Wanli Gu, Yuanpeng He, Mengdi Zhang, Xunliang Cai, Haiyan Zhao, and Zhi Jin

work page
[9]

Revisit Self-Debugging with Self-Generated Tests for Code Generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computational Linguistics, Vienna, Austria, 18003–18023. doi:10.18653/v1...

work page doi:10.18653/v1/2025.acl-long.881 2025
[10]

Cursor. 2026. Cursor Agent. Product page. https://cursor.com/product Accessed: 2026-03-24

work page 2026
[11]

Yihan Dai, Sijie Liang, Haotian Xu, Peichu Xie, and Sergey Mechtaev. 2025. Reducing Hallucinations in LLM-Generated Code via Semantic Triangulation. arXiv:2511.12288 [cs.SE] https://arxiv.org/abs/2511.12288

work page arXiv 2025
[12]

Desmarais

Arghavan Moradi Dakhel, Amin Nikanjam, Vahid Majdinasab, Foutse Khomh, and Michel C. Desmarais. 2023. Effective Test Generation Using Pre-trained Large Language Models and Mutation Testing. arXiv:2308.16557 [cs.SE] https: //arxiv.org/abs/2308.16557

work page arXiv 2023
[13]

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. 2024. Detecting hallucinations in large language models using semantic entropy.Nature630, 8017 (June 2024), 625–630. doi:10.1038/s41586-024-07421-0

work page doi:10.1038/s41586-024-07421-0 2024
[14]

Audrey Huang, Adam Block, Qinghua Liu, Nan Jiang, Akshay Krishnamurthy, and Dylan J. Foster. 2025. Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment. arXiv:2503.21878 [cs.AI] https: //arxiv.org/abs/2503.21878

work page arXiv 2025
[15]

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions.ACM Trans. Inf. Syst.43, 2, Article 42 (Jan. 2025), 55 pages. doi:10.1145/3703155

work page doi:10.1145/3703155 2025
[16]

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. Live- CodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. arXiv:2403.07974 [cs.SE] https://arxiv.org/abs/2403.07974

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Why Language Models Hallucinate

Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. 2025. Why Language Models Hallucinate. arXiv:2509.04664 [cs.CL] https://arxiv.org/ abs/2509.04664

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Alex Kendall and Yarin Gal. 2017. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? arXiv:1703.04977 [cs.CV] https://arxiv.org/ abs/1703.04977

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic Uncertainty: Lin- guistic Invariances for Uncertainty Estimation in Natural Language Generation. arXiv:2302.09664 [cs.CL] https://arxiv.org/abs/2302.09664

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017. Sim- ple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. arXiv:1612.01474 [stat.ML] https://arxiv.org/abs/1612.01474

work page internal anchor Pith review Pith/arXiv arXiv 2017
[21]

Rémi Leblond, Felix Gimeno, and Florent Altché. 2023. AlphaCode 2 Tech- nical Report. https://storage.googleapis.com/deepmind-media/AlphaCode2/ AlphaCode2_Tech_Report.pdf

work page 2023
[22]

Gonzalez, and Ion Stoica

Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, and Ion Stoica. 2025. S*: Test Time Scaling for Code Generation. arXiv:2502.14382 [cs.LG] https://arxiv.org/abs/2502.14382

work page arXiv 2025
[23]

Kefan Li and Yuan Yuan. 2024. Large Language Models as Test Case Generators: Performance Evaluation and Enhancement. arXiv:2404.13340 [cs.SE] https: //arxiv.org/abs/2404.13340

work page arXiv 2024
[24]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

work page doi:10.1126/science.abq1158 2022
[25]

Yunhao Liang, Ruixuan Ying, Takuya Taniguchi, Chengguang Gan, and Zhe Cui

work page
[26]

InAdvanced Intelligent Computing Technology and Applications, De-Shuang Huang, Bo Li, Haiming Chen, and Chuanlei Zhang (Eds.)

RECODE: Leveraging Reliable Self-generated Tests and Fine-Grained Exe- cution Feedback to Enhance LLM-Based Code Generation. InAdvanced Intelligent Computing Technology and Applications, De-Shuang Huang, Bo Li, Haiming Chen, and Chuanlei Zhang (Eds.). Springer Nature Singapore, Singapore, 510–521

work page
[27]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. arXiv:2305.01210 [cs.SE] https://arxiv. org/abs/2305.01210

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-Refine: Iterative Refinement with Self-Feedback. arXiv:2303.17651 [cs.CL] https://arxiv.or...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Andrey Malinin and Mark Gales. 2021. Uncertainty Estimation in Autoregressive Structured Prediction. arXiv:2002.07650 [stat.ML] https://arxiv.org/abs/2002. 07650

work page arXiv 2021
[30]

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. 2025. s1: Simple test-time scaling. arXiv:2501.19393 [cs.CL] https: //arxiv.org/abs/2501.19393

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Dang Nguyen, Ali Payani, and Baharan Mirzasoleiman. 2025. Beyond Seman- tic Entropy: Boosting LLM Uncertainty Quantification with Pairwise Semantic Similarity. arXiv:2506.00245 [cs.LG] https://arxiv.org/abs/2506.00245

work page arXiv 2025
[32]

OpenAI, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Apple- baum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bubeck, Che Chang, Kai Chen, Mark Chen, Enoch Cheung, Aidan Clark, Dan Cook, Marat Dukhan, Casey Dvorak, Kevin Fives, Vl...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[33]

Ji Won Park and Kyunghyun Cho. 2026. Efficient semantic uncertainty quantifica- tion in language models via diversity-steered sampling. arXiv:2510.21310 [cs.CL] https://arxiv.org/abs/2510.21310

work page arXiv 2026
[34]

Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. arXiv. https://www.microsoft.com/en-us/research/publication/the-impact-of- ai-on-developer-productivity-evidence-from-github-copilot/

work page 2023
[35]

Chaitanya Ravuri and Saman Amarasinghe. 2025. Eliminating Hallucination- Induced Errors in LLM Code Generation with Functional Clustering. arXiv:2506.11021 [cs.SE] https://arxiv.org/abs/2506.11021

work page arXiv 2025
[36]

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sun- daresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. arXiv:2009.10297 [cs.SE] https://arxiv.org/abs/2009.10297

work page internal anchor Pith review Pith/arXiv arXiv 2020
[37]

Mehrzad Samadi, Aleksander Ficek, Sean Narenthiran, Siddhartha Jain, Wasi Ud- din Ahmad, Somshubra Majumdar, Vahid Noroozi, and Boris Ginsburg. 2025. Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models. arXiv:2510.14232 [cs.LG] https://arxiv.org/abs/2510.14232

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. arXiv:2302.06527 [cs.SE] https://arxiv.org/abs/2302.06527

work page arXiv 2023
[39]

Arindam Sharma and Cristina David. 2025. Assessing Correctness in LLM- Based Code Generation via Uncertainty Estimation. arXiv:2502.11620 [cs.SE] https://arxiv.org/abs/2502.11620

work page arXiv 2025
[40]

Adi Simhi, Itay Itzhak, Fazl Barez, Gabriel Stanovsky, and Yonatan Belinkov. 2025. Trust Me, I’m Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer. arXiv:2502.12964 [cs.CL] https://arxiv.org/abs/2502.12964

work page arXiv 2025
[41]

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Jingyu Sun, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shu...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Thomas Valentin, Ardi Madadi, Gaetano Sapia, and Marcel Böhme. 2025. In- coherence as Oracle-less Measure of Error in LLM-Based Code Generation. arXiv:2507.00057 [cs.PL] https://arxiv.org/abs/2507.00057

work page arXiv 2025
[43]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL] https: //arxiv.org/abs/2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Zhuoyi Yang, Xu Guo, Tong Zhang, Huijuan Xu, and Boyang Li. 2025. Test- time Scaling of LLMs: A Survey from A Subproblem Structure Perspective. arXiv:2511.14772 [cs.CL] https://arxiv.org/abs/2511.14772

work page arXiv 2025
[46]

Zhaojian Yu, Yinghao Wu, Yilun Zhao, Arman Cohan, and Xiao-Ping Zhang

work page
[47]

arXiv:2504.00810 [cs.CL] https: //arxiv.org/abs/2504.00810

Z1: Efficient Test-time Scaling with Code. arXiv:2504.00810 [cs.CL] https: //arxiv.org/abs/2504.00810

work page arXiv
[48]

Murong Yue, Jie Zhao, Min Zhang, Liang Du, and Ziyu Yao. 2024. Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning. arXiv:2310.03094 [cs.CL] https://arxiv.org/abs/2310.03094

work page arXiv 2024
[49]

Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, and Zibin Zheng. 2025. LLM Hallucinations in Practical Code Generation: Phenomena, Mechanism, and Mitigation.Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA022 (June 2025), 23 pages. doi:10.1145/3728894

work page doi:10.1145/3728894 2025