pith. machine review for the scientific record. sign in

arxiv: 2603.27098 · v2 · submitted 2026-03-28 · 💻 cs.SE

Recognition: 2 theorem links

· Lean Theorem

Ensemble-Based Uncertainty Estimation for Code Correctness Estimation

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:49 UTC · model grok-4.3

classification 💻 cs.SE
keywords code generationuncertainty estimationensemble methodssemantic entropyprogram correctnesstest-time scalingLLM evaluation
0
0 comments X

The pith

Aggregating semantic consistency across multiple models detects incorrect code more reliably than repeated samples from one model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Single large language models often settle on the same wrong program when generating code from a description, which makes consistency checks within one model a weak signal of correctness. The paper instead measures how much the semantic meaning of outputs varies when the same prompt is given to an ensemble of different models. Greater disagreement across the ensemble turns out to track actual program errors more closely than single-model checks do. This improved uncertainty signal allows selective acceptance of generated code with fewer mistakes and supports a cascading procedure that applies extra computation only when the signal is uncertain.

Core claim

Ensemble Semantic Entropy estimates uncertainty for code generation by measuring the consistency of semantics across samples from multiple different models rather than from one model. On LiveCodeBench this measure correlates more strongly with program correctness, raises prediction accuracy by 53.4 percent in selective generation under strict false-positive constraints, and powers a cascading test-time scaling framework that preserves performance while cutting FLOPs by 64.9 percent relative to uniform single-model scaling.

What carries the argument

Ensemble Semantic Entropy (ESE), which quantifies uncertainty by checking semantic consistency of generated programs aggregated across an ensemble of distinct models.

If this is right

  • Selective acceptance of generated code achieves higher accuracy at the same strict false-positive rate.
  • A cascading procedure can apply additional inference steps only when ensemble disagreement is high, cutting total compute while preserving output quality.
  • Uncertainty estimates improve when models trained on different paths are combined rather than when repeated draws are taken from one training path.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ensemble disagreement signal could be tested on other structured generation tasks such as theorem proving or data transformation scripts.
  • Experiments that vary the degree of model diversity while holding ensemble size fixed would show how much training difference is required for the method to work.

Load-bearing premise

Different models will disagree more on incorrect programs than on correct ones without their architectural or training differences creating new shared mistakes.

What would settle it

A benchmark showing that an ensemble of models produces highly consistent but incorrect code outputs at rates comparable to or higher than a single model would demonstrate that the ensemble adds no reliable correctness signal.

Figures

Figures reproduced from arXiv: 2603.27098 by Aishan Liu, Jian Yang, Mingfei Cheng, Qiang Hu, Tianlin Li, Xiaoyu Zhang, Yanni Dong, Yunxiang Wei, Yuwei Zheng.

Figure 1
Figure 1. Figure 1: Calculation of Ensemble Semantic Entropy on a motivating example. The problem requires traversing an array [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of the distribution of the largest cluster [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pearson correlation coefficients between uncer [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy-cost comparison on LiveCodeBench [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Large language models (LLMs) have demonstrated remarkable capabilities in generating programs from natural language descriptions, yet ensuring their correctness without an external oracle remains a critical challenge. To solve the challenge, existing methods often rely on uncertainty estimation, measuring the consistency of semantics or execution behaviors across multiple samples generated by a single model. However, we observe that a single model can often converge to a consistent but incorrect solution, rendering such consistency-based proxies ineffective. To address this, we propose Ensemble Semantic Entropy (ESE), which estimates uncertainty by evaluating the consistency of samples aggregated across an ensemble of models. Experiments on LiveCodeBench demonstrate that ESE correlates more strongly with program correctness than single-model semantic entropy. Notably, in selective generation tasks with strict false-positive rate constraints, ESE improves prediction accuracy by 53.4%. Furthermore, by leveraging ESE as the decision signal, we propose a cascading test-time scaling framework Cas, which maintains performance while reducing FLOPs by 64.9% compared to single-model scaling, offering a new perspective on balancing parameter and inference scaling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Ensemble Semantic Entropy (ESE) as an uncertainty estimator for LLM code generation, computing semantic consistency across samples from an ensemble of models instead of a single model. It claims stronger correlation with program correctness on LiveCodeBench than single-model semantic entropy, a 53.4% accuracy lift in selective generation under strict false-positive-rate constraints, and a cascading test-time scaling method (Cas) that preserves performance while cutting FLOPs by 64.9% relative to single-model scaling.

Significance. If the reported gains are reproducible and not artifacts of unmatched model capabilities, ESE could supply a practical signal for reliable selective prediction and efficient test-time scaling in code generation. The work highlights a concrete limitation of single-model consistency measures and offers an ensemble-based alternative that may better surface semantic errors.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): the 53.4% accuracy improvement and 64.9% FLOPs reduction are stated without any description of the ensemble models (sizes, training distributions, or capability matching), baseline implementations, statistical tests, or error bars. These omissions render the central empirical claims unevaluable from the supplied text.
  2. [§3 and §5] §3 (ESE definition) and §5 (Cas framework): the claim that ESE isolates semantic diversity (rather than systematic capability differences) is load-bearing for both the correlation result and the Cas efficiency gain, yet no ablation or control is reported that holds model capability fixed while varying ensemble diversity. Without such evidence the reported improvements could be driven by the ensemble simply including stronger models on certain problems.
  3. [§4.2] §4.2 (selective generation): the strict false-positive-rate constraint is central to the 53.4% figure, but the manuscript supplies neither the exact threshold values used nor the procedure for computing the operating point, preventing verification that the comparison is fair across single-model and ensemble signals.
minor comments (2)
  1. [§3] Notation for semantic entropy and ESE should be defined once in a single equation block rather than re-introduced in prose.
  2. [Figures 2-4] Figure captions for the correlation and scaling plots should include the exact number of problems, models, and samples per model.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. The comments highlight important areas for improving clarity and reproducibility, which we will address in the revised manuscript. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the 53.4% accuracy improvement and 64.9% FLOPs reduction are stated without any description of the ensemble models (sizes, training distributions, or capability matching), baseline implementations, statistical tests, or error bars. These omissions render the central empirical claims unevaluable from the supplied text.

    Authors: We agree that the current text lacks sufficient detail on these elements. In the revision we will expand the abstract and §4 to explicitly list the ensemble models (including sizes and training distributions), describe how capabilities were matched across the ensemble, provide full baseline implementation details, report statistical tests (e.g., paired t-tests with p-values), and include error bars on all key metrics including the 53.4% accuracy lift and 64.9% FLOPs reduction. These additions will make the empirical claims fully evaluable. revision: yes

  2. Referee: [§3 and §5] §3 (ESE definition) and §5 (Cas framework): the claim that ESE isolates semantic diversity (rather than systematic capability differences) is load-bearing for both the correlation result and the Cas efficiency gain, yet no ablation or control is reported that holds model capability fixed while varying ensemble diversity. Without such evidence the reported improvements could be driven by the ensemble simply including stronger models on certain problems.

    Authors: This is a valid concern. While our ensemble was constructed from models with comparable per-problem performance on LiveCodeBench, we did not include an explicit ablation that holds capability fixed. In the revised §5 we will add a controlled ablation comparing ESE under matched-capability ensembles versus ensembles that deliberately vary capability, demonstrating that the correlation and efficiency gains persist when capability differences are minimized. This will directly address whether the benefits arise from semantic diversity. revision: yes

  3. Referee: [§4.2] §4.2 (selective generation): the strict false-positive-rate constraint is central to the 53.4% figure, but the manuscript supplies neither the exact threshold values used nor the procedure for computing the operating point, preventing verification that the comparison is fair across single-model and ensemble signals.

    Authors: We will revise §4.2 to specify the exact FPR thresholds (e.g., 0.05 and 0.10) and the full procedure for determining the operating point, including how thresholds were selected on a held-out validation split to enforce the FPR constraint while maximizing accuracy. This will allow direct verification that the single-model versus ensemble comparison is performed under identical constraints. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical proposal of ESE metric and Cas framework

full rationale

The paper proposes Ensemble Semantic Entropy (ESE) as an uncertainty measure based on cross-model sample consistency and reports experimental results on LiveCodeBench showing stronger correlation with program correctness than single-model semantic entropy, plus accuracy gains under FPR constraints and FLOPs reductions via the Cas cascading framework. No mathematical derivation chain, equations, or first-principles results are claimed. The central claims rest on empirical observations from experiments rather than any reduction to fitted inputs, self-definitions, or self-citation chains. The work is self-contained as a novel metric and framework without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that ensembles expose semantic diversity better than single models and on empirical results whose details are absent from the abstract.

axioms (1)
  • domain assumption An ensemble of models produces sufficiently diverse samples to expose incorrect but consistent solutions that a single model would miss.
    This premise is required for ESE to outperform single-model semantic entropy and is stated as an observation in the abstract.
invented entities (1)
  • Ensemble Semantic Entropy (ESE) no independent evidence
    purpose: Uncertainty score computed from semantic consistency across multiple models
    Newly introduced metric whose exact aggregation formula is not supplied in the abstract.

pith-pipeline@v0.9.0 · 5502 in / 1453 out tokens · 62109 ms · 2026-05-14T22:49:41.210146+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 15 internal anchors

  1. [1]

    Aradhye Agarwal, Ayan Sengupta, and Tanmoy Chakraborty. 2025. The Art of Scaling Test-Time Compute for Large Language Models. arXiv:2512.02008 [cs.CL] https://arxiv.org/abs/2512.02008

  2. [2]

    Anonymous. 2026. Anonymous Artifact for Ensemble Semantic Entropy. https: //anonymous.4open.science/r/Ensemble-Semantic-Entropy-F376. Anonymous repository for peer review

  3. [3]

    Anthropic. 2026. Claude Code: AI-powered coding assistant for developers. Product page. https://claude.com/product/claude-code Accessed: 2026-03-24

  4. [4]

    Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2022. CodeT: Code Generation with Generated Tests. arXiv:2207.10397 [cs.CL] https://arxiv.org/abs/2207.10397

  5. [5]

    Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2023. CodeT: Code Generation with Generated Tests. InThe Eleventh International Conference on Learning Representations. https: //openreview.net/forum?id=ktrw68Cmu9c

  6. [6]

    Jiefeng Chen, Jie Ren, Xinyun Chen, Chengrun Yang, Ruoxi Sun, Jinsung Yoon, and Sercan Ö Arık. 2025. SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling. arXiv:2501.19306 [cs.AI] https://arxiv.org/abs/ 2501.19306

  7. [7]

    Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching Large Language Models to Self-Debug. arXiv:2304.05128 [cs.CL] https://arxiv. org/abs/2304.05128

  8. [8]

    Xiancai Chen, Zhengwei Tao, Kechi Zhang, Changzhi Zhou, Xinyu Zhang, Wanli Gu, Yuanpeng He, Mengdi Zhang, Xunliang Cai, Haiyan Zhao, and Zhi Jin

  9. [9]

    Revisit Self-Debugging with Self-Generated Tests for Code Generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computational Linguistics, Vienna, Austria, 18003–18023. doi:10.18653/v1...

  10. [10]

    Cursor. 2026. Cursor Agent. Product page. https://cursor.com/product Accessed: 2026-03-24

  11. [11]

    Yihan Dai, Sijie Liang, Haotian Xu, Peichu Xie, and Sergey Mechtaev. 2025. Reducing Hallucinations in LLM-Generated Code via Semantic Triangulation. arXiv:2511.12288 [cs.SE] https://arxiv.org/abs/2511.12288

  12. [12]

    Desmarais

    Arghavan Moradi Dakhel, Amin Nikanjam, Vahid Majdinasab, Foutse Khomh, and Michel C. Desmarais. 2023. Effective Test Generation Using Pre-trained Large Language Models and Mutation Testing. arXiv:2308.16557 [cs.SE] https: //arxiv.org/abs/2308.16557

  13. [13]

    Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. 2024. Detecting hallucinations in large language models using semantic entropy.Nature630, 8017 (June 2024), 625–630. doi:10.1038/s41586-024-07421-0

  14. [14]

    Audrey Huang, Adam Block, Qinghua Liu, Nan Jiang, Akshay Krishnamurthy, and Dylan J. Foster. 2025. Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment. arXiv:2503.21878 [cs.AI] https: //arxiv.org/abs/2503.21878

  15. [15]

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions.ACM Trans. Inf. Syst.43, 2, Article 42 (Jan. 2025), 55 pages. doi:10.1145/3703155

  16. [16]

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. Live- CodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. arXiv:2403.07974 [cs.SE] https://arxiv.org/abs/2403.07974

  17. [17]

    Why Language Models Hallucinate

    Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. 2025. Why Language Models Hallucinate. arXiv:2509.04664 [cs.CL] https://arxiv.org/ abs/2509.04664

  18. [18]

    Alex Kendall and Yarin Gal. 2017. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? arXiv:1703.04977 [cs.CV] https://arxiv.org/ abs/1703.04977

  19. [19]

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic Uncertainty: Lin- guistic Invariances for Uncertainty Estimation in Natural Language Generation. arXiv:2302.09664 [cs.CL] https://arxiv.org/abs/2302.09664

  20. [20]

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017. Sim- ple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. arXiv:1612.01474 [stat.ML] https://arxiv.org/abs/1612.01474

  21. [21]

    Rémi Leblond, Felix Gimeno, and Florent Altché. 2023. AlphaCode 2 Tech- nical Report. https://storage.googleapis.com/deepmind-media/AlphaCode2/ AlphaCode2_Tech_Report.pdf

  22. [22]

    Gonzalez, and Ion Stoica

    Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, and Ion Stoica. 2025. S*: Test Time Scaling for Code Generation. arXiv:2502.14382 [cs.LG] https://arxiv.org/abs/2502.14382

  23. [23]

    Kefan Li and Yuan Yuan. 2024. Large Language Models as Test Case Generators: Performance Evaluation and Enhancement. arXiv:2404.13340 [cs.SE] https: //arxiv.org/abs/2404.13340

  24. [24]

    Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

  25. [25]

    Yunhao Liang, Ruixuan Ying, Takuya Taniguchi, Chengguang Gan, and Zhe Cui

  26. [26]

    InAdvanced Intelligent Computing Technology and Applications, De-Shuang Huang, Bo Li, Haiming Chen, and Chuanlei Zhang (Eds.)

    RECODE: Leveraging Reliable Self-generated Tests and Fine-Grained Exe- cution Feedback to Enhance LLM-Based Code Generation. InAdvanced Intelligent Computing Technology and Applications, De-Shuang Huang, Bo Li, Haiming Chen, and Chuanlei Zhang (Eds.). Springer Nature Singapore, Singapore, 510–521

  27. [27]

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. arXiv:2305.01210 [cs.SE] https://arxiv. org/abs/2305.01210

  28. [28]

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-Refine: Iterative Refinement with Self-Feedback. arXiv:2303.17651 [cs.CL] https://arxiv.or...

  29. [29]

    Andrey Malinin and Mark Gales. 2021. Uncertainty Estimation in Autoregressive Structured Prediction. arXiv:2002.07650 [stat.ML] https://arxiv.org/abs/2002. 07650

  30. [30]

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. 2025. s1: Simple test-time scaling. arXiv:2501.19393 [cs.CL] https: //arxiv.org/abs/2501.19393

  31. [31]

    Dang Nguyen, Ali Payani, and Baharan Mirzasoleiman. 2025. Beyond Seman- tic Entropy: Boosting LLM Uncertainty Quantification with Pairwise Semantic Similarity. arXiv:2506.00245 [cs.LG] https://arxiv.org/abs/2506.00245

  32. [32]

    OpenAI, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Apple- baum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bubeck, Che Chang, Kai Chen, Mark Chen, Enoch Cheung, Aidan Clark, Dan Cook, Marat Dukhan, Casey Dvorak, Kevin Fives, Vl...

  33. [33]

    Ji Won Park and Kyunghyun Cho. 2026. Efficient semantic uncertainty quantifica- tion in language models via diversity-steered sampling. arXiv:2510.21310 [cs.CL] https://arxiv.org/abs/2510.21310

  34. [34]

    Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. arXiv. https://www.microsoft.com/en-us/research/publication/the-impact-of- ai-on-developer-productivity-evidence-from-github-copilot/

  35. [35]

    Chaitanya Ravuri and Saman Amarasinghe. 2025. Eliminating Hallucination- Induced Errors in LLM Code Generation with Functional Clustering. arXiv:2506.11021 [cs.SE] https://arxiv.org/abs/2506.11021

  36. [36]

    Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sun- daresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. arXiv:2009.10297 [cs.SE] https://arxiv.org/abs/2009.10297

  37. [37]

    Mehrzad Samadi, Aleksander Ficek, Sean Narenthiran, Siddhartha Jain, Wasi Ud- din Ahmad, Somshubra Majumdar, Vahid Noroozi, and Boris Ginsburg. 2025. Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models. arXiv:2510.14232 [cs.LG] https://arxiv.org/abs/2510.14232

  38. [38]

    Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. arXiv:2302.06527 [cs.SE] https://arxiv.org/abs/2302.06527

  39. [39]

    Arindam Sharma and Cristina David. 2025. Assessing Correctness in LLM- Based Code Generation via Uncertainty Estimation. arXiv:2502.11620 [cs.SE] https://arxiv.org/abs/2502.11620

  40. [40]

    Adi Simhi, Itay Itzhak, Fazl Barez, Gabriel Stanovsky, and Yonatan Belinkov. 2025. Trust Me, I’m Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer. arXiv:2502.12964 [cs.CL] https://arxiv.org/abs/2502.12964

  41. [41]

    Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Jingyu Sun, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shu...

  42. [42]

    Thomas Valentin, Ardi Madadi, Gaetano Sapia, and Marcel Böhme. 2025. In- coherence as Oracle-less Measure of Error in LLM-Based Code Generation. arXiv:2507.00057 [cs.PL] https://arxiv.org/abs/2507.00057

  43. [43]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL] https: //arxiv.org/abs/2201.11903

  44. [44]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  45. [45]

    Zhuoyi Yang, Xu Guo, Tong Zhang, Huijuan Xu, and Boyang Li. 2025. Test- time Scaling of LLMs: A Survey from A Subproblem Structure Perspective. arXiv:2511.14772 [cs.CL] https://arxiv.org/abs/2511.14772

  46. [46]

    Zhaojian Yu, Yinghao Wu, Yilun Zhao, Arman Cohan, and Xiao-Ping Zhang

  47. [47]

    arXiv:2504.00810 [cs.CL] https: //arxiv.org/abs/2504.00810

    Z1: Efficient Test-time Scaling with Code. arXiv:2504.00810 [cs.CL] https: //arxiv.org/abs/2504.00810

  48. [48]

    Murong Yue, Jie Zhao, Min Zhang, Liang Du, and Ziyu Yao. 2024. Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning. arXiv:2310.03094 [cs.CL] https://arxiv.org/abs/2310.03094

  49. [49]

    Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, and Zibin Zheng. 2025. LLM Hallucinations in Practical Code Generation: Phenomena, Mechanism, and Mitigation.Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA022 (June 2025), 23 pages. doi:10.1145/3728894