arxiv: 2509.09438 · v2 · submitted 2025-09-11 · 💻 cs.CL

GrACE: A Generative Approach to Better Confidence Elicitation and Efficient Test-Time Scaling in Large Language Models

Zhaohan Zhang , Ziquan Liu , Ioannis Patras This is my paper

Pith reviewed 2026-05-18 17:50 UTC · model grok-4.3

classification 💻 cs.CL

keywords confidence elicitationlarge language modelsmodel calibrationtest-time scalinggenerative confidenceuncertainty estimation

0 comments

The pith

GrACE trains LLMs to express confidence through similarity of their last hidden state to a special token's embedding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models need ways to indicate how reliable their outputs are, particularly for critical uses. Existing approaches often require running the model many times or training separate systems, which adds cost and complexity. GrACE instead adds one special token to the vocabulary and fine-tunes the model using accuracy targets. This makes the similarity between the final hidden state and the token embedding serve as a direct, real-time confidence score. The method delivers top calibration and discrimination on generation tasks and supports more efficient test-time computation by guiding how many samples to draw.

Core claim

GrACE achieves reliable confidence elicitation by fine-tuning the LLM so that the similarity between its last hidden state and the embedding of an appended special token directly indicates the accuracy of the generated output. This generative approach yields superior discriminative capacity and calibration on open-ended tasks compared to prior methods, all without extra sampling or an auxiliary model, and enables confidence-guided strategies that boost final accuracy while reducing required samples.

What carries the argument

The similarity between the last hidden state and the embedding of a special token appended to the vocabulary, calibrated via fine-tuning on accuracy targets.

If this is right

LLMs can provide on-the-fly confidence estimates during generation without additional compute.
Test-time scaling becomes more efficient by using confidence to limit the number of samples needed.
Open-ended generation tasks gain better calibrated uncertainty measures for decision making.
Deployment in high-stakes domains improves because confidence no longer relies on costly post-processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the special token embedding learns to represent uncertainty patterns, it could be used to debug or interpret model errors.
This mechanism might extend to other sequence models beyond transformers if the hidden state carries similar information.
Combining GrACE with existing sampling methods could create hybrid systems with even lower overhead.

Load-bearing premise

Fine-tuning with accuracy-associated targets makes the hidden-state similarity to the special token a reliable and generalizable indicator of true model confidence.

What would settle it

A test showing that after fine-tuning, the similarity scores do not correlate with actual output accuracy on held-out open-ended generation tasks or perform worse than baseline methods on standard calibration metrics like expected calibration error.

Figures

Figures reproduced from arXiv: 2509.09438 by Ioannis Patras, Zhaohan Zhang, Ziquan Liu.

**Figure 1.** Figure 1: Comparison among the workflow and performance of post-generation methods, verbalized confidence, and GrACE for confidence elicitation. Post-generation methods require an additional evaluation step for calculating confidence and increase the latency in the inference stage. Verbalized confidence methods generate verbal expressions about confidence along with the generation — those are not calibrated. GrACE g… view at source ↗

**Figure 2.** Figure 2: The illustration of GrACE framework. We group the calibration data using the model’s self-awareness and use the accuracy of each group as the calibration target (§3.2.2). The model is trained with the combination of calibration loss and supervised fine-tuning loss (§3.2.3). The confidence is elicited by the distance between the hidden state and <CNF> embedding (§3.2.1). We perform k-fold binning [28] to al… view at source ↗

**Figure 3.** Figure 3: Generalization ability of different confidence elicitation methods. The values in the brackets represent how much the model’s performance changes when moving from its training domain to an unseen domain. Generalization to unseen domain. We study the generalization ability of confidence elicitation methods that need training (i.e., Apricot, ActCab, GrACE). We conduct cross-validation using TriviaQA and Sc… view at source ↗

**Figure 4.** Figure 4: (Left) Accuracy over actual sample size from 2 0 to 2 5 . (Right) Distribution of actual sample size Tˆ for GrACE-ES with sampling budget T = 8, based on Llama-3.1-8B-Instruct. encouraging the LLM to internalize the mapping between quesion-answer pairs and the accuracy. This enables GrACE to generalize more effectively across domains with different semantics. Ablation Study. We conduct an ablation study u… view at source ↗

**Figure 5.** Figure 5: Accuracy over actual sample size Tˆ with sampling budget T from 2 0 to 2 5 . The results compare adaptive test-time scaling strategies incorprating different confidence estimations. the final answer. ASC calculates the cumulative frequency of each answer in the TTS process and terminates sampling when the response’s relative frequency reaches the threshold. ESC divides the sampling steps into windows and… view at source ↗

**Figure 6.** Figure 6: The reliability diagram of various confidence elicitation methods applied to Llama2-7B on the TriviaQA test dataset. The color indicates the proportion of total responses contained in each bin. The color close to red suggests a larger value. ActCab Seq. Likelihood Platt Scaling P(True) Verbal Apricot GrACE Calibration Error: 16.54% Calibration Error: 10.14% Calibration Error: 7.45% Calibration Error: 56.05… view at source ↗

**Figure 7.** Figure 7: The reliability diagram of various confidence elicitation methods applied to Llama2-7B on the SciQ test dataset. The color indicates the proportion of total responses contained in each bin. The color close to red suggests larger value. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: The reliability diagram of various confidence elicitation methods applied to Phi3-3.8B on the TriviaQA test dataset. The color indicates the proportion of total responses contained in each bin. The color close to red suggests larger value. ActCab Seq. Likelihood Platt Scaling P(True) Verbal Apricot GrACE Calibration Error: 14.48% Calibration Error: 10.85% Calibration Error: 31.84% Calibration Error: 36.52%… view at source ↗

**Figure 9.** Figure 9: The reliability diagram of various confidence elicitation methods applied to Phi3-3.8B on the SciQ test dataset. The color indicates the proportion of total responses contained in each bin. The color close to red suggests larger value. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: The reliability diagram of various confidence elicitation methods applied to Llama3.1- 8B-Instruct on the TriviaQA test dataset. The color indicates the proportion of total responses contained in each bin. The color close to red suggests larger value. ActCab GrACE P(True) Verbal Apricot Pla5 Scaling Calibration Error: 9.03% Calibration Error: 17.96% Calibration Error: 21.11% CalibraDon Error: 17.06% Seq. … view at source ↗

**Figure 11.** Figure 11: The reliability diagram of various confidence elicitation methods applied to Llama3.1- 8B-Instruct on the SciQ test dataset. The color indicates the proportion of total responses contained in each bin. The color close to red suggests larger value. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

read the original abstract

Assessing the reliability of Large Language Models (LLMs) by confidence elicitation is a prominent approach to AI safety in high-stakes applications, such as healthcare and finance. Existing methods either require expensive computational overhead or suffer from poor calibration, making them impractical and unreliable for real-world deployment. In this work, we propose GrACE, a Generative Approach to Confidence Elicitation that enables scalable and reliable confidence elicitation for LLMs. GrACE adopts a novel mechanism in which the model expresses confidence by the similarity between the last hidden state and the embedding of a special token appended to the vocabulary, in real-time. We fine-tune the model for calibrating the confidence with targets associated with accuracy. Extensive experiments show that the confidence produced by GrACE achieves the best discriminative capacity and calibration on open-ended generation tasks without resorting to additional sampling or an auxiliary model. Moreover, we propose two confidence-based strategies for test-time scaling with GrACE, which not only improve the accuracy of the final decision but also significantly reduce the number of required samples, highlighting its potential as a practical solution for deploying LLMs with reliable, on-the-fly confidence estimation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GrACE fine-tunes a special token so its embedding similarity gives a built-in confidence score, then uses that score to cut samples in test-time scaling, but the calibration step is the part that needs checking.

read the letter

The main point is that they add one special token to the vocabulary, fine-tune the model on sequences labeled by accuracy, and then treat the similarity between the final hidden state and that token's embedding as the confidence signal. No extra sampling or separate model is required at inference. They also show two ways to use this score for test-time scaling that improve accuracy while lowering the sample count on open-ended tasks.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes GrACE, a generative approach to confidence elicitation for LLMs. A special token is appended to the vocabulary, and confidence is expressed via the similarity between the last hidden state and the special token's embedding. The model is fine-tuned using targets associated with accuracy to calibrate this signal. The central claims are that GrACE achieves the best discriminative capacity and calibration on open-ended generation tasks without additional sampling or an auxiliary model, and that two proposed confidence-based strategies enable improved accuracy with fewer samples during test-time scaling.

Significance. If the empirical claims hold under rigorous validation, the work could offer a low-overhead, real-time confidence mechanism that addresses practical limitations of sampling-based or auxiliary-model approaches in high-stakes LLM deployment. The test-time scaling component adds value by linking confidence directly to efficiency gains. The approach sits outside standard logit or ensemble methods, so confirmation that the learned similarity generalizes beyond training artifacts would strengthen its contribution to uncertainty quantification.

major comments (2)

[Abstract and §3 (method)] Abstract and fine-tuning description: The claim that fine-tuning on accuracy-associated targets causes the similarity between the last hidden state and the special-token embedding to become a reliable, generalizable confidence proxy is load-bearing. The manuscript does not specify whether the objective is an explicit calibration or ranking loss on the similarity metric itself or merely next-token prediction on accuracy-labeled sequences. Without the former, the similarity risks latching onto dataset-specific patterns rather than epistemic uncertainty, especially on open-ended tasks where accuracy labels are noisy or model-dependent.
[§4 (experiments)] Experimental evaluation: The abstract asserts 'best discriminative capacity and calibration' on open-ended tasks, yet the provided high-level summary lacks concrete metrics (e.g., AUC, ECE, Brier score), baseline implementations, dataset statistics, or error bars. These details are required to substantiate superiority and to rule out that gains arise from the fine-tuning distribution rather than the proposed similarity mechanism.

minor comments (2)

[Method] The precise similarity function (cosine or otherwise) and the exact placement of the special token during generation should be formalized with an equation for reproducibility.
[Method] Clarify whether the special token embedding is learned from scratch or initialized from an existing token; this affects the number of free parameters and the interpretation of the 'parameter-free' aspect of inference-time use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and have revised the manuscript accordingly to improve clarity and substantiation of our claims.

read point-by-point responses

Referee: [Abstract and §3 (method)] Abstract and fine-tuning description: The claim that fine-tuning on accuracy-associated targets causes the similarity between the last hidden state and the special-token embedding to become a reliable, generalizable confidence proxy is load-bearing. The manuscript does not specify whether the objective is an explicit calibration or ranking loss on the similarity metric itself or merely next-token prediction on accuracy-labeled sequences. Without the former, the similarity risks latching onto dataset-specific patterns rather than epistemic uncertainty, especially on open-ended tasks where accuracy labels are noisy or model-dependent.

Authors: We appreciate this observation and agree that the precise training objective requires explicit description to support the claim of a generalizable confidence proxy. The GrACE fine-tuning uses next-token prediction on accuracy-labeled sequences where the special token is appended and its generation probability is tied to correctness; however, we acknowledge that this alone may not guarantee the similarity acts as an explicit calibration signal. In the revision we have expanded §3 to detail the full objective (including any auxiliary term on the similarity) and added a new ablation showing that the learned similarity generalizes to held-out tasks beyond the fine-tuning distribution. revision: yes
Referee: [§4 (experiments)] Experimental evaluation: The abstract asserts 'best discriminative capacity and calibration' on open-ended tasks, yet the provided high-level summary lacks concrete metrics (e.g., AUC, ECE, Brier score), baseline implementations, dataset statistics, or error bars. These details are required to substantiate superiority and to rule out that gains arise from the fine-tuning distribution rather than the proposed similarity mechanism.

Authors: We agree that concrete metrics, baselines, and statistical details are essential for rigorous validation. The full experimental section already reports AUC, ECE, and Brier scores with comparisons to logit-based, sampling-based, and auxiliary-model baselines, along with dataset sizes and error bars from multiple seeds. To address the concern, we have moved key quantitative results into the abstract, added a table summarizing all metrics with standard deviations, and included an additional ablation that isolates the contribution of the similarity mechanism from the fine-tuning data distribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity: confidence defined via similarity and calibrated via standard fine-tuning on accuracy targets

full rationale

The derivation introduces a special token whose embedding similarity to the final hidden state serves as the confidence signal, then applies fine-tuning with accuracy-associated targets to align that signal. This constitutes a conventional supervised calibration step rather than a self-definitional loop or a fitted input relabeled as a prediction. No equations or claims reduce the reported discriminative capacity or calibration metrics to quantities that are identical to the training targets by construction. The paper presents empirical results on open-ended generation tasks as external validation, with no load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation. The method is therefore self-contained against its stated benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of a newly introduced special token whose embedding is aligned to accuracy via fine-tuning; this introduces one invented entity and one domain assumption about calibration. No explicit free parameters beyond the learned embedding are stated in the abstract.

free parameters (1)

special token embedding
The embedding of the appended special token is adjusted during fine-tuning to serve as the confidence reference.

axioms (1)

domain assumption Fine-tuning on accuracy-associated targets will make similarity to the special-token embedding a reliable proxy for true model confidence.
Invoked when the abstract states the model is fine-tuned for calibrating the confidence with targets associated with accuracy.

invented entities (1)

special confidence token no independent evidence
purpose: To provide a real-time confidence signal via embedding similarity to the last hidden state
Introduced as the core novel mechanism in the abstract; no independent evidence outside the paper is supplied.

pith-pipeline@v0.9.0 · 5742 in / 1427 out tokens · 57442 ms · 2026-05-18T17:50:10.444354+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 14 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Claude 3: A conversational ai model

Anthropic. Claude 3: A conversational ai model. 2024

work page 2024
[4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

work page 2023
[6]

Hallucination-free? assessing the reliability of leading ai legal research tools

Varun Magesh, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher D Manning, and Daniel E Ho. Hallucination-free? assessing the reliability of leading ai legal research tools. arXiv preprint arXiv:2405.20362, 2024

work page arXiv 2024
[7]

Finben: A holistic financial benchmark for large language models.Advances in Neural Information Processing Systems, 37:95716–95743, 2024

Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, et al. Finben: A holistic financial benchmark for large language models.Advances in Neural Information Processing Systems, 37:95716–95743, 2024

work page 2024
[8]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

On calibration of modern neural networks

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. InInternational conference on machine learning, pages 1321–1330. PMLR, 2017

work page 2017
[10]

Calibration of pre-trained transformers

Shrey Desai and Greg Durrett. Calibration of pre-trained transformers. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 295–302, 2020

work page 2020
[11]

Large language models are miscalibrated in-context learners

Chengzu Li, Han Zhou, Goran Glavaš, Anna Korhonen, and Ivan Vuli ´c. Large language models are miscalibrated in-context learners. InFindings of the Association for Computational Linguistics: ACL 2025, pages 11575–11596, 2025

work page 2025
[12]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Atomic calibration of llms in long-form generations.arXiv preprint arXiv:2410.13246, 2024

Caiqi Zhang, Ruihan Yang, Zhisong Zhang, Xinting Huang, Sen Yang, Dong Yu, and Nigel Collier. Atomic calibration of llms in long-form generations.arXiv preprint arXiv:2410.13246, 2024

work page arXiv 2024
[14]

Cali- brating large language models using their generations only

Dennis Thomas Ulmer, Martin Gubri, Hwaran Lee, Sangdoo Yun, and Seong Joon Oh. Cali- brating large language models using their generations only. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 15440–15459. Association for Computational Linguistics, 2024

work page 2024
[15]

Large language models must be taught to know what they don’t know

Sanyam Kapoor, Nate Gruver, Manley Roberts, Katherine M Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, and Andrew Gordon Wilson. Large language models must be taught to know what they don’t know. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. 10

work page
[16]

Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback

Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages ...

work page 2023
[17]

Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms

Miao Xiong, Zhiyuan Hu, Xinyang Lu, YIFEI LI, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. InThe Twelfth International Conference on Learning Representations

work page
[18]

Linguistic calibration of long- form generations

Neil Band, Xuechen Li, Tengyu Ma, and Tatsunori Hashimoto. Linguistic calibration of long- form generations. InProceedings of the 41st International Conference on Machine Learning, pages 2732–2778, 2024

work page 2024
[19]

Logu: Long-form generation with uncertainty expressions.arXiv preprint arXiv:2410.14309, 2024

Ruihan Yang, Caiqi Zhang, Zhisong Zhang, Xinting Huang, Sen Yang, Nigel Collier, Dong Yu, and Deqing Yang. Logu: Long-form generation with uncertainty expressions.arXiv preprint arXiv:2410.14309, 2024

work page arXiv 2024
[20]

Uncle: Uncertainty expressions in long-form generation.arXiv preprint arXiv:2505.16922, 2025

Ruihan Yang, Caiqi Zhang, Zhisong Zhang, Xinting Huang, Dong Yu, Nigel Collier, and Deqing Yang. Uncle: Uncertainty expressions in long-form generation.arXiv preprint arXiv:2505.16922, 2025

work page arXiv 2025
[21]

Calibrating verbal uncertainty as a linear feature to reduce hallucinations.arXiv preprint arXiv:2503.14477, 2025

Ziwei Ji, Lei Yu, Yeskendir Koishekenov, Yejin Bang, Anthony Hartshorn, Alan Schelten, Cheng Zhang, Pascale Fung, and Nicola Cancedda. Calibrating verbal uncertainty as a linear feature to reduce hallucinations.arXiv preprint arXiv:2503.14477, 2025

work page arXiv 2025
[22]

Lora: Low-rank adaptation of large language models

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations

work page
[23]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InThe Eleventh International Conference on Learning Representations

work page
[24]

Get confused cautiously: Textual sequence memorization erasure with selective entropy maximization

Zhaohan Zhang, Ziquan Liu, and Ioannis Patras. Get confused cautiously: Textual sequence memorization erasure with selective entropy maximization. InProceedings of the 31st Interna- tional Conference on Computational Linguistics, pages 10924–10939, 2025

work page 2025
[25]

Softmax probabilities (mostly) predict large language model correctness on multiple-choice q&a.arXiv e-prints, pages arXiv–2402, 2024

Benjamin Plaut, Khanh Nguyen, and Tu Trinh. Softmax probabilities (mostly) predict large language model correctness on multiple-choice q&a.arXiv e-prints, pages arXiv–2402, 2024

work page 2024
[26]

The internal state of an llm knows when it’s lying

Amos Azaria and Tom Mitchell. The internal state of an llm knows when it’s lying. InThe 2023 Conference on Empirical Methods in Natural Language Processing

work page 2023
[27]

Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, and Yarin Gal. Semantic entropy probes: Robust and cheap hallucination detection in llms.arXiv preprint arXiv:2406.15927, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Enhancing language model factuality via activation- based confidence calibration and guided decoding

Xin Liu, Farima Fatahi Bayat, and Lu Wang. Enhancing language model factuality via activation- based confidence calibration and guided decoding. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 10436–10448, 2024

work page 2024
[29]

Uncertainty distillation: Teaching language models to express semantic confidence.arXiv preprint arXiv:2503.14749, 2025

Sophia Hager, David Mueller, Kevin Duh, and Nicholas Andrews. Uncertainty distillation: Teaching language models to express semantic confidence.arXiv preprint arXiv:2503.14749, 2025

work page arXiv 2025
[30]

I don’t know: Explicit modeling of uncertainty with an [idk] token.Advances in Neural Information Processing Systems, 37:10935–10958, 2024

Roi Cohen, Konstantin Dobler, Eden Biran, and Gerard de Melo. I don’t know: Explicit modeling of uncertainty with an [idk] token.Advances in Neural Information Processing Systems, 37:10935–10958, 2024

work page 2024
[31]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti- mally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations

work page
[33]

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms.arXiv preprint arXiv:2412.21187, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning

Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li. Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning. In The Twelfth International Conference on Learning Representations

work page
[36]

Let’s sample step by step: Adaptive- consistency for efficient reasoning and coding with llms

Pranjal Aggarwal, Aman Madaan, Yiming Yang, et al. Let’s sample step by step: Adaptive- consistency for efficient reasoning and coding with llms. InThe 2023 Conference on Empirical Methods in Natural Language Processing

work page 2023
[37]

Scaling evaluation-time compute with reasoning models as process evaluators.arXiv preprint arXiv:2503.19877, 2025

Seungone Kim, Ian Wu, Jinu Lee, Xiang Yue, Seongyun Lee, Mingyeong Moon, Kiril Gash- teovski, Carolin Lawrence, Julia Hockenmaier, Graham Neubig, et al. Scaling evaluation-time compute with reasoning models as process evaluators.arXiv preprint arXiv:2503.19877, 2025

work page arXiv 2025
[38]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

work page 2023
[39]

Confidence improves self-consistency in llms.arXiv preprint arXiv:2502.06233, 2025

Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, and Gal Yona. Confidence improves self-consistency in llms.arXiv preprint arXiv:2502.06233, 2025

work page arXiv 2025
[40]

Efficient test-time scaling via self-calibration.arXiv preprint arXiv:2503.00031, 2025

Chengsong Huang, Langlin Huang, Jixuan Leng, Jiacheng Liu, and Jiaxin Huang. Efficient test-time scaling via self-calibration.arXiv preprint arXiv:2503.00031, 2025

work page arXiv 2025
[41]

Deep Think with Confidence

Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence.arXiv preprint arXiv:2508.15260, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Think before you speak: Training language models with pause tokens

Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. In The Twelfth International Conference on Learning Representations

work page
[43]

Guiding language model reasoning with planning tokens

Xinyi Wang, Lucas Caccia, Oleksiy Ostapenko, Xingdi Yuan, William Yang Wang, and Alessan- dro Sordoni. Guiding language model reasoning with planning tokens. InFirst Conference on Language Modeling

work page
[44]

Calibrated structured prediction.Advances in Neural Information Processing Systems, 28, 2015

V olodymyr Kuleshov and Percy S Liang. Calibrated structured prediction.Advances in Neural Information Processing Systems, 28, 2015

work page 2015
[45]

Uncertainty estimation in autoregressive structured prediction

Andrey Malinin and Mark Gales. Uncertainty estimation in autoregressive structured prediction. InInternational Conference on Learning Representations

work page
[46]

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.Advances in large margin classifiers, 10(3):61–74, 1999

John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.Advances in large margin classifiers, 10(3):61–74, 1999

work page 1999
[47]

Reasoning aware self-consistency: Leveraging reasoning paths for efficient llm sampling

Guangya Wan, Yuqi Wu, Jie Chen, and Sheng Li. Reasoning aware self-consistency: Leveraging reasoning paths for efficient llm sampling. InThe 2025 Annual Conference of the Nations of the Americas Chapter of the ACL, 2025

work page 2025
[48]

Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, 2017. 12

work page 2017
[49]

Crowdsourcing multiple choice science questions

Johannes Welbl, Nelson F Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. InProceedings of the 3rd Workshop on Noisy User-generated Text, pages 94–106, 2017

work page 2017
[50]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

work page 2004
[51]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Obtaining well calibrated probabilities using bayesian binning

Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. InProceedings of the AAAI conference on artificial intelligence, volume 29, 2015

work page 2015
[55]

Verification of forecasts expressed in terms of probability.Monthly weather review, 78(1):1–3, 1950

Glenn W Brier. Verification of forecasts expressed in terms of probability.Monthly weather review, 78(1):1–3, 1950

work page 1950
[56]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Mathqa: Towards interpretable math word problem solving with operation-based formalisms

Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and ...

work page 2019
[58]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[59]

Too consistent to detect: A study of self-consistent errors in llms.arXiv preprint arXiv:2505.17656, 2025

Hexiang Tan, Fei Sun, Sha Liu, Du Su, Qi Cao, Xin Chen, Jingang Wang, Xunliang Cai, Yuanzhuo Wang, Huawei Shen, et al. Too consistent to detect: A study of self-consistent errors in llms.arXiv preprint arXiv:2505.17656, 2025. 13 A Implementation Details A.1 Prompt Template We elaborate on the prompt templates used for open-ended generation and test-time s...

work page arXiv 2025