On Verbalized Confidence Scores for LLMs

Daniel Yang; Makoto Yamada; Yao-Hung Hubert Tsai

arxiv: 2412.14737 · v2 · submitted 2024-12-19 · 💻 cs.CL

On Verbalized Confidence Scores for LLMs

Daniel Yang , Yao-Hung Hubert Tsai , Makoto Yamada This is my paper

Pith reviewed 2026-05-23 06:34 UTC · model grok-4.3

classification 💻 cs.CL

keywords verbalized confidenceuncertainty quantificationlarge language modelsprompt methodscalibrationtrustworthiness

0 comments

The pith

Certain prompt methods let LLMs output well-calibrated verbalized confidence scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether large language models can assess their own uncertainty by including a numerical confidence score in their generated text. It evaluates this verbalization approach across multiple datasets, models, and prompting techniques to measure how well the reported scores align with actual response accuracy. Results indicate that calibration quality depends heavily on the specific prompt used, yet some methods produce scores that reliably reflect correctness. This matters because verbalized scores require no access to internal model states, extra sampling, or auxiliary models, offering a lightweight route to uncertainty estimates that could support better human trust and agent decision-making.

Core claim

The central claim is that verbalized confidence scores, obtained by directly prompting an LLM to state its certainty as part of its output, can be well-calibrated when the right prompt strategy is chosen, as demonstrated by consistent alignment between reported confidence levels and empirical accuracy across an extensive set of benchmarks.

What carries the argument

Verbalized confidence scores produced by the LLM itself in response to targeted prompts that request a numerical self-assessment of certainty.

If this is right

Verbalized scores can function as a prompt- and model-agnostic method for uncertainty quantification.
LLM agents can use these scores to make more informed decisions when interacting with each other.
Human users can place greater trust in responses that include reliable self-reported confidence.
The approach avoids the overhead of logit inspection or response sampling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Integration into standard chat interfaces could occur with minimal system changes.
The method might extend naturally to multi-turn conversations where confidence evolves.
Further checks on out-of-distribution inputs could clarify the boundary of reliable verbalization.

Load-bearing premise

The benchmark datasets and evaluation metrics used are representative of the uncertainty that matters in downstream LLM applications.

What would settle it

Finding that the same prompt methods produce poorly calibrated scores on a new task domain or dataset outside the evaluated benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2412.14737 by Daniel Yang, Makoto Yamada, Yao-Hung Hubert Tsai.

**Figure 1.** Figure 1: Uncertainty quantification for LLMs. Existing methods usually quantify the uncertainty based on the consistency of multiple sampled responses (Kuhn et al., 2022; Lin et al., 2023; Manakul et al., 2023; Tanneru et al., 2023; Xiong et al., 2023) or the internal token logits (Kadavath et al., 2022; Si et al., 2022; Ye et al., 2024). These approaches essentially let the LLM to self-assess its uncertainty based… view at source ↗

**Figure 2.** Figure 2: Different uncertainty quantification methods for LLMs. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 6.** Figure 6: Relative number of valid responses over all datasets per model and prompt method. The [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 8.** Figure 8: Calibration diagram for gemma1.1-2b. The color intensity of each bar is proportional to the bin size on a log scale. Note that the accuracy is close to uniform no matter on which range of confidence scores is conditioned. B.4 Insights into prompt methods 0.0 0.2 0.4 0.6 0.8 1.0 0.64 0.64 0.64 0.64 0.95 0.96 0.90 0.89 0.31 0.33 0.26 0.26 agg. over datasets[all], models[tiny] 1 basic basic_scorefloat basic_s… view at source ↗

read the original abstract

The rise of large language models (LLMs) and their tight integration into our daily life make it essential to dedicate efforts towards their trustworthiness. Uncertainty quantification for LLMs can establish more human trust into their responses, but also allows LLM agents to make more informed decisions based on each other's uncertainty. To estimate the uncertainty in a response, internal token logits, task-specific proxy models, or sampling of multiple responses are commonly used. This work focuses on asking the LLM itself to verbalize its uncertainty with a confidence score as part of its output tokens, which is a promising way for prompt- and model-agnostic uncertainty quantification with low overhead. Using an extensive benchmark, we assess the reliability of verbalized confidence scores with respect to different datasets, models, and prompt methods. Our results reveal that the reliability of these scores strongly depends on how the model is asked, but also that it is possible to extract well-calibrated confidence scores with certain prompt methods. We argue that verbalized confidence scores can become a simple but effective and versatile uncertainty quantification method in the future. Our code is available at https://github.com/danielyxyang/llm-verbalized-uq.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Prompt choice matters for calibrated verbalized confidence in LLMs, but benchmark representativeness is the open question.

read the letter

The one or two things to know: this paper shows through extensive tests that the way you prompt an LLM for a confidence score strongly affects how well that score matches its actual accuracy, and that some prompts do produce well-calibrated results. The value for practice depends on how representative the test tasks are. They compare multiple models and datasets with different prompt variations for verbalizing confidence. The results indicate prompt dependence but also that good calibration is achievable without extra compute or models. Making the code available is helpful for anyone wanting to try it. This adds to the literature by providing a wider set of comparisons than earlier studies on the same idea. It's a straightforward empirical exercise with no fitted parameters or derivations. The soft spots are around the evaluation. The abstract mentions an extensive benchmark but gives no specifics on the calibration measures or how they handled the data. More importantly, the datasets are the usual suspects for QA and such. If those don't capture the uncertainty patterns in actual LLM agent use or under distribution shift, then the positive calibration findings might not carry over. The stress-test concern seems valid based on what's described. The paper is for applied researchers who want simple ways to get uncertainty from LLMs. Someone looking for model-agnostic methods would get concrete ideas from the prompt comparisons. It deserves a serious referee because the topic matters for trustworthiness and the setup is clear enough to review properly, even if more details on methods would strengthen it. Recommendation: yes, send for peer review after checking the full methods section.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates verbalized confidence scores produced by LLMs across multiple models, datasets (primarily QA and classification tasks), and prompting strategies. It concludes that certain prompt methods can yield well-calibrated scores, positioning verbalized confidence as a low-overhead, prompt- and model-agnostic uncertainty quantification technique, with code released for reproducibility.

Significance. If the calibration results prove robust, the work offers a practical alternative to logit-based or sampling-based UQ methods for LLM trustworthiness and agentic decision-making. The public code release is a clear strength that supports verification and extension.

major comments (2)

[Abstract; evaluation sections (likely §4–5)] The central claim that certain prompt methods produce well-calibrated verbalized scores rests on results from standard benchmarks (QA, classification, reasoning). These datasets may under-represent the ambiguity, distribution shift, and multi-step dependencies typical of downstream LLM-agent applications; without explicit tests on such tasks, the observed calibration may not transfer (see skeptic concern on benchmark representativeness).
[Abstract] Abstract states an 'extensive benchmark' but provides no details on calibration metrics (e.g., ECE definition), statistical significance tests, or data exclusion rules. This prevents verification of soundness from the provided text and makes it impossible to assess whether the reported calibration improvements are statistically reliable or sensitive to evaluation choices.

minor comments (2)

[Methods] Notation for prompt variants and confidence verbalization formats should be standardized in a table for clarity.
[Results figures] Figure captions could more explicitly link plotted calibration curves to the specific prompt methods and datasets used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below with clarifications and proposed revisions where appropriate.

read point-by-point responses

Referee: [Abstract; evaluation sections (likely §4–5)] The central claim that certain prompt methods produce well-calibrated verbalized scores rests on results from standard benchmarks (QA, classification, reasoning). These datasets may under-represent the ambiguity, distribution shift, and multi-step dependencies typical of downstream LLM-agent applications; without explicit tests on such tasks, the observed calibration may not transfer (see skeptic concern on benchmark representativeness).

Authors: We agree that the evaluated benchmarks (primarily QA and classification tasks) do not fully capture the ambiguity, distribution shifts, or multi-step dependencies common in LLM-agent applications. Our work establishes that certain prompting strategies can yield well-calibrated verbalized scores on these standard tasks as a controlled baseline. We will add a limitations paragraph in the discussion section explicitly noting this scope and recommending future evaluations on agentic tasks to assess transfer. revision: partial
Referee: [Abstract] Abstract states an 'extensive benchmark' but provides no details on calibration metrics (e.g., ECE definition), statistical significance tests, or data exclusion rules. This prevents verification of soundness from the provided text and makes it impossible to assess whether the reported calibration improvements are statistically reliable or sensitive to evaluation choices.

Authors: The abstract is high-level by design, with full details on metrics (ECE defined and computed per Section 3), evaluation procedures, and data handling provided in the methods and results sections. To improve accessibility, we will revise the abstract to briefly name the primary metric (Expected Calibration Error) and direct readers to the relevant sections for definitions, statistical considerations, and exclusion criteria. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; purely empirical evaluation

full rationale

The paper reports results from an extensive benchmark study comparing verbalized confidence scores under different prompt methods, models, and datasets. No mathematical derivations, fitted parameters, or load-bearing self-citations are used to establish the central claim. All reported outcomes are direct empirical measurements (e.g., calibration metrics on held-out benchmarks) that do not reduce to quantities defined inside the paper itself. The evaluation is therefore self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work relies on standard empirical benchmarking practices in machine learning.

pith-pipeline@v0.9.0 · 5731 in / 849 out tokens · 31972 ms · 2026-05-23T06:34:42.432076+00:00 · methodology

discussion (0)

Forward citations

Cited by 14 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?
cs.AI 2026-04 unverdicted novelty 7.0

LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.
PromptNCE: Pointwise Mutual Information Predictions Using Only LLMs and Contrastive Estimation Prompts
cs.CL 2026-05 unverdicted novelty 6.0

PromptNCE frames LLM conditional probability estimation as contrastive prompting augmented with an OTHER category, recovering true P(y|x) and achieving up to 0.82 Spearman correlation with human-derived PMI on three datasets.
ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents
cs.CR 2026-05 conditional novelty 6.0

Clarification-seeking in LLM agents amplifies prompt injection attack success from ~2% to over 30% across ten frontier models in a new 728-scenario benchmark.
LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling
cs.LG 2026-05 conditional novelty 6.0

A metacognitive harness uses LLMs' pre- and post-solution self-monitoring signals to control test-time reasoning, raising pooled accuracy from 48.3% to 56.9% on text, code, and multimodal benchmarks.
Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen
cs.CL 2026-04 conditional novelty 6.0

Seven 3-9B instruction-tuned LLMs produce verbal confidence that saturates at high values and fails psychometric validity criteria for Type-2 discrimination under minimal elicitation.
Calibration-Aware Policy Optimization for Reasoning LLMs
cs.LG 2026-04 unverdicted novelty 6.0

CAPO improves LLM calibration by up to 15% while matching or exceeding GRPO accuracy through logistic AUC loss and noise masking, enabling better abstention and scaling performance.
CLSGen: A Dual-Head Fine-Tuning Framework for Joint Probabilistic Classification and Verbalized Explanation
cs.CL 2026-04 unverdicted novelty 6.0

CLSGen is a dual-head LLM fine-tuning framework that enables joint probabilistic classification and verbalized explanation generation without catastrophic forgetting of generative capabilities.
Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards
cs.LG 2026-03 unverdicted novelty 6.0

DCPO decouples reasoning optimization from calibration in RLVR to fix overconfidence in LLMs without losing accuracy.
Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty
cs.LG 2025-07 conditional novelty 6.0

RLCR augments standard RL rewards for LM reasoning with Brier scores on verbalized confidence, producing models that are both more accurate and better calibrated on in-domain and out-of-domain tasks.
Textual Bayes: Quantifying Prompt Uncertainty in LLM-Based Systems
cs.LG 2025-06 unverdicted novelty 6.0

Introduces a Bayesian framework viewing LLM prompts as textual parameters and proposes MHLP, a novel MCMC algorithm using LLM proposals, to perform inference and improve accuracy plus uncertainty quantification on benchmarks.
Towards Trustworthy Report Generation: A Deep Research Agent with Progressive Confidence Estimation and Calibration
cs.AI 2026-04 unverdicted novelty 4.0

A deep research agent incorporates progressive confidence estimation and calibration to produce trustworthy reports with transparent confidence scores on claims.
Do Small Language Models Know When They're Wrong? Confidence-Based Cascade Scoring for Educational Assessment
cs.CY 2026-03 unverdicted novelty 4.0

Verbalized confidence from small LMs enables cost-effective cascade routing for automated educational scoring, matching large-model accuracy at 76% lower cost when discrimination is strong.
Seven simple steps for log analysis in AI systems
cs.AI 2026-02 unverdicted novelty 4.0

A seven-step pipeline for log analysis in AI systems is outlined with code examples to support rigorous and reproducible evaluation of model capabilities and behaviors.
Self-Reported Confidence of Large Language Models in Gastroenterology: Analysis of Commercial, Open-Source, and Quantized Models
cs.CL 2025-03 unverdicted novelty 4.0

LLMs show improved accuracy on gastroenterology questions but remain overconfident in self-reported certainty across commercial, open-source, and quantized variants.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · cited by 14 Pith papers · 2 internal anchors

[1]

The Falcon Series of Open Language Models

AI@Meta (2024). Llama 3 Model Card. URL: https://github.com/meta- llama/llama3/ blob/main/MODEL_CARD.md, visited on 09/17/2024. Almazrouei, E., Alobeidli, H., Alshamsi, A., Cappelli, A., Cojocaru, R., Debbah, M., et al. (2023). The Falcon Series of Open Language Models. arXiv: 2311.16867. Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., et al. (20...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Language Models (Mostly) Know What They Know

Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., et al. (2022). Language Models (Mostly) Know What They Know. arXiv: 2207.05221. Klingbeil, A., Grützner, C., and Schreck, P. (2024). Trust and Reliance on AI — An Experimental Study on the Extent and Costs of Overreliance on AI. Computers in Human Behavior 160, page 108352. Kuhn, L...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[1] [1]

The Falcon Series of Open Language Models

AI@Meta (2024). Llama 3 Model Card. URL: https://github.com/meta- llama/llama3/ blob/main/MODEL_CARD.md, visited on 09/17/2024. Almazrouei, E., Alobeidli, H., Alshamsi, A., Cappelli, A., Cojocaru, R., Debbah, M., et al. (2023). The Falcon Series of Open Language Models. arXiv: 2311.16867. Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., et al. (20...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Language Models (Mostly) Know What They Know

Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., et al. (2022). Language Models (Mostly) Know What They Know. arXiv: 2207.05221. Klingbeil, A., Grützner, C., and Schreck, P. (2024). Trust and Reliance on AI — An Experimental Study on the Extent and Costs of Overreliance on AI. Computers in Human Behavior 160, page 108352. Kuhn, L...

work page internal anchor Pith review Pith/arXiv arXiv 2022