On Verbalized Confidence Scores for LLMs
Pith reviewed 2026-05-23 06:34 UTC · model grok-4.3
The pith
Certain prompt methods let LLMs output well-calibrated verbalized confidence scores.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that verbalized confidence scores, obtained by directly prompting an LLM to state its certainty as part of its output, can be well-calibrated when the right prompt strategy is chosen, as demonstrated by consistent alignment between reported confidence levels and empirical accuracy across an extensive set of benchmarks.
What carries the argument
Verbalized confidence scores produced by the LLM itself in response to targeted prompts that request a numerical self-assessment of certainty.
If this is right
- Verbalized scores can function as a prompt- and model-agnostic method for uncertainty quantification.
- LLM agents can use these scores to make more informed decisions when interacting with each other.
- Human users can place greater trust in responses that include reliable self-reported confidence.
- The approach avoids the overhead of logit inspection or response sampling.
Where Pith is reading between the lines
- Integration into standard chat interfaces could occur with minimal system changes.
- The method might extend naturally to multi-turn conversations where confidence evolves.
- Further checks on out-of-distribution inputs could clarify the boundary of reliable verbalization.
Load-bearing premise
The benchmark datasets and evaluation metrics used are representative of the uncertainty that matters in downstream LLM applications.
What would settle it
Finding that the same prompt methods produce poorly calibrated scores on a new task domain or dataset outside the evaluated benchmarks would falsify the central claim.
Figures
read the original abstract
The rise of large language models (LLMs) and their tight integration into our daily life make it essential to dedicate efforts towards their trustworthiness. Uncertainty quantification for LLMs can establish more human trust into their responses, but also allows LLM agents to make more informed decisions based on each other's uncertainty. To estimate the uncertainty in a response, internal token logits, task-specific proxy models, or sampling of multiple responses are commonly used. This work focuses on asking the LLM itself to verbalize its uncertainty with a confidence score as part of its output tokens, which is a promising way for prompt- and model-agnostic uncertainty quantification with low overhead. Using an extensive benchmark, we assess the reliability of verbalized confidence scores with respect to different datasets, models, and prompt methods. Our results reveal that the reliability of these scores strongly depends on how the model is asked, but also that it is possible to extract well-calibrated confidence scores with certain prompt methods. We argue that verbalized confidence scores can become a simple but effective and versatile uncertainty quantification method in the future. Our code is available at https://github.com/danielyxyang/llm-verbalized-uq.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates verbalized confidence scores produced by LLMs across multiple models, datasets (primarily QA and classification tasks), and prompting strategies. It concludes that certain prompt methods can yield well-calibrated scores, positioning verbalized confidence as a low-overhead, prompt- and model-agnostic uncertainty quantification technique, with code released for reproducibility.
Significance. If the calibration results prove robust, the work offers a practical alternative to logit-based or sampling-based UQ methods for LLM trustworthiness and agentic decision-making. The public code release is a clear strength that supports verification and extension.
major comments (2)
- [Abstract; evaluation sections (likely §4–5)] The central claim that certain prompt methods produce well-calibrated verbalized scores rests on results from standard benchmarks (QA, classification, reasoning). These datasets may under-represent the ambiguity, distribution shift, and multi-step dependencies typical of downstream LLM-agent applications; without explicit tests on such tasks, the observed calibration may not transfer (see skeptic concern on benchmark representativeness).
- [Abstract] Abstract states an 'extensive benchmark' but provides no details on calibration metrics (e.g., ECE definition), statistical significance tests, or data exclusion rules. This prevents verification of soundness from the provided text and makes it impossible to assess whether the reported calibration improvements are statistically reliable or sensitive to evaluation choices.
minor comments (2)
- [Methods] Notation for prompt variants and confidence verbalization formats should be standardized in a table for clarity.
- [Results figures] Figure captions could more explicitly link plotted calibration curves to the specific prompt methods and datasets used.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below with clarifications and proposed revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract; evaluation sections (likely §4–5)] The central claim that certain prompt methods produce well-calibrated verbalized scores rests on results from standard benchmarks (QA, classification, reasoning). These datasets may under-represent the ambiguity, distribution shift, and multi-step dependencies typical of downstream LLM-agent applications; without explicit tests on such tasks, the observed calibration may not transfer (see skeptic concern on benchmark representativeness).
Authors: We agree that the evaluated benchmarks (primarily QA and classification tasks) do not fully capture the ambiguity, distribution shifts, or multi-step dependencies common in LLM-agent applications. Our work establishes that certain prompting strategies can yield well-calibrated verbalized scores on these standard tasks as a controlled baseline. We will add a limitations paragraph in the discussion section explicitly noting this scope and recommending future evaluations on agentic tasks to assess transfer. revision: partial
-
Referee: [Abstract] Abstract states an 'extensive benchmark' but provides no details on calibration metrics (e.g., ECE definition), statistical significance tests, or data exclusion rules. This prevents verification of soundness from the provided text and makes it impossible to assess whether the reported calibration improvements are statistically reliable or sensitive to evaluation choices.
Authors: The abstract is high-level by design, with full details on metrics (ECE defined and computed per Section 3), evaluation procedures, and data handling provided in the methods and results sections. To improve accessibility, we will revise the abstract to briefly name the primary metric (Expected Calibration Error) and direct readers to the relevant sections for definitions, statistical considerations, and exclusion criteria. revision: yes
Circularity Check
No derivation chain present; purely empirical evaluation
full rationale
The paper reports results from an extensive benchmark study comparing verbalized confidence scores under different prompt methods, models, and datasets. No mathematical derivations, fitted parameters, or load-bearing self-citations are used to establish the central claim. All reported outcomes are direct empirical measurements (e.g., calibration metrics on held-out benchmarks) that do not reduce to quantities defined inside the paper itself. The evaluation is therefore self-contained against external data.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 14 Pith papers
-
SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?
LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.
-
PromptNCE: Pointwise Mutual Information Predictions Using Only LLMs and Contrastive Estimation Prompts
PromptNCE frames LLM conditional probability estimation as contrastive prompting augmented with an OTHER category, recovering true P(y|x) and achieving up to 0.82 Spearman correlation with human-derived PMI on three datasets.
-
ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents
Clarification-seeking in LLM agents amplifies prompt injection attack success from ~2% to over 30% across ten frontier models in a new 728-scenario benchmark.
-
LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling
A metacognitive harness uses LLMs' pre- and post-solution self-monitoring signals to control test-time reasoning, raising pooled accuracy from 48.3% to 56.9% on text, code, and multimodal benchmarks.
-
Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen
Seven 3-9B instruction-tuned LLMs produce verbal confidence that saturates at high values and fails psychometric validity criteria for Type-2 discrimination under minimal elicitation.
-
Calibration-Aware Policy Optimization for Reasoning LLMs
CAPO improves LLM calibration by up to 15% while matching or exceeding GRPO accuracy through logistic AUC loss and noise masking, enabling better abstention and scaling performance.
-
CLSGen: A Dual-Head Fine-Tuning Framework for Joint Probabilistic Classification and Verbalized Explanation
CLSGen is a dual-head LLM fine-tuning framework that enables joint probabilistic classification and verbalized explanation generation without catastrophic forgetting of generative capabilities.
-
Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards
DCPO decouples reasoning optimization from calibration in RLVR to fix overconfidence in LLMs without losing accuracy.
-
Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty
RLCR augments standard RL rewards for LM reasoning with Brier scores on verbalized confidence, producing models that are both more accurate and better calibrated on in-domain and out-of-domain tasks.
-
Textual Bayes: Quantifying Prompt Uncertainty in LLM-Based Systems
Introduces a Bayesian framework viewing LLM prompts as textual parameters and proposes MHLP, a novel MCMC algorithm using LLM proposals, to perform inference and improve accuracy plus uncertainty quantification on benchmarks.
-
Towards Trustworthy Report Generation: A Deep Research Agent with Progressive Confidence Estimation and Calibration
A deep research agent incorporates progressive confidence estimation and calibration to produce trustworthy reports with transparent confidence scores on claims.
-
Do Small Language Models Know When They're Wrong? Confidence-Based Cascade Scoring for Educational Assessment
Verbalized confidence from small LMs enables cost-effective cascade routing for automated educational scoring, matching large-model accuracy at 76% lower cost when discrimination is strong.
-
Seven simple steps for log analysis in AI systems
A seven-step pipeline for log analysis in AI systems is outlined with code examples to support rigorous and reproducible evaluation of model capabilities and behaviors.
-
Self-Reported Confidence of Large Language Models in Gastroenterology: Analysis of Commercial, Open-Source, and Quantized Models
LLMs show improved accuracy on gastroenterology questions but remain overconfident in self-reported certainty across commercial, open-source, and quantized variants.
Reference graph
Works this paper leans on
-
[1]
The Falcon Series of Open Language Models
AI@Meta (2024). Llama 3 Model Card. URL: https://github.com/meta- llama/llama3/ blob/main/MODEL_CARD.md, visited on 09/17/2024. Almazrouei, E., Alobeidli, H., Alshamsi, A., Cappelli, A., Cojocaru, R., Debbah, M., et al. (2023). The Falcon Series of Open Language Models. arXiv: 2311.16867. Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., et al. (20...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Language Models (Mostly) Know What They Know
Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., et al. (2022). Language Models (Mostly) Know What They Know. arXiv: 2207.05221. Klingbeil, A., Grützner, C., and Schreck, P. (2024). Trust and Reliance on AI — An Experimental Study on the Extent and Costs of Overreliance on AI. Computers in Human Behavior 160, page 108352. Kuhn, L...
work page internal anchor Pith review Pith/arXiv arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.