Recognition: unknown
On Verbalized Confidence Scores for LLMs
read the original abstract
The rise of large language models (LLMs) and their tight integration into our daily life make it essential to dedicate efforts towards their trustworthiness. Uncertainty quantification for LLMs can establish more human trust into their responses, but also allows LLM agents to make more informed decisions based on each other's uncertainty. To estimate the uncertainty in a response, internal token logits, task-specific proxy models, or sampling of multiple responses are commonly used. This work focuses on asking the LLM itself to verbalize its uncertainty with a confidence score as part of its output tokens, which is a promising way for prompt- and model-agnostic uncertainty quantification with low overhead. Using an extensive benchmark, we assess the reliability of verbalized confidence scores with respect to different datasets, models, and prompt methods. Our results reveal that the reliability of these scores strongly depends on how the model is asked, but also that it is possible to extract well-calibrated confidence scores with certain prompt methods. We argue that verbalized confidence scores can become a simple but effective and versatile uncertainty quantification method in the future. Our code is available at https://github.com/danielyxyang/llm-verbalized-uq.
This paper has not been read by Pith yet.
Forward citations
Cited by 8 Pith papers
-
SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?
LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.
-
LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling
A metacognitive harness uses LLMs' pre- and post-solution self-monitoring signals to control test-time reasoning, raising pooled accuracy from 48.3% to 56.9% on text, code, and multimodal benchmarks.
-
Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen
Seven 3-9B instruction-tuned LLMs produce verbal confidence that saturates at high values and fails psychometric validity criteria for Type-2 discrimination under minimal elicitation.
-
Calibration-Aware Policy Optimization for Reasoning LLMs
CAPO improves LLM calibration by up to 15% while matching or exceeding GRPO accuracy through logistic AUC loss and noise masking, enabling better abstention and scaling performance.
-
CLSGen: A Dual-Head Fine-Tuning Framework for Joint Probabilistic Classification and Verbalized Explanation
CLSGen is a dual-head LLM fine-tuning framework that enables joint probabilistic classification and verbalized explanation generation without catastrophic forgetting of generative capabilities.
-
Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards
DCPO decouples reasoning optimization from calibration in RLVR to fix overconfidence in LLMs without losing accuracy.
-
Towards Trustworthy Report Generation: A Deep Research Agent with Progressive Confidence Estimation and Calibration
A deep research agent incorporates progressive confidence estimation and calibration to produce trustworthy reports with transparent confidence scores on claims.
-
Do Small Language Models Know When They're Wrong? Confidence-Based Cascade Scoring for Educational Assessment
Verbalized confidence from small LMs enables cost-effective cascade routing for automated educational scoring, matching large-model accuracy at 76% lower cost when discrimination is strong.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.