pith. machine review for the scientific record. sign in

arxiv: 2205.14334 · v2 · submitted 2022-05-28 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 1 theorem link

Teaching Models to Express Their Uncertainty in Words

Authors on Pith no claims yet

Pith reviewed 2026-05-16 17:31 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords uncertainty estimationlanguage modelscalibrationverbalized probabilityGPT-3epistemic uncertaintynatural language generation
0
0 comments X

The pith

GPT-3 can learn to state its own uncertainty in natural language, and those statements map to well-calibrated probabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a GPT-3 model can be trained to generate an answer to a question together with a verbal confidence level such as '90% confidence' or 'high confidence'. These verbal expressions convert directly into probabilities that match the model's actual rate of correctness. The approach requires no access to the model's internal logits and maintains moderate calibration when questions come from a shifted distribution. The model also bases its verbal uncertainty on its own knowledge gaps rather than simply copying example patterns from the prompt.

Core claim

A GPT-3 model can be taught to output both an answer and a natural-language expression of uncertainty about that answer, and the expressed levels correspond to probabilities that are well calibrated on the model's actual performance. Calibration generalizes to new distributions, and the behavior arises from pre-trained latent representations that track epistemic uncertainty rather than from surface-level imitation of human examples.

What carries the argument

Verbalized probability: the model generates an answer plus a confidence phrase in words, which is then interpreted as a numerical probability for calibration checks.

If this is right

  • Models can communicate uncertainty to users in readable language without exposing internal logits.
  • Calibration of uncertainty can be achieved and tested entirely through generated text.
  • Verbalized uncertainty remains usable under moderate distribution shifts.
  • The method reveals that pre-trained representations already encode information about the model's own knowledge limits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Systems that must operate without logit access could still report reliable uncertainty to downstream users.
  • Combining verbalized statements with other uncertainty signals might improve overall calibration further.
  • The same training approach could be tested on models that have not been instruction-tuned to see how much pre-training alone contributes.

Load-bearing premise

The verbal confidence phrases actually track the model's real epistemic uncertainty rather than merely imitating patterns seen in training examples or prompts.

What would settle it

Measuring that the model's stated confidence levels fail to match its observed accuracy on a fresh set of questions drawn from a domain where the model has no relevant pre-training knowledge.

read the original abstract

We show that a GPT-3 model can learn to express uncertainty about its own answers in natural language -- without use of model logits. When given a question, the model generates both an answer and a level of confidence (e.g. "90% confidence" or "high confidence"). These levels map to probabilities that are well calibrated. The model also remains moderately calibrated under distribution shift, and is sensitive to uncertainty in its own answers, rather than imitating human examples. To our knowledge, this is the first time a model has been shown to express calibrated uncertainty about its own answers in natural language. For testing calibration, we introduce the CalibratedMath suite of tasks. We compare the calibration of uncertainty expressed in words ("verbalized probability") to uncertainty extracted from model logits. Both kinds of uncertainty are capable of generalizing calibration under distribution shift. We also provide evidence that GPT-3's ability to generalize calibration depends on pre-trained latent representations that correlate with epistemic uncertainty over its answers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that a GPT-3 model can be prompted or trained to generate answers together with natural-language expressions of uncertainty (e.g., '90% confidence' or 'high confidence') whose implied probabilities are well calibrated on a new CalibratedMath suite of math tasks. The verbalized confidences remain moderately calibrated under distribution shift, outperform or match logit-based baselines in generalization, and are argued to arise from pre-trained latent representations that track epistemic uncertainty rather than from surface-level imitation of human examples.

Significance. If the central empirical claims are substantiated, the result is significant: it provides the first demonstration that a large language model can communicate calibrated uncertainty about its own answers in human-readable natural language without access to logits. The CalibratedMath benchmark is a useful addition for studying calibration on arithmetic reasoning, and the evidence that generalization depends on pre-trained representations (rather than prompt artifacts) strengthens the interpretation that the model accesses internal uncertainty signals.

major comments (2)
  1. [Abstract] Abstract and §4 (results on sensitivity to uncertainty): the central claim that verbalized phrases reflect the model's actual epistemic uncertainty rather than surface-level pattern matching on question features (e.g., number magnitude, operator type, digit length) is load-bearing but unsupported by any reported ablation or incremental analysis. No result quantifies the predictive power of latent states over question-only baselines at the moment the confidence phrase is generated.
  2. [§3] §3 (methods) and §5 (evaluation): exact prompting templates, fine-tuning hyperparameters, statistical tests for calibration (e.g., ECE computation details), and data-exclusion rules are not specified at a level that permits reproduction or rules out confounds, leaving the reported calibration numbers and distribution-shift results only moderately supported.
minor comments (2)
  1. [Table 1] Table 1 and Figure 2: axis labels and legend entries should explicitly state whether the plotted probabilities are derived from verbalized phrases or from logits to avoid reader confusion.
  2. [Related Work] Related-work section: add citations to recent work on verbalized uncertainty in smaller models (e.g., post-2021 papers on confidence elicitation) for completeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us identify areas for improvement. We address each major point below and will revise the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Abstract] Abstract and §4 (results on sensitivity to uncertainty): the central claim that verbalized phrases reflect the model's actual epistemic uncertainty rather than surface-level pattern matching on question features (e.g., number magnitude, operator type, digit length) is load-bearing but unsupported by any reported ablation or incremental analysis. No result quantifies the predictive power of latent states over question-only baselines at the moment the confidence phrase is generated.

    Authors: We appreciate this observation. Our distribution-shift experiments and sensitivity analyses in §4 already indicate that performance is not driven solely by surface features, as calibration holds when question characteristics (e.g., magnitude, operators) change substantially. Nevertheless, we agree that an explicit ablation quantifying the incremental predictive value of latent states over question-only baselines would provide stronger support. We will add this analysis, including a direct comparison of calibration when conditioning only on question features versus internal representations at the point of confidence generation. revision: yes

  2. Referee: [§3] §3 (methods) and §5 (evaluation): exact prompting templates, fine-tuning hyperparameters, statistical tests for calibration (e.g., ECE computation details), and data-exclusion rules are not specified at a level that permits reproduction or rules out confounds, leaving the reported calibration numbers and distribution-shift results only moderately supported.

    Authors: We agree that additional methodological detail is necessary for full reproducibility. In the revised manuscript we will include the complete prompting templates, all fine-tuning hyperparameters, the precise procedure for computing ECE (including binning, any statistical significance tests, and confidence intervals), and explicit data-exclusion criteria. These additions will allow independent replication and help rule out potential confounds. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical calibration results rely on held-out evaluation rather than definitional reduction

full rationale

The paper's central result is an empirical demonstration that GPT-3, after exposure to CalibratedMath examples, generates verbalized confidence phrases whose implied probabilities are calibrated on held-out tasks and under distribution shift. Calibration is measured against ground-truth correctness on those tasks, not against any quantity defined from the training fit itself. No equation or procedure in the described chain equates the generated confidence level to a fitted parameter or renames a training statistic as a prediction. Self-citations (if present) are not invoked as uniqueness theorems that forbid alternatives; the claim of sensitivity to pre-trained latent representations is supported by generalization experiments rather than by construction. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical calibration metrics and the assumption that pre-trained representations encode epistemic uncertainty usable for verbal output.

axioms (1)
  • domain assumption GPT-3 possesses pre-trained latent representations that correlate with epistemic uncertainty over its answers
    The paper states this correlation explains the observed generalization of calibration under distribution shift.

pith-pipeline@v0.9.0 · 5465 in / 1156 out tokens · 30108 ms · 2026-05-16T17:31:22.258256+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

    cs.CL 2026-05 unverdicted novelty 8.0

    Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

  2. Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models

    eess.AS 2026-04 unverdicted novelty 7.0

    Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- a...

  3. SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

    cs.AI 2026-04 unverdicted novelty 7.0

    LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.

  4. BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence

    cs.CL 2026-04 unverdicted novelty 7.0

    BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.

  5. Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

    cs.AI 2026-05 unverdicted novelty 6.0

    LLMs show a knowing-doing gap in tool use: they often recognize when tools are needed via internal states but fail to translate that into actual tool calls, with mismatches of 26-54% on arithmetic and factual tasks.

  6. Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning

    cs.LG 2026-05 unverdicted novelty 6.0

    METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior perform...

  7. Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization

    cs.AI 2026-05 unverdicted novelty 6.0

    Trajectory geometry in embedding space fused with coverage and verbalization yields better black-box CoT confidence estimation than self-consistency at lower sample counts across six benchmark-reasoner pairs.

  8. What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Multi-variant testing reveals that prompt design and evaluator choices can change apparent model reliability by large margins, with verbal confidence often overstated and robustness uncorrelated with size.

  9. Spatiotemporal Hidden-State Dynamics as a Signature of Internal Reasoning in Large Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Large reasoning models show measurable hidden-state dynamics that a new statistic can use to distinguish correct reasoning trajectories without labels.

  10. Hallucinations Undermine Trust; Metacognition is a Way Forward

    cs.CL 2026-05 unverdicted novelty 6.0

    LLMs need metacognition to align expressed uncertainty with their actual knowledge boundaries, moving beyond knowledge expansion to reduce confident errors.

  11. Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification

    cs.AI 2026-04 unverdicted novelty 6.0

    Cross-model semantic disagreement adds an epistemic uncertainty term that improves total uncertainty estimation over self-consistency alone, helping flag confident errors in LLMs.

  12. Towards an AI co-scientist

    cs.AI 2025-02 unverdicted novelty 6.0

    A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.

  13. Measuring short-form factuality in large language models

    cs.CL 2024-11 unverdicted novelty 6.0

    SimpleQA is a new benchmark of short, single-answer factual questions collected adversarially against GPT-4 to evaluate LLM factuality and confidence calibration.

  14. Aligning Large Multimodal Models with Factually Augmented RLHF

    cs.CV 2023-09 conditional novelty 6.0

    Factually Augmented RLHF aligns large multimodal models to reduce hallucinations, reaching 94% of GPT-4 on LLaVA-Bench and 60% improvement on the new MMHAL-BENCH.

  15. Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

    cs.CL 2023-02 unverdicted novelty 6.0

    Semantic entropy improves uncertainty estimation in natural language generation by incorporating semantic equivalences, outperforming standard entropy baselines on predicting model accuracy for question answering.

  16. Language Models (Mostly) Know What They Know

    cs.CL 2022-07 unverdicted novelty 6.0

    Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

  17. Towards Trustworthy Report Generation: A Deep Research Agent with Progressive Confidence Estimation and Calibration

    cs.AI 2026-04 unverdicted novelty 4.0

    A deep research agent incorporates progressive confidence estimation and calibration to produce trustworthy reports with transparent confidence scores on claims.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 17 Pith papers · 6 internal anchors

  1. [1]

    A General Language Assistant as a Laboratory for Alignment

    Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A general language assistant as a laborat...

  2. [2]

    https://www.gwern.net/GPT-3-nonfiction# calibration, Last accessed on 2022-04-24. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff...

  3. [3]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311,

  4. [4]

    Gabriela Csurka

    https://www.alignmentforum.org/posts/qHCDysDnvhteW7kRd/ arc-s-first-technical-report-eliciting-latent-knowledge , Last accessed on 2022-04-30. Gabriela Csurka. Domain adaptation for visual applications: A comprehensive survey,

  5. [5]

    Domain Adaptation for Visual Applications: A Comprehensive Survey

    URLhttps: //arxiv.org/abs/1702.05374. Shrey Desai and Greg Durrett. Calibration of pre-trained transformers. InProceedings of the 2020 Confer- ence on Empirical Methods in Natural Language Processing (EMNLP), pp. 295–302, Online, Novem- ber

  6. [6]

    doi: 10.18653/v1/2020.emnlp-main.21

    Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.21. URL https://aclanthology.org/2020.emnlp-main.21. 11 Owain Evans, Owen Cotton-Barratt, Lukas Finnveden, Adam Bales, Avital Balwit, Peter Wills, Luca Righetti, and William Saunders. Truthful AI: Developing and governing AI that does not lie. arXiv preprint arXiv:2110.06674,

  7. [7]

    Yunye Gong, Xiao Lin, Yi Yao, Thomas G

    URL https://arxiv.org/abs/2110.06674. Yunye Gong, Xiao Lin, Yi Yao, Thomas G. Dietterich, Ajay Divakaran, and Melinda Gervasio. Confidence calibration for domain generalization under covariate shift

  8. [8]

    URL https://arxiv.org/abs/2104.00742

    doi: 10.48550/ARXIV.2104.00742. URL https://arxiv.org/abs/2104.00742. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks,

  9. [9]

    Deep Anomaly Detection with Outlier Exposure

    URL https://arxiv.org/abs/1812.04606. DanHendrycks, CollinBurns, StevenBasart, AndyZou, MantasMazeika, DawnSong, andJacobSteinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

  10. [10]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

  11. [11]

    doi: 10.1162/tacl_a_00407

    ISSN 2307-387X. doi: 10.1162/tacl_a_00407. URL https://doi.org/10.1162/tacl_a_00407. Amita Kamath, Robin Jia, and Percy Liang. Selective question answering under domain shift. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5684–5696, Online, July

  12. [12]

    doi: 10.18653/v1/2020.acl-main.503

    Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.503. URLhttps: //aclanthology.org/2020.acl-main.503. Archit Karandikar, Nicholas Cain, Dustin Tran, Balaji Lakshminarayanan, Jonathon Shlens, Michael C Mozer, and Becca Roelofs. Soft calibration objectives for neural networks. arXiv preprint arXiv:2108.00106,

  13. [13]

    Calibrated language model fine-tuning for in- and out-of-distribution data

    Lingkai Kong, Haoming Jiang, Yuchen Zhuang, Jie Lyu, Tuo Zhao, and Chao Zhang. Calibrated language model fine-tuning for in- and out-of-distribution data. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1326–1340, Online, November

  14. [14]

    doi: 10.18653/v1/2020.emnlp-main.102

    Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.102. URL https://aclanthology. org/2020.emnlp-main.102. Volodymyr Kuleshov and Percy S Liang. Calibrated structured prediction.Advances in Neural Information Processing Systems, 28,

  15. [16]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    URL https://arxiv.org/abs/2109.07958. Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization.arXiv preprint arXiv:2005.00661,

  16. [17]

    Jishnu Mukhoti, Viveka Kulharia, Amartya Sanyal, Stuart Golodetz, Philip Torr, and Puneet Doka- nia

    URL https: //proceedings.neurips.cc/paper/2021/file/8420d359404024567b5aefda1231af24-Paper.pdf. Jishnu Mukhoti, Viveka Kulharia, Amartya Sanyal, Stuart Golodetz, Philip Torr, and Puneet Doka- nia. Calibrating deep neural networks using focal loss. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Proc...

  17. [18]

    Arvind Neelakanta

    URL https://proceedings.neurips.cc/paper/2020/ file/aeb7b30ef1d024a76f21a1d40e30c302-Paper.pdf. Arvind Neelakanta. Introducing text and code embeddings in the openai api, 2022.https://openai.com/ blog/introducing-text-and-code-embeddings/, Last accessed on 2022-04-30. Khanh Nguyen and Brendan O’Connor. Posterior calibration and exploratory analysis for na...

  18. [19]

    doi: 10.18653/v1/D15-1182

    Association for Computational Linguistics. doi: 10.18653/v1/D15-1182. URL https://aclanthology.org/D15-1182. Jeremy Nixon, Mike Dusenberry, Ghassen Jerfel, Timothy Nguyen, Jeremiah Liu, Linchuan Zhang, and Dustin Tran. Measuring calibration in deep learning,

  19. [20]

    URLhttps://arxiv.org/abs/1904.01685. OpenAI. Fine-tuning,

  20. [21]

    Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D

    https://beta.openai.com/docs/guides/fine-tuning/advanced-usage, Last accessed on 2022-04-30. Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua V. Dillon, Balaji Lakshminarayanan, and Jasper Snoek.Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty under Dataset Shift. Curran Associates Inc., Red...

  21. [22]

    Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston

    URL https://arxiv.org/abs/2102.07350. Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. Retrieval augmentation reduces hallucination in conversation.arXiv preprint arXiv:2104.07567,

  22. [23]

    URL https://doi.org/10

    doi: 10.24963/ijcai.2021/628. URL https://doi.org/10. 24963/ijcai.2021/628. Survey Track. Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, and Paul Christiano. Recursively summarizing books with human feedback,

  23. [24]

    Add-subtract

    URLhttps://arxiv.org/abs/2109.10862. 13 A CalibratedMath Table 3: Breakdown of tasks in the CalibratedMath benchmark.‘# Levels’ refers to the count of difficulty levels within each operation, where the difficulty is determined by the number of digits in each operand and the formatting used for the numbers. Models are trained on tasks from the ‘Add/Sub’ group,...

  24. [25]

    B.3 Supervised fine-tuning The supervised fine-tuning dataset consists of approximately 10k examples, where 100 examples are sampled from each sub-task in the training set

    The prompt is randomized before every query. B.3 Supervised fine-tuning The supervised fine-tuning dataset consists of approximately 10k examples, where 100 examples are sampled from each sub-task in the training set. Models are trained for one epoch to prevent overfitting, using the default hyperparameters from OpenAI’s fine-tuning API withlearning_rate_mult...