How do LLMs Compute Verbal Confidence
Pith reviewed 2026-05-21 10:34 UTC · model grok-4.3
The pith
LLMs automatically compute verbal confidence during answer generation and cache it at the first post-answer position for later retrieval.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Verbal confidence arises from representations that emerge at answer-adjacent positions before the verbalization site, with information flowing from answer tokens to a cache at the first post-answer position via attention, and then to output; these cached representations explain substantial variance in verbal confidence beyond token log-probabilities, indicating an automatic and sophisticated self-evaluation of answer quality rather than post-hoc reconstruction.
What carries the argument
The cached confidence representation at the first post-answer position, which aggregates information from answer tokens and is later retrieved for verbalization.
If this is right
- Verbal confidence can be directly influenced by intervening at the post-answer cache position.
- Models perform automatic evaluation of answer quality that is independent of simple fluency or probability measures.
- Information flow for confidence follows a specific path from answer tokens through the cache to verbal output.
- Calibration improvements could target these internal cached states rather than output prompting alone.
Where Pith is reading between the lines
- Similar automatic caching may occur for other internal self-assessments beyond confidence.
- Direct access to the post-answer cache could provide uncertainty estimates without requiring explicit verbalization prompts.
- Testing whether the same position serves as a cache in non-transformer architectures would clarify generality.
Load-bearing premise
The activation steering, patching, noising, and attention blocking interventions reveal the model's natural confidence computation without introducing artifacts or altering representations in ways that differ from normal forward passes.
What would settle it
Finding that verbal confidence output remains unchanged when the cached representations at the first post-answer position are disrupted under otherwise normal generation conditions.
Figures
read the original abstract
Verbal confidence -- prompting LLMs to state their confidence as a number or category -- is widely used to extract uncertainty estimates from black-box models. However, how LLMs internally generate such scores remains unknown. We address two questions: first, when confidence is computed -- just-in-time when requested, or automatically during answer generation and cached for later retrieval; and second, what verbal confidence represents -- token log-probabilities, or a richer evaluation of answer quality? Focusing on Gemma 3 27B (across TriviaQA, BigMath, and MMLU), Qwen 2.5 7B, and the reasoning model Magistral Small 24B, we provide convergent evidence for cached retrieval. Activation steering, patching, noising, and swap experiments reveal that confidence representations emerge at answer-adjacent positions before appearing at the verbalization site. Attention blocking pinpoints the information flow: confidence is gathered from answer tokens, cached at the first post-answer position, then retrieved for output. Critically, linear probing and variance partitioning reveal that these cached representations explain substantial variance in verbal confidence beyond token log-probabilities, suggesting a richer answer-quality evaluation rather than a simple fluency readout. These findings demonstrate that verbal confidence reflects automatic, sophisticated self-evaluation -- not post-hoc reconstruction -- with implications for understanding metacognition in LLMs and improving calibration.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines how LLMs generate verbal confidence scores. It claims these scores arise from automatic computation and caching during answer generation rather than just-in-time reconstruction, with representations emerging at answer-adjacent positions (specifically the first post-answer token) before retrieval at verbalization. Experiments on Gemma 3 27B, Qwen 2.5 7B, and Magistral Small 24B across TriviaQA, BigMath, and MMLU use activation steering, patching, noising, swap experiments, and attention blocking to trace information flow from answer tokens to the cache site. Linear probing and variance partitioning further indicate that the cached representations capture richer answer-quality information beyond token log-probabilities.
Significance. If the central claims hold, the work advances understanding of metacognition and internal self-evaluation in LLMs by showing that verbal confidence reflects sophisticated, automatic processes rather than post-hoc readout. The convergent evidence across multiple intervention techniques, models, and datasets, combined with the demonstration of additional explanatory power beyond log-probabilities, provides a mechanistic basis for improving uncertainty estimation and calibration. Strengths include the use of causal interventions alongside correlational probing and the focus on falsifiable predictions about information flow timing.
major comments (2)
- [Methods (activation steering and attention blocking experiments)] Methods section on activation steering, patching, noising, and attention blocking: The evidence for automatic caching at the first post-answer position and subsequent retrieval rests entirely on these interventions. However, the manuscript does not demonstrate that the observed information flow or representations occur during unperturbed forward passes; interventions can induce or amplify patterns absent in normal generation. This is load-bearing for the claim that the model performs this caching by default rather than only under experimental conditions.
- [Results (variance partitioning and probing)] Results on linear probing and variance partitioning: The claim that cached representations explain 'substantial variance' beyond token log-probabilities requires explicit reporting of effect sizes, cross-validation details, and controls for overlap between probed features and log-probability baselines. Without these, it is unclear whether the richer-evaluation interpretation is supported or whether the additional variance is marginal or artifactual.
minor comments (2)
- [Abstract] Abstract: The term 'swap experiments' is mentioned but not defined or referenced in the main text summary; ensure all experimental variants are consistently described and cited.
- [Figures] Figure clarity: Attention maps or probing accuracy plots should include explicit legends, error bars, and statistical significance markers to aid interpretation of the information-flow claims.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on methodological rigor and reporting standards. We address each major point below, providing additional analyses and clarifications while preserving the core claims supported by our convergent evidence.
read point-by-point responses
-
Referee: [Methods (activation steering and attention blocking experiments)] Methods section on activation steering, patching, noising, and attention blocking: The evidence for automatic caching at the first post-answer position and subsequent retrieval rests entirely on these interventions. However, the manuscript does not demonstrate that the observed information flow or representations occur during unperturbed forward passes; interventions can induce or amplify patterns absent in normal generation. This is load-bearing for the claim that the model performs this caching by default rather than only under experimental conditions.
Authors: We agree that establishing the presence of the representations in unperturbed forward passes is essential. While our linear probing and variance partitioning analyses are performed on activations from standard, unperturbed generations (as described in the Results section), we have added a new subsection in the revised Methods and Results that directly examines hidden-state correlations at the first post-answer position during normal forward passes without any steering, patching, or noising. These baseline analyses show that the post-answer representations already encode information predictive of verbal confidence prior to any intervention, with the causal experiments then used to establish necessity and directionality of the flow. This combination supports that caching occurs by default. revision: yes
-
Referee: [Results (variance partitioning and probing)] Results on linear probing and variance partitioning: The claim that cached representations explain 'substantial variance' beyond token log-probabilities requires explicit reporting of effect sizes, cross-validation details, and controls for overlap between probed features and log-probability baselines. Without these, it is unclear whether the richer-evaluation interpretation is supported or whether the additional variance is marginal or artifactual.
Authors: We appreciate the request for greater transparency. In the revised manuscript we now report incremental R² values from the hierarchical variance partitioning (showing 12–28% additional variance explained across models and datasets after accounting for log-probabilities), specify that all probes use 5-fold cross-validation with held-out test sets, and include an orthogonalization control in which log-probability features are residualized from the cached representations before probing. These additions confirm that the probed representations capture explanatory power beyond token-level probabilities, consistent with a richer answer-quality evaluation. revision: yes
Circularity Check
No circularity: claims rest on empirical interventions and external baselines
full rationale
The paper's central claims—that verbal confidence representations are cached at the first post-answer position and explain variance beyond token log-probabilities—are supported by activation steering, patching, noising, attention blocking, linear probing, and variance partitioning experiments across multiple models and datasets. These methods compare internal activations to observed verbal outputs and to independent log-probability baselines rather than defining the target quantity in terms of itself or fitting parameters that are then relabeled as predictions. No equations, uniqueness theorems, or ansatzes are invoked that reduce the result to the input by construction. Self-citations, if present, are not load-bearing for the core mechanism; the evidence is externally falsifiable via the described interventions.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Activation steering and patching interventions isolate causal contributions to confidence without creating spurious representations absent in normal inference.
- domain assumption Linear probes and variance partitioning accurately measure the information content of internal activations relevant to verbal confidence.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Activation steering, patching, noising, and swap experiments reveal that confidence representations emerge at answer-adjacent positions
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
linear probing and variance partitioning reveal that these cached representations explain substantial variance beyond token log-probabilities
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 6 Pith papers
-
The Hidden Signal of Verifier Strictness: Controlling and Improving Step-Wise Verification via Selective Latent Steering
VerifySteer selectively steers hidden states at paragraph boundaries using latent correctness signals to control verifier strictness and outperform baselines on ProcessBench and Hard2Verify with lower compute.
-
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning
METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior perform...
-
Hypothesis generation and updating in large language models
LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.
-
How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals
LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.
-
Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen
Seven 3-9B instruction-tuned LLMs produce verbal confidence that saturates at high values and fails psychometric validity criteria for Type-2 discrimination under minimal elicitation.
-
Distilling Self-Consistency into Verbal Confidence: A Pre-Registered Negative Result and Post-Hoc Rescue on Gemma 3 4B
Fine-tuning Gemma 3 4B on unfiltered self-consistency targets produces a binary verbal correctness discriminator with AUROC 0.774 on TriviaQA, outperforming logit entropy after a modal-filtered pre-registration failed.
Reference graph
Works this paper leans on
- [1]
-
[2]
The Internal State of an LLM Knows When It's Lying
Anthropic Research Report. Azaria, A. and Mitchell, T. The internal state of an llm knows when it’s lying.arXiv preprint arXiv:2304.13734,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Discovering Latent Knowledge in Language Models Without Supervision
Burns, C., Ye, H., Klein, D., and Steinhardt, J. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Trace length is a simple un- certainty signal in reasoning models.arXiv preprint arXiv:2510.10409,
Devic, S., Peale, C., Bradley, A., Williamson, S., Nakki- ran, P., and Gollakota, A. Trace length is a simple un- certainty signal in reasoning models.arXiv preprint arXiv:2510.10409,
-
[5]
A survey of language model confidence esti- mation and calibration.arXiv preprint arXiv:2311.08298,
Geng, J., Cai, F., Wang, Y ., Koeppl, H., Nakov, P., and Gurevych, I. A survey of language model confidence esti- mation and calibration.arXiv preprint arXiv:2311.08298,
-
[6]
arXiv preprint arXiv:2304.14767 , year=
Geva, M., Bastings, J., Filippova, K., and Globerson, A. Dis- secting recall of factual associations in auto-regressive language models.arXiv preprint arXiv:2304.14767,
-
[7]
How to use and interpret activation patching
Heimersheim, S. and Nanda, N. How to use and interpret activation patching.arXiv preprint arXiv:2404.15255,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
arXiv preprint arXiv:2510.20487 , year =
Hua, T. T., Qin, A., Marks, S., and Nanda, N. Steering evaluation-aware language models to act like they are deployed.arXiv preprint arXiv:2510.20487,
-
[9]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Language Models (Mostly) Know What They Know
Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Liu, J., Jain, J., Diab, M., and Subramani, N
URL https://transformer-circuits.pub/ 2025/attribution-graphs/biology.html. Liu, J., Jain, J., Diab, M., and Subramani, N. Llm microscope: What model internals reveal about an- swer correctness and context utilization.arXiv preprint arXiv:2510.04013,
-
[12]
Mei, Z., Zhang, C., Yin, T., Lidard, J., Shorinwa, O., and Majumdar, A. Reasoning about uncertainty: Do reason- ing models know when they don’t know?arXiv preprint arXiv:2506.18183,
-
[13]
Panickssery, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A. M. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
9 How do LLMs Compute Verbal Confidence? Rai, D., Zhou, Y ., Feng, S., Saparov, A., and Yao, Z. A practical review of mechanistic interpretability for transformer-based language models.arXiv preprint arXiv:2407.02646,
-
[15]
Steyvers, M., Belem, C., and Smyth, P. Improving metacog- nition and uncertainty communication in language mod- els.arXiv preprint arXiv:2510.05126, 2025a. Steyvers, M., Tejeda, H., Kumar, A., Belem, C., Karny, S., Hu, X., Mayer, L. W., and Smyth, P. What large language models know and what people think they know.Nature Machine Intelligence, pp. 1–11, 202...
- [16]
-
[17]
Steering Language Models With Activation Engineering
Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Steering lan- guage models with activation engineering.arXiv preprint arXiv:2308.10248,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Varshney, N., Yao, W., Zhang, H., Chen, J., and Yu, D. A stitch in time saves nine: Detecting and mitigating hallu- cinations of llms by validating low-confidence generation. arXiv preprint arXiv:2307.03987,
-
[19]
Base models know how to reason, thinking models learn when.arXiv preprint arXiv:2510.07364,
Venhoff, C., Arcuschin, I., Torr, P., Conmy, A., and Nanda, N. Base models know how to reason, thinking models learn when.arXiv preprint arXiv:2510.07364,
-
[20]
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs
Xiong, M., Hu, Z., Lu, X., Li, Y ., Fu, J., He, J., and Hooi, B. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.arXiv preprint arXiv:2306.13063,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Reasoning models better express their confidence.arXiv preprint arXiv:2505.14489,
Yoon, D., Kim, S., Yang, S., Kim, S., Kim, S., Kim, Y ., Choi, E., Kim, Y ., and Seo, M. Reasoning models better express their confidence.arXiv preprint arXiv:2505.14489,
-
[22]
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Zhang, F. and Nanda, N. Towards best practices of activation patching in language models: Metrics and methods.arXiv preprint arXiv:2309.16042,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
10 How do LLMs Compute Verbal Confidence? Appendix Overview •Appendix A: Related Work(§A) — Summary of related literature •Appendix B: Supplemental Figures(§B) — Prompts, calibration plots, and additional experimental results •Appendix C: Supplemental Methods(§C) –C.1 Experiments with Categorical Confidence Prompt in Gemma 3 27B (§C.1) * C.1.1 Technical D...
work page 2016
-
[24]
were used, since we were focussed on understanding the generation of Gemma’s raw verbal confidence signals. The model’s performance was 77.4%; this was determined by having GPT4o-mini mark questions (B) Distribution of Gemma’s confidence responses across the 10 classes. n = 7858 questions from the TriviaQA dataset (Joshi et al., 2017). 16 How do LLMs Comp...
work page 2017
-
[25]
the TriviaQA dataset (Joshi et al., 2017). 20 How do LLMs Compute Verbal Confidence? Figure 15.Calibration and Distribution of Categorical Confidence Ratings in Qwen 2.5 7b. (A) Calibration of Qwen: Expected Calibration Error (ECE) = 0.06, AUROC = 0.65. No procedures such as temperature scaling (Guo et al.,
work page 2017
-
[26]
We created high and low confidence vectors by contrast- ing high and low confidence trials (all trials that the model scored correctly), following standard procedures in acti- vation steering (Turner et al., 2023; Stolfo et al., 2024a; Panickssery et al., 2023; Hua et al., 2025). Creation of high- and low-confidence steering vectors: We constructed steeri...
work page 2023
-
[27]
– 1/2 of these trials were randomly sampled from the top 3 confidence classes and 1/2 from the bottom 3 classes. C.1.4. ACTIVATIONPATCHING Corruption of Answer Tokens via Mean AblationTo test whether specific position-layer combinations are suf- ficient for confidence computation, we use a corrupt-and- restore procedure following Meng et al. (2022); Heime...
work page 2022
-
[28]
C.1.5. METRICS USED INPATCHING ANDOTHER EXPERIMENTS Logit Difference.As a generalization of (Wang et al., 2023), we define logit difference as the logit of the orig- inal confidence class minus the mean logit of alternative confidence classes: ∆logit =z y∗ − 1 K−1 X k̸=y∗ zk (3) where zy∗ is the logit of the clean trial’s confidence class, zk are logits o...
work page 2023
-
[29]
and verbal confidence ratings from both Phase 0 (same run) and Phase 1 (different run with identical questions but answers provided in the prompt). Logprobs explained only 4.9% of variance in within-run verbal con- fidence (r= 0.23 , R2 CV = 0.049 ) and 8.4% in cross-run verbal confidence (r= 0.29 , R2 CV = 0.084). These low val- ues confirm that verbal c...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.