How do LLMs Compute Verbal Confidence

Arthur Conmy; Dharshan Kumaran; Federico Barbero; Petar Veli\v{c}kovi\'c; Simon Osindero; Viorica Patraucean

arxiv: 2603.17839 · v3 · pith:K2YZJN2Unew · submitted 2026-03-18 · 💻 cs.CL · cs.AI· cs.LG

How do LLMs Compute Verbal Confidence

Dharshan Kumaran , Arthur Conmy , Federico Barbero , Simon Osindero , Viorica Patraucean , Petar Veli\v{c}kovi\'c This is my paper

Pith reviewed 2026-05-21 10:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords verbal confidenceLLM metacognitionactivation patchingcached representationsanswer quality evaluationinformation flowself-evaluation

0 comments

The pith

LLMs automatically compute verbal confidence during answer generation and cache it at the first post-answer position for later retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether large language models calculate verbal confidence on demand when prompted or generate and store it automatically while producing their answer. Experiments using activation steering, patching, noising, and attention blocking on models such as Gemma 3 27B, Qwen 2.5 7B, and Magistral Small 24B across TriviaQA, BigMath, and MMLU show that relevant representations first appear at answer-adjacent positions. These cached states are then retrieved when the model verbalizes its confidence. Linear probing and variance partitioning further indicate that the cached information captures a richer assessment of answer quality that goes beyond token log-probabilities.

Core claim

Verbal confidence arises from representations that emerge at answer-adjacent positions before the verbalization site, with information flowing from answer tokens to a cache at the first post-answer position via attention, and then to output; these cached representations explain substantial variance in verbal confidence beyond token log-probabilities, indicating an automatic and sophisticated self-evaluation of answer quality rather than post-hoc reconstruction.

What carries the argument

The cached confidence representation at the first post-answer position, which aggregates information from answer tokens and is later retrieved for verbalization.

If this is right

Verbal confidence can be directly influenced by intervening at the post-answer cache position.
Models perform automatic evaluation of answer quality that is independent of simple fluency or probability measures.
Information flow for confidence follows a specific path from answer tokens through the cache to verbal output.
Calibration improvements could target these internal cached states rather than output prompting alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar automatic caching may occur for other internal self-assessments beyond confidence.
Direct access to the post-answer cache could provide uncertainty estimates without requiring explicit verbalization prompts.
Testing whether the same position serves as a cache in non-transformer architectures would clarify generality.

Load-bearing premise

The activation steering, patching, noising, and attention blocking interventions reveal the model's natural confidence computation without introducing artifacts or altering representations in ways that differ from normal forward passes.

What would settle it

Finding that verbal confidence output remains unchanged when the cached representations at the first post-answer position are disrupted under otherwise normal generation conditions.

Figures

Figures reproduced from arXiv: 2603.17839 by Arthur Conmy, Dharshan Kumaran, Federico Barbero, Petar Veli\v{c}kovi\'c, Simon Osindero, Viorica Patraucean.

**Figure 1.** Figure 1: Main Prompt and Illustration of our findings. We included the generated answer (example question shown) from a previous phase as part of the prompt for the confidence rating experiment (see §C.1.2). Since the Transformer’s forward pass is a function of previous tokens, providing the answer as context yields the exact same representation at the PANL as autoregressive generation. See §B for full prompt used.… view at source ↗

**Figure 2.** Figure 2: Results of Activation Steering In Gemma 3 27B. High (green lines) and low confidence (red lines) steering, at scales of 2 (solid line) and 5 (dotted line). Key positions: PANL (postanswer-newline) token, CC (confidence-colon) token. Control positions: PANL+1 (token immediately after PANL), FCC (firstconfidence-colon) token (i.e. token preceding “$CLASS” in the prompt, following the confidence instruction… view at source ↗

**Figure 3.** Figure 3: Results of Activation Patching in High Confidence Trials: Confidence Class Prompt. Clean baseline shown in green; corrupt baseline shown in red (i.e. at near zero for logit difference and confidence, and near 100 for first token change rate). Patching of PANL representation resulted in partial recovery of logit difference, first token and confidence (upper, middle, lower panel respectively). Patching of CC… view at source ↗

**Figure 4.** Figure 4: Illustration of Activation Swap Experiment. Upper panel: High→Low (i.e. cross-confidence swap: high confidence recipient trial receives low confidence donor representation) – result is a lowering of confidence. Lower panel: High→High (i.e. high confidence recipient trial receives High confidence donor representation) – in this same-same confidence swap, the result is no change in confidence. same-confide… view at source ↗

**Figure 6.** Figure 6: , top and middle panels). Critically, variance partitioning revealed that activations at PANL and CC explain [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7 [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Full categorical confidence class prompt in Experiment. We focused our analysis on the newline token following the model’s answer (given in a previous phase): the post-answer-newline (PANL) token, and the confidence-colon token (i.e. the last token of the prompt). In addition, we report analyses on the token immediately following the PANL token (i.e. PANL-plus1 token), the first-confidencecolon (FCC)(i.e.… view at source ↗

**Figure 9.** Figure 9: Calibration and Distribution of Confidence Classes in Gemma. (A) Calibration of Gemma: Expected Calibration Error (ECE) = 0.12, AUROC = 0.71. No procedures such as temperature scaling (Guo et al., 2017) were used, since we were focussed on understanding the generation of Gemma’s raw verbal confidence signals. The model’s performance was 77.4%; this was determined by having GPT4o-mini mark questions (B) Dis… view at source ↗

**Figure 10.** Figure 10: Results of Activation Steering in Gemma at answer tokens (Confidence Class Prompt). High (green lines) and low confidence (red lines) steering, at scales of 2 (solid line) and 5 (dotted line). Error bars show SEM. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Results of Activation Noising across all trials. Mean ablation of representations of PANL or CC token at a single layer causes disruption of verbal confidence reporting as measured by decrease in logit difference, and change in first token outputted by model. See text for details 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Full numeric confidence prompt used in Experiment. This prompt is derived from (Mei et al., 2025; Devic et al., 2025) [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Minimal numeric confidence prompt used in attention blocking experiments. This prompt elicits confidence on a 0–9 scale (single token output) and minimizes intermediate tokens between the post-answer-newline token (PANL, position 1) and the confidence-colon token (CC, position 2). These positions are analogous to those in the main categorical prompt ( [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 14.** Figure 14: Calibration and Distribution of Numeric Confidence Scores in Gemma. (A) Calibration of Gemma: Expected Calibration Error (ECE) = 0.16, AUROC = 0.73. No procedures such as temperature scaling (Guo et al., 2017) were used, since we were focussed on understanding the generation of Gemma’s raw verbal confidence signals. (B) Distribution of Gemma’s numeric confidence responses binned into 10 bins. Questions (n… view at source ↗

**Figure 15.** Figure 15: Calibration and Distribution of Categorical Confidence Ratings in Qwen 2.5 7b. (A) Calibration of Qwen: Expected Calibration Error (ECE) = 0.06, AUROC = 0.65. No procedures such as temperature scaling (Guo et al., 2017) were used, since we were focussed on understanding the generation of Qwen’s raw verbal confidence signals. (B) Distribution of Qwen’s confidence responses. 21 [PITH_FULL_IMAGE:figures/ful… view at source ↗

**Figure 16.** Figure 16: Results of Activation Steering in Gemma with Numeric Confidence Prompt. High (green lines) and low confidence (red lines) steering, at scales of 2 (solid line) and 5 (dotted line). n = 124 trials per condition per layer. Positions correspond to the analogous locations in the categorical confidence prompt. Error bars show SEM [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗

**Figure 17.** Figure 17: Results of Activation Swap Experiment at CC and PANL+1 position (Confidence Class Prompt). Same-confidence swaps (H→H, L→L) control for cross-trial substitution effects; cross-confidence swaps (H→L, L→H) isolate confidence-specific transfer. See main text for details. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

**Figure 18.** Figure 18: Results of Attention Blocking Experiment using Main Categorical Prompt. First token change rate induced by blocking different attentional pathways shown in upper panel. Lower panel: Change in logit difference between the first token and the mean of alternative confidence classes caused by attention blocking. PANL and PANL+1 denoted by NL and NL+1 for brevity. last A and A refer to last answer token and an… view at source ↗

**Figure 19.** Figure 19: Results for Gemma 3 27B and Qwen 2.5 7B on Activation Steering, Patching, Noising and Swap Experiments: Categorical Confidence and Numeric Confidence Prompt. Maximal effects observed at each position’s (PANL, PANL+1, CC) peak layer (layer index below bar). Qwen has 28 layers, Gemma 3 27B has 62 layers. Logit difference change (i.e. change from logit of clean run first token vs average of all other confide… view at source ↗

read the original abstract

Verbal confidence -- prompting LLMs to state their confidence as a number or category -- is widely used to extract uncertainty estimates from black-box models. However, how LLMs internally generate such scores remains unknown. We address two questions: first, when confidence is computed -- just-in-time when requested, or automatically during answer generation and cached for later retrieval; and second, what verbal confidence represents -- token log-probabilities, or a richer evaluation of answer quality? Focusing on Gemma 3 27B (across TriviaQA, BigMath, and MMLU), Qwen 2.5 7B, and the reasoning model Magistral Small 24B, we provide convergent evidence for cached retrieval. Activation steering, patching, noising, and swap experiments reveal that confidence representations emerge at answer-adjacent positions before appearing at the verbalization site. Attention blocking pinpoints the information flow: confidence is gathered from answer tokens, cached at the first post-answer position, then retrieved for output. Critically, linear probing and variance partitioning reveal that these cached representations explain substantial variance in verbal confidence beyond token log-probabilities, suggesting a richer answer-quality evaluation rather than a simple fluency readout. These findings demonstrate that verbal confidence reflects automatic, sophisticated self-evaluation -- not post-hoc reconstruction -- with implications for understanding metacognition in LLMs and improving calibration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper examines how LLMs generate verbal confidence scores. It claims these scores arise from automatic computation and caching during answer generation rather than just-in-time reconstruction, with representations emerging at answer-adjacent positions (specifically the first post-answer token) before retrieval at verbalization. Experiments on Gemma 3 27B, Qwen 2.5 7B, and Magistral Small 24B across TriviaQA, BigMath, and MMLU use activation steering, patching, noising, swap experiments, and attention blocking to trace information flow from answer tokens to the cache site. Linear probing and variance partitioning further indicate that the cached representations capture richer answer-quality information beyond token log-probabilities.

Significance. If the central claims hold, the work advances understanding of metacognition and internal self-evaluation in LLMs by showing that verbal confidence reflects sophisticated, automatic processes rather than post-hoc readout. The convergent evidence across multiple intervention techniques, models, and datasets, combined with the demonstration of additional explanatory power beyond log-probabilities, provides a mechanistic basis for improving uncertainty estimation and calibration. Strengths include the use of causal interventions alongside correlational probing and the focus on falsifiable predictions about information flow timing.

major comments (2)

[Methods (activation steering and attention blocking experiments)] Methods section on activation steering, patching, noising, and attention blocking: The evidence for automatic caching at the first post-answer position and subsequent retrieval rests entirely on these interventions. However, the manuscript does not demonstrate that the observed information flow or representations occur during unperturbed forward passes; interventions can induce or amplify patterns absent in normal generation. This is load-bearing for the claim that the model performs this caching by default rather than only under experimental conditions.
[Results (variance partitioning and probing)] Results on linear probing and variance partitioning: The claim that cached representations explain 'substantial variance' beyond token log-probabilities requires explicit reporting of effect sizes, cross-validation details, and controls for overlap between probed features and log-probability baselines. Without these, it is unclear whether the richer-evaluation interpretation is supported or whether the additional variance is marginal or artifactual.

minor comments (2)

[Abstract] Abstract: The term 'swap experiments' is mentioned but not defined or referenced in the main text summary; ensure all experimental variants are consistently described and cited.
[Figures] Figure clarity: Attention maps or probing accuracy plots should include explicit legends, error bars, and statistical significance markers to aid interpretation of the information-flow claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on methodological rigor and reporting standards. We address each major point below, providing additional analyses and clarifications while preserving the core claims supported by our convergent evidence.

read point-by-point responses

Referee: [Methods (activation steering and attention blocking experiments)] Methods section on activation steering, patching, noising, and attention blocking: The evidence for automatic caching at the first post-answer position and subsequent retrieval rests entirely on these interventions. However, the manuscript does not demonstrate that the observed information flow or representations occur during unperturbed forward passes; interventions can induce or amplify patterns absent in normal generation. This is load-bearing for the claim that the model performs this caching by default rather than only under experimental conditions.

Authors: We agree that establishing the presence of the representations in unperturbed forward passes is essential. While our linear probing and variance partitioning analyses are performed on activations from standard, unperturbed generations (as described in the Results section), we have added a new subsection in the revised Methods and Results that directly examines hidden-state correlations at the first post-answer position during normal forward passes without any steering, patching, or noising. These baseline analyses show that the post-answer representations already encode information predictive of verbal confidence prior to any intervention, with the causal experiments then used to establish necessity and directionality of the flow. This combination supports that caching occurs by default. revision: yes
Referee: [Results (variance partitioning and probing)] Results on linear probing and variance partitioning: The claim that cached representations explain 'substantial variance' beyond token log-probabilities requires explicit reporting of effect sizes, cross-validation details, and controls for overlap between probed features and log-probability baselines. Without these, it is unclear whether the richer-evaluation interpretation is supported or whether the additional variance is marginal or artifactual.

Authors: We appreciate the request for greater transparency. In the revised manuscript we now report incremental R² values from the hierarchical variance partitioning (showing 12–28% additional variance explained across models and datasets after accounting for log-probabilities), specify that all probes use 5-fold cross-validation with held-out test sets, and include an orthogonalization control in which log-probability features are residualized from the cached representations before probing. These additions confirm that the probed representations capture explanatory power beyond token-level probabilities, consistent with a richer answer-quality evaluation. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical interventions and external baselines

full rationale

The paper's central claims—that verbal confidence representations are cached at the first post-answer position and explain variance beyond token log-probabilities—are supported by activation steering, patching, noising, attention blocking, linear probing, and variance partitioning experiments across multiple models and datasets. These methods compare internal activations to observed verbal outputs and to independent log-probability baselines rather than defining the target quantity in terms of itself or fitting parameters that are then relabeled as predictions. No equations, uniqueness theorems, or ansatzes are invoked that reduce the result to the input by construction. Self-citations, if present, are not load-bearing for the core mechanism; the evidence is externally falsifiable via the described interventions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests primarily on the validity of standard mechanistic interpretability techniques rather than new mathematical axioms or invented entities; no free parameters or new postulated objects are introduced in the abstract.

axioms (2)

domain assumption Activation steering and patching interventions isolate causal contributions to confidence without creating spurious representations absent in normal inference.
Invoked implicitly when interpreting steering and patching results as revealing natural computation flow.
domain assumption Linear probes and variance partitioning accurately measure the information content of internal activations relevant to verbal confidence.
Used to claim that cached representations explain variance beyond log-probabilities.

pith-pipeline@v0.9.0 · 5795 in / 1459 out tokens · 69684 ms · 2026-05-21T10:34:34.718230+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Activation steering, patching, noising, and swap experiments reveal that confidence representations emerge at answer-adjacent positions
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

linear probing and variance partitioning reveal that these cached representations explain substantial variance beyond token log-probabilities

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Hidden Signal of Verifier Strictness: Controlling and Improving Step-Wise Verification via Selective Latent Steering
cs.LG 2026-05 conditional novelty 7.0

VerifySteer selectively steers hidden states at paragraph boundaries using latent correctness signals to control verifier strictness and outperform baselines on ProcessBench and Hard2Verify with lower compute.
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning
cs.LG 2026-05 unverdicted novelty 6.0

METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior perform...
Hypothesis generation and updating in large language models
cs.LG 2026-05 unverdicted novelty 6.0

LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.
How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals
cs.LG 2026-04 unverdicted novelty 6.0

LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.
Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen
cs.CL 2026-04 conditional novelty 6.0

Seven 3-9B instruction-tuned LLMs produce verbal confidence that saturates at high values and fails psychometric validity criteria for Type-2 discrimination under minimal elicitation.
Distilling Self-Consistency into Verbal Confidence: A Pre-Registered Negative Result and Post-Hoc Rescue on Gemma 3 4B
cs.CL 2026-04 conditional novelty 5.0

Fine-tuning Gemma 3 4B on unfiltered self-consistency targets produces a binary verbal correctness discriminator with AUROC 0.774 on TriviaQA, outperforming logit entropy after a modal-filtered pre-registration failed.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 6 Pith papers · 9 internal anchors

[1]

Anthropic

URL https://transformer-circuits.pub/ 2025/attribution-graphs/methods.html. Anthropic. Emergent introspective aware- ness in large language models. https: 8 How do LLMs Compute Verbal Confidence? //transformer-circuits.pub/2025/ introspection/,

work page 2025
[2]

The Internal State of an LLM Knows When It's Lying

Anthropic Research Report. Azaria, A. and Mitchell, T. The internal state of an llm knows when it’s lying.arXiv preprint arXiv:2304.13734,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Discovering Latent Knowledge in Language Models Without Supervision

Burns, C., Ye, H., Klein, D., and Steinhardt, J. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Trace length is a simple un- certainty signal in reasoning models.arXiv preprint arXiv:2510.10409,

Devic, S., Peale, C., Bradley, A., Williamson, S., Nakki- ran, P., and Gollakota, A. Trace length is a simple un- certainty signal in reasoning models.arXiv preprint arXiv:2510.10409,

work page arXiv
[5]

A survey of language model confidence esti- mation and calibration.arXiv preprint arXiv:2311.08298,

Geng, J., Cai, F., Wang, Y ., Koeppl, H., Nakov, P., and Gurevych, I. A survey of language model confidence esti- mation and calibration.arXiv preprint arXiv:2311.08298,

work page arXiv
[6]

arXiv preprint arXiv:2304.14767 , year=

Geva, M., Bastings, J., Filippova, K., and Globerson, A. Dis- secting recall of factual associations in auto-regressive language models.arXiv preprint arXiv:2304.14767,

work page arXiv
[7]

How to use and interpret activation patching

Heimersheim, S. and Nanda, N. How to use and interpret activation patching.arXiv preprint arXiv:2404.15255,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

arXiv preprint arXiv:2510.20487 , year =

Hua, T. T., Qin, A., Marks, S., and Nanda, N. Steering evaluation-aware language models to act like they are deployed.arXiv preprint arXiv:2510.20487,

work page arXiv
[9]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Language Models (Mostly) Know What They Know

Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Liu, J., Jain, J., Diab, M., and Subramani, N

URL https://transformer-circuits.pub/ 2025/attribution-graphs/biology.html. Liu, J., Jain, J., Diab, M., and Subramani, N. Llm microscope: What model internals reveal about an- swer correctness and context utilization.arXiv preprint arXiv:2510.04013,

work page arXiv 2025
[12]

Reasoning about uncertainty: Do reason- ing models know when they don’t know?arXiv preprint arXiv:2506.18183,

Mei, Z., Zhang, C., Yin, T., Lidard, J., Shorinwa, O., and Majumdar, A. Reasoning about uncertainty: Do reason- ing models know when they don’t know?arXiv preprint arXiv:2506.18183,

work page arXiv
[13]

Panickssery, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A. M. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

A practical review of mechanistic interpretability for transformer-based language models.arXiv preprint arXiv:2407.02646,

9 How do LLMs Compute Verbal Confidence? Rai, D., Zhou, Y ., Feng, S., Saparov, A., and Yao, Z. A practical review of mechanistic interpretability for transformer-based language models.arXiv preprint arXiv:2407.02646,

work page arXiv
[15]

Improving metacog- nition and uncertainty communication in language mod- els.arXiv preprint arXiv:2510.05126, 2025a

Steyvers, M., Belem, C., and Smyth, P. Improving metacog- nition and uncertainty communication in language mod- els.arXiv preprint arXiv:2510.05126, 2025a. Steyvers, M., Tejeda, H., Kumar, A., Belem, C., Karny, S., Hu, X., Mayer, L. W., and Smyth, P. What large language models know and what people think they know.Nature Machine Intelligence, pp. 1–11, 202...

work page arXiv
[16]

Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., and Manning, C. D. Just ask for calibra- tion: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. arXiv preprint arXiv:2305.14975,

work page arXiv
[17]

Steering Language Models With Activation Engineering

Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Steering lan- guage models with activation engineering.arXiv preprint arXiv:2308.10248,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

A stitch in time saves nine: Detecting and mitigating hallu- cinations of llms by validating low-confidence generation

Varshney, N., Yao, W., Zhang, H., Chen, J., and Yu, D. A stitch in time saves nine: Detecting and mitigating hallu- cinations of llms by validating low-confidence generation. arXiv preprint arXiv:2307.03987,

work page arXiv
[19]

Base models know how to reason, thinking models learn when.arXiv preprint arXiv:2510.07364,

Venhoff, C., Arcuschin, I., Torr, P., Conmy, A., and Nanda, N. Base models know how to reason, thinking models learn when.arXiv preprint arXiv:2510.07364,

work page arXiv
[20]

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Xiong, M., Hu, Z., Lu, X., Li, Y ., Fu, J., He, J., and Hooi, B. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.arXiv preprint arXiv:2306.13063,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Reasoning models better express their confidence.arXiv preprint arXiv:2505.14489,

Yoon, D., Kim, S., Yang, S., Kim, S., Kim, S., Kim, Y ., Choi, E., Kim, Y ., and Seo, M. Reasoning models better express their confidence.arXiv preprint arXiv:2505.14489,

work page arXiv
[22]

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

Zhang, F. and Nanda, N. Towards best practices of activation patching in language models: Metrics and methods.arXiv preprint arXiv:2309.16042,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Almost certain

10 How do LLMs Compute Verbal Confidence? Appendix Overview •Appendix A: Related Work(§A) — Summary of related literature •Appendix B: Supplemental Figures(§B) — Prompts, calibration plots, and additional experimental results •Appendix C: Supplemental Methods(§C) –C.1 Experiments with Categorical Confidence Prompt in Gemma 3 27B (§C.1) * C.1.1 Technical D...

work page 2016
[24]

The model’s performance was 77.4%; this was determined by having GPT4o-mini mark questions (B) Distribution of Gemma’s confidence responses across the 10 classes

were used, since we were focussed on understanding the generation of Gemma’s raw verbal confidence signals. The model’s performance was 77.4%; this was determined by having GPT4o-mini mark questions (B) Distribution of Gemma’s confidence responses across the 10 classes. n = 7858 questions from the TriviaQA dataset (Joshi et al., 2017). 16 How do LLMs Comp...

work page 2017
[25]

20 How do LLMs Compute Verbal Confidence? Figure 15.Calibration and Distribution of Categorical Confidence Ratings in Qwen 2.5 7b

the TriviaQA dataset (Joshi et al., 2017). 20 How do LLMs Compute Verbal Confidence? Figure 15.Calibration and Distribution of Categorical Confidence Ratings in Qwen 2.5 7b. (A) Calibration of Qwen: Expected Calibration Error (ECE) = 0.06, AUROC = 0.65. No procedures such as temperature scaling (Guo et al.,

work page 2017
[26]

Almost certain

We created high and low confidence vectors by contrast- ing high and low confidence trials (all trials that the model scored correctly), following standard procedures in acti- vation steering (Turner et al., 2023; Stolfo et al., 2024a; Panickssery et al., 2023; Hua et al., 2025). Creation of high- and low-confidence steering vectors: We constructed steeri...

work page 2023
[27]

– 1/2 of these trials were randomly sampled from the top 3 confidence classes and 1/2 from the bottom 3 classes. C.1.4. ACTIVATIONPATCHING Corruption of Answer Tokens via Mean AblationTo test whether specific position-layer combinations are suf- ficient for confidence computation, we use a corrupt-and- restore procedure following Meng et al. (2022); Heime...

work page 2022
[28]

Highly likely

C.1.5. METRICS USED INPATCHING ANDOTHER EXPERIMENTS Logit Difference.As a generalization of (Wang et al., 2023), we define logit difference as the logit of the orig- inal confidence class minus the mean logit of alternative confidence classes: ∆logit =z y∗ − 1 K−1 X k̸=y∗ zk (3) where zy∗ is the logit of the clean trial’s confidence class, zk are logits o...

work page 2023
[29]

Logprobs explained only 4.9% of variance in within-run verbal con- fidence (r= 0.23 , R2 CV = 0.049 ) and 8.4% in cross-run verbal confidence (r= 0.29 , R2 CV = 0.084)

and verbal confidence ratings from both Phase 0 (same run) and Phase 1 (different run with identical questions but answers provided in the prompt). Logprobs explained only 4.9% of variance in within-run verbal con- fidence (r= 0.23 , R2 CV = 0.049 ) and 8.4% in cross-run verbal confidence (r= 0.29 , R2 CV = 0.084). These low val- ues confirm that verbal c...

work page 2023

[1] [1]

Anthropic

URL https://transformer-circuits.pub/ 2025/attribution-graphs/methods.html. Anthropic. Emergent introspective aware- ness in large language models. https: 8 How do LLMs Compute Verbal Confidence? //transformer-circuits.pub/2025/ introspection/,

work page 2025

[2] [2]

The Internal State of an LLM Knows When It's Lying

Anthropic Research Report. Azaria, A. and Mitchell, T. The internal state of an llm knows when it’s lying.arXiv preprint arXiv:2304.13734,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Discovering Latent Knowledge in Language Models Without Supervision

Burns, C., Ye, H., Klein, D., and Steinhardt, J. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Trace length is a simple un- certainty signal in reasoning models.arXiv preprint arXiv:2510.10409,

Devic, S., Peale, C., Bradley, A., Williamson, S., Nakki- ran, P., and Gollakota, A. Trace length is a simple un- certainty signal in reasoning models.arXiv preprint arXiv:2510.10409,

work page arXiv

[5] [5]

A survey of language model confidence esti- mation and calibration.arXiv preprint arXiv:2311.08298,

Geng, J., Cai, F., Wang, Y ., Koeppl, H., Nakov, P., and Gurevych, I. A survey of language model confidence esti- mation and calibration.arXiv preprint arXiv:2311.08298,

work page arXiv

[6] [6]

arXiv preprint arXiv:2304.14767 , year=

Geva, M., Bastings, J., Filippova, K., and Globerson, A. Dis- secting recall of factual associations in auto-regressive language models.arXiv preprint arXiv:2304.14767,

work page arXiv

[7] [7]

How to use and interpret activation patching

Heimersheim, S. and Nanda, N. How to use and interpret activation patching.arXiv preprint arXiv:2404.15255,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

arXiv preprint arXiv:2510.20487 , year =

Hua, T. T., Qin, A., Marks, S., and Nanda, N. Steering evaluation-aware language models to act like they are deployed.arXiv preprint arXiv:2510.20487,

work page arXiv

[9] [9]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Language Models (Mostly) Know What They Know

Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Liu, J., Jain, J., Diab, M., and Subramani, N

URL https://transformer-circuits.pub/ 2025/attribution-graphs/biology.html. Liu, J., Jain, J., Diab, M., and Subramani, N. Llm microscope: What model internals reveal about an- swer correctness and context utilization.arXiv preprint arXiv:2510.04013,

work page arXiv 2025

[12] [12]

Reasoning about uncertainty: Do reason- ing models know when they don’t know?arXiv preprint arXiv:2506.18183,

Mei, Z., Zhang, C., Yin, T., Lidard, J., Shorinwa, O., and Majumdar, A. Reasoning about uncertainty: Do reason- ing models know when they don’t know?arXiv preprint arXiv:2506.18183,

work page arXiv

[13] [13]

Panickssery, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A. M. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

A practical review of mechanistic interpretability for transformer-based language models.arXiv preprint arXiv:2407.02646,

9 How do LLMs Compute Verbal Confidence? Rai, D., Zhou, Y ., Feng, S., Saparov, A., and Yao, Z. A practical review of mechanistic interpretability for transformer-based language models.arXiv preprint arXiv:2407.02646,

work page arXiv

[15] [15]

Improving metacog- nition and uncertainty communication in language mod- els.arXiv preprint arXiv:2510.05126, 2025a

Steyvers, M., Belem, C., and Smyth, P. Improving metacog- nition and uncertainty communication in language mod- els.arXiv preprint arXiv:2510.05126, 2025a. Steyvers, M., Tejeda, H., Kumar, A., Belem, C., Karny, S., Hu, X., Mayer, L. W., and Smyth, P. What large language models know and what people think they know.Nature Machine Intelligence, pp. 1–11, 202...

work page arXiv

[16] [16]

Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., and Manning, C. D. Just ask for calibra- tion: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. arXiv preprint arXiv:2305.14975,

work page arXiv

[17] [17]

Steering Language Models With Activation Engineering

Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Steering lan- guage models with activation engineering.arXiv preprint arXiv:2308.10248,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

A stitch in time saves nine: Detecting and mitigating hallu- cinations of llms by validating low-confidence generation

Varshney, N., Yao, W., Zhang, H., Chen, J., and Yu, D. A stitch in time saves nine: Detecting and mitigating hallu- cinations of llms by validating low-confidence generation. arXiv preprint arXiv:2307.03987,

work page arXiv

[19] [19]

Base models know how to reason, thinking models learn when.arXiv preprint arXiv:2510.07364,

Venhoff, C., Arcuschin, I., Torr, P., Conmy, A., and Nanda, N. Base models know how to reason, thinking models learn when.arXiv preprint arXiv:2510.07364,

work page arXiv

[20] [20]

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Xiong, M., Hu, Z., Lu, X., Li, Y ., Fu, J., He, J., and Hooi, B. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.arXiv preprint arXiv:2306.13063,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Reasoning models better express their confidence.arXiv preprint arXiv:2505.14489,

Yoon, D., Kim, S., Yang, S., Kim, S., Kim, S., Kim, Y ., Choi, E., Kim, Y ., and Seo, M. Reasoning models better express their confidence.arXiv preprint arXiv:2505.14489,

work page arXiv

[22] [22]

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

Zhang, F. and Nanda, N. Towards best practices of activation patching in language models: Metrics and methods.arXiv preprint arXiv:2309.16042,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Almost certain

10 How do LLMs Compute Verbal Confidence? Appendix Overview •Appendix A: Related Work(§A) — Summary of related literature •Appendix B: Supplemental Figures(§B) — Prompts, calibration plots, and additional experimental results •Appendix C: Supplemental Methods(§C) –C.1 Experiments with Categorical Confidence Prompt in Gemma 3 27B (§C.1) * C.1.1 Technical D...

work page 2016

[24] [24]

The model’s performance was 77.4%; this was determined by having GPT4o-mini mark questions (B) Distribution of Gemma’s confidence responses across the 10 classes

were used, since we were focussed on understanding the generation of Gemma’s raw verbal confidence signals. The model’s performance was 77.4%; this was determined by having GPT4o-mini mark questions (B) Distribution of Gemma’s confidence responses across the 10 classes. n = 7858 questions from the TriviaQA dataset (Joshi et al., 2017). 16 How do LLMs Comp...

work page 2017

[25] [25]

20 How do LLMs Compute Verbal Confidence? Figure 15.Calibration and Distribution of Categorical Confidence Ratings in Qwen 2.5 7b

the TriviaQA dataset (Joshi et al., 2017). 20 How do LLMs Compute Verbal Confidence? Figure 15.Calibration and Distribution of Categorical Confidence Ratings in Qwen 2.5 7b. (A) Calibration of Qwen: Expected Calibration Error (ECE) = 0.06, AUROC = 0.65. No procedures such as temperature scaling (Guo et al.,

work page 2017

[26] [26]

Almost certain

We created high and low confidence vectors by contrast- ing high and low confidence trials (all trials that the model scored correctly), following standard procedures in acti- vation steering (Turner et al., 2023; Stolfo et al., 2024a; Panickssery et al., 2023; Hua et al., 2025). Creation of high- and low-confidence steering vectors: We constructed steeri...

work page 2023

[27] [27]

– 1/2 of these trials were randomly sampled from the top 3 confidence classes and 1/2 from the bottom 3 classes. C.1.4. ACTIVATIONPATCHING Corruption of Answer Tokens via Mean AblationTo test whether specific position-layer combinations are suf- ficient for confidence computation, we use a corrupt-and- restore procedure following Meng et al. (2022); Heime...

work page 2022

[28] [28]

Highly likely

C.1.5. METRICS USED INPATCHING ANDOTHER EXPERIMENTS Logit Difference.As a generalization of (Wang et al., 2023), we define logit difference as the logit of the orig- inal confidence class minus the mean logit of alternative confidence classes: ∆logit =z y∗ − 1 K−1 X k̸=y∗ zk (3) where zy∗ is the logit of the clean trial’s confidence class, zk are logits o...

work page 2023

[29] [29]

Logprobs explained only 4.9% of variance in within-run verbal con- fidence (r= 0.23 , R2 CV = 0.049 ) and 8.4% in cross-run verbal confidence (r= 0.29 , R2 CV = 0.084)

and verbal confidence ratings from both Phase 0 (same run) and Phase 1 (different run with identical questions but answers provided in the prompt). Logprobs explained only 4.9% of variance in within-run verbal con- fidence (r= 0.23 , R2 CV = 0.049 ) and 8.4% in cross-run verbal confidence (r= 0.29 , R2 CV = 0.084). These low val- ues confirm that verbal c...

work page 2023