pith. machine review for the scientific record. sign in

arxiv: 2604.22215 · v1 · submitted 2026-04-24 · 💻 cs.CL · cs.AI

Recognition: unknown

Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:04 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords verbal confidenceLLMsuncertainty estimationpsychometric validityType-2 discriminationTriviaQAopen-weight modelsinstruction-tuned
0
0 comments X

The pith

Seven 3-9B instruction-tuned LLMs fail to produce valid verbal confidence ratings that discriminate correct from incorrect answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether small open-weight LLMs can express internal uncertainty through simple verbal confidence ratings under greedy decoding. It applies a pre-registered psychometric validity screen to numeric and categorical outputs across 524 TriviaQA items in seven models, revealing that all produce near-ceiling numeric ratings averaging 91.7 percent at maximum and show no useful alignment with token logprobabilities. Categorical formats also fail the validity criteria and often reduce answer accuracy. These patterns indicate that standard verbal elicitation does not transmit uncertainty signals to the output in this model-size range. The work stresses that invalid verbal outputs do not rule out the presence of internal representations.

Core claim

All seven instruction-tuned models were classified invalid on numeric confidence elicitation because of a mean ceiling rate of 91.7 percent, with categorical formats also failing to produce valid Type-2 discrimination and instead lowering task accuracy below 5 percent in six models. Token logprobabilities showed mean cross-validated R-squared below 0.01 when predicting verbal confidence, while reasoning-trace length in the distilled model correlated negatively with confidence at rho equal to -0.36.

What carries the argument

The pre-registered psychometric validity screen that checks for item-level Type-2 discrimination between correct and incorrect answers under numeric and categorical confidence formats.

If this is right

  • Verbal confidence ratings from these models cannot be treated as reliable uncertainty estimates in applications without prior validity screening.
  • Categorical confidence formats do not improve discrimination and can degrade overall task performance.
  • Token logprobabilities do not serve as a useful proxy for verbalized confidence in high-variance output regimes.
  • In reasoning-distilled models, longer reasoning traces are associated with lower verbal confidence ratings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Non-verbal methods of extracting uncertainty may be required to access internal signals in models of this size.
  • Confidence saturation may help explain why smaller LLMs produce overconfident errors in deployed settings.
  • Repeating the screen on larger models or with varied decoding strategies could identify whether the failure is size-dependent.

Load-bearing premise

The chosen psychometric criteria for Type-2 discrimination, combined with numeric and categorical elicitation plus greedy decoding, are sufficient to detect whether internal uncertainty signals reach the verbal output.

What would settle it

A demonstration of statistically significant positive item-level correlations between verbal confidence ratings and answer accuracy in the same models, formats, and decoding conditions would falsify the invalidity result.

Figures

Figures reproduced from arXiv: 2604.22215 by Jon-Paul Cacioli.

Figure 1
Figure 1. Figure 1: Confidence ceiling rates by model and elicitation format. The dashed line shows the H1 view at source ↗
Figure 2
Figure 2. Figure 2: Type-2 AUROC2 by model and condition (bootstrap 95% CI). NUM values are modestly above chance with tight CIs. CAT values for M4, M5, M6, M8 show wide CIs reflecting unreliable estimates from extreme base-rate imbalance. 4 Discussion 4.1 Summary Under minimal numeric elicitation with greedy decoding, verbal confidence from seven instruction￾tuned open-weight models in the 3–9B range fails to meet minimal va… view at source ↗
Figure 3
Figure 3. Figure 3: M8 (DeepSeek-R1-Distill): reasoning-trace length vs. verbalised confidence (NUM). Points view at source ↗
read the original abstract

Verbal confidence elicitation is widely used to extract uncertainty estimates from LLMs. We tested whether seven instruction-tuned open-weight models (3-9B parameters, four families) produce verbalised confidence that meets minimal validity criteria for item-level Type-2 discrimination under minimal numeric elicitation with greedy decoding. In a pre-registered study (OSF: osf.io/azbvx), 524 TriviaQA items were administered under numeric (0-100) and categorical (10-class) elicitation to eight models at Q5_K_M quantisation on consumer hardware, yielding 8,384 deterministic trials. A psychometric validity screen was applied to each model-format cell. All seven instruct models were classified Invalid on numeric confidence (H2 confirmed, 7/7 vs. predicted >=4/7), with a mean ceiling rate of 91.7% (H1 confirmed). Categorical elicitation did not rescue validity. Instead, it disrupted task performance in six of seven models, producing accuracy below 5% (H4 not confirmed). Token-level logprobability did not usefully predict verbalised confidence under the observed variance regime (H5 confirmed, mean cross-validated R^2 < 0.01). Within the reasoning-distilled model, reasoning-trace length showed a strong negative partial correlation with confidence (rho = -0.36, p < .001), consistent with the Reasoning Contamination Effect. These results do not imply that internal uncertainty representations are absent. They show that minimal verbal elicitation fails to preserve internal signals at the output interface in this model-size regime. Psychometric screening should precede any downstream use of such signals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript reports results from a pre-registered study (OSF: osf.io/azbvx) testing verbal confidence elicitation in seven 3-9B instruction-tuned open-weight LLMs across four families. Using 524 TriviaQA items under numeric (0-100) and categorical (10-class) formats with greedy decoding at Q5_K_M quantization (8,384 deterministic trials), the authors apply a psychometric validity screen for item-level Type-2 discrimination. All seven models are classified Invalid on numeric confidence (mean ceiling rate 91.7%), categorical elicitation disrupts accuracy below 5% in six models, token log-probabilities show near-zero predictive power (mean cross-validated R² < 0.01), and reasoning-trace length shows a negative partial correlation (rho = -0.36) with confidence in the reasoning-distilled model. The central claim is that minimal verbal elicitation fails to preserve internal uncertainty signals at the output interface, without implying absence of internal representations.

Significance. If the results hold, the work demonstrates that standard verbal confidence prompts do not meet basic metacognitive validity criteria in this model-size regime, with direct implications for uncertainty-aware applications of LLMs. Strengths include the pre-registered design, large deterministic trial count, explicit comparison to token log-probabilities, and the careful caveat distinguishing elicitation failure from internal representation absence. These elements provide falsifiable, non-circular support for the output-interface claim and support the recommendation for psychometric screening prior to downstream use.

major comments (1)
  1. The validity screen for item-level Type-2 discrimination (numeric and categorical formats) is load-bearing for the Invalid classification and the claim that signals are not preserved at the output interface. The manuscript should explicitly state the exact discrimination threshold, how it was pre-registered, and whether an ablation or simulation was performed to confirm the screen would detect preserved internal signals if present under greedy decoding.
minor comments (3)
  1. Abstract and results sections: specify which of the seven models is the 'reasoning-distilled model' when reporting the rho = -0.36 correlation with reasoning-trace length.
  2. Results for H5: report the exact mean cross-validated R² value and its range or standard deviation across the eight model-format cells rather than only the inequality < 0.01.
  3. Methods: clarify whether the 10-class categorical format used the same 0-100 numeric scale for scoring or a separate mapping, as this affects interpretation of the accuracy drop below 5%.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment, recognition of the pre-registered design and other strengths, and the recommendation for minor revision. We address the single major comment below.

read point-by-point responses
  1. Referee: The validity screen for item-level Type-2 discrimination (numeric and categorical formats) is load-bearing for the Invalid classification and the claim that signals are not preserved at the output interface. The manuscript should explicitly state the exact discrimination threshold, how it was pre-registered, and whether an ablation or simulation was performed to confirm the screen would detect preserved internal signals if present under greedy decoding.

    Authors: We agree that the validity screen is central to the Invalid classification and the output-interface interpretation. The exact discrimination threshold is the pre-registered criterion for item-level Type-2 discrimination (a statistically significant positive association between verbal confidence and accuracy, with the precise statistical test and cutoff specified in the OSF protocol). We will revise the manuscript to state this threshold explicitly in the Methods section and to include a direct citation to the pre-registration (osf.io/azbvx). No ablation or simulation of the screen's sensitivity under hypothetical preserved signals and greedy decoding was performed, as the pre-registered protocol applied the screen directly to the empirical outputs of the tested models. The screen follows standard psychometric criteria for metacognitive sensitivity; the observed mean ceiling rate of 91.7% across models supplies indirect support for its appropriateness in this regime. We will add a brief note acknowledging the absence of such a simulation to the Limitations section. revision: partial

Circularity Check

0 steps flagged

No significant circularity: purely empirical pre-registered psychometric screen

full rationale

The paper reports a pre-registered empirical study (OSF: osf.io/azbvx) administering 524 TriviaQA items to seven LLMs under numeric and categorical confidence elicitation formats, applying standard psychometric validity criteria (item-level Type-2 discrimination) to classify models as Invalid or Valid. All claims rest on direct data outcomes such as 91.7% mean ceiling rate, accuracy drops below 5%, cross-validated R² < 0.01 with log-probabilities, and partial correlations, none of which are derived from or reduce to the paper's own definitions, fitted parameters, or self-citations. No mathematical derivation chain, ansatz, uniqueness theorem, or renaming of known results is present; the study is self-contained against external benchmarks and falsifiable hypotheses.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis rests on standard psychometric domain assumptions about validity criteria without introducing new free parameters or postulated entities.

axioms (2)
  • domain assumption Psychometric validity for Type-2 discrimination requires above-chance accuracy in distinguishing correct from incorrect responses based on reported confidence.
    This is the core criterion used to classify models as Invalid.
  • domain assumption Minimal verbal elicitation with greedy decoding is a fair test of whether internal uncertainty signals reach the output interface.
    Invoked when interpreting the failure of numeric and categorical formats.

pith-pipeline@v0.9.0 · 10661 in / 1531 out tokens · 90307 ms · 2026-05-08T12:04:43.479637+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Distilling Self-Consistency into Verbal Confidence: A Pre-Registered Negative Result and Post-Hoc Rescue on Gemma 3 4B

    cs.CL 2026-04 conditional novelty 5.0

    Fine-tuning Gemma 3 4B on unfiltered self-consistency targets produces a binary verbal correctness discriminator with AUROC 0.774 on TriviaQA, outperforming logit entropy after a modal-filtered pre-registration failed.

Reference graph

Works this paper leans on

23 extracted references · 17 canonical work pages · cited by 1 Pith paper · 8 internal anchors

  1. [1]

    Ben-Porath, Y. S. and Tellegen, A. (2020).MMPI-3: Manual for administration, scoring, and interpretation. University of Minnesota Press

  2. [2]

    Cacioli, J. P. (2026a). Type-2 signal detection for LLM metacognition.arXiv:2603.14893

  3. [3]

    Cacioli, J. P. (2026b). Meta-d′ and M-ratio for LLM metacognition.arXiv:2603.25112

  4. [4]

    Cacioli, J. P. (2026c). The metacognitive monitoring battery: A cross-domain benchmark for LLM self-monitoring.arXiv:2604.15702

  5. [5]

    Cacioli, J. P. (2026d). Before you interpret the profile: Validity scaling for LLM metacognitive self-report.arXiv:2604.17707

  6. [6]

    Cacioli, J. P. (2026e). Screen before you interpret: A portable validity protocol for benchmark-based LLM confidence signals.arXiv:2604.17714

  7. [7]

    Cacioli, J. P. (2026f). Concurrent criterion validation of a validity screen for LLM confidence signals via selective prediction.arXiv:2604.17716

  8. [8]

    Cacioli, J. P. (2026g). AUROC2 is format-stable; M-ratio is not.arXiv:2604.08976. 9

  9. [9]

    Geng, J. et al. (2024). A survey of confidence estimation and calibration in large language models. arXiv:2311.08298

  10. [10]

    S., and Zettlemoyer, L

    Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. (2017). TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of ACL

  11. [11]

    Kim, S. (2026). Knowing before speaking: In-computation metacognition precedes verbal confidence in large language models.Preprints.org, doi:10.20944/preprints202604.0078.v2

  12. [12]

    Kumaran, V. et al. (2026). How do LLMs compute verbal confidence?arXiv:2603.17839

  13. [13]

    Closing the confidence-faithfulness gap in large language models.arXiv preprint arXiv:2603.25052, 2026

    Miao, N. and Ungar, L. (2026). Closing the confidence-faithfulness gap.arXiv:2603.25052

  14. [14]

    Morey, L. C. (1991).Personality Assessment Inventory professional manual. Psychological Assess- ment Resources

  15. [15]

    (2021).Modern Psychometrics(4th ed.)

    Rust, J., Kosinski, M., and Stillwell, D. (2021).Modern Psychometrics(4th ed.). Routledge

  16. [16]

    Seo, J. et al. (2026). ADVICE: Answer-dependent verbalized confidence estimation. arXiv:2510.10913

  17. [17]

    and Peters, M

    Steyvers, M. and Peters, M. A. K. (2025). Metacognition and uncertainty communication in humans and large language models.Current Directions in Psychological Science

  18. [18]

    Steyvers, M., Belem, C., and Smyth, P. (2025). Calibration of LLM confidence. Manuscript

  19. [19]

    Tian, K. et al. (2023). Just ask for calibration.arXiv:2305.14975

  20. [20]

    In 2022 IEEE/CVF Conference On Computer Vision and Pattern Recognition (CVPR), pages 7191–7201, New Orleans, USA

    Wang, A. and Stengel-Eskin, E. (2026). Calibrating verbalized confidence with self-generated dis- tractors (DiNCo).arXiv:2509.25532

  21. [21]

    Xiong, M. et al. (2023). Can LLMs express their uncertainty?arXiv:2306.13063

  22. [22]

    Yang, Z. (2024). Can LLMs express confidence? The impact of prompt and generation on verbalized confidence.arXiv:2412.14737

  23. [23]

    Zhao, Y. et al. (2026). Wired for overconfidence.arXiv:2604.01457. 10