Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration
Pith reviewed 2026-06-29 16:47 UTC · model grok-4.3
The pith
LLM confidence calibration comparisons between token probabilities and verbalized answers change sign or size based on conditioning context and token readout choices.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Holding verbalized-confidence elicitation fixed while varying the answer string scored for token probability, the token readout method, and the conditioning context produces ECE gaps whose sign or magnitude changes across settings and models. Under the default generated-answer bare-context protocol Instruct models reach near parity rather than showing a large calibration gain for verbalized confidence. In supplied-answer tests, surface-plausible wrong answers receive nearly the same verbalized confidence as gold answers.
What carries the argument
Protocol axes of conditioning context, token-probability readout from answer tokens, and ECE estimation when comparing verbalized versus token confidence signals on QA tasks.
If this is right
- Calibration papers must specify conditioning context, scored answer string, and token readout to allow comparison.
- Claims that Instruct tuning improves verbalized calibration over token probabilities may not generalize beyond particular protocols.
- Verbalized confidence encodes answer plausibility and provenance in addition to correctness.
- Both token and verbalized signals should be treated as measurements that depend on the chosen protocol.
Where Pith is reading between the lines
- Studies could standardize on generated-answer bare-context to reduce apparent differences across model families.
- Similar protocol sensitivity could affect other tasks that compare internal and expressed uncertainty such as hallucination detection.
- Users may need to select protocols according to whether they want to measure internal token uncertainty or surface-expressed confidence.
Load-bearing premise
That fixing only the verbalized prompt while varying token-probability axes and conditioning context isolates those effects without introducing other uncontrolled differences in model outputs or benchmark behavior.
What would settle it
Re-running the same benchmarks and models under multiple conditioning contexts and finding the ECE gap sign stays consistent rather than flipping.
Figures
read the original abstract
LLM confidence calibration is often evaluated by comparing two signals: token-probability scores and verbalized confidence. These signals are sometimes treated as direct readouts of model uncertainty, but their comparison depends on measurement choices that are rarely made explicit. In the main analysis, we hold the verbalized-confidence elicitation fixed: a single prompt template, probability scale, and output format. We then vary the measurement axes that define the verbalized-vs-token comparison: which answer string receives the token-probability score, how that score is read from the answer tokens, and under which conditioning context it is measured. We evaluate this design on four QA benchmarks across three open 7--8B base/Instruct model families, with larger Qwen2.5 variants as same-family robustness checks. The resulting comparison is sensitive to these choices: conditioning context changes the sign or magnitude of the ECE gap across settings, token readout produces smaller but still sign-moving changes, and changing the ECE estimator has little effect. Under the default generated-answer, bare-context protocol, Instruct settings are close to parity rather than showing a large calibration gain for verbalized confidence. In a separate supplied-answer analysis, surface-plausible wrong answers receive nearly the same confidence as supplied gold answers, suggesting that verbalized confidence also reflects answer plausibility and provenance rather than correctness alone. We argue that both confidence signals should be treated as protocol-dependent behavioral measurements, and provide a reporting checklist covering elicitation provenance, scored answer, token-probability readout, and conditioning context.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that comparisons of token-probability scores versus verbalized confidence for LLM calibration are sensitive to measurement protocols. Holding the verbalized elicitation prompt, scale, and format fixed, the authors vary conditioning context, which answer string is scored, and token readout method across four QA benchmarks and three 7-8B model families (with Qwen2.5 robustness checks). They report that context alters the sign or magnitude of ECE gaps, token readout produces smaller sign-moving changes, ECE estimator choice has little effect, Instruct models are near parity under the default generated-answer bare-context protocol, and verbalized confidence on supplied answers tracks plausibility rather than correctness alone. They conclude both signals are protocol-dependent and supply a reporting checklist.
Significance. If the empirical patterns hold, the work usefully demonstrates that apparent calibration advantages are not robust to standard but rarely documented measurement choices, supporting more careful protocol reporting in the field. The checklist directly addresses a practical gap. The multi-benchmark, multi-family design and explicit separation of verbalized elicitation from the varied axes are strengths that make the sensitivity claim falsifiable and extensible.
major comments (2)
- [Abstract / main analysis] Abstract and main analysis: the central claim that conditioning context produces sign or magnitude changes in the ECE gap rests on the assumption that fixing the verbalized prompt isolates the token-probability axes; however, the manuscript does not report whether answer-generation distributions or token-score baselines remain stable across bare vs. supplied-answer contexts, leaving open the possibility that uncontrolled shifts in model behavior contribute to the observed flips independently of the intended protocol axes.
- [supplied-answer analysis] Supplied-answer analysis: the observation that surface-plausible wrong answers receive nearly the same verbalized confidence as gold answers is load-bearing for the claim that verbalized confidence reflects plausibility and provenance; the manuscript should define and operationalize 'surface-plausible' (e.g., via similarity metric or human rating) and report the exact confidence values or distributions to allow readers to assess the magnitude of the effect.
minor comments (2)
- The manuscript should include a table or figure explicitly listing the exact prompt template, probability scale, and output format used for verbalized elicitation so that the 'fixed' condition can be reproduced.
- The robustness checks with larger Qwen2.5 variants are mentioned but not detailed; adding a short subsection or appendix table showing whether sign changes replicate would strengthen the multi-family claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and outline revisions to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract / main analysis] Abstract and main analysis: the central claim that conditioning context produces sign or magnitude changes in the ECE gap rests on the assumption that fixing the verbalized prompt isolates the token-probability axes; however, the manuscript does not report whether answer-generation distributions or token-score baselines remain stable across bare vs. supplied-answer contexts, leaving open the possibility that uncontrolled shifts in model behavior contribute to the observed flips independently of the intended protocol axes.
Authors: We agree that reporting stability is important for isolating the protocol axes. The design holds verbalized elicitation fixed while varying conditioning context (among other token axes), but the manuscript does not include explicit checks on generation distributions or token baselines across bare vs. supplied contexts. In revision we will add these checks (e.g., comparing answer token distributions and baseline token scores) to confirm that observed ECE sign/magnitude changes are attributable to the measured protocol variations. revision: yes
-
Referee: [supplied-answer analysis] Supplied-answer analysis: the observation that surface-plausible wrong answers receive nearly the same verbalized confidence as gold answers is load-bearing for the claim that verbalized confidence reflects plausibility and provenance; the manuscript should define and operationalize 'surface-plausible' (e.g., via similarity metric or human rating) and report the exact confidence values or distributions to allow readers to assess the magnitude of the effect.
Authors: We accept this point. The manuscript uses the term 'surface-plausible' without a formal definition or quantitative reporting. In the revision we will operationalize it via a similarity metric (e.g., cosine similarity of embeddings or normalized edit distance) and report the exact verbalized confidence means, distributions, or histograms comparing gold answers to these plausible incorrect answers, allowing direct assessment of the effect size. revision: yes
Circularity Check
Empirical protocol-sensitivity study with no derivation chain or fitted predictions
full rationale
The paper performs an empirical comparison of ECE gaps under varied conditioning contexts, token readouts, and answer sources while holding the verbalized elicitation prompt fixed. No equations, first-principles derivations, parameter fits, or predictions appear in the provided text. Claims rest on direct experimental measurements across benchmarks and models rather than any reduction of outputs to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked. This is a standard non-circular empirical measurement study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Measuring massive multitask language under- standing. InICLR. ArXiv:2009.03300. Xinmeng Huang, Shuo Li, Mengxin Yu, Matteo Sesia, Hamed Hassani, Insup Lee, Osbert Bastani, and Edgar Dobriban. 2024. Uncertainty in language models: Assessment through rank-calibration. In EMNLP, pages 284–312. ArXiv:2404.03163. Albert Q. Jiang, Alexandre Sablayrolles, Arthur...
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[2]
Calibrating LLM confidence by probing per- turbed representation stability. InEMNLP, pages 10448–10514. ArXiv:2505.21772. Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. 2019. Similarity of neu- ral network representations revisited. InICML. ArXiv:1905.00414. Ananya Kumar, Percy S. Liang, and Tengyu Ma. 2019. Verified uncertainty cali...
-
[3]
Mix-n-Match: Ensemble and compositional methods for uncertainty calibration in deep learning. InICML. ArXiv:2003.07329. Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Im- proving few-shot performance of language mod- els. InICML, pages 12697–12706. PMLR 139; arXiv:2102.09690. 11 A Full A–D Protocol Grid by Set...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.