Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration

Hankyeol Kim; Pilsung Kang

arxiv: 2605.27752 · v2 · pith:46VPSFFFnew · submitted 2026-05-26 · 💻 cs.AI

Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration

Hankyeol Kim , Pilsung Kang This is my paper

Pith reviewed 2026-06-29 16:47 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM calibrationexpected calibration errorverbalized confidencetoken probabilityprotocol sensitivityquestion answeringInstruct models

0 comments

The pith

LLM confidence calibration comparisons between token probabilities and verbalized answers change sign or size based on conditioning context and token readout choices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether token-probability scores and verbalized confidence can be compared as direct uncertainty measures by holding the verbalized prompt fixed and varying only the scoring and context axes. It evaluates this on multiple QA benchmarks and model families, finding that the expected calibration error gap shifts with which answer is scored, how token probabilities are read out, and especially the conditioning context. Under the default generated-answer bare-context setup, Instruct models show near parity rather than a clear verbalized advantage. The work also shows verbalized confidence assigns similar scores to plausible wrong answers as to correct ones, indicating it tracks surface features beyond correctness alone. This implies both signals are protocol-dependent behavioral outputs that require explicit reporting of measurement choices.

Core claim

Holding verbalized-confidence elicitation fixed while varying the answer string scored for token probability, the token readout method, and the conditioning context produces ECE gaps whose sign or magnitude changes across settings and models. Under the default generated-answer bare-context protocol Instruct models reach near parity rather than showing a large calibration gain for verbalized confidence. In supplied-answer tests, surface-plausible wrong answers receive nearly the same verbalized confidence as gold answers.

What carries the argument

Protocol axes of conditioning context, token-probability readout from answer tokens, and ECE estimation when comparing verbalized versus token confidence signals on QA tasks.

If this is right

Calibration papers must specify conditioning context, scored answer string, and token readout to allow comparison.
Claims that Instruct tuning improves verbalized calibration over token probabilities may not generalize beyond particular protocols.
Verbalized confidence encodes answer plausibility and provenance in addition to correctness.
Both token and verbalized signals should be treated as measurements that depend on the chosen protocol.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Studies could standardize on generated-answer bare-context to reduce apparent differences across model families.
Similar protocol sensitivity could affect other tasks that compare internal and expressed uncertainty such as hallucination detection.
Users may need to select protocols according to whether they want to measure internal token uncertainty or surface-expressed confidence.

Load-bearing premise

That fixing only the verbalized prompt while varying token-probability axes and conditioning context isolates those effects without introducing other uncontrolled differences in model outputs or benchmark behavior.

What would settle it

Re-running the same benchmarks and models under multiple conditioning contexts and finding the ECE gap sign stays consistent rather than flipping.

Figures

Figures reproduced from arXiv: 2605.27752 by Hankyeol Kim, Pilsung Kang.

read the original abstract

LLM confidence calibration is often evaluated by comparing two signals: token-probability scores and verbalized confidence. These signals are sometimes treated as direct readouts of model uncertainty, but their comparison depends on measurement choices that are rarely made explicit. In the main analysis, we hold the verbalized-confidence elicitation fixed: a single prompt template, probability scale, and output format. We then vary the measurement axes that define the verbalized-vs-token comparison: which answer string receives the token-probability score, how that score is read from the answer tokens, and under which conditioning context it is measured. We evaluate this design on four QA benchmarks across three open 7--8B base/Instruct model families, with larger Qwen2.5 variants as same-family robustness checks. The resulting comparison is sensitive to these choices: conditioning context changes the sign or magnitude of the ECE gap across settings, token readout produces smaller but still sign-moving changes, and changing the ECE estimator has little effect. Under the default generated-answer, bare-context protocol, Instruct settings are close to parity rather than showing a large calibration gain for verbalized confidence. In a separate supplied-answer analysis, surface-plausible wrong answers receive nearly the same confidence as supplied gold answers, suggesting that verbalized confidence also reflects answer plausibility and provenance rather than correctness alone. We argue that both confidence signals should be treated as protocol-dependent behavioral measurements, and provide a reporting checklist covering elicitation provenance, scored answer, token-probability readout, and conditioning context.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows calibration comparisons flip with protocol choices on context and token readout, with a useful checklist, but the isolation of those effects from other model behaviors is not fully convincing.

read the letter

The main thing to know is that this work finds the ECE gap between verbalized confidence and token probabilities changes sign or size depending on conditioning context and how the token score is read out, even when the verbalized prompt stays fixed. Under the default generated-answer bare-context setup, Instruct models end up near parity instead of showing a clear verbalized advantage.

They do a solid job running the comparison across four QA benchmarks and three 7-8B model families, with same-family larger variants for checks. Holding the verbalized elicitation constant while varying the other axes is a clean way to surface the sensitivity, and the supplied-answer analysis adds the point that verbalized scores track plausibility as much as correctness. The reporting checklist on elicitation, scored answer, readout, and context is practical and directly usable.

The soft spot is the assumption that these variations cleanly isolate the measurement axes. Different conditioning contexts can shift answer generation dynamics or token distributions in uncontrolled ways, which might drive some of the reported sign changes independently of the intended protocol effect. The abstract does not describe explicit checks for those interactions, so the magnitude of the sensitivity could be overstated. The ECE estimator change having little effect is noted but not explored in depth.

This is for people designing or comparing LLM calibration benchmarks. It flags a real methodological issue without overclaiming. The empirical setup shows honest engagement with how these measurements actually work in practice. It deserves peer review so referees can examine the full methods, data splits, and any robustness tests against the isolation concern.

Referee Report

2 major / 2 minor

Summary. The paper claims that comparisons of token-probability scores versus verbalized confidence for LLM calibration are sensitive to measurement protocols. Holding the verbalized elicitation prompt, scale, and format fixed, the authors vary conditioning context, which answer string is scored, and token readout method across four QA benchmarks and three 7-8B model families (with Qwen2.5 robustness checks). They report that context alters the sign or magnitude of ECE gaps, token readout produces smaller sign-moving changes, ECE estimator choice has little effect, Instruct models are near parity under the default generated-answer bare-context protocol, and verbalized confidence on supplied answers tracks plausibility rather than correctness alone. They conclude both signals are protocol-dependent and supply a reporting checklist.

Significance. If the empirical patterns hold, the work usefully demonstrates that apparent calibration advantages are not robust to standard but rarely documented measurement choices, supporting more careful protocol reporting in the field. The checklist directly addresses a practical gap. The multi-benchmark, multi-family design and explicit separation of verbalized elicitation from the varied axes are strengths that make the sensitivity claim falsifiable and extensible.

major comments (2)

[Abstract / main analysis] Abstract and main analysis: the central claim that conditioning context produces sign or magnitude changes in the ECE gap rests on the assumption that fixing the verbalized prompt isolates the token-probability axes; however, the manuscript does not report whether answer-generation distributions or token-score baselines remain stable across bare vs. supplied-answer contexts, leaving open the possibility that uncontrolled shifts in model behavior contribute to the observed flips independently of the intended protocol axes.
[supplied-answer analysis] Supplied-answer analysis: the observation that surface-plausible wrong answers receive nearly the same verbalized confidence as gold answers is load-bearing for the claim that verbalized confidence reflects plausibility and provenance; the manuscript should define and operationalize 'surface-plausible' (e.g., via similarity metric or human rating) and report the exact confidence values or distributions to allow readers to assess the magnitude of the effect.

minor comments (2)

The manuscript should include a table or figure explicitly listing the exact prompt template, probability scale, and output format used for verbalized elicitation so that the 'fixed' condition can be reproduced.
The robustness checks with larger Qwen2.5 variants are mentioned but not detailed; adding a short subsection or appendix table showing whether sign changes replicate would strengthen the multi-family claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and outline revisions to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract / main analysis] Abstract and main analysis: the central claim that conditioning context produces sign or magnitude changes in the ECE gap rests on the assumption that fixing the verbalized prompt isolates the token-probability axes; however, the manuscript does not report whether answer-generation distributions or token-score baselines remain stable across bare vs. supplied-answer contexts, leaving open the possibility that uncontrolled shifts in model behavior contribute to the observed flips independently of the intended protocol axes.

Authors: We agree that reporting stability is important for isolating the protocol axes. The design holds verbalized elicitation fixed while varying conditioning context (among other token axes), but the manuscript does not include explicit checks on generation distributions or token baselines across bare vs. supplied contexts. In revision we will add these checks (e.g., comparing answer token distributions and baseline token scores) to confirm that observed ECE sign/magnitude changes are attributable to the measured protocol variations. revision: yes
Referee: [supplied-answer analysis] Supplied-answer analysis: the observation that surface-plausible wrong answers receive nearly the same verbalized confidence as gold answers is load-bearing for the claim that verbalized confidence reflects plausibility and provenance; the manuscript should define and operationalize 'surface-plausible' (e.g., via similarity metric or human rating) and report the exact confidence values or distributions to allow readers to assess the magnitude of the effect.

Authors: We accept this point. The manuscript uses the term 'surface-plausible' without a formal definition or quantitative reporting. In the revision we will operationalize it via a similarity metric (e.g., cosine similarity of embeddings or normalized edit distance) and report the exact verbalized confidence means, distributions, or histograms comparing gold answers to these plausible incorrect answers, allowing direct assessment of the effect size. revision: yes

Circularity Check

0 steps flagged

Empirical protocol-sensitivity study with no derivation chain or fitted predictions

full rationale

The paper performs an empirical comparison of ECE gaps under varied conditioning contexts, token readouts, and answer sources while holding the verbalized elicitation prompt fixed. No equations, first-principles derivations, parameter fits, or predictions appear in the provided text. Claims rest on direct experimental measurements across benchmarks and models rather than any reduction of outputs to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked. This is a standard non-circular empirical measurement study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical measurement study with no mathematical derivations, free parameters, or new postulated entities described.

pith-pipeline@v0.9.1-grok · 5800 in / 1059 out tokens · 42692 ms · 2026-06-29T16:47:03.806688+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Measuring massive multitask language under- standing. InICLR. ArXiv:2009.03300. Xinmeng Huang, Shuo Li, Mengxin Yu, Matteo Sesia, Hamed Hassani, Insup Lee, Osbert Bastani, and Edgar Dobriban. 2024. Uncertainty in language models: Assessment through rank-calibration. In EMNLP, pages 284–312. ArXiv:2404.03163. Albert Q. Jiang, Alexandre Sablayrolles, Arthur...

work page internal anchor Pith review Pith/arXiv arXiv 2009
[2]

InEMNLP, pages 10448–10514

Calibrating LLM confidence by probing per- turbed representation stability. InEMNLP, pages 10448–10514. ArXiv:2505.21772. Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. 2019. Similarity of neu- ral network representations revisited. InICML. ArXiv:1905.00414. Ananya Kumar, Percy S. Liang, and Tengyu Ma. 2019. Verified uncertainty cali...

work page arXiv 2019
[3]

Guess: X

Mix-n-Match: Ensemble and compositional methods for uncertainty calibration in deep learning. InICML. ArXiv:2003.07329. Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Im- proving few-shot performance of language mod- els. InICML, pages 12697–12706. PMLR 139; arXiv:2102.09690. 11 A Full A–D Protocol Grid by Set...

work page arXiv 2003

[1] [1]

Measuring massive multitask language under- standing. InICLR. ArXiv:2009.03300. Xinmeng Huang, Shuo Li, Mengxin Yu, Matteo Sesia, Hamed Hassani, Insup Lee, Osbert Bastani, and Edgar Dobriban. 2024. Uncertainty in language models: Assessment through rank-calibration. In EMNLP, pages 284–312. ArXiv:2404.03163. Albert Q. Jiang, Alexandre Sablayrolles, Arthur...

work page internal anchor Pith review Pith/arXiv arXiv 2009

[2] [2]

InEMNLP, pages 10448–10514

Calibrating LLM confidence by probing per- turbed representation stability. InEMNLP, pages 10448–10514. ArXiv:2505.21772. Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. 2019. Similarity of neu- ral network representations revisited. InICML. ArXiv:1905.00414. Ananya Kumar, Percy S. Liang, and Tengyu Ma. 2019. Verified uncertainty cali...

work page arXiv 2019

[3] [3]

Guess: X

Mix-n-Match: Ensemble and compositional methods for uncertainty calibration in deep learning. InICML. ArXiv:2003.07329. Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Im- proving few-shot performance of language mod- els. InICML, pages 12697–12706. PMLR 139; arXiv:2102.09690. 11 A Full A–D Protocol Grid by Set...

work page arXiv 2003