pith. sign in

arxiv: 2505.24778 · v3 · submitted 2025-05-30 · 💻 cs.CL

Revisiting Epistemic Markers in Confidence Estimation: Can Markers Accurately Reflect Large Language Models' Uncertainty?

Pith reviewed 2026-05-19 12:41 UTC · model grok-4.3

classification 💻 cs.CL
keywords epistemic markersconfidence estimationlarge language modelsuncertainty quantificationout-of-distributionquestion answeringin-distribution generalization
0
0 comments X

The pith

Epistemic markers reflect LLM confidence reliably only within the same distribution but become inconsistent out of distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether verbal markers such as 'fairly confident' in LLM answers track the model's actual uncertainty. It measures marker confidence by the accuracy achieved whenever the model produces a given marker and checks whether this value stays stable when models answer questions from the same distribution or from new ones. Experiments across question-answering datasets show that the accuracy tied to each marker holds up inside the training distribution for both open-source and proprietary models, yet shifts noticeably when questions come from different distributions. This matters for high-stakes uses where people rely on the markers to decide how much to trust an answer. The results indicate that current marker usage does not provide a stable signal of intrinsic uncertainty across contexts.

Core claim

We define marker confidence as the observed accuracy when a model employs an epistemic marker. We evaluate its stability across multiple question-answering datasets in both in-distribution and out-of-distribution settings for open-source and proprietary LLMs. Our results show that while markers generalize well within the same distribution, their confidence is inconsistent in out-of-distribution scenarios. These findings raise significant concerns about the reliability of epistemic markers for confidence estimation, underscoring the need for improved alignment between marker based confidence and actual model uncertainty.

What carries the argument

marker confidence, defined as the observed accuracy when the model uses a particular epistemic marker in its generated answer

If this is right

  • Marker-based confidence estimates must be treated as dependent on the data distribution rather than universal.
  • In unfamiliar domains the same verbal marker can correspond to different levels of actual answer accuracy.
  • Improved training or calibration methods are required to make marker usage track internal uncertainty more consistently.
  • Both open-source and proprietary models display comparable patterns of marker inconsistency outside the training distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Fine-tuning on deliberately mixed distributions could reduce the observed shift in marker accuracy.
  • Explicit numerical confidence scores extracted from model logits may prove more stable than verbal markers across domains.
  • Application developers may need to monitor and adjust how markers are generated separately for each new domain.

Load-bearing premise

The assumption that accuracy observed when a marker appears is a faithful proxy for the model's intrinsic uncertainty and that the chosen datasets cleanly separate in-distribution from out-of-distribution regimes.

What would settle it

Finding that the accuracy associated with each epistemic marker stays essentially the same on new out-of-distribution datasets as it does on the original in-distribution sets would show the claimed inconsistency does not hold.

read the original abstract

As large language models (LLMs) are increasingly used in high-stakes domains, accurately assessing their confidence is crucial. Humans typically express confidence through epistemic markers (e.g., "fairly confident") instead of numerical values. However, it remains unclear whether LLMs consistently use these markers to reflect their intrinsic confidence due to the difficulty of quantifying uncertainty associated with various markers. To address this gap, we first define marker confidence as the observed accuracy when a model employs an epistemic marker. We evaluate its stability across multiple question-answering datasets in both in-distribution and out-of-distribution settings for open-source and proprietary LLMs. Our results show that while markers generalize well within the same distribution, their confidence is inconsistent in out-of-distribution scenarios. These findings raise significant concerns about the reliability of epistemic markers for confidence estimation, underscoring the need for improved alignment between marker based confidence and actual model uncertainty. Our code is available at https://github.com/HKUST-KnowComp/MarConf.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript defines marker confidence as the empirical accuracy of LLM answers conditional on the presence of specific epistemic markers (e.g., 'fairly confident'). It evaluates the stability of these per-marker accuracies across multiple QA datasets in both in-distribution and out-of-distribution regimes for open-source and proprietary models, reporting that markers generalize reliably within a distribution but exhibit inconsistency across distributions. The authors conclude that this undermines the use of epistemic markers as faithful readouts of intrinsic model uncertainty.

Significance. If the central empirical pattern is robust to the concerns below, the result would be moderately significant for LLM calibration research: it supplies concrete evidence that natural-language confidence expressions fail to track accuracy under distribution shift, which is directly relevant to high-stakes deployment. The public release of code is a clear strength that supports reproducibility and follow-up experiments.

major comments (2)
  1. [§3] §3 (Definition of marker confidence): the central claim equates observed accuracy conditional on marker presence with a readout of intrinsic uncertainty. In OOD regimes this proxy is load-bearing for the inconsistency result, yet the paper does not rule out that accuracy drops arise from shifts in answer phrasing, marker frequency, or dataset-specific lexical patterns rather than from mis-calibrated internal uncertainty. An ablation that holds answer content fixed while varying only marker insertion (or that conditions on marker while controlling for question difficulty and generation length) is needed to support the attribution.
  2. [§4.2] §4.2 and Table 3: the reported OOD inconsistency is presented without statistical tests or confidence intervals on the per-marker accuracy differences. Given that the abstract and results emphasize 'inconsistency,' the absence of significance testing leaves the strength of the cross-distribution claim only partially supported.
minor comments (2)
  1. [Abstract / §3] The abstract and §3 do not specify the exact list of epistemic markers or the string-matching rules used for extraction; adding this list (perhaps as an appendix table) would improve replicability.
  2. [Figure 2] Figure 2 caption should explicitly state the number of samples per dataset and whether error bars represent standard error or standard deviation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects for strengthening the attribution of results to model uncertainty and for improving statistical support. We address each major comment below and have revised the manuscript to incorporate the suggested analyses.

read point-by-point responses
  1. Referee: [§3] §3 (Definition of marker confidence): the central claim equates observed accuracy conditional on marker presence with a readout of intrinsic uncertainty. In OOD regimes this proxy is load-bearing for the inconsistency result, yet the paper does not rule out that accuracy drops arise from shifts in answer phrasing, marker frequency, or dataset-specific lexical patterns rather than from mis-calibrated internal uncertainty. An ablation that holds answer content fixed while varying only marker insertion (or that conditions on marker while controlling for question difficulty and generation length) is needed to support the attribution.

    Authors: We agree that additional controls are necessary to better isolate the contribution of epistemic markers from potential confounds such as lexical patterns or answer phrasing shifts. In the revised manuscript we have added a new ablation in Section 3 (with details in the appendix) that re-uses the same model-generated answers across conditions and systematically varies only the presence and type of epistemic marker. We further condition on question difficulty (via baseline accuracy without markers) and control for generation length. The updated results continue to show OOD inconsistency after these controls, supporting the original interpretation while addressing the referee's concern. revision: yes

  2. Referee: [§4.2] §4.2 and Table 3: the reported OOD inconsistency is presented without statistical tests or confidence intervals on the per-marker accuracy differences. Given that the abstract and results emphasize 'inconsistency,' the absence of significance testing leaves the strength of the cross-distribution claim only partially supported.

    Authors: We acknowledge that the original submission lacked formal statistical assessment of the cross-distribution differences. In the revised version we have updated Table 3 and the accompanying text to include 95% bootstrap confidence intervals for each per-marker accuracy as well as paired statistical tests (McNemar’s test for accuracy differences) between in-distribution and out-of-distribution settings. The reported inconsistencies remain statistically significant (p < 0.05) for the majority of markers and models, which we now explicitly state. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical computation of marker confidence from held-out accuracy

full rationale

The paper defines marker confidence explicitly as observed accuracy conditional on epistemic marker presence and computes this quantity directly on held-out QA datasets for both ID and OOD regimes. No equations, fitted parameters, or derivations are presented; results are reported as raw empirical observations. No self-citations are invoked to justify uniqueness or load-bearing premises, and the central claim (inconsistency in OOD) follows from straightforward conditional accuracy measurements rather than reducing to any input by construction. The study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that marker presence can be automatically detected and that the chosen QA datasets form representative in- and out-of-distribution splits; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Epistemic markers can be reliably identified in LLM-generated text and their conditional accuracy measures intrinsic uncertainty
    Invoked when defining marker confidence and when interpreting accuracy differences as evidence of unreliable uncertainty signaling.

pith-pipeline@v0.9.0 · 5711 in / 1159 out tokens · 26137 ms · 2026-05-19T12:41:45.473529+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.