Revisiting Epistemic Markers in Confidence Estimation: Can Markers Accurately Reflect Large Language Models' Uncertainty?

Jiayu Liu; Qing Zong; Weiqi Wang; Yangqiu Song

arxiv: 2505.24778 · v3 · submitted 2025-05-30 · 💻 cs.CL

Revisiting Epistemic Markers in Confidence Estimation: Can Markers Accurately Reflect Large Language Models' Uncertainty?

Jiayu Liu , Qing Zong , Weiqi Wang , Yangqiu Song This is my paper

Pith reviewed 2026-05-19 12:41 UTC · model grok-4.3

classification 💻 cs.CL

keywords epistemic markersconfidence estimationlarge language modelsuncertainty quantificationout-of-distributionquestion answeringin-distribution generalization

0 comments

The pith

Epistemic markers reflect LLM confidence reliably only within the same distribution but become inconsistent out of distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether verbal markers such as 'fairly confident' in LLM answers track the model's actual uncertainty. It measures marker confidence by the accuracy achieved whenever the model produces a given marker and checks whether this value stays stable when models answer questions from the same distribution or from new ones. Experiments across question-answering datasets show that the accuracy tied to each marker holds up inside the training distribution for both open-source and proprietary models, yet shifts noticeably when questions come from different distributions. This matters for high-stakes uses where people rely on the markers to decide how much to trust an answer. The results indicate that current marker usage does not provide a stable signal of intrinsic uncertainty across contexts.

Core claim

We define marker confidence as the observed accuracy when a model employs an epistemic marker. We evaluate its stability across multiple question-answering datasets in both in-distribution and out-of-distribution settings for open-source and proprietary LLMs. Our results show that while markers generalize well within the same distribution, their confidence is inconsistent in out-of-distribution scenarios. These findings raise significant concerns about the reliability of epistemic markers for confidence estimation, underscoring the need for improved alignment between marker based confidence and actual model uncertainty.

What carries the argument

marker confidence, defined as the observed accuracy when the model uses a particular epistemic marker in its generated answer

If this is right

Marker-based confidence estimates must be treated as dependent on the data distribution rather than universal.
In unfamiliar domains the same verbal marker can correspond to different levels of actual answer accuracy.
Improved training or calibration methods are required to make marker usage track internal uncertainty more consistently.
Both open-source and proprietary models display comparable patterns of marker inconsistency outside the training distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Fine-tuning on deliberately mixed distributions could reduce the observed shift in marker accuracy.
Explicit numerical confidence scores extracted from model logits may prove more stable than verbal markers across domains.
Application developers may need to monitor and adjust how markers are generated separately for each new domain.

Load-bearing premise

The assumption that accuracy observed when a marker appears is a faithful proxy for the model's intrinsic uncertainty and that the chosen datasets cleanly separate in-distribution from out-of-distribution regimes.

What would settle it

Finding that the accuracy associated with each epistemic marker stays essentially the same on new out-of-distribution datasets as it does on the original in-distribution sets would show the claimed inconsistency does not hold.

read the original abstract

As large language models (LLMs) are increasingly used in high-stakes domains, accurately assessing their confidence is crucial. Humans typically express confidence through epistemic markers (e.g., "fairly confident") instead of numerical values. However, it remains unclear whether LLMs consistently use these markers to reflect their intrinsic confidence due to the difficulty of quantifying uncertainty associated with various markers. To address this gap, we first define marker confidence as the observed accuracy when a model employs an epistemic marker. We evaluate its stability across multiple question-answering datasets in both in-distribution and out-of-distribution settings for open-source and proprietary LLMs. Our results show that while markers generalize well within the same distribution, their confidence is inconsistent in out-of-distribution scenarios. These findings raise significant concerns about the reliability of epistemic markers for confidence estimation, underscoring the need for improved alignment between marker based confidence and actual model uncertainty. Our code is available at https://github.com/HKUST-KnowComp/MarConf.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Markers give stable accuracy signals inside the distribution but inconsistent ones outside it, though the drop may track phrasing changes more than intrinsic uncertainty.

read the letter

Hi, The main thing to know about this paper is that it reports epistemic markers producing consistent accuracy levels within the same data distribution but inconsistent levels when the questions come from a different distribution, and this pattern appears in both open-source and proprietary LLMs. They do a good job defining marker confidence clearly as the accuracy observed on responses that include a particular epistemic marker. The design runs this measure across multiple question-answering datasets with an in-distribution versus out-of-distribution split, covering both kinds of models. Making the code public at the GitHub link is useful if someone wants to replicate or extend the experiments. The softer part is the link between the observed inconsistency and actual model uncertainty. The accuracy given a marker could drop in out-of-distribution cases simply because the model changes its answer style, uses markers at different rates, or pairs them with different quality outputs due to domain-specific patterns it learned. Without an experiment that keeps the underlying answer fixed and only varies the marker, or that controls for question difficulty and generation length, it's difficult to say the inconsistency comes from misaligned uncertainty rather than surface-level phrasing shifts. The abstract also does not detail the exact rules for pulling markers out of the generated text or how the out-of-distribution sets were constructed, so the strength of the central claim depends on those choices in the full paper. This kind of work is relevant for people who study or deploy LLMs in settings where natural language confidence expressions are used instead of numbers. A reader interested in calibration or safe use of LLMs would find the in-distribution stability versus out-of-distribution inconsistency worth examining. The empirical comparison is grounded enough to justify sending it to peer review, provided the authors add the missing methodological details and address the possible confounding from phrasing. I would recommend putting it through review after those clarifications. Best regards,

Referee Report

2 major / 2 minor

Summary. The manuscript defines marker confidence as the empirical accuracy of LLM answers conditional on the presence of specific epistemic markers (e.g., 'fairly confident'). It evaluates the stability of these per-marker accuracies across multiple QA datasets in both in-distribution and out-of-distribution regimes for open-source and proprietary models, reporting that markers generalize reliably within a distribution but exhibit inconsistency across distributions. The authors conclude that this undermines the use of epistemic markers as faithful readouts of intrinsic model uncertainty.

Significance. If the central empirical pattern is robust to the concerns below, the result would be moderately significant for LLM calibration research: it supplies concrete evidence that natural-language confidence expressions fail to track accuracy under distribution shift, which is directly relevant to high-stakes deployment. The public release of code is a clear strength that supports reproducibility and follow-up experiments.

major comments (2)

[§3] §3 (Definition of marker confidence): the central claim equates observed accuracy conditional on marker presence with a readout of intrinsic uncertainty. In OOD regimes this proxy is load-bearing for the inconsistency result, yet the paper does not rule out that accuracy drops arise from shifts in answer phrasing, marker frequency, or dataset-specific lexical patterns rather than from mis-calibrated internal uncertainty. An ablation that holds answer content fixed while varying only marker insertion (or that conditions on marker while controlling for question difficulty and generation length) is needed to support the attribution.
[§4.2] §4.2 and Table 3: the reported OOD inconsistency is presented without statistical tests or confidence intervals on the per-marker accuracy differences. Given that the abstract and results emphasize 'inconsistency,' the absence of significance testing leaves the strength of the cross-distribution claim only partially supported.

minor comments (2)

[Abstract / §3] The abstract and §3 do not specify the exact list of epistemic markers or the string-matching rules used for extraction; adding this list (perhaps as an appendix table) would improve replicability.
[Figure 2] Figure 2 caption should explicitly state the number of samples per dataset and whether error bars represent standard error or standard deviation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects for strengthening the attribution of results to model uncertainty and for improving statistical support. We address each major comment below and have revised the manuscript to incorporate the suggested analyses.

read point-by-point responses

Referee: [§3] §3 (Definition of marker confidence): the central claim equates observed accuracy conditional on marker presence with a readout of intrinsic uncertainty. In OOD regimes this proxy is load-bearing for the inconsistency result, yet the paper does not rule out that accuracy drops arise from shifts in answer phrasing, marker frequency, or dataset-specific lexical patterns rather than from mis-calibrated internal uncertainty. An ablation that holds answer content fixed while varying only marker insertion (or that conditions on marker while controlling for question difficulty and generation length) is needed to support the attribution.

Authors: We agree that additional controls are necessary to better isolate the contribution of epistemic markers from potential confounds such as lexical patterns or answer phrasing shifts. In the revised manuscript we have added a new ablation in Section 3 (with details in the appendix) that re-uses the same model-generated answers across conditions and systematically varies only the presence and type of epistemic marker. We further condition on question difficulty (via baseline accuracy without markers) and control for generation length. The updated results continue to show OOD inconsistency after these controls, supporting the original interpretation while addressing the referee's concern. revision: yes
Referee: [§4.2] §4.2 and Table 3: the reported OOD inconsistency is presented without statistical tests or confidence intervals on the per-marker accuracy differences. Given that the abstract and results emphasize 'inconsistency,' the absence of significance testing leaves the strength of the cross-distribution claim only partially supported.

Authors: We acknowledge that the original submission lacked formal statistical assessment of the cross-distribution differences. In the revised version we have updated Table 3 and the accompanying text to include 95% bootstrap confidence intervals for each per-marker accuracy as well as paired statistical tests (McNemar’s test for accuracy differences) between in-distribution and out-of-distribution settings. The reported inconsistencies remain statistically significant (p < 0.05) for the majority of markers and models, which we now explicitly state. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical computation of marker confidence from held-out accuracy

full rationale

The paper defines marker confidence explicitly as observed accuracy conditional on epistemic marker presence and computes this quantity directly on held-out QA datasets for both ID and OOD regimes. No equations, fitted parameters, or derivations are presented; results are reported as raw empirical observations. No self-citations are invoked to justify uniqueness or load-bearing premises, and the central claim (inconsistency in OOD) follows from straightforward conditional accuracy measurements rather than reducing to any input by construction. The study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that marker presence can be automatically detected and that the chosen QA datasets form representative in- and out-of-distribution splits; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Epistemic markers can be reliably identified in LLM-generated text and their conditional accuracy measures intrinsic uncertainty
Invoked when defining marker confidence and when interpreting accuracy differences as evidence of unreliable uncertainty signaling.

pith-pipeline@v0.9.0 · 5711 in / 1159 out tokens · 26137 ms · 2026-05-19T12:41:45.473529+00:00 · methodology

Revisiting Epistemic Markers in Confidence Estimation: Can Markers Accurately Reflect Large Language Models' Uncertainty?

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)