Calibration Is Not Enough: Evaluating Confidence Estimation Under Language Variations

Benjamin Roth; Dennis Ulmer; Hinrich Sch\"utze; Terra Blevins; Yihong Liu; Yuxi Xia

arxiv: 2601.08064 · v2 · pith:7PZYTFEGnew · submitted 2026-01-12 · 💻 cs.CL

Calibration Is Not Enough: Evaluating Confidence Estimation Under Language Variations

Yuxi Xia , Dennis Ulmer , Terra Blevins , Yihong Liu , Hinrich Sch\"utze , Benjamin Roth This is my paper

classification 💻 cs.CL

keywords confidenceanswerssemanticallylanguagetextbfanswercorrectnessdifferent

0 comments

read the original abstract

Confidence estimation (CE) indicates how reliable the answers of large language models are and impacts user trust and decision-making. Existing evaluations mainly concern the alignment between confidence and correctness, but ignore the variability of language: confidence estimates should remain consistent under semantically equivalent prompts or answer variations, while changing when answer meaning differs, as this may indicate a change in correctness. Therefore, we introduce a novel evaluation framework based on three complementary properties: \textbf{robustness} to prompt perturbations, \textbf{stability} across semantically equivalent answers, and \textbf{sensitivity} to semantically different answers. We show that these metrics are largely independent from existing CE metrics, and that common CE methods often fail on them: while most methods achieve high robustness and stability, they struggle to distinguish semantically different answers, potentially because they do not effectively leverage generation-side information. Overall, our framework exposes overlooked limitations of current CE evaluations and provides guidance for selecting confidence estimators for real-world applications.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
cs.AI 2026-05 unverdicted novelty 6.0

LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
Enhancing Confidence Estimation in Telco LLMs via Twin-Pass CoT-Ensembling
cs.LG 2026-04 unverdicted novelty 6.0

Twin-Pass Chain-of-Thought Ensembling cuts Expected Calibration Error by up to 88% in Gemma-3 models on TeleQnA, ORANBench, and srsRANBench.