Recognition: 2 theorem links
Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation
Pith reviewed 2026-05-12 17:55 UTC · model grok-4.3
The pith
Semantic entropy, which groups model outputs by shared meaning before measuring uncertainty, predicts answer accuracy more reliably than token-level entropy on question answering tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors introduce semantic entropy as an entropy measure over semantic equivalence classes of generated sentences rather than over individual token sequences. Sentences are grouped into classes that share the same meaning through an unsupervised procedure that queries the language model itself; the entropy is then taken with respect to the total probability mass assigned to each class. This construction is invariant to linguistic rephrasings that preserve meaning and requires no model modifications, additional training data, or auxiliary models. Ablation studies on multiple question answering benchmarks show that semantic entropy is more predictive of model accuracy than comparable token
What carries the argument
Semantic entropy: entropy computed over clusters of semantically equivalent generations identified unsupervised by the model itself.
Load-bearing premise
Semantic equivalence classes among generated sentences can be reliably identified in an unsupervised manner using the language model itself.
What would settle it
A dataset or experiment in which the unsupervised clustering places semantically distinct answers into the same class (or vice versa) and semantic entropy loses its advantage in predicting accuracy over baselines.
read the original abstract
We introduce a method to measure uncertainty in large language models. For tasks like question answering, it is essential to know when we can trust the natural language outputs of foundation models. We show that measuring uncertainty in natural language is challenging because of "semantic equivalence" -- different sentences can mean the same thing. To overcome these challenges we introduce semantic entropy -- an entropy which incorporates linguistic invariances created by shared meanings. Our method is unsupervised, uses only a single model, and requires no modifications to off-the-shelf language models. In comprehensive ablation studies we show that the semantic entropy is more predictive of model accuracy on question answering data sets than comparable baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces semantic entropy, an uncertainty measure for natural language generation that incorporates linguistic invariances arising from semantic equivalence among different phrasings. The approach is unsupervised, relies on a single off-the-shelf language model without modifications, and is evaluated via ablation studies claiming superior predictive power for model accuracy on question-answering datasets relative to standard baselines.
Significance. If the central empirical claim holds after addressing the clustering validation, the work would offer a practical advance in uncertainty estimation for NLG by handling semantic equivalence without external supervision or model changes. The unsupervised single-model design is a notable strength that could facilitate broader adoption in reliability-critical applications.
major comments (2)
- [Ablation studies] The ablation studies' claim of superior predictive performance for semantic entropy depends on the reliability of the unsupervised semantic equivalence clustering step, yet no details are supplied on the exact prompting/embedding procedure used to form clusters or on any independent validation of cluster quality (e.g., human agreement rates stratified by model confidence level).
- [Method] Because equivalence judgments are obtained from the same language model whose uncertainty is being quantified, the clustering step risks producing unreliable or inconsistent partitions precisely when the model is uncertain about the answer; this directly affects the entropy calculation and could inflate the reported advantage over baselines.
minor comments (1)
- [Abstract] The abstract states empirical superiority on QA datasets but omits any mention of the statistical tests employed or controls for confounding factors such as generation length or sampling temperature.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify important aspects of our work on semantic entropy. We address each major point below and will revise the manuscript to improve transparency and robustness.
read point-by-point responses
-
Referee: [Ablation studies] The ablation studies' claim of superior predictive performance for semantic entropy depends on the reliability of the unsupervised semantic equivalence clustering step, yet no details are supplied on the exact prompting/embedding procedure used to form clusters or on any independent validation of cluster quality (e.g., human agreement rates stratified by model confidence level).
Authors: We agree that greater detail on the clustering procedure is needed for reproducibility. In the revised manuscript, we will expand the methods section to fully specify the prompting strategy for equivalence judgments and the embedding approach used to form clusters. We will also add a human evaluation of cluster quality, reporting agreement rates and stratifying results by model confidence levels to directly validate this component of the method. revision: yes
-
Referee: [Method] Because equivalence judgments are obtained from the same language model whose uncertainty is being quantified, the clustering step risks producing unreliable or inconsistent partitions precisely when the model is uncertain about the answer; this directly affects the entropy calculation and could inflate the reported advantage over baselines.
Authors: This is a substantive methodological concern. Using the same model for equivalence judgments introduces a potential dependency that could affect cluster reliability in low-confidence regimes. We will add a dedicated discussion section in the revision addressing this limitation, including analysis of how the entropy measure behaves under varying confidence levels and why the observed performance gains are not solely attributable to this effect. revision: partial
Circularity Check
No significant circularity detected in semantic entropy derivation
full rationale
The paper defines semantic entropy by extending standard entropy to group generations into semantic equivalence classes identified unsupervised via the same model. No equations or steps in the provided text reduce the final measure to a fitted parameter, self-referential definition, or load-bearing self-citation by construction. The method is explicitly described as model-agnostic and unsupervised without modifications, and ablation results are presented as empirical comparisons to baselines rather than forced outcomes. This satisfies the default expectation of a non-circular paper; the clustering step is a methodological choice whose quality is not shown to be tautological with the uncertainty output.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 31 Pith papers
-
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
-
Inducing Artificial Uncertainty in Language Models
Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
-
Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation
Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a tra...
-
Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking
BICR uses blind-image contrastive ranking on frozen LVLM hidden states to train a lightweight probe that penalizes confidence on blacked-out inputs, yielding top calibration and discrimination across five models and m...
-
Task-Aware Calibration: Provably Optimal Decoding in LLMs
Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.
-
Elicitation Matters: How Prompts and Query Protocols Shape LLM Surrogates under Sparse Observations
LLM surrogate beliefs under sparse observations depend on prompts and query protocols, with structural prompts as priors, pointwise vs joint querying producing different beliefs, and sequential evidence causing non-mo...
-
Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders
Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
-
Beyond One Output: Visualizing and Comparing Distributions of Language Model Generations
GROVE visualizes distributions of language model generations as overlapping paths through a text graph, with user studies showing that graph summaries aid structural judgments like diversity assessment while raw outpu...
-
Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code
CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and si...
-
Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models
A new Latent Imagination Module uses cross-attention to predict latent visual embeddings from text, improving accuracy and calibration of vision-language models on text-only inputs.
-
LLMs as Implicit Imputers: Uncertainty Should Scale with Missing Information
Response entropy in LLMs rises with missing context on SQuAD while sampling-based confidence stays high, supporting the multiple imputation criterion and introducing a diagnostic for uncertainty reduction by context level.
-
Uncertainty Quantification for LLM-based Code Generation
RisCoSet applies multiple hypothesis testing to construct risk-controlling partial-program prediction sets for LLM code generation, achieving up to 24.5% less code removal than prior methods at equivalent risk levels.
-
Active Testing of Large Language Models via Approximate Neyman Allocation
Active testing via surrogate semantic entropy stratification and approximate Neyman allocation reduces MSE by up to 28% versus uniform sampling and saves about 23% of the labeling budget on language and multimodal benchmarks.
-
Annotations Mitigate Post-Training Mode Collapse
Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.
-
The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations
Temporal knowledge drift is encoded as a geometrically orthogonal direction in LLM residual streams, independent of correctness and uncertainty.
-
Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization
Trajectory geometry in embedding space fused with coverage and verbalization yields better black-box CoT confidence estimation than self-consistency at lower sample counts across six benchmark-reasoner pairs.
-
Estimating the Black-box LLM Uncertainty with Distribution-Aligned Adversarial Distillation
DisAAD trains a 1%-sized proxy model via adversarial distillation to quantify uncertainty in black-box LLMs by aligning with their output distributions.
-
Geometry-Calibrated Conformal Abstention for Language Models
Geometry-calibrated conformal abstention lets language models abstain from uncertain queries with finite-sample guarantees on both participation rate and conditional correctness of answers.
-
Trace-Level Analysis of Information Contamination in Multi-Agent Systems
Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.
-
The Surprising Universality of LLM Outputs: A Real-Time Verification Primitive
LLM token rank-frequency distributions converge to a shared Mandelbrot distribution across models and domains, enabling a microsecond-scale statistical primitive for provenance verification and black-box anomaly triage.
-
Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification
Cross-model semantic disagreement adds an epistemic uncertainty term that improves total uncertainty estimation over self-consistency alone, helping flag confident errors in LLMs.
-
CLSGen: A Dual-Head Fine-Tuning Framework for Joint Probabilistic Classification and Verbalized Explanation
CLSGen is a dual-head LLM fine-tuning framework that enables joint probabilistic classification and verbalized explanation generation without catastrophic forgetting of generative capabilities.
-
Rag Performance Prediction for Question Answering
A novel supervised predictor modeling semantic relationships among question, retrieved passages, and generated answer best forecasts when RAG improves QA performance.
-
Ensemble-Based Uncertainty Estimation for Code Correctness Estimation
Ensemble Semantic Entropy improves correlation with code correctness over single-model methods and powers a cascading scaling system that cuts FLOPs by 64.9% while preserving performance on LiveCodeBench.
-
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
-
BALAR : A Bayesian Agentic Loop for Active Reasoning
BALAR is a task-agnostic Bayesian loop that maintains structured beliefs over latent states, selects questions via expected mutual information, and expands its state space when needed, delivering 14.6-38.5% accuracy g...
-
Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models
UE-DPO quantifies epistemic uncertainty from grounding failures to direct more learning pressure on hard visual tokens in preferred samples while easing penalties on dispreferred ones.
-
LLMs Uncertainty Quantification via Adaptive Conformal Semantic Entropy
ACSE estimates LLM prompt uncertainty via adaptive clustering of semantic entropy across multiple responses and uses conformal prediction to bound error rates on accepted answers with distribution-free guarantees.
-
U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning
U-Define improves user control in LLM planning by letting people define hard rules and soft preferences in natural language with matching verification methods, raising usefulness and satisfaction scores.
-
Learning Uncertainty from Sequential Internal Dispersion in Large Language Models
SIVR detects LLM hallucinations by learning from token-wise and layer-wise variance patterns in internal hidden states, outperforming baselines with better generalization and less training data.
-
Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)
HUMBR reduces LLM hallucinations in enterprise workflows by using a hybrid semantic-lexical utility within minimum Bayes risk decoding to identify consensus outputs, with derived error bounds and reported outperforman...
Reference graph
Works this paper leans on
-
[1]
Language models are few-shot learners
6 Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,
work page 1901
-
[2]
PaLM: Scaling Language Modeling with Pathways
1 Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Calibration of pre-trained transformers
6 Shrey Desai and Greg Durrett. Calibration of pre-trained transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pp. 295– 302,
work page 2020
-
[4]
Unsupervised quality estimation for neural machine translation
7 10 Published as a conference paper at ICLR 2023 Marina Fomicheva, Shuo Sun, Lisa Yankovskaya, Fr´ed´eric Blain, Francisco Guzm´an, Mark Fishel, Nikolaos Aletras, Vishrav Chaudhary, and Lucia Specia. Unsupervised quality estimation for neural machine translation. Transactions of the Association for Computational Linguistics , 8: 539–555,
work page 2023
-
[5]
Uncertainty-aware ma- chine translation evaluation
1, 2 Taisiya Glushkova, Chrysoula Zerva, Ricardo Rei, and Andr ´e FT Martins. Uncertainty-aware ma- chine translation evaluation. arXiv preprint arXiv:2109.06352,
-
[6]
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
6 Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020a. 5 Ruining He, Anirudh Ravula, Bhargav Kanagal, and Joshua Ainslie. Realformer: Transformer likes residual attention. arXiv preprint arXiv:2012.11747, 2020b. 6 Dan Hendrycks, Nicholas Carlini, Joh...
work page internal anchor Pith review arXiv 2006
-
[7]
Training Compute-Optimal Large Language Models
1 Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Train- ing compute-optimal large language models. arXiv preprint arXiv:2203.15556,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Abstract meaning repre- sentation for paraphrase detection
1 Fuad Issa, Marco Damonte, Shay B Cohen, Xiaohui Yan, and Yi Chang. Abstract meaning repre- sentation for paraphrase detection. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol- ume 1 (Long Papers), pp. 442–452,
work page 2018
-
[9]
Deup: Direct epistemic uncertainty prediction
6 Moksh Jain, Salem Lahlou, Hadi Nekoei, Victor Butoi, Paul Bertin, Jarrid Rector-Brooks, Maksym Korablyov, and Yoshua Bengio. Deup: Direct epistemic uncertainty prediction. arXiv preprint arXiv:2102.08501,
-
[10]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
6 Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Language Models (Mostly) Know What They Know
7 Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language mod- els (mostly) know what they know. arXiv preprint arXiv:2207.05221,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Teaching models to express their uncertainty in words, 2022
7 Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334, 2022a. 2, 3, 6 11 Published as a conference paper at ICLR 2023 Zi Lin, Jeremiah Zhe Liu, and Jingbo Shang. Towards collaborative neural-symbolic graph semantic parsing via uncertainty. In Findings of the Association for C...
-
[13]
Uncertainty estimation in autoregressive structured prediction
3 Andrey Malinin and Mark Gales. Uncertainty estimation in autoregressive structured prediction. arXiv preprint arXiv:2002.07650,
-
[14]
1, 2, 3, 4, 5, 7, 8 Andrey Malinin, Sergey Chervontsev, Ivan Provilkov, and Mark Gales. Regression prior networks. arXiv preprint arXiv:2006.11590,
-
[15]
3 Sabrina J Mielke, Arthur Szlam, Y-Lan Boureau, and Emily Dinan. Linguistic calibration through metacognition: aligning dialogue agent responses with expected correctness. arXiv preprint arXiv:2012.14983,
-
[16]
Correcting length bias in neural machine translation
6 Kenton Murray and David Chiang. Correcting length bias in neural machine translation. arXiv preprint arXiv:1808.10006,
-
[17]
Charformer: Fast character transformers via gradient-based subword tokenization
3 Yi Tay, Vinh Q Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Si- mon Baumgartner, Cong Yu, and Donald Metzler. Charformer: Fast character transformers via gradient-based subword tokenization. arXiv preprint arXiv:2106.12672,
-
[18]
Alex Warstadt, Amanpreet Singh, and Samuel R Bowman
6 Sinong Wang, Han Fang, Madian Khabsa, Hanzi Mao, and Hao Ma. Entailment as few-shot learner. arXiv preprint arXiv:2104.14690,
-
[19]
Bilateral multi-perspective matching for natural language sentences
6 Zhiguo Wang, Wael Hamza, and Radu Florian. Bilateral multi-perspective matching for natural language sentences. arXiv preprint arXiv:1702.03814,
-
[20]
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , url =
6 Adina Williams, Nikita Nangia, and Samuel R Bowman. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426,
-
[21]
Deep learning for answer sentence selection
5 12 Published as a conference paper at ICLR 2023 Lei Yu, Karl Moritz Hermann, Phil Blunsom, and Stephen Pulman. Deep learning for answer sentence selection. arXiv preprint arXiv:1412.1632,
-
[22]
OPT: Open Pre-trained Transformer Language Models
6 Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christo- pher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
2, 7 13 Published as a conference paper at ICLR 2023 Table 3: Illustration of semantic, syntactic, and lexical equivalence. Work with foundation mod- els implicitly focuses on lexical equivalence, which entails the others, but we usually care about semantic equivalence. Equivalence Sentence A Sentence B Lexical Syntactic Semantic Paris is the capital of F...
work page 2023
-
[24]
<g/>”,x, s(m))) ⊿ Does old sequence entail new one? right ← M(cat(x, s(m), “<g/>
Lexically equivalent sequences use exactly the same symbols. They are always also semantically and syntactically equiv- alent (in a given context). Syntactically equivalent sentences have the same grammatical form. But they can have different meanings (not semantically equivalent) and can use different symbols (not lexically equivalent). Semantically equi...
work page 2023
-
[25]
As in the main body of the paper, we measure diversity as the average lexical overlap of the answers in the answer set. Additionally, we investigate, why the semantic entropy underperforms the length-normalised entropy at high temperatures. To that end, we manually inspect and label 100 classifications of our semantic equivalence method at T=1.5, and we fin...
work page 2023
-
[26]
We find that on CoQA, we obtain accurate model results with zero-shot prompting
We use the following prompts on CoQA and TriviaQA. We find that on CoQA, we obtain accurate model results with zero-shot prompting. While we have to use few-shot prompting to obtain accurate answers on closed-book TriviaQA. We use the following prompts for each of the settings: CoQA: [The provided context paragraph] [additional question-answer pairs] Q: [P...
work page 2023
-
[27]
except for the exact matching accuracy criterion which is too demanding because of the much larger variety of possible answers for this task. 17 Published as a conference paper at ICLR 2023 Table 7: CoQA: the exact choice of the accuracy metric for the free-form open-book QA task has little effect on the assessment of the quality of the uncertainty measur...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.