hub

Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, Yarin Gal · 2024 · cs.CL · arXiv 2406.15927

31 Pith papers cite this work. Polarity classification is still indexing.

31 Pith papers citing it

open full Pith review browse 31 citing papers arXiv PDF

abstract

We propose semantic entropy probes (SEPs), a cheap and reliable method for uncertainty quantification in Large Language Models (LLMs). Hallucinations, which are plausible-sounding but factually incorrect and arbitrary model generations, present a major challenge to the practical adoption of LLMs. Recent work by Farquhar et al. (2024) proposes semantic entropy (SE), which can detect hallucinations by estimating uncertainty in the space semantic meaning for a set of model generations. However, the 5-to-10-fold increase in computation cost associated with SE computation hinders practical adoption. To address this, we propose SEPs, which directly approximate SE from the hidden states of a single generation. SEPs are simple to train and do not require sampling multiple model generations at test time, reducing the overhead of semantic uncertainty quantification to almost zero. We show that SEPs retain high performance for hallucination detection and generalize better to out-of-distribution data than previous probing methods that directly predict model accuracy. Our results across models and tasks suggest that model hidden states capture SE, and our ablation studies give further insights into the token positions and model layers for which this is the case.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

baseline 2

citation-polarity summary

baseline 2

representative citing papers

Not All Uncertainty Is Equal: How Uncertainty Granularity Shapes Human Verification in LLM-Assisted Decision Making

cs.HC · 2026-05-27 · unverdicted · novelty 7.0

A between-subjects experiment (N=192) finds that token-level uncertainty increases agreement with LLM answers while relation-level uncertainty reduces external verification in medical decision tasks.

Risk-aware Selective Prompting for Hallucination Mitigation in Large Vision-Language Models

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

Verification prompting in LVLMs is input-dependent and risk-bearing; RSP selectively triggers it via pre-generation uncertainty to avoid performance degradation on easy cases.

Inducing Artificial Uncertainty in Language Models

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.

Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

cs.AI · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

Attractor basins in transformer hidden states unify conflict and hallucination as basin competition or absence, with geometric margin outperforming entropy for detection and a scaling law governing confident hallucination rates.

Clotho: Measuring Task-Specific Pre-Generation Test Adequacy for LLM Inputs

cs.SE · 2025-09-22 · unverdicted · novelty 7.0

Clotho ranks LLM test inputs by failure likelihood using pre-generation hidden states and GMMs, achieving 0.716 ROC-AUC after labeling 5.4% of inputs on average across eight tasks and three models, with transfer to proprietary models.

Grad Detect: Gradient-Based Hallucination Detection in LLMs

cs.LG · 2026-06-23 · unverdicted · novelty 6.0

Grad Detect uses internal gradient patterns from one inference pass to predict LLM hallucinations and abstention, outperforming confidence and sampling baselines on Q&A benchmarks with most signal in the final five layers.

Knowledge Dependency Estimation for Reliable Question Answering

cs.CL · 2026-05-27 · unverdicted · novelty 6.0

Knot estimates QA model sensitivity to candidate knowledge via subset counterfactual training and latent factor coverage, yielding unit rankings that outperform baselines without extra model calls.

Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

Larger LLMs hallucinate more often despite having the correct concept available because instruction tuning causes probability mass to disperse across alternative surface forms instead of concentrating on one.

Reading Calibrated Uncertainty from Language Model Trajectories

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

Geometric features from per-layer MLP update trajectories fed to a sparse linear probe outperform maximum softmax probability for uncertainty quantification under selective abstention, with gains up to 21 AURC points.

The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

Temporal knowledge drift is encoded as a geometrically orthogonal direction in LLM residual streams, independent of correctness and uncertainty.

Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

Probabilistic circuits detect LLM hallucinations as residual-stream anomalies with up to 99% AUROC and enable dynamic correction that raises truthfulness scores while cutting unnecessary output corruption.

Zero-Shot Confidence Estimation for Small LLMs: When Supervised Baselines Aren't Worth Training

cs.AI · 2026-05-04 · conditional · novelty 6.0

Average token log-probability provides a zero-shot confidence signal for small LLMs that matches supervised baselines in-distribution and outperforms them out-of-distribution, with a new retrieval-conditional variant improving further at lower latency.

The Surprising Universality of LLM Outputs: A Real-Time Verification Primitive

cs.CR · 2026-04-28 · unverdicted · novelty 6.0

LLM token rank-frequency distributions converge to a shared Mandelbrot distribution across models and domains, enabling a microsecond-scale statistical primitive for provenance verification and black-box anomaly triage.

Process Supervision of Confidence Margin for Calibrated LLM Reasoning

cs.LG · 2026-04-25 · unverdicted · novelty 6.0

RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.

Convergent Evolution: How Different Language Models Learn Similar Number Representations

cs.CL · 2026-04-22 · unverdicted · novelty 6.0

Diverse language models converge on similar periodic number features with a two-tier hierarchy of Fourier sparsity and geometric separability, acquired via language co-occurrences or multi-token arithmetic.

Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models

cs.RO · 2026-04-22 · unverdicted · novelty 6.0

Temporal difference calibration aligns uncertainty estimates in vision-language-action models with their value functions for better sequential performance.

Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

Unsupervised single-generation confidence calibration for reasoning LLMs via offline self-consistency proxy distillation outperforms baselines on math and QA tasks and improves selective prediction.

Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

SHADE adaptively combines coverage and spectral signals to estimate semantic alphabet size from few LLM samples, yielding better performance than baselines in low-sample regimes for alphabet estimation and QA error detection.

Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification

cs.AI · 2026-04-18 · unverdicted · novelty 6.0

Cross-model semantic disagreement adds an epistemic uncertainty term that improves total uncertainty estimation over self-consistency alone, helping flag confident errors in LLMs.

Do Hallucination Neurons Generalize? Evidence from Cross-Domain Transfer in LLMs

cs.CL · 2026-03-27 · unverdicted · novelty 6.0

Hallucination neurons in LLMs are domain-specific, with cross-domain classifiers dropping from AUROC 0.783 within-domain to 0.563 across domains.

GrACE: A Generative Approach to Better Confidence Elicitation and Efficient Test-Time Scaling in Large Language Models

cs.CL · 2025-09-11 · unverdicted · novelty 6.0

GrACE is a fine-tuned generative method that uses similarity to a special token embedding for real-time calibrated confidence in LLMs and enables efficient confidence-based test-time scaling.

FLaG: Fine-Grained Latent Grouping for Hallucination Detection

cs.LG · 2026-05-29 · unverdicted · novelty 5.0

FLaG models hallucination detection via latent evidence groups and energy-based routing with log-marginal aggregation, claiming SOTA results and a theoretical link to Bayes-optimal detection under heterogeneous mechanisms.

Capability Self-Assessment: Teaching LLMs to Know Their Limits

cs.AI · 2026-05-29 · unverdicted · novelty 5.0

Reinforcement learning teaches LLMs to assess their own capabilities more effectively than supervised fine-tuning, preserves original skills, generalizes out of distribution, and aids local-cloud routing and data selection.

Reverse Probing: Supervised Token-level Uncertainty Quantification for Large Language Models in Clinical Text

cs.CL · 2026-05-27 · unverdicted · novelty 5.0

Reverse Probing extracts token-level uncertainty from LLM internal activations on labeled clinical summaries, outperforming eight baselines with up to 4x higher AUPRC on two expert-annotated datasets while lowering compute costs.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Clotho: Measuring Task-Specific Pre-Generation Test Adequacy for LLM Inputs cs.SE · 2025-09-22 · unverdicted · none · ref 24 · internal anchor
Clotho ranks LLM test inputs by failure likelihood using pre-generation hidden states and GMMs, achieving 0.716 ROC-AUC after labeling 5.4% of inputs on average across eight tasks and three models, with transfer to proprietary models.

Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer