pith. sign in

arxiv: 2605.28264 · v1 · pith:LETX6BEXnew · submitted 2026-05-27 · 💻 cs.AI

Entropy Distribution as a Fingerprint for Hallucinations in Generative Models

Pith reviewed 2026-06-29 12:08 UTC · model grok-4.3

classification 💻 cs.AI
keywords hallucination detectiontoken entropy distributioncalibrated entropy scoresingle forward passblack-box accesslarge language modelsstatistical hypothesis test
0
0 comments X

The pith

The distribution of token-level entropies serves as a fingerprint for hallucinations, enabling single-pass detection via a calibrated score that combines mean and maximum signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that hallucinations leave a signature in the full spread of token entropies, not merely their average as captured by perplexity. Shape and tail behavior of this distribution carry independent information that allows formal detection as a hypothesis test. The authors introduce the Calibrated Entropy Score, which normalizes mean and peak entropy against a reference cumulative distribution to produce comparable outputs across models. The method needs only one forward pass and black-box logit access, yet supplies finite-sample calibration bounds and exponential convergence of detection probability with length. Empirical results across eight benchmarks and ten models show it matches the accuracy of far more expensive multi-sample approaches.

Core claim

We provide theoretical background and empirical evidence that the distribution of token-level entropies, beyond the mean captured by perplexity or length-normalised entropy, serves as a fingerprint of hallucination, with distributional shape and tail behaviour carrying independent signal. We formalize hallucination detection as a statistical hypothesis test and propose the Calibrated Entropy Score (CES), a lightweight algorithm requiring only a single forward pass and black-box access to token logits. CES combines the mean signal with the maximum signal of the generated entropy through a calibrated reference CDF, producing scores that are directly comparable across models and tasks. We estab

What carries the argument

The Calibrated Entropy Score (CES), formed by mapping the mean and maximum of the per-token entropy sequence onto a reference cumulative distribution function to yield a comparable hallucination score.

If this is right

  • CES achieves the highest detection performance among single-pass black-box methods on eight QA benchmarks and ten generator models.
  • Finite-sample calibration guarantees are provided by a random-length Dvoretzky--Kiefer--Wolfowitz inequality.
  • Detection probability converges exponentially to one as generation length grows.
  • Produced scores are directly comparable across open-source and API-access models without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be inserted into decoding loops to steer away from high-entropy token sequences in real time.
  • A broader reference CDF built from mixed domains might remove the need for task-specific recalibration.
  • If tail behavior dominates, then monitoring only the highest-entropy tokens could yield cheaper approximations.

Load-bearing premise

A single reference CDF can calibrate the mean and maximum entropy signals so that the resulting scores stay comparable and the detection guarantees remain valid across different models, tasks, and generation lengths.

What would settle it

A controlled test on a held-out model and task showing that CES scores lose calibration or fail to separate hallucinated from correct outputs at rates predicted by the exponential convergence bound.

Figures

Figures reproduced from arXiv: 2605.28264 by Akshay Seshadri, Mattia J. Villani, Niraj Kumar, Pranav Deshpande, Romina Yalovetzky.

Figure 1
Figure 1. Figure 1: Calibrated Entropy Score (CES) hallucination detection. A single forward pass produces token distributions p (t) , from which entropies h (t) are computed. Two summary statistics (the mean entropy h¯ and the maximum entropy hmax) are mapped through the calibrated CDF Fb0 and combined via a geometric mean. The reference CDF, Fb0, is estimated offline from an oracle-labeled calibration set (dashed path; Algo… view at source ↗
Figure 2
Figure 2. Figure 2: illustrates the separation. Panel (a) shows representative ECDFs for a median-effect-size experiment, where faithful and hallucinated entropy distributions are visibly offset. Panel (b) displays the distribution of dKS distances across all 80 experiments, with the rejection threshold annotated. Panel (c) shows that the mean-centred shape signal is universally significant, confirming distributional shape di… view at source ↗
Figure 3
Figure 3. Figure 3: Benchmark comparison of confabulation detection methods across all model–dataset experiments. (a) Average rank (lower is better) for each method, aggregated over all 80 open-weight and API model experiments. Bar annotations indicate the average rank and number of first-place finishes. (b) Nemenyi Critical Difference (CD) diagram following [7]. CES (unsupervised) ranks among the top clique without requiring… view at source ↗
Figure 4
Figure 4. Figure 4: Per-experiment AUROC for all 17 methods across 10 models × 8 datasets. Each bar represents one method’s AUROC; models are grouped by row and datasets by column. CES (unsupervised, starred) achieves a median AUROC of 0.653, comparable to Embedding Regression (0.665) and KLE variants (0.647–0.651). Setup. We evaluate 7 detection methods across our full 10 × 8 experimental grid: CES (supervised and unsupervis… view at source ↗
Figure 5
Figure 5. Figure 5: Token entropy autocorrelation analysis across 18,705 sequences. (a) Distribution of lag-1 ACF; median ρ1 = 0.061, well below the threshold for concern. (b) Lag-1 vs. lag-2 autocorrelation scatter; second-order dependence is weaker (median ρ2 ≈ 0.03). (c) Effective sample size ratio neff/n; most sequences retain >80% effective samples. Only 15.8% of sequences have |ρ1| > 0.3, yielding substantially reduced … view at source ↗
Figure 6
Figure 6. Figure 6: KS test power vs. sample size across all 80 experiments (10 models × 8 datasets). Each cell shows the log10 p-value as a function of subsample size for one model–dataset pair. Significance threshold (α=0.05) is shown as a horizontal line. Power increases monotonically with sample size; 65% of all 3,200 resamples achieve significance. E.5 Per-Dataset ECDF Gallery [PITH_FULL_IMAGE:figures/full_fig_p030_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-dataset ECDF gallery. Faithful (blue) vs. hallucinated (red) entropy distributions for each of 8 datasets, pooled across all 10 models. dKS and p-values are annotated per panel. Visual separation directly corresponds to detection performance: datasets with clearly offset curves (SVAMP, TriviaQA) yield the highest AUROC. E.6 Generation Length and Test Power Setup. The KS test’s power depends on sample s… view at source ↗
Figure 8
Figure 8. Figure 8 [PITH_FULL_IMAGE:figures/full_fig_p032_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Combinatorial statistic ranking. (top 10 of 44 variants). Ranked by average rank across 80 experiments (lower = better). The CES formula geom(mean, max) achieves the best average rank (10.3), confirming that combining location (mean) and tail (max) entropy statistics via geometric aggregation optimally captures both distributional features. E.8 Calibration Contamination Setup. The supervised CES variant us… view at source ↗
Figure 10
Figure 10. Figure 10 [PITH_FULL_IMAGE:figures/full_fig_p033_10.png] view at source ↗
Figure 11
Figure 11. Figure 11 [PITH_FULL_IMAGE:figures/full_fig_p034_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Nemenyi Critical Difference diagram for 17 methods on 80 experiments. Methods connected by a thick bar are not significantly different (α = 0.05, CD = 2.78). CES (unsupervised) at rank 6.29 is within the top clique and statistically indistinguishable from the best-ranked methods (KLE (heat) at 6.16). 35 [PITH_FULL_IMAGE:figures/full_fig_p035_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: CES pairwise win rates against each of 16 benchmark methods. CES wins 854/1279 pairwise comparisons (66.8%) overall and beats 12/16 methods at >50% win rate. Median AUROC advantage (∆) annotated for each opponent. E.11 API vs. Open-Weight Model Comparison Setup. A key practical question is whether entropy-based detection generalises to API-only models where only token-level log-probabilities (not full log… view at source ↗
Figure 14
Figure 14. Figure 14: Per-dataset CES AUROC comparison: API (red) vs. open-weight (blue) models. Significant differences (p < 0.05) are marked with asterisks. 4/8 datasets show statistically significant differences, with the direction varying by task type. 0.4 0.5 0.6 0.7 0.8 0.9 CES (supervised) score 0 1 2 3 4 5 6 7 Density (a) CES Score Distributions Open-Weight (n=21559) API (n=14924) Open-Weight API 0.0 0.2 0.4 0.6 0.8 1.… view at source ↗
Figure 15
Figure 15. Figure 15: Distributional properties: API vs. open-weight models. (a) AUROC distributions overlap substantially. (b) Cohen’s d distributions are comparable (API median 0.494, open 0.515). (c) Dataset difficulty correlation between model types (Pearson r = −0.49), suggesting that difficult datasets for API models are not necessarily difficult for open-weight models and vice versa. 37 [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
Figure 16
Figure 16. Figure 16: Method ranking comparison: internal methods ranked by median AUROC separately for API and open-weight models. Spearman rank correlation ρ = 0.673 (p = 0.098) indicates moderate agreement in method effectiveness across model types. E.12 Empirical Verification of CES Error Bounds Setup. Theorem 5 establishes that the Type I and Type II errors of the CES test decay exponentially with generation length m. Spe… view at source ↗
Figure 17
Figure 17. Figure 17: Empirical verification of CES error bounds (Theorem 5). Type I error (false positive rate) and Type II error (false negative rate) as a function of generation length m, measured on synthetic i.i.d. generations drawn from pooled faithful and hallucinated token entropy distributions. Dashed lines show the theoretical exponential upper bounds. Both error rates decay exponentially with m, and the empirical cu… view at source ↗
read the original abstract

Large Language Models (LLMs) often generate factually incorrect outputs, commonly termed hallucinations, that undermine trust and limit deployment in high-stakes settings. Existing hallucination detection methods typically require multiple forward passes, or access to model internals. In this work, we provide theoretical background and empirical evidence that the distribution of token-level entropies, beyond the mean captured by perplexity or length-normalised entropy, serves as a fingerprint of hallucination, with distributional shape and tail behaviour carrying independent signal. We formalize hallucination detection as a statistical hypothesis test and propose the Calibrated Entropy Score (CES), a lightweight algorithm requiring only a single forward pass and black-box access to token logits. CES combines the mean signal with the maximum signal of the generated entropy through a calibrated reference CDF, producing scores that are directly comparable across models and tasks. We establish finite-sample calibration guarantees via a novel random-length Dvoretzky--Kiefer--Wolfowitz inequality, and also prove that CES detects hallucinations with probability converging to one exponentially fast in the generation length. Across eight QA benchmarks and ten generator models spanning open-source and API access models, CES achieves the highest detection performance among all single-pass black-box methods while providing formal error guarantees that existing heuristics lack. Remarkably, CES is statistically indistinguishable from multi-sample methods that require far greater computational cost, closing the gap between lightweight and expensive detection and making it suitable for real-time, large-scale deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that the distribution of token-level entropies (shape and tails beyond the mean) serves as a fingerprint for hallucinations in LLMs. It formalizes detection as a hypothesis test and introduces the Calibrated Entropy Score (CES), a single-pass black-box method that combines mean and max entropy signals via a calibrated reference CDF. Theoretical contributions include finite-sample guarantees from a novel random-length Dvoretzky–Kiefer–Wolfowitz inequality and exponential convergence of detection probability with generation length. Empirically, CES outperforms other single-pass methods on eight QA benchmarks across ten models (open-source and API) and is statistically indistinguishable from multi-sample baselines.

Significance. If the single-reference-CDF calibration transfers without model-specific bias, the result would be significant: it supplies the first lightweight detector with explicit finite-sample and exponential-rate guarantees, closing the performance gap to expensive multi-sample methods while remaining deployable at scale.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (CES definition): the central cross-model claim requires that one fixed reference CDF calibrates both mean and max entropy signals so that CES remains comparable and the DKW-based finite-sample guarantees hold for all ten models. Token-level entropy distributions are shaped by architecture, scale, tokenizer, and training objective; if these induce different families, the quantile mapping introduces model-specific bias that invalidates both the hypothesis-test interpretation and the claim of matching multi-sample methods without per-model recalibration.
  2. [Theoretical development] Theoretical development (novel random-length DKW inequality): the inequality assumes the underlying distribution is fixed, yet the cross-model empirical claims implicitly treat the reference CDF as universal. No evidence is provided that the inequality’s assumptions survive the shift from reference-construction models to the ten evaluation models.
minor comments (2)
  1. Clarify whether the reference CDF was constructed from a held-out set of models/tasks independent of the ten evaluated generators.
  2. Add explicit statistical tests (e.g., paired significance) for the claim that CES is indistinguishable from multi-sample methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on the cross-model calibration of CES and the scope of the theoretical guarantees. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (CES definition): the central cross-model claim requires that one fixed reference CDF calibrates both mean and max entropy signals so that CES remains comparable and the DKW-based finite-sample guarantees hold for all ten models. Token-level entropy distributions are shaped by architecture, scale, tokenizer, and training objective; if these induce different families, the quantile mapping introduces model-specific bias that invalidates both the hypothesis-test interpretation and the claim of matching multi-sample methods without per-model recalibration.

    Authors: Empirical results across ten diverse models (open- and closed-source) and eight benchmarks show that a single reference CDF yields comparable CES values and performance statistically indistinguishable from multi-sample baselines without per-model recalibration. This supports the practical validity of the hypothesis-test framing. We will revise §3 to detail the reference CDF construction from a multi-model calibration set and add a brief discussion of observed robustness to architectural differences. revision: partial

  2. Referee: [Theoretical development] Theoretical development (novel random-length DKW inequality): the inequality assumes the underlying distribution is fixed, yet the cross-model empirical claims implicitly treat the reference CDF as universal. No evidence is provided that the inequality’s assumptions survive the shift from reference-construction models to the ten evaluation models.

    Authors: The random-length DKW inequality is applied to the fixed reference CDF once constructed, supplying finite-sample guarantees for quantile estimation on subsequent generations from any model. While a formal transfer theorem under arbitrary model shift is not derived, the evaluation includes models held out from reference construction, and the observed performance provides supporting empirical evidence. We will add a clarifying paragraph in the theoretical section on the conditional nature of the guarantees and the role of the empirical validation. revision: partial

Circularity Check

0 steps flagged

No circularity; derivation relies on novel inequality and independent statistical formalization

full rationale

The paper formalizes detection as a hypothesis test, defines CES via mean/max entropy signals combined through a reference CDF for cross-model comparability, and derives finite-sample guarantees plus exponential convergence from a novel random-length DKW inequality. No quoted step reduces a claimed prediction or result to a fitted input by construction, nor does any load-bearing premise collapse to self-citation or ansatz smuggling. The reference CDF construction is not shown to be data-dependent in a way that forces the detection claims, and the theoretical results are presented as independent of the empirical benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that hallucination status systematically alters the entropy distribution in a calibratable way; the reference CDF is the main unstated element whose construction is not detailed in the abstract.

free parameters (1)
  • reference CDF construction parameters
    The reference cumulative distribution function used for calibration is necessarily chosen or fitted from some data or model outputs to enable cross-model comparability.
axioms (1)
  • domain assumption Hallucinated generations exhibit distinguishable entropy distribution shape and tail behaviour compared to factual generations in a manner consistent enough for a single reference CDF to calibrate across models and tasks.
    This is the load-bearing premise that allows the fingerprint and the CES guarantees to hold.

pith-pipeline@v0.9.1-grok · 5809 in / 1401 out tokens · 54215 ms · 2026-06-29T12:08:26.155191+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 22 canonical work pages · 7 internal anchors

  1. [1]

    The Falcon Series of Open Language Models

    Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxan- dra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, et al. The falcon series of open language models.arXiv preprint arXiv:2311.16867, 2023

  2. [2]

    The internal state of an llm knows when it’s lying

    Amos Azaria and Tom Mitchell. The internal state of an llm knows when it’s lying. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976, 2023

  3. [3]

    Inside: Llms’ internal states retain the power of hallucination detection.arXiv preprint arXiv:2402.03744, 2024

    Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. Inside: Llms’ internal states retain the power of hallucination detection.arXiv preprint arXiv:2402.03744, 2024

  4. [4]

    Going beyond neural network feature similarity: The network feature complexity and its interpretation using category theory.arXiv preprint arXiv:2310.06756, 2023

    Yiting Chen, Zhanpeng Zhou, and Junchi Yan. Going beyond neural network feature similarity: The network feature complexity and its interpretation using category theory.arXiv preprint arXiv:2310.06756, 2023

  5. [5]

    Lookback lens: Detecting and mitigating contextual hallucinations in large language models using only attention maps

    Yung-Sung Chuang, Linlu Qiu, Cheng-Yu Hsieh, Ranjay Krishna, Yoon Kim, and James Glass. Lookback lens: Detecting and mitigating contextual hallucinations in large language models using only attention maps. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1419–1436, 2024

  6. [6]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  7. [7]

    Statistical comparisons of classifiers over multiple data sets.Journal of Machine learning research, 7(Jan):1–30, 2006

    Janez Demšar. Statistical comparisons of classifiers over multiple data sets.Journal of Machine learning research, 7(Jan):1–30, 2006

  8. [8]

    Haloscope: Harnessing unlabeled llm generations for hallucination detection.Advances in Neural Information Processing Systems, 37:102948– 102972, 2024

    Xuefeng Du, Chaowei Xiao, and Yixuan Li. Haloscope: Harnessing unlabeled llm generations for hallucination detection.Advances in Neural Information Processing Systems, 37:102948– 102972, 2024

  9. [9]

    Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs

    Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gard- ner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Paper...

  10. [10]

    Do llms know about hallucination? an empirical investigation of llm’s hidden states.arXiv preprint arXiv:2402.09733, 2024

    Hanyu Duan, Yi Yang, and Kar Yan Tam. Do llms know about hallucination? an empirical investigation of llm’s hidden states.arXiv preprint arXiv:2402.09733, 2024

  11. [11]

    Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models

    Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5050–5063,...

  12. [12]

    Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

    Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

  13. [13]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

  14. [14]

    Look before you leap: An exploratory study of uncertainty measurement for large language models.arXiv preprint arXiv:2307.10236, 2023

    Yuheng Huang, Jiayang Song, Zhijie Wang, Shengming Zhao, Huaming Chen, Felix Juefei-Xu, and Lei Ma. Look before you leap: An exploratory study of uncertainty measurement for large language models.arXiv preprint arXiv:2307.10236, 2023

  15. [15]

    Look before you leap: An exploratory study of uncertainty analysis for large language models.IEEE Transactions on Software Engineering, 51(2):413–429, 2025

    Yuheng Huang, Jiayang Song, Zhijie Wang, Shengming Zhao, Huaming Chen, Felix Juefei-Xu, and Lei Ma. Look before you leap: An exploratory study of uncertainty analysis for large language models.IEEE Transactions on Software Engineering, 51(2):413–429, 2025

  16. [16]

    The illusion of progress: Re-evaluating hallucination detection in llms

    Denis Janiak, Jakub Binkowski, Albert Sawczyn, Bogdan Gabrys, Ravid Shwartz-Ziv, and Tomasz Kajdanowicz. The illusion of progress: Re-evaluating hallucination detection in llms. arXiv preprint arXiv:2508.08285, 2025

  17. [17]

    Perplexity—a measure of the difficulty of speech recognition tasks.The journal of the Acoustical Society of America, 62(S1): S63–S63, 1977

    Fred Jelinek, Robert L Mercer, Lalit R Bahl, and James K Baker. Perplexity—a measure of the difficulty of speech recognition tasks.The journal of the Acoustical Society of America, 62(S1): S63–S63, 1977

  18. [18]

    Survey of hallucination in natural language generation

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM computing surveys, 55(12):1–38, 2023

  19. [19]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxi...

  20. [20]

    Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, 2017

  21. [21]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

  22. [22]

    Why Language Models Hallucinate

    Adam Tauman Kalai, Ofir Nachum, Santosh S Vempala, and Edwin Zhang. Why language models hallucinate.arXiv preprint arXiv:2509.04664, 2025

  23. [23]

    Bioasq-qa: A manually curated corpus for biomedical question answering.Scientific data, 10 (1):170, 2023

    Anastasia Krithara, Anastasios Nentidis, Konstantinos Bougiatiotis, and Georgios Paliouras. Bioasq-qa: A manually curated corpus for biomedical question answering.Scientific data, 10 (1):170, 2023

  24. [24]

    Conformal prediction with large language models for multi-choice question answering.arXiv preprint arXiv:2305.18404, 2023

    Bhawesh Kumar, Charlie Lu, Gauri Gupta, Anil Palepu, David Bellamy, Ramesh Raskar, and Andrew Beam. Conformal prediction with large language models for multi-choice question answering.arXiv preprint arXiv:2305.18404, 2023

  25. [25]

    Natural questions: a benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

  26. [26]

    Preprint, arXiv:2002.07650

    Andrey Malinin and Mark Gales. Uncertainty estimation in autoregressive structured prediction. arXiv preprint arXiv:2002.07650, 2020. 11

  27. [27]

    Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models

    Potsawee Manakul, Adian Liusie, and Mark Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 9004–9017, 2023

  28. [28]

    On faithfulness and factuality in abstractive summarization.arXiv preprint arXiv:2005.00661, 2020

    Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization.arXiv preprint arXiv:2005.00661, 2020

  29. [29]

    Fine-grained hallucination detection and editing for language models.arXiv preprint arXiv:2401.06855, 2024

    Abhika Mishra, Akari Asai, Vidhisha Balachandran, Yizhong Wang, Graham Neubig, Yulia Tsvetkov, and Hannaneh Hajishirzi. Fine-grained hallucination detection and editing for language models.arXiv preprint arXiv:2401.06855, 2024

  30. [30]

    Learned hallucination detection in black-box llms using token-level entropy production rate

    Charles Moslonka, Hicham Randrianarivo, Arthur Garnier, and Emmanuel Malherbe. Learned hallucination detection in black-box llms using token-level entropy production rate. InEuropean Conference on Information Retrieval, pages 115–130. Springer, 2026

  31. [31]

    Kernel language entropy: Fine-grained uncertainty quantification for llms from semantic similarities.Advances in Neural Information Processing Systems, 37:8901–8929, 2024

    Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen. Kernel language entropy: Fine-grained uncertainty quantification for llms from semantic similarities.Advances in Neural Information Processing Systems, 37:8901–8929, 2024

  32. [32]

    Robust hallucination detection in llms via adaptive token selection.arXiv preprint arXiv:2504.07863, 2025

    Mengjia Niu, Hamed Haddadi, and Guansong Pang. Robust hallucination detection in llms via adaptive token selection.arXiv preprint arXiv:2504.07863, 2025

  33. [33]

    Real-time detection of hallucinated entities in long-form generation.arXiv preprint arXiv:2509.03531, 2025

    Oscar Obeso, Andy Arditi, Javier Ferrando, Joshua Freeman, Cameron Holmes, and Neel Nanda. Real-time detection of hallucinated entities in long-form generation.arXiv preprint arXiv:2509.03531, 2025

  34. [34]

    Hallucination detection using multi-view attention features.arXiv e-prints, pages arXiv–2504, 2025

    Yuya Ogasa and Yuki Arase. Hallucination detection using multi-view attention features.arXiv e-prints, pages arXiv–2504, 2025

  35. [35]

    Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? InProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 2080–2094, 2021

  36. [36]

    Large language model confidence estimation via black-box access.arXiv preprint arXiv:2406.04370, 2024

    Tejaswini Pedapati, Amit Dhurandhar, Soumya Ghosh, Soham Dan, and Prasanna Sat- tigeri. Large language model confidence estimation via black-box access.arXiv preprint arXiv:2406.04370, 2024

  37. [37]

    Mauve: Measuring the gap between neural text and human text using divergence frontiers.Advances in Neural Information Processing Systems, 34:4816–4828, 2021

    Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, and Zaid Harchaoui. Mauve: Measuring the gap between neural text and human text using divergence frontiers.Advances in Neural Information Processing Systems, 34:4816–4828, 2021

  38. [38]

    Conformal language modeling.arXiv preprint arXiv:2306.10193, 2023

    Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S Jaakkola, and Regina Barzilay. Conformal language modeling.arXiv preprint arXiv:2306.10193, 2023

  39. [39]

    Know what you don’t know: Unanswerable ques- tions for squad

    Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable ques- tions for squad. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, 2018

  40. [40]

    Coqa: A conversational question answering challenge.Transactions of the Association for Computational Linguistics, 7:249–266, 2019

    Siva Reddy, Danqi Chen, and Christopher D Manning. Coqa: A conversational question answering challenge.Transactions of the Association for Computational Linguistics, 7:249–266, 2019

  41. [41]

    Out-of-distribution detection and selective generation for conditional language models.arXiv preprint arXiv:2209.15558, 2022

    Jie Ren, Jiaming Luo, Yao Zhao, Kundan Krishna, Mohammad Saleh, Balaji Lakshminarayanan, and Peter J Liu. Out-of-distribution detection and selective generation for conditional language models.arXiv preprint arXiv:2209.15558, 2022

  42. [42]

    Validation and extraction of reliable information through automated scraping and natural language inference

    Arjun Shah, Hetansh Shah, Vedica Bafna, Charmi Khandor, and Sindhu Nair. Validation and extraction of reliable information through automated scraping and natural language inference. Engineering Applications of Artificial Intelligence, 147:110284, 2025

  43. [43]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 12

  44. [44]

    An overview of the bioasq large-scale biomedical semantic indexing and question answering competition.BMC bioinformatics, 16(1):138, 2015

    George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, et al. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition.BMC bioinformatics, 16(1):138, 2015

  45. [45]

    A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation.arXiv preprint arXiv:2307.03987, 2023

    Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jianshu Chen, and Dong Yu. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation.arXiv preprint arXiv:2307.03987, 2023

  46. [46]

    Long-form factuality in large language models.Advances in Neural Information Processing Systems, 37:80756–80827, 2024

    Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, et al. Long-form factuality in large language models.Advances in Neural Information Processing Systems, 37:80756–80827, 2024

  47. [47]

    A survey of uncertainty estimation methods on large language models

    Zhiqiu Xia, Jinxuan Xu, Yuqian Zhang, and Hang Liu. A survey of uncertainty estimation methods on large language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 21381–21396, 2025

  48. [48]

    Hallucination is Inevitable: An Innate Limitation of Large Language Models

    Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. Hallucination is inevitable: An innate limitation of large language models.arXiv preprint arXiv:2401.11817, 2024. A Theory for Calibration In this section, we expand on the theory supporting our calibration algorithm, and prove the corre- sponding theorems stated in the main text. We begin by setting up the pr...

  49. [49]

    Full Experimental Results: we report the full experimental grids with interquantile ranges (IQRs) for each experiment in Appendix E.2

  50. [50]

    Independence Assumption: In Appendix E.3, we show that the median lag-1 autocorre- lation is ρ1 = 0.061 with 15.8% of the sequences exhibiting |ρ1|>0.3 , with 80.5% of the sequences falling into white-noise bounds, witnessing a low autocorrelation for most samples

  51. [51]

    Specifically, we show that for all datasets and model pairs, as we increase i.i.d

    Consistently Signficant KS test for Hallucinated versus non-Hallucinated Generations: In Appendix E.4, we find that power of the KS test increases monotonically with sample size; 65% of all 3,200 resamples achieve significance. Specifically, we show that for all datasets and model pairs, as we increase i.i.d. sampling size, hallucinated v.s. faithful gene...

  52. [52]

    We find that CES outperforms the other combinations

    CES Variants: in Appendix E.7, we exhaustively evaluate 44 variants combining different entropy summaries (mean, median, max, q25, q75) under both arithmetic and geometric aggregation. We find that CES outperforms the other combinations

  53. [53]

    Robustness Results: In Appendix E.8 and Appendix E.9, we analyze the impact of adding noisy samples to the calibration distribution, finding robustness in the performance of CES

  54. [54]

    CES is competitive: in Appendix E.10, we conclude that CES (unsupervised) ranks 3rd out of 17 methods by average rank (6.29) across all 80 experiments, belongs to the top statistical clique (Nemenyi CD= 2.78 ), and is significantly better than 7/16 benchmarks after Holm–Bonferroni correction. Unlike the top-performing methods (KLE, Embedding Regression) t...

  55. [55]

    This is analyzed in Appendix E.11

    API model generalisation: CES transfers effectively to API-only models (GPT-4.1 family), achieving median AUROC 0.669 on API models versus 0.642 on open-weight models (Mann-Whitney p= 0.060 , KS D= 0.323 , p= 0.031 ). This is analyzed in Appendix E.11

  56. [56]

    maximally contaminated

    Empirical Validation of Error Bounds: in Appendix E.12, we validate Theorem 5, confirming the theorem’s prediction on Type I and Type II exponential error decay rates. 9.Example Outputs: we show examples outputs from our generations in Appendix E.13. 25 E.2 Main Experimental Grid Llama-2-13b-chat-hfLlama-2-7b-chat-hfLlama-3.2-1B-InstructMeta-Llama-3-8B-In...