Recognition: unknown
Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation
Pith reviewed 2026-05-10 02:28 UTC · model grok-4.3
The pith
A hybrid estimator fuses coverage statistics with graph spectral traces to better count distinct meanings in small samples from language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SHADE estimates semantic alphabet size by adaptively fusing Generalized Good-Turing coverage with the heat-kernel trace of the normalized Laplacian on an entailment-weighted graph over sampled responses. High coverage triggers a convex combination of the two signals; low coverage applies LogSumExp fusion to emphasize missing modes. A finite-sample correction stabilizes the cardinality estimate, which then yields a coverage-adjusted semantic entropy score for black-box uncertainty quantification.
What carries the argument
The entailment-weighted graph over sampled responses, whose normalized Laplacian supplies a heat-kernel trace that is fused with Generalized Good-Turing coverage; the coverage value itself selects between convex-combination and LogSumExp fusion rules.
If this is right
- Uncertainty scores derived from SHADE improve detection of incorrect QA answers most strongly when sampling budgets are tight.
- The hybrid fusion reduces undercounting of rare semantic modes compared with pure frequency or pure spectral estimators.
- As sample size grows, the advantage over simpler estimators shrinks, consistent with the method targeting the low-sample regime.
- The coverage-adjusted semantic entropy provides a practical proxy for downstream risk under black-box access constraints.
Where Pith is reading between the lines
- The same coverage-triggered fusion logic could be tested on other generative tasks such as code or image captioning where semantic modes are costly to sample exhaustively.
- Adaptive sampling strategies might stop early once estimated coverage exceeds a threshold, reducing query cost while preserving estimate quality.
- If the entailment graph construction generalizes across model families, the estimator could serve as a lightweight post-hoc check on consistency without retraining.
Load-bearing premise
The entailment-weighted graph built from the sampled responses correctly identifies distinct semantic modes, and the coverage estimate accurately chooses the right fusion rule without biasing the final count of meanings.
What would settle it
Collect a very large reference set of responses for the same queries to establish ground-truth semantic mode counts, then check whether SHADE's low-sample estimates deviate systematically from those counts in the small-sample regime.
Figures
read the original abstract
This paper studies uncertainty quantification for large language models (LLMs) under black-box access, where only a small number of responses can be sampled for each query. In this setting, estimating the effective semantic alphabet size--that is, the number of distinct meanings expressed in the sampled responses--provides a useful proxy for downstream risk. However, frequency-based estimators tend to undercount rare semantic modes when the sample size is small, while graph-spectral quantities alone are not designed to estimate semantic occupancy accurately. To address this issue, we propose SHADE (Soft-Hybrid Alphabet Dynamic Estimator), a simple and interpretable estimator that combines Generalized Good-Turing coverage with a heat-kernel trace of the normalized Laplacian constructed from an entailment-weighted graph over sampled responses. The estimated coverage adaptively determines the fusion rule: under high coverage, SHADE uses a convex combination of the two signals, while under low coverage it applies a LogSumExp fusion to emphasize missing or weakly observed semantic modes. A finite-sample correction is then introduced to stabilize the resulting cardinality estimate before converting it into a coverage-adjusted semantic entropy score. Experiments on pooled semantic alphabet-size estimation against large-sample references and on QA incorrectness detection show that SHADE achieves the strongest improvements in the most sample-limited regime, while the performance gap narrows as the number of samples increases. These results suggest that hybrid semantic occupancy estimation is particularly beneficial when black-box uncertainty quantification must operate under tight sampling budgets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SHADE (Soft-Hybrid Alphabet Dynamic Estimator), a hybrid method for estimating the effective semantic alphabet size (number of distinct meanings) in LLM responses under black-box access with small sample sizes. It fuses a Generalized Good-Turing coverage estimate with the heat-kernel trace of the normalized Laplacian on an entailment-weighted graph over sampled responses; coverage level selects between convex-combination fusion (high coverage) and LogSumExp fusion (low coverage), followed by a finite-sample correction to yield a cardinality estimate that is converted into a coverage-adjusted semantic entropy score. Experiments on pooled alphabet-size estimation against large-sample references and on QA incorrectness detection report the largest gains in the most sample-limited regimes, with the gap narrowing as sample size grows.
Significance. If the central claims hold, the work addresses a practically important gap in black-box LLM uncertainty quantification by providing an interpretable estimator that targets unseen semantic modes when sampling budgets are tight; this could improve hallucination proxies via better semantic-occupancy estimates. The adaptive fusion of frequency-based and graph-spectral signals is a reasonable idea for small-n settings, and the empirical focus on sample-limited regimes is well-motivated. However, the absence of explicit formulas, derivations, or stability analysis for the coverage-driven fusion and finite-sample correction makes it difficult to assess whether the reported gains are robust or artifactual.
major comments (3)
- [Abstract] Abstract: the finite-sample correction is described only at a high level with no explicit formula, derivation, or bias analysis; because this correction is applied after the coverage-dependent fusion and is claimed to stabilize the cardinality estimate precisely in the small-n regime where gains are largest, its absence prevents verification that the estimator does not reduce to a fitted or self-referential quantity.
- [Abstract] Abstract and §3 (method): the coverage estimate obtained via Generalized Good-Turing on the same small sample used to build the entailment graph is used to select between convex and LogSumExp fusion; yet coverage estimation variance is highest precisely when n is smallest, so mis-selection can systematically bias the final cardinality in the direction opposite to the intended correction for unseen modes. No analytic bound, threshold-stability analysis, or ablation for n ≤ 10 is provided, undermining the central claim that SHADE achieves its strongest improvements in the sample-limited regime.
- [Experiments] Experiments section: the pooled semantic alphabet-size results and QA incorrectness detection both rely on the entailment-weighted graph accurately capturing distinct semantic modes and on the coverage scalar reliably indicating the fusion rule without introducing bias; the weakest assumption noted in the reader report is therefore load-bearing, but no sensitivity analysis or alternative graph-construction ablations are reported to confirm that the observed gains survive perturbations to the entailment threshold or graph construction.
minor comments (3)
- [Abstract] Abstract: the acronym SHADE is introduced without an initial expansion in the title or first sentence.
- Notation: the precise definition of the heat-kernel trace and the normalized Laplacian on the entailment graph should be given explicitly (including any temperature or scaling parameters) rather than left at the level of 'heat-kernel trace of the normalized Laplacian'.
- Missing references: prior work on Good-Turing estimators for unseen species and on graph-based semantic clustering for LLM responses should be cited to clarify the incremental contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and have revised the manuscript to improve clarity and add supporting analyses where feasible.
read point-by-point responses
-
Referee: [Abstract] Abstract: the finite-sample correction is described only at a high level with no explicit formula, derivation, or bias analysis; because this correction is applied after the coverage-dependent fusion and is claimed to stabilize the cardinality estimate precisely in the small-n regime where gains are largest, its absence prevents verification that the estimator does not reduce to a fitted or self-referential quantity.
Authors: We agree that the finite-sample correction was presented only at a high level in the original abstract. In the revised manuscript we have added the explicit formula, its derivation based on the expected unseen mass, and a short bias analysis to Section 3. The correction is a post-fusion multiplicative adjustment that depends only on the coverage scalar and is therefore not self-referential. revision: yes
-
Referee: [Abstract] Abstract and §3 (method): the coverage estimate obtained via Generalized Good-Turing on the same small sample used to build the entailment graph is used to select between convex and LogSumExp fusion; yet coverage estimation variance is highest precisely when n is smallest, so mis-selection can systematically bias the final cardinality in the direction opposite to the intended correction for unseen modes. No analytic bound, threshold-stability analysis, or ablation for n ≤ 10 is provided, undermining the central claim that SHADE achieves its strongest improvements in the sample-limited regime.
Authors: We acknowledge the risk of mis-selection arising from variance in the coverage estimate at small n. We have added an ablation study restricted to n ≤ 10 that compares the adaptive rule against fixed convex and fixed LogSumExp fusions and reports the observed selection frequencies. These results show that SHADE retains its advantage even under noisy coverage estimates. A full analytic bound on selection stability is non-trivial and is noted as future work. revision: partial
-
Referee: [Experiments] Experiments section: the pooled semantic alphabet-size results and QA incorrectness detection both rely on the entailment-weighted graph accurately capturing distinct semantic modes and on the coverage scalar reliably indicating the fusion rule without introducing bias; the weakest assumption noted in the reader report is therefore load-bearing, but no sensitivity analysis or alternative graph-construction ablations are reported to confirm that the observed gains survive perturbations to the entailment threshold or graph construction.
Authors: We agree that sensitivity to graph construction merits explicit verification. The revised Experiments section now includes ablations that vary the entailment threshold over [0.6, 0.95] and substitute an embedding-similarity graph for the entailment graph. The performance gains of SHADE remain consistent across these variants. revision: yes
- Full analytic bound or threshold-stability analysis for coverage-driven fusion selection at n ≤ 10
Circularity Check
No significant circularity in SHADE derivation chain
full rationale
The abstract and described method define SHADE as an explicit combination of Generalized Good-Turing coverage (estimated from samples) with heat-kernel trace on an entailment graph, using the coverage scalar to select between convex combination and LogSumExp fusion before a finite-sample correction. No equation or step reduces the final cardinality estimate to a fitted parameter, self-referential quantity, or prior self-citation by construction. The fusion rule is data-driven but defined externally to the target estimate, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption An entailment-weighted graph over sampled responses meaningfully represents semantic occupancy.
- ad hoc to paper Coverage level is a reliable indicator for choosing between convex and LogSumExp fusion without introducing systematic bias.
Reference graph
Works this paper leans on
-
[1]
Thu Bui, Carola-Bibiane Schönlieb, Bruno Ribeiro, Beatrice Bevilacqua, and Moshe Eliasof. On the effectiveness of random weights in graph neural networks.arXiv preprint arXiv:2502.00190, 2025
-
[2]
Nonparametric estimation of Shannon’s index of diversity when there are unseen species in a sample.Environmental and Ecological Statistics, 10:429–443, 2003
Anne Chao and Tsung-Jen Shen. Nonparametric estimation of Shannon’s index of diversity when there are unseen species in a sample.Environmental and Ecological Statistics, 10:429–443, 2003
2003
-
[3]
Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. Inside: Llms’ internal states retain the power of hallucination detection.ArXiv, abs/2402.03744, 2024
-
[4]
Quantifying uncertainty in answers from any language model and enhancing their trustworthiness
Jiuhai Chen and Jonas Mueller. Quantifying uncertainty in answers from any language model and enhancing their trustworthiness. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5186–5200, 2024
2024
-
[5]
Fan R. K. Chung.Spectral Graph Theory. Number 92 in CBMS Regional Conference Series in Mathematics. American Mathematical Society, Providence, RI, 1997
1997
-
[6]
Superpixel-based and spatially regularized diffusion learning for unsupervised hyperspectral image clustering.IEEE Transactions on Geoscience and Remote Sensing, 62:1–18, 2024
Kangning Cui, Ruoning Li, Sam L Polk, Yinyi Lin, Hongsheng Zhang, James M Murphy, Robert J Plemmons, and Raymond H Chan. Superpixel-based and spatially regularized diffusion learning for unsupervised hyperspectral image clustering.IEEE Transactions on Geoscience and Remote Sensing, 62:1–18, 2024
2024
-
[7]
Efficient localization and spatial distribution modeling of canopy palms using uav imagery.IEEE Transactions on Geoscience and Remote Sensing, 2025
Kangning Cui, Wei Tang, Rongkun Zhu, Manqi Wang, Gregory D Larsen, Victor P Pauca, Sarra Alqahtani, Fan Yang, David Segurado, Paul Fine, et al. Efficient localization and spatial distribution modeling of canopy palms using uav imagery.IEEE Transactions on Geoscience and Remote Sensing, 2025
2025
-
[8]
Signed graph convolutional networks.2018 IEEE International Conference on Data Mining (ICDM), pages 929–934, 2018
Tyler Derr, Yao Ma, and Jiliang Tang. Signed graph convolutional networks.2018 IEEE International Conference on Data Mining (ICDM), pages 929–934, 2018
2018
-
[9]
Detecting hallucinations in large language models using semantic entropy.Nature, 630:625 – 630, 2024
Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630:625 – 630, 2024. 5
2024
-
[10]
Graph random neural networks for semi-supervised learning on graphs.arXiv: Learning, 2020
Wenzheng Feng, Jie Zhang, Yuxiao Dong, Yu Han, Huanbo Luan, Qian Xu, Qiang Yang, Evgeny Khar- lamov, and Jie Tang. Graph random neural networks for semi-supervised learning on graphs.arXiv: Learning, 2020
2020
-
[11]
I. J. Good. The population frequencies of species and the estimation of population parameters.Biometrika, 40(3/4):237–264, 1953
1953
-
[12]
arXiv preprint arXiv:2111.09543 , year=
Pengcheng He, Jianfeng Gao, and Weizhu Chen. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing.CoRR, abs/2111.09543, 2021
-
[13]
Are graph convo- lutional networks with random weights feasible?IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:2751–2768, 2022
Changqin Huang, Ming Li, Feilong Cao, Hamido Fujita, Zhao Li, and Xindong Wu. Are graph convo- lutional networks with random weights feasible?IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:2751–2768, 2022
2022
-
[14]
Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023
2023
-
[15]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.CoRR, abs/2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. https://doi.org/10.21227/de50-f985, April 2025. Accessed on YYYY-MM-DD
-
[17]
Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth A. Malik, and Yarin Gal. Semantic entropy probes: Robust and cheap hallucination detection in llms.ArXiv, abs/2406.15927, 2024
-
[18]
Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation
Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023
2023
-
[19]
Evidential semantic entropy for llm uncertainty quantification
Lucie Kunitomo-Jacquin, Edison Marrese-Taylor, Ken Fukuda, and Masahiro Hamasaki. Evidential semantic entropy for llm uncertainty quantification. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7107–7122, 2026
2026
-
[20]
Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur P. Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: a benchmark for question answering research.Trans...
2019
-
[21]
Enhancing uncertainty quantification in large language models through semantic graph density
Zhaoye Li, Siyuan Shen, Wenjing Yang, Ruochun Jin, Huan Chen, Ligong Cao, and Jing Ren. Enhancing uncertainty quantification in large language models through semantic graph density. InConference on Uncertainty in Artificial Intelligence, 2025
2025
-
[22]
Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen-Yu Lin, and Hua Wei. Uncertainty quantifica- tion and confidence calibration in large language models: A survey.Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V .2, 2025
2025
-
[23]
Lucas H McCabe, Rimon Melamed, Thomas Hartvigsen, and H Howie Huang. Estimating semantic alphabet size for llm uncertainty quantification.arXiv preprint arXiv:2509.14478, 2025
-
[24]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Microsoft. Phi-3 technical report: A highly capable language model locally on your phone.CoRR, abs/2404.14219, 2024
work page internal anchor Pith review arXiv 2024
-
[25]
George A. Miller. Note on the bias of information estimates.Information Theory, IRE Transactions on, 2(2):190–190, 1955
1955
-
[26]
Beyond semantic entropy: Boosting llm uncertainty quantification with pairwise semantic similarity
Dang Nguyen, Ali Payani, and Baharan Mirzasoleiman. Beyond semantic entropy: Boosting llm uncertainty quantification with pairwise semantic similarity. InFindings of the Association for Computational Linguistics: ACL 2025, pages 4530–4540, 2025
2025
-
[27]
Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen. Kernel language entropy: Fine-grained uncertainty quantification for llms from semantic similarities.ArXiv, abs/2405.20003, 2024
-
[28]
Squad: 100, 000+ questions for machine comprehension of text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for machine comprehension of text. In Jian Su, Xavier Carreras, and Kevin Duh, editors,Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 2383–2392. The Association for...
2016
-
[29]
A survey of hallucination in large foundation models
Vipula Rawte, Amit Sheth, and Amitava Das. A survey of hallucination in large foundation models.arXiv preprint arXiv:2309.05922, 2023. 6
-
[30]
Siva Reddy, Danqi Chen, and Christopher D. Manning. Coqa: A conversational question answering challenge.Trans. Assoc. Comput. Linguistics, 7:249–266, 2019
2019
-
[31]
Ren, and Anirudha Majumdar
Olaoluwa Shorinwa, Zhiting Mei, Justin Lidard, Allen Z. Ren, and Anirudha Majumdar. A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions.ACM Computing Surveys, 58:1 – 38, 2024
2024
-
[32]
Efficient hallucination detection: Adaptive bayesian estimation of semantic entropy with guided semantic exploration
Qiyao Sun, Xingming Li, Xixiang He, Ao Cheng, Xuanyu Ji, Hailun Lu, Runke Huang, and Qingyong Hu. Efficient hallucination detection: Adaptive bayesian estimation of semantic entropy with guided semantic exploration. In Sven Koenig, Chad Jenkins, and Matthew E. Taylor, editors,Fortieth AAAI Conference on Artificial Intelligence, Thirty-Eighth Conference on...
2026
-
[33]
Bilateral signal warping for left ventricular hypertrophy diagnosis
Wei Tang, Kangning Cui, Raymond H Chan, and Jean-Michel Morel. Bilateral signal warping for left ventricular hypertrophy diagnosis. In2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI), pages 1–5. IEEE, 2025
2025
-
[34]
Tianyun Yang, Ziniu Li, Juan Cao, and Chang Xu
Yihao Xue, Kristjan H. Greenewald, Youssef Mroueh, and Baharan Mirzasoleiman. Verify when uncertain: Beyond self-consistency in black box hallucination detection.ArXiv, abs/2502.15845, 2025
-
[35]
Cohen, Ruslan Salakhutdinov, and Christopher D
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Proces...
2018
-
[36]
Uncertainty estimation by flexible evidential deep learning.arXiv preprint arXiv:2510.18322, 2025
Taeseong Yoon and Heeyoung Kim. Uncertainty estimation by flexible evidential deep learning.arXiv preprint arXiv:2510.18322, 2025
-
[37]
Yangchen Zeng, Zhenyu Yu, Dongming Jiang, Wenbo Zhang, Yifan Hong, Zhanhua Hu, Jiao Luo, and Kangning Cui. Learning where to embed: Noise-aware positional embedding for query retrieval in small-object detection.arXiv preprint arXiv:2604.15065, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[38]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona T. Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. OPT: open pre-trained transformer language models.CoRR, abs/2205.01068, 2022. ...
work page internal anchor Pith review arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.