arxiv: 2604.17112 · v1 · submitted 2026-04-18 · 💻 cs.AI

Recognition: unknown

Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification

Kimia Hamidieh , Veronika Thost , Walter Gerych , Mikhail Yurochkin , Marzyeh Ghassemi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:24 UTC · model grok-4.3

classification 💻 cs.AI

keywords uncertainty quantificationlarge language modelsepistemic uncertaintyaleatoric uncertaintyself-consistencyblack-box accessselective abstentionensemble disagreement

0 comments

The pith

Cross-model disagreement supplies a missing epistemic uncertainty signal when self-consistency alone is low.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often repeat the same wrong answer across repeated samples, so self-consistency measures of aleatoric uncertainty fail to flag those errors. The paper shows that disagreement across a small ensemble of models is systematically higher on incorrect outputs precisely in that low-self-consistency regime. It therefore defines an epistemic uncertainty term as the gap between how similarly the models answer each other versus how similarly each model answers itself, adds the two terms to obtain total uncertainty, and reports that the combined score improves ranking calibration and selective abstention across five 7-9B models and ten long-form tasks.

Core claim

In the black-box setting, epistemic uncertainty is estimated from the difference between inter-model and intra-model sequence-semantic similarity; adding this term to self-consistency aleatoric uncertainty produces a total uncertainty that ranks answers more reliably and allows better selective abstention, while also surfacing confident failures that aleatoric uncertainty alone misses.

What carries the argument

The epistemic uncertainty term computed as the gap between inter-model and intra-model sequence-semantic similarity, used as a proxy that activates when self-consistency is low.

If this is right

Total uncertainty improves ranking calibration and selective abstention relative to aleatoric uncertainty alone.
The epistemic term flags confident failures where aleatoric uncertainty is low.
The method requires only generated text from a scale-matched ensemble and works without token probabilities.
Agreement and complementarity diagnostics identify the regimes where the added term contributes most.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same inter-versus-intra similarity gap could be tested on short-form or multiple-choice tasks to check whether the pattern holds beyond long-form generation.
Replacing semantic similarity with other cheap distance measures might preserve the signal while lowering compute.
An ensemble of three to five models appears sufficient, suggesting the approach scales without requiring dozens of models.

Load-bearing premise

Cross-model semantic disagreement is higher on incorrect answers exactly when self-consistency is low.

What would settle it

A dataset in which models disagree more on correct answers than on incorrect answers whenever self-consistency is low would show the epistemic term adds no value or harms calibration.

Figures

Figures reproduced from arXiv: 2604.17112 by Kimia Hamidieh, Marzyeh Ghassemi, Mikhail Yurochkin, Veronika Thost, Walter Gerych.

**Figure 2.** Figure 2: Based on the distribution of EU across samples with different AU values, we find that EU [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Epistemic uncertainty AUROC versus dataset level redundancy ( [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Under matched sample budgets, TU (AU+EU) consistently shows higher AUROC than AU [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Using Mistral-7B-Instruct-v0.3 as the reference model, TU attains the best mean AUROC (0.72), and outperforms the strongest baseline (closeness centrality, 0.64) across almost all datasets. Per-task results appear in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: We keep the reference model fixed as mistral-7B, and vary the size of the single auxiliary model. TU achieves higher AUROC in comparison to AU, even in cases where the size of the auxiliary model is lower than (×0.43) or roughly the same (×1) as the reference model. The improvements are more significant with larger and more capable the auxiliary models on TriviaQA. of data points across the whole dataset d… view at source ↗

**Figure 7.** Figure 7: Risk–coverage analysis shows that TU consistently improves selective prediction across [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: We show AUROC for each model separately to compare aleatoric and total uncertainty. TU [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: AUROC improvement obtained by adding EU to AU across all samples per dataset, [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Accuracy per model-dataset pair. ROC Curves [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: ROC curves between aleatoric and total uncertainty aggregated across all models and [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: ROC curves comparing aleatoric and total uncertainty across individual datasets. TU [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: reports results on TriviaQA, using two model families (Gemma3 (Team et al., 2025) and Qwen2.5 (Yang et al., 2024)) of various sizes. As the size of the reference model increases, both aleatoric and total uncertainty AUROC scores tend to decrease, but total uncertainty has consistently higher AUROC across different model sizes. This holds even when the reference model is substantially stronger than any mod… view at source ↗

**Figure 14.** Figure 14: Uncertainty calibration for experiments where auxiliary model set for each model is [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: We plot AUROC as a function of the number of auxiliary models used to compute total [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: AUROC of total uncertainty as a function of the number of samples per model. Even with [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Distribution of EU across different levels of AU and correctness. Across all models, we [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

**Figure 18.** Figure 18: Distribution of EU across different levels of AU and correctness. Across all benchmarks, [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

read the original abstract

Large language models (LLMs) often produce confident yet incorrect responses, and uncertainty quantification is one potential solution to more robust usage. Recent works routinely rely on self-consistency to estimate aleatoric uncertainty (AU), yet this proxy collapses when models are overconfident and produce the same incorrect answer across samples. We analyze this regime and show that cross-model semantic disagreement is higher on incorrect answers precisely when AU is low. Motivated by this, we introduce an epistemic uncertainty (EU) term that operates in the black-box access setting: EU uses only generated text from a small, scale-matched ensemble and is computed as the gap between inter-model and intra-model sequence-semantic similarity. We then define total uncertainty (TU) as the sum of AU and EU. In a comprehensive study across five 7-9B instruction-tuned models and ten long-form tasks, TU improves ranking calibration and selective abstention relative to AU, and EU reliably flags confident failures where AU is low. We further characterize when EU is most useful via agreement and complementarity diagnostics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adds a cross-model semantic disagreement term to self-consistency that catches overconfident wrong answers on long-form tasks.

read the letter

This paper adds a cross-model semantic disagreement term to self-consistency that catches overconfident wrong answers on long-form tasks. They define epistemic uncertainty as the gap between how similar outputs are across different models versus how consistent repeated samples are from one model, then set total uncertainty to their sum. The motivation comes from seeing that disagreement between models rises on errors exactly when self-consistency stays high, which is a real failure mode for the usual approach. The experiments run across five 7-9B models and ten tasks show that the combined measure improves calibration and selective abstention, with the new term specifically helping on the cases where self-consistency alone is misleading. They also include checks on when the two terms agree or complement each other. The work is straightforward and stays in the black-box setting using only generated text. The soft spots are small. It needs a small ensemble of comparable models, which adds cost even if the access is black-box. Results will depend on the semantic similarity function chosen, though the paper motivates the choice and the reported gains hold in their setup. No major statistical or methodological holes appear in the full text. This is useful for people building reliable LLM systems that need to abstain on uncertain long outputs. A reader already using self-consistency will see a direct, testable extension. I would send it for peer review because the idea is clear, the experiments are broad enough to be informative, and the central claims are supported by the data they present.

Referee Report

0 major / 3 minor

Summary. The paper proposes complementing self-consistency-based aleatoric uncertainty (AU) with a black-box epistemic uncertainty (EU) term defined as the gap between inter-model and intra-model sequence-semantic similarity over a small ensemble of 7-9B instruction-tuned models. Total uncertainty is TU = AU + EU. Motivated by the observation that cross-model disagreement rises on incorrect answers when AU is low, the authors report that TU improves ranking calibration and selective abstention relative to AU alone across five models and ten long-form tasks, while EU specifically flags confident failures; they also provide agreement and complementarity diagnostics.

Significance. If the reported gains hold, the work meaningfully extends uncertainty quantification for LLMs beyond self-consistency by providing a practical, black-box proxy for epistemic uncertainty that targets the overconfident regime. The multi-model, multi-task empirical scope and explicit diagnostics for when EU adds value are strengths; the approach requires only generated text and a modest ensemble, which increases applicability.

minor comments (3)

[§3] §3 (Methods): the precise semantic similarity function (embedding model, pooling, or judge LLM) used to compute sequence-level inter- and intra-model similarities should be stated explicitly, including any hyperparameters, so that EU is fully reproducible.
[Results] Results tables: report the exact calibration metrics (e.g., ECE, Brier score, or ranking AUC) and abstention curves with confidence intervals or statistical tests across the ten tasks; the abstract claims improvement but the quantitative deltas are not summarized in the provided text.
[§4.3] §4.3 (Diagnostics): the agreement and complementarity plots would benefit from a short formal definition of the plotted quantities (e.g., how 'agreement' between AU and EU is quantified) to avoid ambiguity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report accurately reflects the core contribution of our work: showing that cross-model semantic disagreement provides a practical black-box epistemic uncertainty term that complements self-consistency-based aleatoric uncertainty, particularly in the overconfident regime. As the report contains no specific major comments, we have no points requiring rebuttal or targeted revision at this time. We will incorporate any minor suggestions during the revision process.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core contribution is an empirical study: it first observes (via analysis) that cross-model semantic disagreement rises on incorrect answers precisely when self-consistency-based AU is low, then explicitly defines EU as the computable gap between inter-model and intra-model sequence-semantic similarity on generated text, sets TU = AU + EU, and reports that TU improves ranking calibration and selective abstention over AU alone across five models and ten tasks. None of these steps reduces by the paper's own equations or definitions to a fitted parameter, renamed input, or self-citation chain; the definitions are direct and the claims rest on external experimental outcomes rather than tautological re-derivation of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based on abstract only; no explicit free parameters are stated. The approach rests on standard assumptions about semantic similarity metrics and the validity of small ensembles for epistemic uncertainty.

axioms (1)

domain assumption Sequence-semantic similarity can be measured reliably from generated text alone to distinguish intra-model from inter-model agreement.
Central to computing the EU gap; invoked when defining the epistemic term from similarity scores.

invented entities (1)

Epistemic uncertainty (EU) term no independent evidence
purpose: To quantify model disagreement across an ensemble when self-consistency is low
Newly defined as the gap between inter- and intra-model similarities; no independent falsifiable evidence provided beyond the empirical study.

pith-pipeline@v0.9.0 · 5498 in / 1424 out tokens · 47708 ms · 2026-05-10T06:24:35.129984+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 50 canonical work pages · 14 internal anchors

[1]

Semantically diverse language generation for uncertainty estimation in language models.arXiv preprint arXiv:2406.04306,

Lukas Aichberger, Kajetan Schweighofer, Mykyta Ielanskyi, and Sepp Hochreiter. Semantically diverse language generation for uncertainty estimation in language models.arXiv preprint arXiv:2406.04306,

work page arXiv
[2]

Neil Band, Tim GJ Rudner, Qixuan Feng, Angelos Filos, Zachary Nado, Michael W Dusenberry, Ghassen Jerfel, Dustin Tran, and Yarin Gal

URL https: //api.semanticscholar.org/CorpusID:276612236. Neil Band, Tim GJ Rudner, Qixuan Feng, Angelos Filos, Zachary Nado, Michael W Dusenberry, Ghassen Jerfel, Dustin Tran, and Yarin Gal. Benchmarking bayesian deep learning on diabetic retinopathy detection tasks.arXiv preprint arXiv:2211.12717,

work page arXiv
[3]

Findings of the 2016 conference on machine translation (wmt16)

Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, et al. Findings of the 2016 conference on machine translation (wmt16). InFirst conference on machine translation, pp. 131–198. Association for Computational Linguistics,

2016
[4]

Do we truly need so many samples? multi-llm repeated sampling efficiently scale test-time compute

Jianhao Chen, Zishuo Xun, Bocheng Zhou, Han Qi, Qiaosheng Zhang, Yang Chen, Wei Hu, Yuzhong Qu, Wanli Ouyang, and Shuyue Hu. Do we truly need so many samples? multi-llm repeated sampling efficiently scale test-time compute.arXiv preprint arXiv:2504.00762,

work page arXiv
[5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

arXiv preprint arXiv:2405.21015 , author =

Ben Cottier, Robi Rahman, Loredana Fattorini, Nestor Maslej, Tamay Besiroglu, and David Owen. The rising costs of training frontier ai models.arXiv preprint arXiv:2405.21015,

work page arXiv
[7]

Smith, and Matt Gardner

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers.arXiv preprint arXiv:2105.03011,

work page arXiv
[8]

Uncertainty-aware fusion: An ensemble framework for mitigating hallucinations in large language models.arXiv preprint arXiv:2503.05757,

Prasenjit Dey, Srujana Merugu, and Sivaramakrishnan Kaveri. Uncertainty-aware fusion: An ensemble framework for mitigating hallucinations in large language models.arXiv preprint arXiv:2503.05757,

work page arXiv
[9]

arXiv preprint arXiv:2311.07383 (2023)

Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun, Artem Vazhentsev, Sergey Petrakov, Kirill Fedyanin, Daniil Vasilev, Elizaveta Goncharova, Alexander Panchenko, Maxim Panov, et al. Lm-polygraph: Uncertainty estimation for language models.arXiv preprint arXiv:2311.07383,

work page arXiv
[10]

Spuq: Perturbation-based uncertainty quantification for large language models.arXiv preprint arXiv:2403.02509,

11 Published as a conference paper at ICLR 2026 Xiang Gao, Jiaxin Zhang, Lalla Mouatadid, and Kamalika Das. Spuq: Perturbation-based uncertainty quantification for large language models.arXiv preprint arXiv:2403.02509,

work page arXiv 2026
[11]

The Llama 3 Herd of Models

URL https://github.com/ ibm-granite/granite-3.0-language-models/. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Scaling Instruction-Finetuned Language Models

HW Chung L Hou, S Longpre, B Zoph, Y Tay, W Fedus, Y Li, X Wang, M Dehghani, S Brahma, and A Webson. Scaling instruction-finetuned language models.arXiv preprint arXiv:2210.11416,

work page internal anchor Pith review arXiv
[13]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Mistral 7B

Albert Q Jiang, A Sablayrolles, A Mensch, C Bamford, D Singh Chaplot, Ddl Casas, F Bressand, G Lengyel, G Lample, L Saulnier, et al. Mistral 7b. arxiv.arXiv preprint arXiv:2310.06825, 10,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

arXiv:2402.08733 , year=

Daniel D Johnson, Daniel Tarlow, David Duvenaud, and Chris J Maddison. Experts don’t cheat: learning what you don’t know by predicting pairs.arXiv preprint arXiv:2402.08733,

work page arXiv
[16]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,

work page internal anchor Pith review arXiv
[17]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,

work page internal anchor Pith review arXiv
[18]

arXiv preprint arXiv:2502.18581 , year=

Zhewei Kang, Xuandong Zhao, and Dawn Song. Scalable best-of-n selection for large language models via self-certainty.arXiv preprint arXiv:2502.18581,

work page arXiv
[19]

(implicit) ensembles of ensembles: Epistemic uncertainty collapse in large models

Andreas Kirsch. (implicit) ensembles of ensembles: Epistemic uncertainty collapse in large models. arXiv preprint arXiv:2409.02628,

work page arXiv
[20]

Malik, and Yarin Gal

Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, and Yarin Gal. Semantic entropy probes: Robust and cheap hallucination detection in llms.arXiv preprint arXiv:2406.15927,

work page arXiv
[21]

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation.arXiv preprint arXiv:2302.09664, 2023a. Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InICLR, 2023b....

work page internal anchor Pith review arXiv
[22]

Gonzalez, Hao Zhang, and Ion Stoica

12 Published as a conference paper at ICLR 2026 Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,

2026
[23]

arXiv preprint arXiv:2502.20379 , year=

Shalev Lifshitz, Sheila A McIlraith, and Yilun Du. Multi-agent verification: Scaling test-time compute with multiple verifiers.arXiv preprint arXiv:2502.20379,

work page arXiv
[24]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods.arXiv preprint arXiv:2109.07958,

work page internal anchor Pith review arXiv
[25]

Teaching models to express their uncertainty in words, 2022

Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words.arXiv preprint arXiv:2205.14334,

work page arXiv
[26]

arXiv preprint arXiv:2305.19187 , year=

Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Generating with confidence: Uncertainty quantifica- tion for black-box large language models.arXiv preprint arXiv:2305.19187,

work page arXiv
[27]

Uncertainty estimation and quantification for llms: A simple supervised approach.arXiv preprint arXiv:2404.15993,

Linyu Liu, Yu Pan, Xiaocheng Li, and Guanting Chen. Uncertainty estimation and quantification for llms: A simple supervised approach.arXiv preprint arXiv:2404.15993,

work page arXiv
[28]

Enhancing hallucination detection through noise injection.arXiv preprint arXiv:2502.03799,

Litian Liu, Reza Pourreza, Sunny Panchal, Apratim Bhattacharyya, Yao Qin, and Roland Memisevic. Enhancing hallucination detection through noise injection.arXiv preprint arXiv:2502.03799,

work page arXiv
[29]

Merge, ensemble, and cooperate! a survey on collaborative strategies in the era of large language models,

Jinliang Lu, Ziliang Pang, Min Xiao, Yaochen Zhu, Rui Xia, and Jiajun Zhang. Merge, ensemble, and cooperate! a survey on collaborative strategies in the era of large language models.arXiv preprint arXiv:2407.06089,

work page arXiv
[30]

Estimating llm uncertainty with evidence.arXiv preprint arXiv:2502.00290, 2025

Huan Ma, Jingdong Chen, Guangyu Wang, and Changqing Zhang. Estimating llm uncertainty with logits.arXiv preprint arXiv:2502.00290,

work page arXiv
[31]

Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models

Potsawee Manakul, Adian Liusie, and Mark JF Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models.arXiv preprint arXiv:2303.08896,

work page arXiv
[32]

gradient descent

Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer. Ambigqa: Answering ambiguous open-domain questions.arXiv preprint arXiv:2004.10645,

work page arXiv 2004
[33]

Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don’t give me the details, just the sum- mary! topic-aware convolutional neural networks for extreme summarization.arXiv preprint arXiv:1808.08745,

work page Pith review arXiv
[34]

arXiv preprint arXiv:2108.08877 , year=

Jianmo Ni, Gustavo Hernandez Abrego, Noah Constant, Ji Ma, Keith B Hall, Daniel Cer, and Yinfei Yang. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models.arXiv preprint arXiv:2108.08877,

work page arXiv
[35]

When do llms need retrieval augmentation? mitigating llms’ overconfidence helps retrieval augmentation.arXiv preprint arXiv:2402.11457,

Shiyu Ni, Keping Bi, Jiafeng Guo, and Xueqi Cheng. When do llms need retrieval augmentation? mitigating llms’ overconfidence helps retrieval augmentation.arXiv preprint arXiv:2402.11457,

work page arXiv
[36]

Revisiting uncertainty quantification evaluation in language models: Spurious interactions with response length bias results.arXiv preprint arXiv:2504.13677,

13 Published as a conference paper at ICLR 2026 Andrea Santilli, Adam Golinski, Michael Kirchhof, Federico Danieli, Arno Blaas, Miao Xiong, Luca Zappella, and Sinead Williamson. Revisiting uncertainty quantification evaluation in language models: Spurious interactions with response length bias results.arXiv preprint arXiv:2504.13677,

work page arXiv 2026
[37]

Introducing an im- proved information-theoretic measure of predictive uncertainty.arXiv preprint arXiv:2311.08309,

Kajetan Schweighofer, Lukas Aichberger, Mykyta Ielanskyi, and Sepp Hochreiter. Introducing an im- proved information-theoretic measure of predictive uncertainty.arXiv preprint arXiv:2311.08309,

work page arXiv
[38]

Large language model routing with benchmark datasets.arXiv preprint arXiv:2309.15789, 2023

Tal Shnitzer, Anthony Ou, Mírian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thompson, and Mikhail Yurochkin. Large language model routing with benchmark datasets.arXiv preprint arXiv:2309.15789,

work page arXiv
[39]

A survey on uncertainty quantification of large language models,

Ola Shorinwa, Zhiting Mei, Justin Lidard, Allen Z Ren, and Anirudha Majumdar. A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions.arXiv preprint arXiv:2412.05563,

work page arXiv
[40]

Trust me, i'm wrong: High-certainty hallucinations in llms

Adi Simhi, Itay Itzhak, Fazl Barez, Gabriel Stanovsky, and Yonatan Belinkov. Trust me, i’m wrong: High-certainty hallucinations in llms.arXiv preprint arXiv:2502.12964,

work page arXiv
[41]

doi:10.5281/zenodo.10256836 , url =

URLhttps://doi.org/10.5281/zenodo.10256836. Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261,

work page doi:10.5281/zenodo.10256836
[42]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

work page internal anchor Pith review arXiv
[43]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

work page internal anchor Pith review Pith/arXiv arXiv
[44]

LoRA ensembles for large language model fine-tuning.arXiv preprint arXiv:2310.00035,

Xi Wang, Laurence Aitchison, and Maja Rudolph. Lora ensembles for large language model fine- tuning.arXiv preprint arXiv:2310.00035,

work page arXiv
[45]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh- ery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171,

work page Pith review arXiv
[46]

Measuring short-form factuality in large language models

Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. arXiv preprint arXiv:2411.04368,

work page internal anchor Pith review arXiv
[47]

A survey of uncertainty estimation methods on large language models.arXiv preprint arXiv:2503.00172, 2025

Zhiqiu Xia, Jinxuan Xu, Yuqian Zhang, and Hang Liu. A survey of uncertainty estimation methods on large language models.arXiv preprint arXiv:2503.00172,

work page arXiv
[48]

Tianyun Yang, Ziniu Li, Juan Cao, and Chang Xu

Yihao Xue, Kristjan Greenewald, Youssef Mroueh, and Baharan Mirzasoleiman. Verify when uncertain: Beyond self-consistency in black box hallucination detection.arXiv preprint arXiv:2502.15845,

work page arXiv
[49]

14 Published as a conference paper at ICLR 2026 An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[50]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering.arXiv preprint arXiv:1809.09600,

work page internal anchor Pith review arXiv
[51]

negative kernel-entropy

15 Published as a conference paper at ICLR 2026 A APPENDIX A.1 THEORETICALINTERPRETATIONS OFEPISTEMICUNCERTAINTY Kernel and variational interpretation of D(ω||ω ∗).Assume the similarity function s(·,·) is a symmetric positive definite kernel k. Denote the predictive distributions by PΩ :=p(· |x;ω) and Pω∗. Their kernel mean embeddings in the reproducing k...

2026
[52]

Following prior work (Lin et al., 2023), we compute correctness using only the first sampled response from each model

as the judge model. Following prior work (Lin et al., 2023), we compute correctness using only the first sampled response from each model. All evaluation is conducted in inference-only mode; no training or fine-tuning is performed. For each dataset, we sample 10 responses per model for the first 100 prompts. AU is computed using all 10 responses. To match...

2023
[53]

For datasets not origi- nally supported by lm-eval-harness, we follow its prompt formatting conventions and include code for these additions in the supplementary material

embeddings. For datasets not origi- nally supported by lm-eval-harness, we follow its prompt formatting conventions and include code for these additions in the supplementary material. A.4 ADDITIONALRESULTS ONTOTALUNCERTAINTY Figure 8 reports the AUROC of aleatoric and total uncertainty across all model–dataset pairs, and Fig- ure 9 shows the corresponding...

work page arXiv 2026
[54]

TU achieves higher AUROC across most tasks, particularly on HotpotQA, WMT16-de-en, and CoQA, where models exhibit confident failures

20 Published as a conference paper at ICLR 2026 /uni00000156/uni000001ef/uni00000156/uni00000156/uni00000156/uni000001ef/uni00000158/uni0000015b/uni00000156/uni000001ef/uni0000015b/uni00000156/uni00000156/uni000001ef/uni0000015d/uni0000015b/uni00000157/uni000001ef/uni00000156/uni00000156 /uni00000156/uni000001ef/uni00000156 /uni00000156/uni000001ef/uni000...

2026
[55]

and Qwen2.5 (Yang et al., 2024)) of various sizes. As the size of the reference model increases, both aleatoric and total uncertainty AUROC scores tend to decrease, but total uncertainty has consistently higher AUROC across different model sizes. This holds even when the reference model is substantially stronger than any model in the auxiliary set (e.g.,Q...

2024
[56]

/uni00000003/uni00000096/uni0000008b/uni00000092/uni00000090/uni00000013/uni00000003 /uni00000005/uni00000098/uni00000013/uni00000003/uni00000009/uni00000015/uni0000000f/uni0000015e/uni0000000d /uni0000000a/uni00000098/uni0000009d/uni00000099/uni00000098/uni0000009d/uni00000013/uni00000003/uni00000010/uni00000013/uni0000022c/uni00000098/uni00000099/uni000...

work page arXiv 2026
[57]

correctness_score

benchmark into a long-form QA format with chain-of-thought answering. Specifically, we consider Boolean Expressions , Disambiguation QA , and Word Sorting , and prompt models to justify their answers rather than selecting from multiple choices directly. We then evaluate uncertainty scores over the full responses using the same semantic similarity pipeline...

2026