pith. machine review for the scientific record. sign in

arxiv: 2604.17112 · v1 · submitted 2026-04-18 · 💻 cs.AI

Recognition: unknown

Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:24 UTC · model grok-4.3

classification 💻 cs.AI
keywords uncertainty quantificationlarge language modelsepistemic uncertaintyaleatoric uncertaintyself-consistencyblack-box accessselective abstentionensemble disagreement
0
0 comments X

The pith

Cross-model disagreement supplies a missing epistemic uncertainty signal when self-consistency alone is low.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often repeat the same wrong answer across repeated samples, so self-consistency measures of aleatoric uncertainty fail to flag those errors. The paper shows that disagreement across a small ensemble of models is systematically higher on incorrect outputs precisely in that low-self-consistency regime. It therefore defines an epistemic uncertainty term as the gap between how similarly the models answer each other versus how similarly each model answers itself, adds the two terms to obtain total uncertainty, and reports that the combined score improves ranking calibration and selective abstention across five 7-9B models and ten long-form tasks.

Core claim

In the black-box setting, epistemic uncertainty is estimated from the difference between inter-model and intra-model sequence-semantic similarity; adding this term to self-consistency aleatoric uncertainty produces a total uncertainty that ranks answers more reliably and allows better selective abstention, while also surfacing confident failures that aleatoric uncertainty alone misses.

What carries the argument

The epistemic uncertainty term computed as the gap between inter-model and intra-model sequence-semantic similarity, used as a proxy that activates when self-consistency is low.

If this is right

  • Total uncertainty improves ranking calibration and selective abstention relative to aleatoric uncertainty alone.
  • The epistemic term flags confident failures where aleatoric uncertainty is low.
  • The method requires only generated text from a scale-matched ensemble and works without token probabilities.
  • Agreement and complementarity diagnostics identify the regimes where the added term contributes most.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same inter-versus-intra similarity gap could be tested on short-form or multiple-choice tasks to check whether the pattern holds beyond long-form generation.
  • Replacing semantic similarity with other cheap distance measures might preserve the signal while lowering compute.
  • An ensemble of three to five models appears sufficient, suggesting the approach scales without requiring dozens of models.

Load-bearing premise

Cross-model semantic disagreement is higher on incorrect answers exactly when self-consistency is low.

What would settle it

A dataset in which models disagree more on correct answers than on incorrect answers whenever self-consistency is low would show the epistemic term adds no value or harms calibration.

Figures

Figures reproduced from arXiv: 2604.17112 by Kimia Hamidieh, Marzyeh Ghassemi, Mikhail Yurochkin, Veronika Thost, Walter Gerych.

Figure 1
Figure 1. Figure 1: (a) Two models confidently produce distinct, incorrect answers to a factual question, which [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Based on the distribution of EU across samples with different AU values, we find that EU [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Epistemic uncertainty AUROC versus dataset level redundancy ( [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Under matched sample budgets, TU (AU+EU) consistently shows higher AUROC than AU [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Using Mistral-7B-Instruct-v0.3 as the reference model, TU attains the best mean AUROC (0.72), and outperforms the strongest baseline (closeness centrality, 0.64) across almost all datasets. Per-task results appear in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: We keep the reference model fixed as mistral-7B, and vary the size of the single auxiliary model. TU achieves higher AUROC in comparison to AU, even in cases where the size of the auxiliary model is lower than (×0.43) or roughly the same (×1) as the reference model. The improvements are more significant with larger and more capable the auxiliary models on TriviaQA. of data points across the whole dataset d… view at source ↗
Figure 7
Figure 7. Figure 7: Risk–coverage analysis shows that TU consistently improves selective prediction across [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: We show AUROC for each model separately to compare aleatoric and total uncertainty. TU [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: AUROC improvement obtained by adding EU to AU across all samples per dataset, [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Accuracy per model-dataset pair. ROC Curves [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: ROC curves between aleatoric and total uncertainty aggregated across all models and [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: ROC curves comparing aleatoric and total uncertainty across individual datasets. TU [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: reports results on TriviaQA, using two model families (Gemma3 (Team et al., 2025) and Qwen2.5 (Yang et al., 2024)) of various sizes. As the size of the reference model increases, both aleatoric and total uncertainty AUROC scores tend to decrease, but total uncertainty has consistently higher AUROC across different model sizes. This holds even when the reference model is substantially stronger than any mod… view at source ↗
Figure 14
Figure 14. Figure 14: Uncertainty calibration for experiments where auxiliary model set for each model is [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: We plot AUROC as a function of the number of auxiliary models used to compute total [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: AUROC of total uncertainty as a function of the number of samples per model. Even with [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Distribution of EU across different levels of AU and correctness. Across all models, we [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Distribution of EU across different levels of AU and correctness. Across all benchmarks, [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗
read the original abstract

Large language models (LLMs) often produce confident yet incorrect responses, and uncertainty quantification is one potential solution to more robust usage. Recent works routinely rely on self-consistency to estimate aleatoric uncertainty (AU), yet this proxy collapses when models are overconfident and produce the same incorrect answer across samples. We analyze this regime and show that cross-model semantic disagreement is higher on incorrect answers precisely when AU is low. Motivated by this, we introduce an epistemic uncertainty (EU) term that operates in the black-box access setting: EU uses only generated text from a small, scale-matched ensemble and is computed as the gap between inter-model and intra-model sequence-semantic similarity. We then define total uncertainty (TU) as the sum of AU and EU. In a comprehensive study across five 7-9B instruction-tuned models and ten long-form tasks, TU improves ranking calibration and selective abstention relative to AU, and EU reliably flags confident failures where AU is low. We further characterize when EU is most useful via agreement and complementarity diagnostics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper proposes complementing self-consistency-based aleatoric uncertainty (AU) with a black-box epistemic uncertainty (EU) term defined as the gap between inter-model and intra-model sequence-semantic similarity over a small ensemble of 7-9B instruction-tuned models. Total uncertainty is TU = AU + EU. Motivated by the observation that cross-model disagreement rises on incorrect answers when AU is low, the authors report that TU improves ranking calibration and selective abstention relative to AU alone across five models and ten long-form tasks, while EU specifically flags confident failures; they also provide agreement and complementarity diagnostics.

Significance. If the reported gains hold, the work meaningfully extends uncertainty quantification for LLMs beyond self-consistency by providing a practical, black-box proxy for epistemic uncertainty that targets the overconfident regime. The multi-model, multi-task empirical scope and explicit diagnostics for when EU adds value are strengths; the approach requires only generated text and a modest ensemble, which increases applicability.

minor comments (3)
  1. [§3] §3 (Methods): the precise semantic similarity function (embedding model, pooling, or judge LLM) used to compute sequence-level inter- and intra-model similarities should be stated explicitly, including any hyperparameters, so that EU is fully reproducible.
  2. [Results] Results tables: report the exact calibration metrics (e.g., ECE, Brier score, or ranking AUC) and abstention curves with confidence intervals or statistical tests across the ten tasks; the abstract claims improvement but the quantitative deltas are not summarized in the provided text.
  3. [§4.3] §4.3 (Diagnostics): the agreement and complementarity plots would benefit from a short formal definition of the plotted quantities (e.g., how 'agreement' between AU and EU is quantified) to avoid ambiguity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report accurately reflects the core contribution of our work: showing that cross-model semantic disagreement provides a practical black-box epistemic uncertainty term that complements self-consistency-based aleatoric uncertainty, particularly in the overconfident regime. As the report contains no specific major comments, we have no points requiring rebuttal or targeted revision at this time. We will incorporate any minor suggestions during the revision process.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core contribution is an empirical study: it first observes (via analysis) that cross-model semantic disagreement rises on incorrect answers precisely when self-consistency-based AU is low, then explicitly defines EU as the computable gap between inter-model and intra-model sequence-semantic similarity on generated text, sets TU = AU + EU, and reports that TU improves ranking calibration and selective abstention over AU alone across five models and ten tasks. None of these steps reduces by the paper's own equations or definitions to a fitted parameter, renamed input, or self-citation chain; the definitions are direct and the claims rest on external experimental outcomes rather than tautological re-derivation of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based on abstract only; no explicit free parameters are stated. The approach rests on standard assumptions about semantic similarity metrics and the validity of small ensembles for epistemic uncertainty.

axioms (1)
  • domain assumption Sequence-semantic similarity can be measured reliably from generated text alone to distinguish intra-model from inter-model agreement.
    Central to computing the EU gap; invoked when defining the epistemic term from similarity scores.
invented entities (1)
  • Epistemic uncertainty (EU) term no independent evidence
    purpose: To quantify model disagreement across an ensemble when self-consistency is low
    Newly defined as the gap between inter- and intra-model similarities; no independent falsifiable evidence provided beyond the empirical study.

pith-pipeline@v0.9.0 · 5498 in / 1424 out tokens · 47708 ms · 2026-05-10T06:24:35.129984+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 50 canonical work pages · 14 internal anchors

  1. [1]

    Semantically diverse language generation for uncertainty estimation in language models.arXiv preprint arXiv:2406.04306,

    Lukas Aichberger, Kajetan Schweighofer, Mykyta Ielanskyi, and Sepp Hochreiter. Semantically diverse language generation for uncertainty estimation in language models.arXiv preprint arXiv:2406.04306,

  2. [2]

    Neil Band, Tim GJ Rudner, Qixuan Feng, Angelos Filos, Zachary Nado, Michael W Dusenberry, Ghassen Jerfel, Dustin Tran, and Yarin Gal

    URL https: //api.semanticscholar.org/CorpusID:276612236. Neil Band, Tim GJ Rudner, Qixuan Feng, Angelos Filos, Zachary Nado, Michael W Dusenberry, Ghassen Jerfel, Dustin Tran, and Yarin Gal. Benchmarking bayesian deep learning on diabetic retinopathy detection tasks.arXiv preprint arXiv:2211.12717,

  3. [3]

    Findings of the 2016 conference on machine translation (wmt16)

    Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, et al. Findings of the 2016 conference on machine translation (wmt16). InFirst conference on machine translation, pp. 131–198. Association for Computational Linguistics,

  4. [4]

    Do we truly need so many samples? multi-llm repeated sampling efficiently scale test-time compute

    Jianhao Chen, Zishuo Xun, Bocheng Zhou, Han Qi, Qiaosheng Zhang, Yang Chen, Wei Hu, Yuzhong Qu, Wanli Ouyang, and Shuyue Hu. Do we truly need so many samples? multi-llm repeated sampling efficiently scale test-time compute.arXiv preprint arXiv:2504.00762,

  5. [5]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  6. [6]

    arXiv preprint arXiv:2405.21015 , author =

    Ben Cottier, Robi Rahman, Loredana Fattorini, Nestor Maslej, Tamay Besiroglu, and David Owen. The rising costs of training frontier ai models.arXiv preprint arXiv:2405.21015,

  7. [7]

    Smith, and Matt Gardner

    Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers.arXiv preprint arXiv:2105.03011,

  8. [8]

    Uncertainty-aware fusion: An ensemble framework for mitigating hallucinations in large language models.arXiv preprint arXiv:2503.05757,

    Prasenjit Dey, Srujana Merugu, and Sivaramakrishnan Kaveri. Uncertainty-aware fusion: An ensemble framework for mitigating hallucinations in large language models.arXiv preprint arXiv:2503.05757,

  9. [9]

    arXiv preprint arXiv:2311.07383 (2023)

    Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun, Artem Vazhentsev, Sergey Petrakov, Kirill Fedyanin, Daniil Vasilev, Elizaveta Goncharova, Alexander Panchenko, Maxim Panov, et al. Lm-polygraph: Uncertainty estimation for language models.arXiv preprint arXiv:2311.07383,

  10. [10]

    Spuq: Perturbation-based uncertainty quantification for large language models.arXiv preprint arXiv:2403.02509,

    11 Published as a conference paper at ICLR 2026 Xiang Gao, Jiaxin Zhang, Lalla Mouatadid, and Kamalika Das. Spuq: Perturbation-based uncertainty quantification for large language models.arXiv preprint arXiv:2403.02509,

  11. [11]

    The Llama 3 Herd of Models

    URL https://github.com/ ibm-granite/granite-3.0-language-models/. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  12. [12]

    Scaling Instruction-Finetuned Language Models

    HW Chung L Hou, S Longpre, B Zoph, Y Tay, W Fedus, Y Li, X Wang, M Dehghani, S Brahma, and A Webson. Scaling instruction-finetuned language models.arXiv preprint arXiv:2210.11416,

  13. [13]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

  14. [14]

    Mistral 7B

    Albert Q Jiang, A Sablayrolles, A Mensch, C Bamford, D Singh Chaplot, Ddl Casas, F Bressand, G Lengyel, G Lample, L Saulnier, et al. Mistral 7b. arxiv.arXiv preprint arXiv:2310.06825, 10,

  15. [15]

    arXiv:2402.08733 , year=

    Daniel D Johnson, Daniel Tarlow, David Duvenaud, and Chris J Maddison. Experts don’t cheat: learning what you don’t know by predicting pairs.arXiv preprint arXiv:2402.08733,

  16. [16]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,

  17. [17]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,

  18. [18]

    arXiv preprint arXiv:2502.18581 , year=

    Zhewei Kang, Xuandong Zhao, and Dawn Song. Scalable best-of-n selection for large language models via self-certainty.arXiv preprint arXiv:2502.18581,

  19. [19]

    (implicit) ensembles of ensembles: Epistemic uncertainty collapse in large models

    Andreas Kirsch. (implicit) ensembles of ensembles: Epistemic uncertainty collapse in large models. arXiv preprint arXiv:2409.02628,

  20. [20]

    Malik, and Yarin Gal

    Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, and Yarin Gal. Semantic entropy probes: Robust and cheap hallucination detection in llms.arXiv preprint arXiv:2406.15927,

  21. [21]

    Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation.arXiv preprint arXiv:2302.09664, 2023a. Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InICLR, 2023b....

  22. [22]

    Gonzalez, Hao Zhang, and Ion Stoica

    12 Published as a conference paper at ICLR 2026 Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,

  23. [23]

    arXiv preprint arXiv:2502.20379 , year=

    Shalev Lifshitz, Sheila A McIlraith, and Yilun Du. Multi-agent verification: Scaling test-time compute with multiple verifiers.arXiv preprint arXiv:2502.20379,

  24. [24]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods.arXiv preprint arXiv:2109.07958,

  25. [25]

    Teaching models to express their uncertainty in words, 2022

    Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words.arXiv preprint arXiv:2205.14334,

  26. [26]

    arXiv preprint arXiv:2305.19187 , year=

    Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Generating with confidence: Uncertainty quantifica- tion for black-box large language models.arXiv preprint arXiv:2305.19187,

  27. [27]

    Uncertainty estimation and quantification for llms: A simple supervised approach.arXiv preprint arXiv:2404.15993,

    Linyu Liu, Yu Pan, Xiaocheng Li, and Guanting Chen. Uncertainty estimation and quantification for llms: A simple supervised approach.arXiv preprint arXiv:2404.15993,

  28. [28]

    Enhancing hallucination detection through noise injection.arXiv preprint arXiv:2502.03799,

    Litian Liu, Reza Pourreza, Sunny Panchal, Apratim Bhattacharyya, Yao Qin, and Roland Memisevic. Enhancing hallucination detection through noise injection.arXiv preprint arXiv:2502.03799,

  29. [29]

    Merge, ensemble, and cooperate! a survey on collaborative strategies in the era of large language models,

    Jinliang Lu, Ziliang Pang, Min Xiao, Yaochen Zhu, Rui Xia, and Jiajun Zhang. Merge, ensemble, and cooperate! a survey on collaborative strategies in the era of large language models.arXiv preprint arXiv:2407.06089,

  30. [30]

    Estimating llm uncertainty with evidence.arXiv preprint arXiv:2502.00290, 2025

    Huan Ma, Jingdong Chen, Guangyu Wang, and Changqing Zhang. Estimating llm uncertainty with logits.arXiv preprint arXiv:2502.00290,

  31. [31]

    Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models

    Potsawee Manakul, Adian Liusie, and Mark JF Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models.arXiv preprint arXiv:2303.08896,

  32. [32]

    gradient descent

    Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer. Ambigqa: Answering ambiguous open-domain questions.arXiv preprint arXiv:2004.10645,

  33. [33]

    Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

    Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don’t give me the details, just the sum- mary! topic-aware convolutional neural networks for extreme summarization.arXiv preprint arXiv:1808.08745,

  34. [34]

    arXiv preprint arXiv:2108.08877 , year=

    Jianmo Ni, Gustavo Hernandez Abrego, Noah Constant, Ji Ma, Keith B Hall, Daniel Cer, and Yinfei Yang. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models.arXiv preprint arXiv:2108.08877,

  35. [35]

    When do llms need retrieval augmentation? mitigating llms’ overconfidence helps retrieval augmentation.arXiv preprint arXiv:2402.11457,

    Shiyu Ni, Keping Bi, Jiafeng Guo, and Xueqi Cheng. When do llms need retrieval augmentation? mitigating llms’ overconfidence helps retrieval augmentation.arXiv preprint arXiv:2402.11457,

  36. [36]

    Revisiting uncertainty quantification evaluation in language models: Spurious interactions with response length bias results.arXiv preprint arXiv:2504.13677,

    13 Published as a conference paper at ICLR 2026 Andrea Santilli, Adam Golinski, Michael Kirchhof, Federico Danieli, Arno Blaas, Miao Xiong, Luca Zappella, and Sinead Williamson. Revisiting uncertainty quantification evaluation in language models: Spurious interactions with response length bias results.arXiv preprint arXiv:2504.13677,

  37. [37]

    Introducing an im- proved information-theoretic measure of predictive uncertainty.arXiv preprint arXiv:2311.08309,

    Kajetan Schweighofer, Lukas Aichberger, Mykyta Ielanskyi, and Sepp Hochreiter. Introducing an im- proved information-theoretic measure of predictive uncertainty.arXiv preprint arXiv:2311.08309,

  38. [38]

    Large language model routing with benchmark datasets.arXiv preprint arXiv:2309.15789, 2023

    Tal Shnitzer, Anthony Ou, Mírian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thompson, and Mikhail Yurochkin. Large language model routing with benchmark datasets.arXiv preprint arXiv:2309.15789,

  39. [39]

    A survey on uncertainty quantification of large language models,

    Ola Shorinwa, Zhiting Mei, Justin Lidard, Allen Z Ren, and Anirudha Majumdar. A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions.arXiv preprint arXiv:2412.05563,

  40. [40]

    Trust me, i'm wrong: High-certainty hallucinations in llms

    Adi Simhi, Itay Itzhak, Fazl Barez, Gabriel Stanovsky, and Yonatan Belinkov. Trust me, i’m wrong: High-certainty hallucinations in llms.arXiv preprint arXiv:2502.12964,

  41. [41]

    doi:10.5281/zenodo.10256836 , url =

    URLhttps://doi.org/10.5281/zenodo.10256836. Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261,

  42. [42]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

  43. [43]

    Gemma 3 Technical Report

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

  44. [44]

    LoRA ensembles for large language model fine-tuning.arXiv preprint arXiv:2310.00035,

    Xi Wang, Laurence Aitchison, and Maja Rudolph. Lora ensembles for large language model fine- tuning.arXiv preprint arXiv:2310.00035,

  45. [45]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh- ery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171,

  46. [46]

    Measuring short-form factuality in large language models

    Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. arXiv preprint arXiv:2411.04368,

  47. [47]

    A survey of uncertainty estimation methods on large language models.arXiv preprint arXiv:2503.00172, 2025

    Zhiqiu Xia, Jinxuan Xu, Yuqian Zhang, and Hang Liu. A survey of uncertainty estimation methods on large language models.arXiv preprint arXiv:2503.00172,

  48. [48]

    Tianyun Yang, Ziniu Li, Juan Cao, and Chang Xu

    Yihao Xue, Kristjan Greenewald, Youssef Mroueh, and Baharan Mirzasoleiman. Verify when uncertain: Beyond self-consistency in black box hallucination detection.arXiv preprint arXiv:2502.15845,

  49. [49]

    14 Published as a conference paper at ICLR 2026 An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115,

  50. [50]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering.arXiv preprint arXiv:1809.09600,

  51. [51]

    negative kernel-entropy

    15 Published as a conference paper at ICLR 2026 A APPENDIX A.1 THEORETICALINTERPRETATIONS OFEPISTEMICUNCERTAINTY Kernel and variational interpretation of D(ω||ω ∗).Assume the similarity function s(·,·) is a symmetric positive definite kernel k. Denote the predictive distributions by PΩ :=p(· |x;ω) and Pω∗. Their kernel mean embeddings in the reproducing k...

  52. [52]

    Following prior work (Lin et al., 2023), we compute correctness using only the first sampled response from each model

    as the judge model. Following prior work (Lin et al., 2023), we compute correctness using only the first sampled response from each model. All evaluation is conducted in inference-only mode; no training or fine-tuning is performed. For each dataset, we sample 10 responses per model for the first 100 prompts. AU is computed using all 10 responses. To match...

  53. [53]

    For datasets not origi- nally supported by lm-eval-harness, we follow its prompt formatting conventions and include code for these additions in the supplementary material

    embeddings. For datasets not origi- nally supported by lm-eval-harness, we follow its prompt formatting conventions and include code for these additions in the supplementary material. A.4 ADDITIONALRESULTS ONTOTALUNCERTAINTY Figure 8 reports the AUROC of aleatoric and total uncertainty across all model–dataset pairs, and Fig- ure 9 shows the corresponding...

  54. [54]

    TU achieves higher AUROC across most tasks, particularly on HotpotQA, WMT16-de-en, and CoQA, where models exhibit confident failures

    20 Published as a conference paper at ICLR 2026 /uni00000156/uni000001ef/uni00000156/uni00000156/uni00000156/uni000001ef/uni00000158/uni0000015b/uni00000156/uni000001ef/uni0000015b/uni00000156/uni00000156/uni000001ef/uni0000015d/uni0000015b/uni00000157/uni000001ef/uni00000156/uni00000156 /uni00000156/uni000001ef/uni00000156 /uni00000156/uni000001ef/uni000...

  55. [55]

    and Qwen2.5 (Yang et al., 2024)) of various sizes. As the size of the reference model increases, both aleatoric and total uncertainty AUROC scores tend to decrease, but total uncertainty has consistently higher AUROC across different model sizes. This holds even when the reference model is substantially stronger than any model in the auxiliary set (e.g.,Q...

  56. [56]

    /uni00000003/uni00000096/uni0000008b/uni00000092/uni00000090/uni00000013/uni00000003 /uni00000005/uni00000098/uni00000013/uni00000003/uni00000009/uni00000015/uni0000000f/uni0000015e/uni0000000d /uni0000000a/uni00000098/uni0000009d/uni00000099/uni00000098/uni0000009d/uni00000013/uni00000003/uni00000010/uni00000013/uni0000022c/uni00000098/uni00000099/uni000...

  57. [57]

    correctness_score

    benchmark into a long-form QA format with chain-of-thought answering. Specifically, we consider Boolean Expressions , Disambiguation QA , and Word Sorting , and prompt models to justify their answers rather than selecting from multiple choices directly. We then evaluate uncertainty scores over the full responses using the same semantic similarity pipeline...