Recognition: unknown
Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification
Pith reviewed 2026-05-10 06:24 UTC · model grok-4.3
The pith
Cross-model disagreement supplies a missing epistemic uncertainty signal when self-consistency alone is low.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the black-box setting, epistemic uncertainty is estimated from the difference between inter-model and intra-model sequence-semantic similarity; adding this term to self-consistency aleatoric uncertainty produces a total uncertainty that ranks answers more reliably and allows better selective abstention, while also surfacing confident failures that aleatoric uncertainty alone misses.
What carries the argument
The epistemic uncertainty term computed as the gap between inter-model and intra-model sequence-semantic similarity, used as a proxy that activates when self-consistency is low.
If this is right
- Total uncertainty improves ranking calibration and selective abstention relative to aleatoric uncertainty alone.
- The epistemic term flags confident failures where aleatoric uncertainty is low.
- The method requires only generated text from a scale-matched ensemble and works without token probabilities.
- Agreement and complementarity diagnostics identify the regimes where the added term contributes most.
Where Pith is reading between the lines
- The same inter-versus-intra similarity gap could be tested on short-form or multiple-choice tasks to check whether the pattern holds beyond long-form generation.
- Replacing semantic similarity with other cheap distance measures might preserve the signal while lowering compute.
- An ensemble of three to five models appears sufficient, suggesting the approach scales without requiring dozens of models.
Load-bearing premise
Cross-model semantic disagreement is higher on incorrect answers exactly when self-consistency is low.
What would settle it
A dataset in which models disagree more on correct answers than on incorrect answers whenever self-consistency is low would show the epistemic term adds no value or harms calibration.
Figures
read the original abstract
Large language models (LLMs) often produce confident yet incorrect responses, and uncertainty quantification is one potential solution to more robust usage. Recent works routinely rely on self-consistency to estimate aleatoric uncertainty (AU), yet this proxy collapses when models are overconfident and produce the same incorrect answer across samples. We analyze this regime and show that cross-model semantic disagreement is higher on incorrect answers precisely when AU is low. Motivated by this, we introduce an epistemic uncertainty (EU) term that operates in the black-box access setting: EU uses only generated text from a small, scale-matched ensemble and is computed as the gap between inter-model and intra-model sequence-semantic similarity. We then define total uncertainty (TU) as the sum of AU and EU. In a comprehensive study across five 7-9B instruction-tuned models and ten long-form tasks, TU improves ranking calibration and selective abstention relative to AU, and EU reliably flags confident failures where AU is low. We further characterize when EU is most useful via agreement and complementarity diagnostics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes complementing self-consistency-based aleatoric uncertainty (AU) with a black-box epistemic uncertainty (EU) term defined as the gap between inter-model and intra-model sequence-semantic similarity over a small ensemble of 7-9B instruction-tuned models. Total uncertainty is TU = AU + EU. Motivated by the observation that cross-model disagreement rises on incorrect answers when AU is low, the authors report that TU improves ranking calibration and selective abstention relative to AU alone across five models and ten long-form tasks, while EU specifically flags confident failures; they also provide agreement and complementarity diagnostics.
Significance. If the reported gains hold, the work meaningfully extends uncertainty quantification for LLMs beyond self-consistency by providing a practical, black-box proxy for epistemic uncertainty that targets the overconfident regime. The multi-model, multi-task empirical scope and explicit diagnostics for when EU adds value are strengths; the approach requires only generated text and a modest ensemble, which increases applicability.
minor comments (3)
- [§3] §3 (Methods): the precise semantic similarity function (embedding model, pooling, or judge LLM) used to compute sequence-level inter- and intra-model similarities should be stated explicitly, including any hyperparameters, so that EU is fully reproducible.
- [Results] Results tables: report the exact calibration metrics (e.g., ECE, Brier score, or ranking AUC) and abstention curves with confidence intervals or statistical tests across the ten tasks; the abstract claims improvement but the quantitative deltas are not summarized in the provided text.
- [§4.3] §4.3 (Diagnostics): the agreement and complementarity plots would benefit from a short formal definition of the plotted quantities (e.g., how 'agreement' between AU and EU is quantified) to avoid ambiguity.
Simulated Author's Rebuttal
We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report accurately reflects the core contribution of our work: showing that cross-model semantic disagreement provides a practical black-box epistemic uncertainty term that complements self-consistency-based aleatoric uncertainty, particularly in the overconfident regime. As the report contains no specific major comments, we have no points requiring rebuttal or targeted revision at this time. We will incorporate any minor suggestions during the revision process.
Circularity Check
No significant circularity
full rationale
The paper's core contribution is an empirical study: it first observes (via analysis) that cross-model semantic disagreement rises on incorrect answers precisely when self-consistency-based AU is low, then explicitly defines EU as the computable gap between inter-model and intra-model sequence-semantic similarity on generated text, sets TU = AU + EU, and reports that TU improves ranking calibration and selective abstention over AU alone across five models and ten tasks. None of these steps reduces by the paper's own equations or definitions to a fitted parameter, renamed input, or self-citation chain; the definitions are direct and the claims rest on external experimental outcomes rather than tautological re-derivation of the inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Sequence-semantic similarity can be measured reliably from generated text alone to distinguish intra-model from inter-model agreement.
invented entities (1)
-
Epistemic uncertainty (EU) term
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Lukas Aichberger, Kajetan Schweighofer, Mykyta Ielanskyi, and Sepp Hochreiter. Semantically diverse language generation for uncertainty estimation in language models.arXiv preprint arXiv:2406.04306,
-
[2]
URL https: //api.semanticscholar.org/CorpusID:276612236. Neil Band, Tim GJ Rudner, Qixuan Feng, Angelos Filos, Zachary Nado, Michael W Dusenberry, Ghassen Jerfel, Dustin Tran, and Yarin Gal. Benchmarking bayesian deep learning on diabetic retinopathy detection tasks.arXiv preprint arXiv:2211.12717,
-
[3]
Findings of the 2016 conference on machine translation (wmt16)
Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, et al. Findings of the 2016 conference on machine translation (wmt16). InFirst conference on machine translation, pp. 131–198. Association for Computational Linguistics,
2016
-
[4]
Do we truly need so many samples? multi-llm repeated sampling efficiently scale test-time compute
Jianhao Chen, Zishuo Xun, Bocheng Zhou, Han Qi, Qiaosheng Zhang, Yang Chen, Wei Hu, Yuzhong Qu, Wanli Ouyang, and Shuyue Hu. Do we truly need so many samples? multi-llm repeated sampling efficiently scale test-time compute.arXiv preprint arXiv:2504.00762,
-
[5]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
arXiv preprint arXiv:2405.21015 , author =
Ben Cottier, Robi Rahman, Loredana Fattorini, Nestor Maslej, Tamay Besiroglu, and David Owen. The rising costs of training frontier ai models.arXiv preprint arXiv:2405.21015,
-
[7]
Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers.arXiv preprint arXiv:2105.03011,
-
[8]
Prasenjit Dey, Srujana Merugu, and Sivaramakrishnan Kaveri. Uncertainty-aware fusion: An ensemble framework for mitigating hallucinations in large language models.arXiv preprint arXiv:2503.05757,
-
[9]
arXiv preprint arXiv:2311.07383 (2023)
Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun, Artem Vazhentsev, Sergey Petrakov, Kirill Fedyanin, Daniil Vasilev, Elizaveta Goncharova, Alexander Panchenko, Maxim Panov, et al. Lm-polygraph: Uncertainty estimation for language models.arXiv preprint arXiv:2311.07383,
-
[10]
11 Published as a conference paper at ICLR 2026 Xiang Gao, Jiaxin Zhang, Lalla Mouatadid, and Kamalika Das. Spuq: Perturbation-based uncertainty quantification for large language models.arXiv preprint arXiv:2403.02509,
-
[11]
URL https://github.com/ ibm-granite/granite-3.0-language-models/. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Scaling Instruction-Finetuned Language Models
HW Chung L Hou, S Longpre, B Zoph, Y Tay, W Fedus, Y Li, X Wang, M Dehghani, S Brahma, and A Webson. Scaling instruction-finetuned language models.arXiv preprint arXiv:2210.11416,
work page internal anchor Pith review arXiv
-
[13]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Albert Q Jiang, A Sablayrolles, A Mensch, C Bamford, D Singh Chaplot, Ddl Casas, F Bressand, G Lengyel, G Lample, L Saulnier, et al. Mistral 7b. arxiv.arXiv preprint arXiv:2310.06825, 10,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Daniel D Johnson, Daniel Tarlow, David Duvenaud, and Chris J Maddison. Experts don’t cheat: learning what you don’t know by predicting pairs.arXiv preprint arXiv:2402.08733,
-
[16]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,
work page internal anchor Pith review arXiv
-
[17]
Language Models (Mostly) Know What They Know
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,
work page internal anchor Pith review arXiv
-
[18]
arXiv preprint arXiv:2502.18581 , year=
Zhewei Kang, Xuandong Zhao, and Dawn Song. Scalable best-of-n selection for large language models via self-certainty.arXiv preprint arXiv:2502.18581,
-
[19]
(implicit) ensembles of ensembles: Epistemic uncertainty collapse in large models
Andreas Kirsch. (implicit) ensembles of ensembles: Epistemic uncertainty collapse in large models. arXiv preprint arXiv:2409.02628,
-
[20]
Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, and Yarin Gal. Semantic entropy probes: Robust and cheap hallucination detection in llms.arXiv preprint arXiv:2406.15927,
-
[21]
Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation.arXiv preprint arXiv:2302.09664, 2023a. Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InICLR, 2023b....
work page internal anchor Pith review arXiv
-
[22]
Gonzalez, Hao Zhang, and Ion Stoica
12 Published as a conference paper at ICLR 2026 Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,
2026
-
[23]
arXiv preprint arXiv:2502.20379 , year=
Shalev Lifshitz, Sheila A McIlraith, and Yilun Du. Multi-agent verification: Scaling test-time compute with multiple verifiers.arXiv preprint arXiv:2502.20379,
-
[24]
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods.arXiv preprint arXiv:2109.07958,
work page internal anchor Pith review arXiv
-
[25]
Teaching models to express their uncertainty in words, 2022
Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words.arXiv preprint arXiv:2205.14334,
-
[26]
arXiv preprint arXiv:2305.19187 , year=
Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Generating with confidence: Uncertainty quantifica- tion for black-box large language models.arXiv preprint arXiv:2305.19187,
-
[27]
Linyu Liu, Yu Pan, Xiaocheng Li, and Guanting Chen. Uncertainty estimation and quantification for llms: A simple supervised approach.arXiv preprint arXiv:2404.15993,
-
[28]
Enhancing hallucination detection through noise injection.arXiv preprint arXiv:2502.03799,
Litian Liu, Reza Pourreza, Sunny Panchal, Apratim Bhattacharyya, Yao Qin, and Roland Memisevic. Enhancing hallucination detection through noise injection.arXiv preprint arXiv:2502.03799,
-
[29]
Jinliang Lu, Ziliang Pang, Min Xiao, Yaochen Zhu, Rui Xia, and Jiajun Zhang. Merge, ensemble, and cooperate! a survey on collaborative strategies in the era of large language models.arXiv preprint arXiv:2407.06089,
-
[30]
Estimating llm uncertainty with evidence.arXiv preprint arXiv:2502.00290, 2025
Huan Ma, Jingdong Chen, Guangyu Wang, and Changqing Zhang. Estimating llm uncertainty with logits.arXiv preprint arXiv:2502.00290,
-
[31]
Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models
Potsawee Manakul, Adian Liusie, and Mark JF Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models.arXiv preprint arXiv:2303.08896,
-
[32]
Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer. Ambigqa: Answering ambiguous open-domain questions.arXiv preprint arXiv:2004.10645,
-
[33]
Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don’t give me the details, just the sum- mary! topic-aware convolutional neural networks for extreme summarization.arXiv preprint arXiv:1808.08745,
-
[34]
arXiv preprint arXiv:2108.08877 , year=
Jianmo Ni, Gustavo Hernandez Abrego, Noah Constant, Ji Ma, Keith B Hall, Daniel Cer, and Yinfei Yang. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models.arXiv preprint arXiv:2108.08877,
-
[35]
Shiyu Ni, Keping Bi, Jiafeng Guo, and Xueqi Cheng. When do llms need retrieval augmentation? mitigating llms’ overconfidence helps retrieval augmentation.arXiv preprint arXiv:2402.11457,
-
[36]
13 Published as a conference paper at ICLR 2026 Andrea Santilli, Adam Golinski, Michael Kirchhof, Federico Danieli, Arno Blaas, Miao Xiong, Luca Zappella, and Sinead Williamson. Revisiting uncertainty quantification evaluation in language models: Spurious interactions with response length bias results.arXiv preprint arXiv:2504.13677,
-
[37]
Kajetan Schweighofer, Lukas Aichberger, Mykyta Ielanskyi, and Sepp Hochreiter. Introducing an im- proved information-theoretic measure of predictive uncertainty.arXiv preprint arXiv:2311.08309,
-
[38]
Large language model routing with benchmark datasets.arXiv preprint arXiv:2309.15789, 2023
Tal Shnitzer, Anthony Ou, Mírian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thompson, and Mikhail Yurochkin. Large language model routing with benchmark datasets.arXiv preprint arXiv:2309.15789,
-
[39]
A survey on uncertainty quantification of large language models,
Ola Shorinwa, Zhiting Mei, Justin Lidard, Allen Z Ren, and Anirudha Majumdar. A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions.arXiv preprint arXiv:2412.05563,
-
[40]
Trust me, i'm wrong: High-certainty hallucinations in llms
Adi Simhi, Itay Itzhak, Fazl Barez, Gabriel Stanovsky, and Yonatan Belinkov. Trust me, i’m wrong: High-certainty hallucinations in llms.arXiv preprint arXiv:2502.12964,
-
[41]
doi:10.5281/zenodo.10256836 , url =
URLhttps://doi.org/10.5281/zenodo.10256836. Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261,
-
[42]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,
work page internal anchor Pith review arXiv
-
[43]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
LoRA ensembles for large language model fine-tuning.arXiv preprint arXiv:2310.00035,
Xi Wang, Laurence Aitchison, and Maja Rudolph. Lora ensembles for large language model fine- tuning.arXiv preprint arXiv:2310.00035,
-
[45]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh- ery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171,
-
[46]
Measuring short-form factuality in large language models
Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. arXiv preprint arXiv:2411.04368,
work page internal anchor Pith review arXiv
-
[47]
Zhiqiu Xia, Jinxuan Xu, Yuqian Zhang, and Hang Liu. A survey of uncertainty estimation methods on large language models.arXiv preprint arXiv:2503.00172,
-
[48]
Tianyun Yang, Ziniu Li, Juan Cao, and Chang Xu
Yihao Xue, Kristjan Greenewald, Youssef Mroueh, and Baharan Mirzasoleiman. Verify when uncertain: Beyond self-consistency in black box hallucination detection.arXiv preprint arXiv:2502.15845,
-
[49]
14 Published as a conference paper at ICLR 2026 An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115,
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[50]
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering.arXiv preprint arXiv:1809.09600,
work page internal anchor Pith review arXiv
-
[51]
negative kernel-entropy
15 Published as a conference paper at ICLR 2026 A APPENDIX A.1 THEORETICALINTERPRETATIONS OFEPISTEMICUNCERTAINTY Kernel and variational interpretation of D(ω||ω ∗).Assume the similarity function s(·,·) is a symmetric positive definite kernel k. Denote the predictive distributions by PΩ :=p(· |x;ω) and Pω∗. Their kernel mean embeddings in the reproducing k...
2026
-
[52]
Following prior work (Lin et al., 2023), we compute correctness using only the first sampled response from each model
as the judge model. Following prior work (Lin et al., 2023), we compute correctness using only the first sampled response from each model. All evaluation is conducted in inference-only mode; no training or fine-tuning is performed. For each dataset, we sample 10 responses per model for the first 100 prompts. AU is computed using all 10 responses. To match...
2023
-
[53]
embeddings. For datasets not origi- nally supported by lm-eval-harness, we follow its prompt formatting conventions and include code for these additions in the supplementary material. A.4 ADDITIONALRESULTS ONTOTALUNCERTAINTY Figure 8 reports the AUROC of aleatoric and total uncertainty across all model–dataset pairs, and Fig- ure 9 shows the corresponding...
-
[54]
TU achieves higher AUROC across most tasks, particularly on HotpotQA, WMT16-de-en, and CoQA, where models exhibit confident failures
20 Published as a conference paper at ICLR 2026 /uni00000156/uni000001ef/uni00000156/uni00000156/uni00000156/uni000001ef/uni00000158/uni0000015b/uni00000156/uni000001ef/uni0000015b/uni00000156/uni00000156/uni000001ef/uni0000015d/uni0000015b/uni00000157/uni000001ef/uni00000156/uni00000156 /uni00000156/uni000001ef/uni00000156 /uni00000156/uni000001ef/uni000...
2026
-
[55]
and Qwen2.5 (Yang et al., 2024)) of various sizes. As the size of the reference model increases, both aleatoric and total uncertainty AUROC scores tend to decrease, but total uncertainty has consistently higher AUROC across different model sizes. This holds even when the reference model is substantially stronger than any model in the auxiliary set (e.g.,Q...
2024
-
[56]
/uni00000003/uni00000096/uni0000008b/uni00000092/uni00000090/uni00000013/uni00000003 /uni00000005/uni00000098/uni00000013/uni00000003/uni00000009/uni00000015/uni0000000f/uni0000015e/uni0000000d /uni0000000a/uni00000098/uni0000009d/uni00000099/uni00000098/uni0000009d/uni00000013/uni00000003/uni00000010/uni00000013/uni0000022c/uni00000098/uni00000099/uni000...
-
[57]
correctness_score
benchmark into a long-form QA format with chain-of-thought answering. Specifically, we consider Boolean Expressions , Disambiguation QA , and Word Sorting , and prompt models to justify their answers rather than selecting from multiple choices directly. We then evaluate uncertainty scores over the full responses using the same semantic similarity pipeline...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.