pith. sign in

arxiv: 2606.22792 · v1 · pith:PXMELEJAnew · submitted 2026-06-22 · 💻 cs.AI

The Origins of Stochasticity: Comprehensive Investigations on Uncertainty Quantification for Large Language Models

Pith reviewed 2026-06-26 09:02 UTC · model grok-4.3

classification 💻 cs.AI
keywords uncertainty quantificationlarge language modelsstochasticityconsensus-based methodsscaling lawuncertainty taxonomyLLM evaluationmodel uncertainty
0
0 comments X

The pith

Consensus-based uncertainty quantification methods outperform other approaches for large language models, with larger models showing lower uncertainty.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a granular taxonomy that attributes LLM uncertainty to four sources: input-level, parameter-level, token-level, and decoding-process. It categorizes UQ methods into Bayesian, ensemble, consensus-based, and single-pass approaches based on this taxonomy. Empirical evaluations of 21 methods across multiple LLMs and benchmarks reveal that consensus-based methods consistently perform best and that uncertainty decreases with model scale. This provides a practical way to assess and manage uncertainty in LLM applications.

Core claim

The paper claims that its four-source uncertainty taxonomy allows for a systematic categorization of UQ methods, and that experiments demonstrate consensus-based methods outperform others while larger model scales correlate with lower uncertainty estimates, suggesting an empirical scaling law.

What carries the argument

The four-source uncertainty taxonomy (input, parameter, token, decoding-process) that supports categorizing and evaluating UQ methods.

If this is right

  • Effectiveness of UQ methods is sensitive to task types and generation settings.
  • Consensus-based methods like Deg and EigV consistently outperform other UQ approaches.
  • Larger model scales correlate with lower uncertainty estimates.
  • This indicates an empirical scaling law for LLM uncertainty.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The taxonomy might enable targeted improvements in UQ by addressing specific sources separately.
  • The scaling observation could imply that uncertainty issues diminish naturally with model advancement.
  • Consensus methods may be preferred in applications where reliability is critical.

Load-bearing premise

The proposed four-source taxonomy systematically and non-overlappingly attributes uncertainty sources in LLM generation.

What would settle it

An observation that the uncertainty sources overlap significantly or that consensus-based methods fail to outperform on new tasks would falsify the main results.

Figures

Figures reproduced from arXiv: 2606.22792 by Jing Wang, Rong-Hao Huang, Shao-Qun Zhang, Shuang Liang, Xiang-Jun Ou, Xin-Yu Hu.

Figure 1
Figure 1. Figure 1: Timeline of remarkable uncertainty quantification methods, including Bayesian, ensemble, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of four uncertainty sources that originate from the stochasticity of inputs, model [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overall evaluation ranking of the investigated LLM UQ methods in the answer-only generation [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overall evaluation ranking of the investigated LLM UQ methods in the reasoning-augmented [PITH_FULL_IMAGE:figures/full_fig_p032_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Plots of output uncertainty versus accuracy across parameter scales. [PITH_FULL_IMAGE:figures/full_fig_p034_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Evaluation ranking of the investigated LLM UQ methods under the answer-only generation [PITH_FULL_IMAGE:figures/full_fig_p038_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Evaluation ranking of the investigated LLM UQ methods under the reasoning-augmented [PITH_FULL_IMAGE:figures/full_fig_p039_7.png] view at source ↗
read the original abstract

Recent advancements in Large Language Models (LLMs) have enabled sophisticated reasoning and content generation, yet their inherent stochasticity poses significant challenges for ensuring predictive credibility. While traditional uncertainty taxonomy paradigms, such as the dichotomy of aleatoric and epistemic uncertainties, provide conceptual foundations, they often fail to capture the multi-component and multi-stage nature of LLM generation and struggle to evaluate the effectiveness of various Uncertainty Quantification (UQ) methods. In this paper, we propose a granular uncertainty taxonomy that systematically attributes LLM uncertainty into input-level, parameter-level, token-level, and decoding-process sources. Correspondingly, we categorize existing UQ methods into Bayesian, ensemble, consensus-based, and single-pass approaches. Furthermore, we introduce a comprehensive evaluation framework covering diverse generation settings and metrics. We empirically evaluate 21 typical UQ methods across three prominent LLM families, including Qwen3, Llama 3.2, and DeepSeek-V3, on benchmarks such as TriviaQA, GSM8K, and HumanEval. Our experimental results demonstrate that (i) the effectiveness of UQ methods is sensitive to task types and generation settings; (ii) consensus-based methods, typed Deg and EigV, consistently outperform other UQ approaches; and (iii) larger model scales correlate with lower uncertainty estimates, suggesting an empirical scaling law for LLM uncertainty. This work bridges the gap between theoretical origins and practical deployment, providing a versatile diagnostic tool for systematically quantifying uncertainty in LLM applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a four-source taxonomy for uncertainty in LLMs (input-level, parameter-level, token-level, decoding-process sources) and correspondingly categorizes 21 UQ methods into Bayesian, ensemble, consensus-based, and single-pass approaches. It presents an evaluation framework and reports empirical results on Qwen3, Llama 3.2, and DeepSeek-V3 across TriviaQA, GSM8K, and HumanEval, claiming that (i) UQ effectiveness is sensitive to task type and generation settings, (ii) consensus-based methods (Deg, EigV) consistently outperform the other categories, and (iii) larger model scales correlate with lower uncertainty estimates, suggesting an empirical scaling law.

Significance. If the taxonomy can be shown to provide a non-overlapping partition and the empirical comparisons are placed on a statistically sound footing, the work would supply a structured diagnostic lens for LLM uncertainty that could guide method selection and connect theoretical sources to practical UQ performance. The reported scaling observation would also be of interest if replicated.

major comments (2)
  1. [Abstract and taxonomy section] Abstract and taxonomy section: the claim that the four sources 'systematically attribute' uncertainty without overlap is load-bearing for the subsequent categorization of the 21 methods and for the interpretation that consensus-based methods outperform because they target a distinct source; no formal disjointness argument, exhaustive mapping, or check for re-interpretability (e.g., a single-pass method also being parameter-level) is supplied.
  2. [Experimental results section] Experimental results section: the statement that Deg and EigV 'consistently outperform' other approaches requires, at minimum, per-benchmark tables with means, standard deviations or error bars, and a statistical test across the three model families; the abstract supplies none of these, leaving open whether observed gaps are significant or artifacts of implementation details within each category.
minor comments (2)
  1. [Notation and tables] Ensure every abbreviation (Deg, EigV, etc.) is defined on first use and that the exact assignment of each of the 21 methods to one of the four categories is tabulated for reproducibility.
  2. [Evaluation framework] Clarify the precise generation settings (temperature, top-p, etc.) and the exact metrics used for each benchmark so that the sensitivity claim in (i) can be verified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our taxonomy and strengthen the empirical claims. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract and taxonomy section] Abstract and taxonomy section: the claim that the four sources 'systematically attribute' uncertainty without overlap is load-bearing for the subsequent categorization of the 21 methods and for the interpretation that consensus-based methods outperform because they target a distinct source; no formal disjointness argument, exhaustive mapping, or check for re-interpretability (e.g., a single-pass method also being parameter-level) is supplied.

    Authors: We agree that the manuscript does not supply a formal proof of disjointness or an exhaustive re-interpretability check. The taxonomy is motivated by the sequential stages of LLM generation (input encoding, parameter sampling during inference, per-token distribution, and decoding strategy), which we treat as primary attribution sources. While conceptual overlaps are possible in edge cases, the categorization of the 21 methods follows these primary attributions. We will add a dedicated subsection in the taxonomy section that discusses potential overlaps, provides an explicit mapping table, and acknowledges limitations in strict disjointness. revision: yes

  2. Referee: [Experimental results section] Experimental results section: the statement that Deg and EigV 'consistently outperform' other approaches requires, at minimum, per-benchmark tables with means, standard deviations or error bars, and a statistical test across the three model families; the abstract supplies none of these, leaving open whether observed gaps are significant or artifacts of implementation details within each category.

    Authors: We acknowledge that the current presentation of results does not include the requested statistical rigor in the reported tables or abstract. The full experimental section contains per-benchmark scores, but we will revise it to include (i) expanded tables with means and standard deviations computed over multiple runs, (ii) error bars in figures, and (iii) paired statistical tests (e.g., Wilcoxon signed-rank) across the three model families to assess whether performance gaps are significant. These additions will be reflected in both the results section and a revised abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks and standard models

full rationale

The paper proposes a four-source taxonomy and corresponding four-way categorization of UQ methods, then reports empirical performance on public benchmarks (TriviaQA, GSM8K, HumanEval) and standard model families (Qwen3, Llama 3.2, DeepSeek-V3). No equations, fitted parameters, or self-citations are shown to reduce the central claims (consensus-based methods outperform; scaling law) to the taxonomy by construction. The taxonomy is presented as a proposed attribution scheme rather than a self-definitional mapping, and the evaluation framework uses independent data and metrics. This satisfies the default expectation of a self-contained empirical study against external references.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that LLM generation can be decomposed into the four listed uncertainty sources; no free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption LLM generation involves distinct and attributable uncertainty sources at input, parameter, token, and decoding stages
    Invoked to motivate the granular taxonomy and method categorization.

pith-pipeline@v0.9.1-grok · 5808 in / 1154 out tokens · 24794 ms · 2026-06-26T09:02:51.537982+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

129 extracted references · 15 linked inside Pith

  1. [1]

    Abbasi-Yadkori, Y., Kuzborskij, I., György, A., and Szepesvári, C. (2024). To believe or not to believe your LLM: Iterative prompting for estimating epistemic uncertainty. InAdvances in Neural Information Processing Systems 37, pages 58077–58117

  2. [2]

    K., Pleiss, G., Zemel, R

    Abe, T., Buchanan, E. K., Pleiss, G., Zemel, R. S., and Cunningham, J. P. (2022). Deep ensembles work, but are they necessary? InAdvances in Neural Information Processing Systems 35, pages 33646–33660

  3. [3]

    Ao, S., Rueger, S., and Siddharthan, A. (2024). CSS: Contrastive semantic similarity for uncertainty quantification of LLMs.arXiv preprint arXiv:2406.03158

  4. [4]

    Baba, K., Liu, C., Kurita, S., and Sannai, A. (2025). Prover agent: An agent-based framework for formal mathematical proofs.arXiv preprint arXiv:2506.19923

  5. [5]

    F., Kang, S., Huang, Z., Yaldiz, D

    Bakman, Y. F., Kang, S., Huang, Z., Yaldiz, D. N., Belém, C. G., Zhu, C., Kumar, A., Samuel, A., Avestimehr, S., Liu, D., and Karimireddy, S. P. (2025). Uncertainty as feature gaps: Epistemic uncertainty quantification of LLMs in contextual question-answering.arXiv preprint arXiv:2510.02671

  6. [6]

    F., Yaldiz, D

    Bakman, Y. F., Yaldiz, D. N., Buyukates, B., Tao, C., Dimitriadis, D., and Avestimehr, S. (2024). MARS: Meaning-aware response scoring for uncertainty estimation in generative LLMs. InProceed- ings of the 20th Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 7752–7767

  7. [7]

    and Linander, H

    Balabanov, O. and Linander, H. (2024). Uncertainty quantification in fine-tuned LLMs using LoRA ensembles.arXiv preprint arXiv:2402.12264

  8. [8]

    Band, N., Li, X., Ma, T., and Hashimoto, T. (2024). Linguistic calibration of long-form generations. InProceedings of the 41st International Conference on Machine Learning, pages 2732–2778

  9. [9]

    and Soatto, S

    Becker, E. and Soatto, S. (2024). Cycles of thought: Measuring LLM confidence through stable explanations.arXiv preprint arXiv:2406.03441

  10. [10]

    Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight uncertainty in neural network. InProceedings of the 32nd International Conference on Machine Learning, pages 1613–1622

  11. [11]

    Brier, W. G. (1950). Verification of forecasts expressed in terms of probability.Monthly weather review, 78(1):1–3

  12. [12]

    Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, 40 R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radfor...

  13. [13]

    Chen, C., Liu, K., Chen, Z., Gu, Y., Wu, Y., Tao, M., Fu, Z., and Ye, J. (2024). INSIDE: LLMs’ internal states retain the power of hallucination detection. InProceedings of the 12th International Conference on Learning Representations

  14. [14]

    and Mueller, J

    Chen, J. and Mueller, J. (2024). Quantifying uncertainty in answers from any language model and enhancing their trustworthiness. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 5186–5200

  15. [15]

    Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Ch...

  16. [16]

    Chen, Q., Qin, L., Liu, J., Peng, D., Guan, J., Wang, P., Hu, M., Zhou, Y., Gao, T., and Che, W. (2025a). Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567

  17. [17]

    Chen, T., Liu, X., Da, L., Chen, J., Papalexakis, V., and Wei, H. (2025b). Uncertainty quantification of large language models through multi-dimensional responses.arXiv preprint arXiv:2502.16820

  18. [18]

    and Li, Y

    Chen, W. and Li, Y. (2023). Calibrating transformers via sparse gaussian processes. InProceedings of the 11th International Conference on Learning Representations

  19. [19]

    J., Gibbs, I., and Candès, E

    Cherian, J. J., Gibbs, I., and Candès, E. J. (2024). Large language model validity via enhanced conformal prediction methods. InAdvances in Neural Information Processing Systems 37, pages 114812–114842

  20. [20]

    Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. (2021). Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168. 41

  21. [21]

    Da, L., Chen, T., Cheng, L., and Wei, H. (2024). LLM uncertainty quantification through directional entailment graph and claim level response augmentation.arXiv preprint arXiv:2407.00994

  22. [22]

    Da, L., Liu, X., Dai, J., Cheng, L., Wang, Y., and Wei, H. (2025). Understanding the uncertainty of LLM explanations: A perspective based on reasoning topology. InProceedings of the 2nd Conference on Language Modeling

  23. [23]

    CanlinearprobesmeasureLLMuncertainty? arXiv preprint arXiv:2510.04108

    Dakhmouche, R., Letellier, A., andGorji, M.H.(2025). CanlinearprobesmeasureLLMuncertainty? arXiv preprint arXiv:2510.04108

  24. [24]

    Darrin, M., Piantanida, P., and Colombo, P. (2023). RainProof: An umbrella to shield text generator from out-of-distribution data. InProceedings of the 28th Conference on Empirical Methods in Natural Language Processing, pages 5831–5857

  25. [25]

    and Goadrich, M

    Davis, J. and Goadrich, M. H. (2006). The relationship between precision-recall and ROC curves. InProceedings of the 23rd International Conference on Machine Learning, pages 233–240

  26. [26]

    DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437

    DeepSeek-AI (2024). DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437

  27. [27]

    Dinh, T. A. and Niehues, J. (2025). Are generative models underconfident? better quality estimation with boosted model probability. InProceedings of the 30th Conference on Empirical Methods in Natural Language Processing, pages 3364–3382

  28. [28]

    Du, X., Xiao, C., and Li, S. (2024). HaloScope: Harnessing unlabeled LLM generations for hallucination detection. InAdvances in Neural Information Processing Systems 37, pages 102948– 102972

  29. [29]

    Duan, J., Cheng, H., Wang, S., Zavalny, A., Wang, C., Xu, R., Kailkhura, B., and Xu, K. (2024). Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 5050–5063

  30. [30]

    Fadeeva, E., Rubashevskii, A., Shelmanov, A., Petrakov, S., Li, H., Mubarak, H., Tsymbalov, E., Kuzmin, G., Panchenko, A., Baldwin, T., Nakov, P., and Panov, M. (2024). Fact-checking the output of large language models via token-level uncertainty quantification. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pag...

  31. [31]

    Fan, A., Lewis, M., and Dauphin, Y. N. (2018). Hierarchical neural story generation. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 889–898

  32. [32]

    Farquhar, S., Kossen, J., Kuhn, L., and Gal, Y. (2024). Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630. 42

  33. [33]

    Unsupervisedqualityestimationforneuralmachinetranslation.Transactions of the Association for Computational Linguistics, 8:539–555

    Fomicheva, M., Sun, S., Yankovskaya, L., Blain, F., Guzmán, F., Fishel, M., Aletras, N., Chaudhary, V., andSpecia, L.(2020). Unsupervisedqualityestimationforneuralmachinetranslation.Transactions of the Association for Computational Linguistics, 8:539–555

  34. [34]

    and Ghahramani, Z

    Gal, Y. and Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing model uncertainty in deep learning. InProceedings of The 33rd International Conference on Machine Learning, pages 1050–1059

  35. [35]

    Gao, X., Zhang, J., Mouatadid, L., and Das, K. (2024). SPUQ: Perturbation-based uncertainty quantification for large language models. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, pages 2336–2346

  36. [36]

    Geifman, Y., Uziel, G., and El-Yaniv, R. (2019). Bias-reduced uncertainty estimation for deep neural classifiers. InProceedings of the 7th International Conference on Learning Representations

  37. [37]

    Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning, pages 1321–1330

  38. [38]

    He, J., Gong, Y., Chen, K., Lin, Z., Wei, C., and Zhao, Y. (2023). LLM factoscope: Uncovering LLMs’ factual discernment through inner states analysis.arXiv preprint arXiv:2312.16374

  39. [39]

    He, P., Liu, X., Gao, J., and Chen, W. (2021). DeBERTa: Decoding-enhanced BERT with disen- tangled attention. InProceedings of the 9th International Conference on Learning Representations

  40. [40]

    Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. (2020). The curious case of neural text degeneration. InProceedings of the 8th International Conference on Learning Representations

  41. [41]

    Hou, B., Liu, Y., Qian, K., Andreas, J., Chang, S., and Zhang, Y. (2024). Decomposing uncer- tainty for large language models through input clarification ensembling. InProceedings of the 35th International Conference on Machine Learning, pages 19023–19042

  42. [42]

    and Waegeman, W

    Hüllermeier, E. and Waegeman, W. (2021). Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods.Machine Learning, 110(3):457–506

  43. [43]

    S., and Zettlemoyer, L

    Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. (2017). TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1601–1611

  44. [44]

    V., Laga, H., Boussaïd, F., Buntine, W

    Jospin, L. V., Laga, H., Boussaïd, F., Buntine, W. L., and Bennamoun, M. (2022). Hands-on bayesian neural networks - A tutorial for deep learning users.IEEE Computational Intelligence Magazine, 17(2):29–48. 43

  45. [45]

    and Yuki, A

    Junya, T. and Yuki, A. (2019). Relevant and informative response generation using pointwise mutual information. InProceedings of the 1st Workshop on NLP for Conversational AI, pages 133–138

  46. [46]

    Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield- Dodds, Z., DasSarma, N., Tran-Johnson, E., Johnston, S., Showk, S. E., Jones, A., Elhage, N., Hume, T., Chen, A., Bai, Y., Bowman, S., Fort, S., Ganguli, D., Hernandez, D., Jacobson, J., Kernion, J., Kravec, S., Lovitt, L., Ndousse, K., Olsson, C., Ringer, S...

  47. [47]

    B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D

    Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). Scaling laws for neural language models.arXiv preprint arXiv:2001.08361

  48. [48]

    and Gal, Y

    Kendall, A. and Gal, Y. (2017). What uncertainties do we need in bayesian deep learning for computer vision? InAdvances in Neural Information Processing Systems 30, pages 5574–5584

  49. [49]

    and Chandorkar, A

    Kharbanda, A. and Chandorkar, A. (2024). Divergent ensemble networks: Enhancing uncertainty estimation with shared representations and independent branching.arXiv preprint arXiv:2412.01193

  50. [50]

    A., and Gal, Y

    Kossen, J., Han, J., Razzak, M., Schut, L., Malik, S. A., and Gal, Y. (2024). Semantic entropy probes: Robust and cheap hallucination detection in LLMs.arXiv preprint arXiv:2406.15927

  51. [51]

    R., Raskar, R., and Beam, A

    Kumar, B., Lu, C., Gupta, G., Palepu, A., Bellamy, D. R., Raskar, R., and Beam, A. (2023). Conformal prediction with large language models for multi-choice question answering.arXiv preprint arXiv:2305.18404

  52. [52]

    Lakshminarayanan, B., Pritzel, A., and Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Processing Systems 30, pages 6402–6413

  53. [53]

    Laurent, O., Lafage, A., Tartaglione, E., Daniel, G., Martinez, J., Bursuc, A., and Franchi, G. (2023). Packed ensembles for efficient uncertainty estimation. InProceedings of the 11th International Conference on Learning Representations

  54. [54]

    Lee, K., Lee, K., Lee, H., and Shin, J. (2018). A simple unified framework for detecting out-of- distribution samples and adversarial attacks. InAdvances in Neural Information Processing Systems 31, pages 7167–7177

  55. [55]

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., and Kiela, D. (2020). Retrieval-augmented generation for 44 knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems 33, pages 9459–9474

  56. [56]

    Li, Y., Qiang, R., Moukheiber, L., and Zhang, C. (2025). Language model uncertainty quantification with attention chain.arXiv preprint arXiv:2503.19168

  57. [57]

    Liang, S., Lu, X., Liu, Z., Wang, M., Lyu, Y., and Zhang, S. (2026). On the impact of weight quantization on deep neural network uncertainty. InProceedings of the 40th AAAI Conference on Artificial Intelligence, pages 23425–23432

  58. [58]

    Lin, S., Hilton, J., and Evans, O. (2022). Teaching models to express their uncertainty in words. Transactions on Machine Learning Research

  59. [59]

    Lin, Z., Trivedi, S., and Sun, J. (2024a). Contextualized sequence likelihood: Enhanced confidence scores for natural language generation. InProceedings of the 29th Conference on Empirical Methods in Natural Language Processing, pages 10351–10368

  60. [60]

    Lin, Z., Trivedi, S., and Sun, J. (2024b). Generating with confidence: Uncertainty quantification for black-box large language models.Transactions on Machine Learning Research

  61. [61]

    Ling, C., Zhao, X., Zhang, X., Cheng, W., Liu, Y., Sun, Y., Oishi, M., Osaki, T., Matsuda, K., Ji, J., Bai, G., Zhao, L., and Chen, H. (2024). Uncertainty quantification for in-context learning of large language models. InProceedings of the 20th Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Techn...

  62. [62]

    Liu, L., Pan, Y., Li, X., and Chen, G. (2024a). Uncertainty estimation and quantification for LLMs: A simple supervised approach.arXiv preprint arXiv:2404.15993

  63. [63]

    F., Chao, L

    Liu, S., Li, Z., Liu, X., Zhan, R., Wong, D. F., Chao, L. S., and Zhang, M. (2024b). Can LLMs learn uncertainty on their own? expressing uncertainty effectively in a self-training manner. In Proceedings of the 29th Conference on Empirical Methods in Natural Language Processing, pages 21635–21645

  64. [64]

    Liu, X., Chen, T., Da, L., Chen, C., Lin, Z., and Wei, H. (2025). Uncertainty quantification and confidence calibration in large language models: A survey. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2, pages 6107–6117

  65. [65]

    Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692

  66. [66]

    The Llama 3 herd of models.arXiv preprint arXiv:2407.21783

    Llama (2024). The Llama 3 herd of models.arXiv preprint arXiv:2407.21783. 45

  67. [67]

    T., Yamada, Y., Hu, S., Foerster, J., Ha, D., and Clune, J

    Lu, C., Lu, C., Lange, R. T., Yamada, Y., Hu, S., Foerster, J., Ha, D., and Clune, J. (2026). Towards end-to-end automation of AI research.Nature, 651(8107):914–919

  68. [68]

    J., Izmailov, P., Garipov, T., Vetrov, D

    Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., and Wilson, A. G. (2019). A simple baseline for bayesian uncertainty in deep learning. InAdvances in Neural Information Processing Systems 32, pages 13132–13143

  69. [69]

    and Gales, M

    Malinin, A. and Gales, M. J. F. (2021). Uncertainty estimation in autoregressive structured prediction. InProceedings of the 9th International Conference on Learning Representations

  70. [70]

    Manakul, P., Liusie, A., and Gales, M. J. F. (2023). SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. InProceedings of the 28th Conference on Empirical Methods in Natural Language Processing, pages 9004–9017

  71. [71]

    Michelmore, R., Kwiatkowska, M., and Gal, Y. (2018). Evaluating uncertainty quantification in end-to-end autonomous driving control.arXiv preprint arXiv:1811.06817

  72. [72]

    Min, S., Michael, J., Hajishirzi, H., and Zettlemoyer, L. (2020). AmbigQA: Answering ambiguous open-domain questions. InProceedings of the 25th Conference on Empirical Methods in Natural Language Processing, pages 5783–5797

  73. [73]

    and Xin, M

    Mo, S. and Xin, M. (2024). Tree of uncertain thoughts reasoning for large language models. In Proceedings of the 51st International Conference on Acoustics, Speech, and Signal Processing, pages 12742–12746

  74. [74]

    Nadeem, M. S. A., Zucker, J.-D., and Hanczar, B. (2009). Accuracy-rejection curves (ARCs) for comparing classification methods with a reject option. InProceedings of the 3rd International Workshop on Machine Learning in Systems Biology, pages 65–81

  75. [75]

    Nemani, V., Biggio, L., Huan, X., Hu, Z., Fink, O., Tran, A., Wang, Y., Zhang, X., and Hu, C. (2023). Uncertainty quantification in machine learning for engineering design and health prognostics: A tutorial.Mechanical Systems and Signal Processing, 205:110796

  76. [76]

    Nikitin, A., Kossen, J., Gal, Y., and Marttinen, P. (2024). Kernel language entropy: Fine-grained uncertainty quantification for LLMs from semantic similarities. InAdvances in Neural Information Processing Systems 37, pages 8901–8929

  77. [77]

    Gpt-4 technical report.arXiv preprint arXiv:2303.08774

    OpenAI (2023). Gpt-4 technical report.arXiv preprint arXiv:2303.08774

  78. [78]

    Piray, P. (2026). Not all uncertainty is alike: volatility, stochasticity, and exploration.arXiv preprint arXiv:2605.19215. 46

  79. [79]

    Semanticdensity: Uncertaintyquantificationforlargelanguage models through confidence measurement in semantic space

    Qiu, X.andMiikkulainen, R.(2024). Semanticdensity: Uncertaintyquantificationforlargelanguage models through confidence measurement in semantic space. InAdvances in Neural Information Processing Systems 37, pages 134507–134533

  80. [80]

    H., Jaakkola, T

    Quach, V., Fisch, A., Schuster, T., Yala, A., Sohn, J. H., Jaakkola, T. S., and Barzilay, R. (2024). Conformal language modeling. InProceedings of the 12th International Conference on Learning Representations

Showing first 80 references.