pith. sign in

arxiv: 2605.19220 · v1 · pith:VWKEUBJGnew · submitted 2026-05-19 · 💻 cs.CL · cs.AI· cs.LG

Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering

Pith reviewed 2026-05-20 06:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords uncertainty quantificationlarge language modelshallucinationsinternal consistencyunsupervised clusteringfactual correctnessconfident errors
0
0 comments X

The pith

Mainstream uncertainty quantification for LLMs measures only internal consistency of generations and misses confident factual errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that typical methods for estimating uncertainty in large language models function as unsupervised clustering routines. These routines group similar model outputs to assess how stable the generations are, without reference to any external standard of correctness. As a result the methods cannot flag cases in which a model produces a consistently wrong answer with high apparent confidence. This limitation leaves deployments vulnerable to undetected hallucinations and creates an unreliable sense of safety around model outputs.

Core claim

Mainstream UQ methods for LLMs are just unsupervised clustering algorithms that quantify the internal consistency of the model's generations rather than their external correctness and therefore fail to detect confident hallucinations.

What carries the argument

The reframing of UQ techniques as unsupervised clustering of model generations that captures only internal consistency rather than factual accuracy.

If this is right

  • UQ methods exhibit hyperparameter sensitivity that makes safe deployment difficult.
  • Evaluation loops equate output stability with truth and therefore cannot validate correctness.
  • Absence of ground truth forces reliance on unstable proxy metrics for assessing uncertainty quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future UQ systems would need explicit anchoring to external verification sources such as retrieved facts or executable checks.
  • Domains with clear objective answers, such as arithmetic or database queries, could serve as minimal test beds to expose the gap between consistency and correctness.
  • Native model mechanisms that expose calibration to real-world outcomes rather than token-level agreement might replace clustering-based proxies.

Load-bearing premise

Internal consistency of a model's generations cannot serve as a useful proxy for external factual correctness under any practical deployment condition.

What would settle it

A controlled test on a factual benchmark with known ground truth where an internal-consistency UQ score remains high for generations that are verifiably incorrect.

Figures

Figures reproduced from arXiv: 2605.19220 by Hua Wei, Longchao Da, Tiejin Chen, Xiaoou Liu.

Figure 1
Figure 1. Figure 1: The common UQ methods for LLM and its representative work (name with *) for inductive discussions in Section 2. argues that this is the wrong level. Instead, it aggregates sequences that share the same meaning into classes, effec￾tively treating each semantic cluster Ci as a distinct “An￾swer Class”. Each answer class has a unique semantic meaning (e.g., “Paris” vs. “The capital of France” will be in the s… view at source ↗
Figure 2
Figure 2. Figure 2: PCA visualization of Qwen2.5-32b-Instruct hidden states during P(true) estimation on the QASC dataset. The visualization demonstrates that the model’s internal states during P(true) are geometrically partitioned into distinct belief clusters, empirically validating that P(true) functions as an implicit clustering. concentration around a single dominant mode. Conversely, when responses express multiple inco… view at source ↗
Figure 3
Figure 3. Figure 3: The effect of correctness threshold τ on UQ method evaluation consistency. As the threshold varies, method rankings become unstable. Figure adapted from Liu et al. (2025b). particularly for open-ended generation tasks. Obtaining ac￾curate correctness labels is inherently challenging because correct answers are not unique. Semantically equivalent re￾sponses may differ substantially in surface form. Differen… view at source ↗
read the original abstract

Uncertainty Quantification (UQ) is widely regarded as the primary safeguard for deploying Large Language Models (LLMs) in high-stakes domains. However, we argue that the field suffers from a category error: mainstream UQ methods for LLMs are just unsupervised clustering algorithms. We demonstrate that most current approaches inherently quantify the internal consistency of the model's generations rather than their external correctness. Consequently, current methods are fundamentally blind to factual reality and fail to detect ``confident hallucinations,'' where models exhibit high confidence in stable but incorrect answers. Therefore, the current UQ methods may create a deceptive sense of safety when deploying the models with uncertainty. In detail, we identify three critical pathologies resulting from this dependence on internal state: a hyperparameter sensitivity crisis that renders deployment unsafe, an internal evaluation cycle that conflates stability with truth, and a fundamental lack of ground truth that forces reliance on unstable proxy metrics to evaluate uncertainty. To resolve this impasse, we advocate for a paradigm shift to UQ and outline a roadmap for the research community to adopt better evaluation metrics and settings, implement mechanism changes for native uncertainty, and anchor verification in objective truth, ensuring that model confidence serves as a reliable proxy for reality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript is a position paper claiming that mainstream uncertainty quantification (UQ) methods for LLMs are equivalent to unsupervised clustering algorithms. These methods, the authors argue, measure only the internal consistency of the model's generations rather than their external factual correctness, rendering them unable to detect confident hallucinations. The paper identifies three resulting pathologies: a hyperparameter sensitivity crisis, an internal evaluation cycle that conflates stability with truth, and a fundamental lack of ground truth that forces reliance on unstable proxy metrics. It concludes with a call for a paradigm shift involving improved evaluation metrics, mechanism changes for native uncertainty, and anchoring verification in objective truth.

Significance. If the central analogy is substantiated, the paper would offer a useful conceptual reframing that could redirect UQ research away from internal-consistency proxies toward methods with stronger external grounding. The explicit listing of three pathologies and the proposed roadmap provide concrete targets for future work on trustworthy LLM deployment. As a position piece without new derivations or experiments, its influence would depend on how well the clustering equivalence is demonstrated and how it engages existing evaluation practices.

major comments (2)
  1. [Abstract] Abstract: The claim that current UQ methods 'inherently quantify the internal consistency of the model's generations rather than their external correctness' is presented as a category error without concrete mappings from specific techniques (e.g., semantic entropy or self-consistency sampling) to unsupervised clustering algorithms or counter-examples showing where the analogy breaks. This interpretive step is load-bearing for all three pathologies.
  2. [Abstract] Abstract (paragraph beginning 'Consequently, current methods are fundamentally blind...'): The assertion that methods are 'fundamentally blind to factual reality' does not address standard UQ evaluation protocols that test uncertainty scores against ground-truth correctness labels via AUROC or correlation on labeled datasets such as TriviaQA or Natural Questions. Engaging these empirical checks is required to support the claim that internal consistency cannot proxy external correctness under any deployment condition.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by naming one or two concrete UQ methods and briefly indicating how each reduces to a clustering operation.
  2. Clarify the distinction between 'hyperparameter sensitivity crisis' and ordinary sensitivity by providing a short illustrative example of how a small hyperparameter change alters UQ behavior in a way that affects deployment safety.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our position paper. We have carefully considered each major comment and revised the manuscript to provide greater clarity and engagement with existing practices while preserving the core argument.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that current UQ methods 'inherently quantify the internal consistency of the model's generations rather than their external correctness' is presented as a category error without concrete mappings from specific techniques (e.g., semantic entropy or self-consistency sampling) to unsupervised clustering algorithms or counter-examples showing where the analogy breaks. This interpretive step is load-bearing for all three pathologies.

    Authors: We agree that the central analogy requires explicit substantiation through concrete mappings. In the revised manuscript we have expanded the abstract and added a new subsection (2.1) that directly maps representative methods to clustering operations. Semantic entropy clusters sentence embeddings of sampled generations and derives uncertainty from the entropy of the resulting cluster distribution; self-consistency sampling identifies the size of the largest agreeing cluster in answer space. We also include brief counter-examples (e.g., retrieval-augmented or externally calibrated methods) where the pure internal-clustering characterization weakens. These additions make the interpretive step explicit and thereby reinforce the three pathologies. revision: yes

  2. Referee: [Abstract] Abstract (paragraph beginning 'Consequently, current methods are fundamentally blind...'): The assertion that methods are 'fundamentally blind to factual reality' does not address standard UQ evaluation protocols that test uncertainty scores against ground-truth correctness labels via AUROC or correlation on labeled datasets such as TriviaQA or Natural Questions. Engaging these empirical checks is required to support the claim that internal consistency cannot proxy external correctness under any deployment condition.

    Authors: We acknowledge the importance of engaging standard evaluation protocols. The revised manuscript now includes a dedicated paragraph in the abstract and an expanded discussion section that directly addresses AUROC-style evaluations on TriviaQA and Natural Questions. We argue that observed correlations remain post-hoc and do not alter the fact that, at inference time, the uncertainty signal is computed exclusively from internal generation statistics without access to external ground truth. Consequently, these proxies cannot reliably flag confident hallucinations in deployment settings where labels are unavailable. We retain the stronger claim while clarifying its scope relative to existing empirical checks. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the position paper's conceptual argument

full rationale

The paper advances a position that mainstream UQ methods amount to unsupervised clustering by virtue of measuring internal generation consistency rather than external factual correctness. This reclassification is supported by identifying three resulting pathologies (hyperparameter sensitivity, internal evaluation cycles, and lack of ground truth) but does not rely on any equations, fitted parameters, or self-citation chains that reduce the central claim to its own inputs by construction. No load-bearing step equates a derived quantity to a prior definition or renames a known result via ansatz; the argument is interpretive and stands on analysis of existing literature without forcing the conclusion through definitional closure. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The position rests on the domain assumption that uncertainty quantification must ultimately be validated against external objective truth rather than model-internal statistics; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Uncertainty quantification for LLMs must be anchored in external factual correctness rather than internal generation consistency.
    Invoked in the abstract when stating that current methods are 'fundamentally blind to factual reality' and when advocating 'anchor verification in objective truth'.

pith-pipeline@v0.9.0 · 5748 in / 1283 out tokens · 43909 ms · 2026-05-20T06:49:51.109473+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

140 extracted references · 140 canonical work pages · 14 internal anchors

  1. [1]

    arXiv preprint arXiv:2502.14268 , year=

    MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels , author=. arXiv preprint arXiv:2502.14268 , year=

  2. [2]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  3. [3]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  4. [4]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  5. [5]

    Bowman, and Shi Feng

    Llm evaluators recognize and favor their own generations , author=. arXiv preprint arXiv:2404.13076 , year=

  6. [6]

    Uncertainty in Language Models: Assessment through Rank-Calibration

    Huang, Xinmeng and Li, Shuo and Yu, Mengxin and Sesia, Matteo and Hassani, Hamed and Lee, Insup and Bastani, Osbert and Dobriban, Edgar. Uncertainty in Language Models: Assessment through Rank-Calibration. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.18

  7. [7]

    arXiv preprint arXiv:2410.12831 , year=

    Segment as You Wish--Free-Form Language-Based Segmentation for Medical Images , author=. arXiv preprint arXiv:2410.12831 , year=

  8. [8]

    arXiv preprint arXiv:2401.00125 , year=

    Llm-assist: Enhancing closed-loop planning with language-based reasoning , author=. arXiv preprint arXiv:2401.00125 , year=

  9. [9]

    Advances in Neural Information Processing Systems , volume=

    Toolqa: A dataset for llm question answering with external tools , author=. Advances in Neural Information Processing Systems , volume=

  10. [10]

    arXiv preprint arXiv:2405.06652 , year=

    Large language model (llm) ai text generation detection based on transformer deep learning algorithm , author=. arXiv preprint arXiv:2405.06652 , year=

  11. [11]

    RACE: Large-scale ReAding Comprehension Dataset From Examinations

    Race: Large-scale reading comprehension dataset from examinations , author=. arXiv preprint arXiv:1704.04683 , year=

  12. [12]

    The Eleventh International Conference on Learning Representations , year=

    Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation , author=. The Eleventh International Conference on Learning Representations , year=

  13. [13]

    Transactions on Machine Learning Research , issn=

    Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

  14. [14]

    Contextualized Sequence Likelihood: Enhanced Confidence Scores for Natural Language Generation

    Lin, Zhen and Trivedi, Shubhendu and Sun, Jimeng. Contextualized Sequence Likelihood: Enhanced Confidence Scores for Natural Language Generation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.578

  15. [15]

    arXiv preprint arXiv:2410.14368 , year=

    CoMAL: Collaborative Multi-Agent Large Language Models for Mixed-Autonomy Traffic , author=. arXiv preprint arXiv:2410.14368 , year=

  16. [16]

    Language Models (Mostly) Know What They Know

    Language models (mostly) know what they know , author=. arXiv preprint arXiv:2207.05221 , year=

  17. [17]

    , title =

    Vinh, Nguyen Xuan and Houle, Michael E. , title =. Proceedings of the 14th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining - Volume Part I , pages =. 2010 , isbn =. doi:10.1007/978-3-642-13657-3_4 , abstract =

  18. [18]

    Position: Uncertainty Quantification Needs Reassessment for Large Language Model Agents , author=

  19. [19]

    I Can't Believe It's Not Better: Failure Modes in the Age of Foundation Models

    Self-Evaluation Improves Selective Generation in Large Language Models , author =. Proceedings on "I Can't Believe It's Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops , pages =. 2023 , editor =

  20. [20]

    Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models

    Duan, Jinhao and Cheng, Hao and Wang, Shiqi and Zavalny, Alex and Wang, Chenan and Xu, Renjing and Kailkhura, Bhavya and Xu, Kaidi. Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa...

  21. [21]

    Combining Confidence Elicitation and Sample-based Methods for Uncertainty Quantification in Misinformation Mitigation

    Rivera, Mauricio and Godbout, Jean-Fran c ois and Rabbany, Reihaneh and Pelrine, Kellin. Combining Confidence Elicitation and Sample-based Methods for Uncertainty Quantification in Misinformation Mitigation. Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024). 2024

  22. [22]

    S elf C heck GPT : Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

    Manakul, Potsawee and Liusie, Adian and Gales, Mark. S elf C heck GPT : Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.557

  23. [23]

    Proceedings of the 34th International Conference on Machine Learning , pages =

    On Calibration of Modern Neural Networks , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

  24. [24]

    Proceedings of the third International Workshop on Machine Learning in Systems Biology , pages =

    Accuracy-Rejection Curves (ARCs) for Comparing Classification Methods with a Reject Option , author =. Proceedings of the third International Workshop on Machine Learning in Systems Biology , pages =. 2009 , editor =

  25. [25]

    The Twelfth International Conference on Learning Representations , year=

    Conformal Language Modeling , author=. The Twelfth International Conference on Learning Representations , year=

  26. [26]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Conformal Alignment: Knowing When to Trust Foundation Models with Guarantees , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  27. [27]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Large language model validity via enhanced conformal prediction methods , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  28. [28]

    Proceedings of the 41st International Conference on Machine Learning , articleno =

    Mohri, Christopher and Hashimoto, Tatsunori , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  29. [29]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Selective Generation for Controllable Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  30. [30]

    arXiv preprint arXiv:2405.01563 , year=

    Mitigating LLM Hallucinations via Conformal Abstention , author=. arXiv preprint arXiv:2405.01563 , year=

  31. [31]

    Selectively Answering Ambiguous Questions

    Cole, Jeremy and Zhang, Michael and Gillick, Daniel and Eisenschlos, Julian and Dhingra, Bhuwan and Eisenstein, Jacob. Selectively Answering Ambiguous Questions. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.35

  32. [32]

    Advances in neural information processing systems , volume=

    Selective classification for deep neural networks , author=. Advances in neural information processing systems , volume=

  33. [33]

    , author=

    On the Foundations of Noise-free Selective Classification. , author=. Journal of Machine Learning Research , volume=

  34. [34]

    arXiv preprint arXiv:2401.17072 , year=

    SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity , author=. arXiv preprint arXiv:2401.17072 , year=

  35. [35]

    JudgeBench: A Benchmark for Evaluating LLM-based Judges

    Judgebench: A benchmark for evaluating llm-based judges , author=. arXiv preprint arXiv:2410.12784 , year=

  36. [36]

    arXiv preprint arXiv:2411.16594 , year=

    From generation to judgment: Opportunities and challenges of llm-as-a-judge , author=. arXiv preprint arXiv:2411.16594 , year=

  37. [37]

    IEEE Transactions on Neural Networks and Learning Systems , year=

    Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods , author=. IEEE Transactions on Neural Networks and Learning Systems , year=

  38. [38]

    arXiv preprint arXiv:2404.09135 , year=

    Unveiling LLM Evaluation Focused on Metrics: Challenges and Solutions , author=. arXiv preprint arXiv:2404.09135 , year=

  39. [39]

    19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007) , volume=

    Conformal prediction with neural networks , author=. 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007) , volume=. 2007 , organization=

  40. [40]

    Why We Need New Evaluation Metrics for NLG

    Why we need new evaluation metrics for NLG , author=. arXiv preprint arXiv:1707.06875 , year=

  41. [41]

    Journal of Artificial Intelligence Research , volume=

    Survey of the state of the art in natural language generation: Core tasks, applications and evaluation , author=. Journal of Artificial Intelligence Research , volume=

  42. [42]

    Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13 , pages=

    Area under the precision-recall curve: point estimates and confidence intervals , author=. Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13 , pages=. 2013 , organization=

  43. [43]

    arXiv preprint arXiv:2407.00994 , year=

    Llm uncertainty quantification through directional entailment graph and claim level response augmentation , author=. arXiv preprint arXiv:2407.00994 , year=

  44. [44]

    arXiv preprint arXiv:2311.08298 , year=

    A survey of language model confidence estimation and calibration , author=. arXiv preprint arXiv:2311.08298 , year=

  45. [45]

    arXiv preprint arXiv:2206.09034 , year=

    Towards better selective classification , author=. arXiv preprint arXiv:2206.09034 , year=

  46. [46]

    and Szlam, Arthur and Dinan, Emily and Boureau, Y-Lan

    Mielke, Sabrina J. and Szlam, Arthur and Dinan, Emily and Boureau, Y-Lan. Reducing Conversational Agents ' Overconfidence Through Linguistic Calibration. Transactions of the Association for Computational Linguistics. 2022. doi:10.1162/tacl_a_00494

  47. [47]

    Re-Examining Calibration: The Case of Question Answering

    Si, Chenglei and Zhao, Chen and Min, Sewon and Boyd-Graber, Jordan. Re-Examining Calibration: The Case of Question Answering. Findings of the Association for Computational Linguistics: EMNLP 2022. 2022

  48. [48]

    Miao Xiong and Zhiyuan Hu and Xinyang Lu and YIFEI LI and Jie Fu and Junxian He and Bryan Hooi , booktitle=. Can. 2024 , url=

  49. [49]

    The 2023 Conference on Empirical Methods in Natural Language Processing , year=

    On the Calibration of Large Language Models and Alignment , author=. The 2023 Conference on Empirical Methods in Natural Language Processing , year=

  50. [50]

    Uncertainty Estimation in Autoregressive Structured Prediction , author=

  51. [51]

    C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge

    Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan. C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/...

  52. [52]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Qasc: A dataset for question answering via sentence composition , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  53. [53]

    Applied Sciences , volume=

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams , author=. Applied Sciences , volume=. 2021 , publisher=

  54. [54]

    RACE : Large-scale R e A ding Comprehension Dataset From Examinations

    Lai, Guokun and Xie, Qizhe and Liu, Hanxiao and Yang, Yiming and Hovy, Eduard. RACE : Large-scale R e A ding Comprehension Dataset From Examinations. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. doi:10.18653/v1/D17-1082

  55. [55]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  56. [56]

    Phi-4 Technical Report

    Phi-4 technical report , author=. arXiv preprint arXiv:2412.08905 , year=

  57. [57]

    Qwen2.5 Technical Report

    Qwen2. 5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

  58. [58]

    Proceedings of the 29th Symposium on Operating Systems Principles , pages=

    Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th Symposium on Operating Systems Principles , pages=

  59. [59]

    International Conference on Learning Representations , year=

    BERTScore: Evaluating Text Generation with BERT , author=. International Conference on Learning Representations , year=

  60. [60]

    International Conference on Learning Representations , year=

    DeBERTa: Decoding-enhanced BERT with Disentangled Attention , author=. International Conference on Learning Representations , year=

  61. [61]

    The Eleventh International Conference on Learning Representations , year=

    Out-of-Distribution Detection and Selective Generation for Conditional Language Models , author=. The Eleventh International Conference on Learning Representations , year=

  62. [62]

    The Internal State of an LLM Knows When It ' s Lying

    Azaria, Amos and Mitchell, Tom. The Internal State of an LLM Knows When It`s Lying. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.68

  63. [63]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  64. [64]

    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  65. [65]

    Journal of Machine Learning Research , year =

    Vojtech Franc and Daniel Prusa and Vaclav Voracek , title =. Journal of Machine Learning Research , year =

  66. [66]

    C o QA : A Conversational Question Answering Challenge

    Reddy, Siva and Chen, Danqi and Manning, Christopher D. C o QA : A Conversational Question Answering Challenge. Transactions of the Association for Computational Linguistics. 2019. doi:10.1162/tacl_a_00266

  67. [67]

    cognition , volume=

    Are humans good intuitive statisticians after all? Rethinking some conclusions from the literature on judgment under uncertainty , author=. cognition , volume=. 1996 , publisher=

  68. [68]

    Advances in neural information processing systems , volume=

    Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration , author=. Advances in neural information processing systems , volume=

  69. [69]

    Taking a Step Back with

    Zhen Lin and Shubhendu Trivedi and Jimeng Sun , booktitle=. Taking a Step Back with. 2023 , url=

  70. [70]

    Proceedings of the 38th International Conference on Machine Learning , pages =

    Meta-Cal: Well-controlled Post-hoc Calibration by Ranking , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

  71. [71]

    Proceedings of the 37th International Conference on Machine Learning , pages =

    Mix-n-Match : Ensemble and Compositional Methods for Uncertainty Calibration in Deep Learning , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , editor =

  72. [72]

    2001 , isbn =

    Zadrozny, Bianca and Elkan, Charles , title =. 2001 , isbn =. doi:10.1145/502512.502540 , booktitle =

  73. [73]

    and Zhang, Linchuan and Jerfel, Ghassen and Tran, Dustin , title =

    Nixon, Jeremy and Dusenberry, Michael W. and Zhang, Linchuan and Jerfel, Ghassen and Tran, Dustin , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , month =

  74. [74]

    arXiv preprint arXiv:2502.01534 , year=

    Preference Leakage: A Contamination Problem in LLM-as-a-judge , author=. arXiv preprint arXiv:2502.01534 , year=

  75. [75]

    LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    Llms-as-judges: a comprehensive survey on llm-based evaluation methods , author=. arXiv preprint arXiv:2412.05579 , year=

  76. [76]

    Advances in Neural Information Processing Systems , volume=

    Benchmarking llms via uncertainty quantification , author=. Advances in Neural Information Processing Systems , volume=

  77. [77]

    Transactions of the Association for Computational Linguistics , volume=

    Benchmarking uncertainty quantification methods for large language models with lm-polygraph , author=. Transactions of the Association for Computational Linguistics , volume=. 2025 , publisher=

  78. [78]

    arXiv preprint arXiv:2305.19187 , year=

    Generating with confidence: Uncertainty quantification for black-box large language models , author=. arXiv preprint arXiv:2305.19187 , year=

  79. [79]

    DeBERTa: Decoding-enhanced BERT with Disentangled Attention

    Deberta: Decoding-enhanced bert with disentangled attention , author=. arXiv preprint arXiv:2006.03654 , year=

  80. [80]

    Advances in neural information processing systems , volume=

    On spectral clustering: Analysis and an algorithm , author=. Advances in neural information processing systems , volume=

Showing first 80 references.