Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering
Pith reviewed 2026-05-20 06:49 UTC · model grok-4.3
The pith
Mainstream uncertainty quantification for LLMs measures only internal consistency of generations and misses confident factual errors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Mainstream UQ methods for LLMs are just unsupervised clustering algorithms that quantify the internal consistency of the model's generations rather than their external correctness and therefore fail to detect confident hallucinations.
What carries the argument
The reframing of UQ techniques as unsupervised clustering of model generations that captures only internal consistency rather than factual accuracy.
If this is right
- UQ methods exhibit hyperparameter sensitivity that makes safe deployment difficult.
- Evaluation loops equate output stability with truth and therefore cannot validate correctness.
- Absence of ground truth forces reliance on unstable proxy metrics for assessing uncertainty quality.
Where Pith is reading between the lines
- Future UQ systems would need explicit anchoring to external verification sources such as retrieved facts or executable checks.
- Domains with clear objective answers, such as arithmetic or database queries, could serve as minimal test beds to expose the gap between consistency and correctness.
- Native model mechanisms that expose calibration to real-world outcomes rather than token-level agreement might replace clustering-based proxies.
Load-bearing premise
Internal consistency of a model's generations cannot serve as a useful proxy for external factual correctness under any practical deployment condition.
What would settle it
A controlled test on a factual benchmark with known ground truth where an internal-consistency UQ score remains high for generations that are verifiably incorrect.
Figures
read the original abstract
Uncertainty Quantification (UQ) is widely regarded as the primary safeguard for deploying Large Language Models (LLMs) in high-stakes domains. However, we argue that the field suffers from a category error: mainstream UQ methods for LLMs are just unsupervised clustering algorithms. We demonstrate that most current approaches inherently quantify the internal consistency of the model's generations rather than their external correctness. Consequently, current methods are fundamentally blind to factual reality and fail to detect ``confident hallucinations,'' where models exhibit high confidence in stable but incorrect answers. Therefore, the current UQ methods may create a deceptive sense of safety when deploying the models with uncertainty. In detail, we identify three critical pathologies resulting from this dependence on internal state: a hyperparameter sensitivity crisis that renders deployment unsafe, an internal evaluation cycle that conflates stability with truth, and a fundamental lack of ground truth that forces reliance on unstable proxy metrics to evaluate uncertainty. To resolve this impasse, we advocate for a paradigm shift to UQ and outline a roadmap for the research community to adopt better evaluation metrics and settings, implement mechanism changes for native uncertainty, and anchor verification in objective truth, ensuring that model confidence serves as a reliable proxy for reality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a position paper claiming that mainstream uncertainty quantification (UQ) methods for LLMs are equivalent to unsupervised clustering algorithms. These methods, the authors argue, measure only the internal consistency of the model's generations rather than their external factual correctness, rendering them unable to detect confident hallucinations. The paper identifies three resulting pathologies: a hyperparameter sensitivity crisis, an internal evaluation cycle that conflates stability with truth, and a fundamental lack of ground truth that forces reliance on unstable proxy metrics. It concludes with a call for a paradigm shift involving improved evaluation metrics, mechanism changes for native uncertainty, and anchoring verification in objective truth.
Significance. If the central analogy is substantiated, the paper would offer a useful conceptual reframing that could redirect UQ research away from internal-consistency proxies toward methods with stronger external grounding. The explicit listing of three pathologies and the proposed roadmap provide concrete targets for future work on trustworthy LLM deployment. As a position piece without new derivations or experiments, its influence would depend on how well the clustering equivalence is demonstrated and how it engages existing evaluation practices.
major comments (2)
- [Abstract] Abstract: The claim that current UQ methods 'inherently quantify the internal consistency of the model's generations rather than their external correctness' is presented as a category error without concrete mappings from specific techniques (e.g., semantic entropy or self-consistency sampling) to unsupervised clustering algorithms or counter-examples showing where the analogy breaks. This interpretive step is load-bearing for all three pathologies.
- [Abstract] Abstract (paragraph beginning 'Consequently, current methods are fundamentally blind...'): The assertion that methods are 'fundamentally blind to factual reality' does not address standard UQ evaluation protocols that test uncertainty scores against ground-truth correctness labels via AUROC or correlation on labeled datasets such as TriviaQA or Natural Questions. Engaging these empirical checks is required to support the claim that internal consistency cannot proxy external correctness under any deployment condition.
minor comments (2)
- [Abstract] The abstract would be strengthened by naming one or two concrete UQ methods and briefly indicating how each reduces to a clustering operation.
- Clarify the distinction between 'hyperparameter sensitivity crisis' and ordinary sensitivity by providing a short illustrative example of how a small hyperparameter change alters UQ behavior in a way that affects deployment safety.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our position paper. We have carefully considered each major comment and revised the manuscript to provide greater clarity and engagement with existing practices while preserving the core argument.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that current UQ methods 'inherently quantify the internal consistency of the model's generations rather than their external correctness' is presented as a category error without concrete mappings from specific techniques (e.g., semantic entropy or self-consistency sampling) to unsupervised clustering algorithms or counter-examples showing where the analogy breaks. This interpretive step is load-bearing for all three pathologies.
Authors: We agree that the central analogy requires explicit substantiation through concrete mappings. In the revised manuscript we have expanded the abstract and added a new subsection (2.1) that directly maps representative methods to clustering operations. Semantic entropy clusters sentence embeddings of sampled generations and derives uncertainty from the entropy of the resulting cluster distribution; self-consistency sampling identifies the size of the largest agreeing cluster in answer space. We also include brief counter-examples (e.g., retrieval-augmented or externally calibrated methods) where the pure internal-clustering characterization weakens. These additions make the interpretive step explicit and thereby reinforce the three pathologies. revision: yes
-
Referee: [Abstract] Abstract (paragraph beginning 'Consequently, current methods are fundamentally blind...'): The assertion that methods are 'fundamentally blind to factual reality' does not address standard UQ evaluation protocols that test uncertainty scores against ground-truth correctness labels via AUROC or correlation on labeled datasets such as TriviaQA or Natural Questions. Engaging these empirical checks is required to support the claim that internal consistency cannot proxy external correctness under any deployment condition.
Authors: We acknowledge the importance of engaging standard evaluation protocols. The revised manuscript now includes a dedicated paragraph in the abstract and an expanded discussion section that directly addresses AUROC-style evaluations on TriviaQA and Natural Questions. We argue that observed correlations remain post-hoc and do not alter the fact that, at inference time, the uncertainty signal is computed exclusively from internal generation statistics without access to external ground truth. Consequently, these proxies cannot reliably flag confident hallucinations in deployment settings where labels are unavailable. We retain the stronger claim while clarifying its scope relative to existing empirical checks. revision: yes
Circularity Check
No significant circularity in the position paper's conceptual argument
full rationale
The paper advances a position that mainstream UQ methods amount to unsupervised clustering by virtue of measuring internal generation consistency rather than external factual correctness. This reclassification is supported by identifying three resulting pathologies (hyperparameter sensitivity, internal evaluation cycles, and lack of ground truth) but does not rely on any equations, fitted parameters, or self-citation chains that reduce the central claim to its own inputs by construction. No load-bearing step equates a derived quantity to a prior definition or renames a known result via ansatz; the argument is interpretive and stands on analysis of existing literature without forcing the conclusion through definitional closure. The derivation chain is therefore self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Uncertainty quantification for LLMs must be anchored in external factual correctness rather than internal generation consistency.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
mainstream UQ methods for LLMs are just unsupervised clustering algorithms that quantify the internal consistency of the model's generations rather than their external correctness
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
three critical pathologies: hyperparameter sensitivity crisis, internal evaluation cycle, fundamental lack of ground truth
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2502.14268 , year=
MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels , author=. arXiv preprint arXiv:2502.14268 , year=
-
[2]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[3]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
- [4]
-
[5]
Llm evaluators recognize and favor their own generations , author=. arXiv preprint arXiv:2404.13076 , year=
-
[6]
Uncertainty in Language Models: Assessment through Rank-Calibration
Huang, Xinmeng and Li, Shuo and Yu, Mengxin and Sesia, Matteo and Hassani, Hamed and Lee, Insup and Bastani, Osbert and Dobriban, Edgar. Uncertainty in Language Models: Assessment through Rank-Calibration. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.18
-
[7]
arXiv preprint arXiv:2410.12831 , year=
Segment as You Wish--Free-Form Language-Based Segmentation for Medical Images , author=. arXiv preprint arXiv:2410.12831 , year=
-
[8]
arXiv preprint arXiv:2401.00125 , year=
Llm-assist: Enhancing closed-loop planning with language-based reasoning , author=. arXiv preprint arXiv:2401.00125 , year=
-
[9]
Advances in Neural Information Processing Systems , volume=
Toolqa: A dataset for llm question answering with external tools , author=. Advances in Neural Information Processing Systems , volume=
-
[10]
arXiv preprint arXiv:2405.06652 , year=
Large language model (llm) ai text generation detection based on transformer deep learning algorithm , author=. arXiv preprint arXiv:2405.06652 , year=
-
[11]
RACE: Large-scale ReAding Comprehension Dataset From Examinations
Race: Large-scale reading comprehension dataset from examinations , author=. arXiv preprint arXiv:1704.04683 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
The Eleventh International Conference on Learning Representations , year=
Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation , author=. The Eleventh International Conference on Learning Representations , year=
-
[13]
Transactions on Machine Learning Research , issn=
Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2024 , url=
work page 2024
-
[14]
Contextualized Sequence Likelihood: Enhanced Confidence Scores for Natural Language Generation
Lin, Zhen and Trivedi, Shubhendu and Sun, Jimeng. Contextualized Sequence Likelihood: Enhanced Confidence Scores for Natural Language Generation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.578
-
[15]
arXiv preprint arXiv:2410.14368 , year=
CoMAL: Collaborative Multi-Agent Large Language Models for Mixed-Autonomy Traffic , author=. arXiv preprint arXiv:2410.14368 , year=
-
[16]
Language Models (Mostly) Know What They Know
Language models (mostly) know what they know , author=. arXiv preprint arXiv:2207.05221 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Vinh, Nguyen Xuan and Houle, Michael E. , title =. Proceedings of the 14th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining - Volume Part I , pages =. 2010 , isbn =. doi:10.1007/978-3-642-13657-3_4 , abstract =
-
[18]
Position: Uncertainty Quantification Needs Reassessment for Large Language Model Agents , author=
-
[19]
I Can't Believe It's Not Better: Failure Modes in the Age of Foundation Models
Self-Evaluation Improves Selective Generation in Large Language Models , author =. Proceedings on "I Can't Believe It's Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops , pages =. 2023 , editor =
work page 2023
-
[20]
Duan, Jinhao and Cheng, Hao and Wang, Shiqi and Zavalny, Alex and Wang, Chenan and Xu, Renjing and Kailkhura, Bhavya and Xu, Kaidi. Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa...
-
[21]
Rivera, Mauricio and Godbout, Jean-Fran c ois and Rabbany, Reihaneh and Pelrine, Kellin. Combining Confidence Elicitation and Sample-based Methods for Uncertainty Quantification in Misinformation Mitigation. Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024). 2024
work page 2024
-
[22]
Manakul, Potsawee and Liusie, Adian and Gales, Mark. S elf C heck GPT : Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.557
-
[23]
Proceedings of the 34th International Conference on Machine Learning , pages =
On Calibration of Modern Neural Networks , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =
work page 2017
-
[24]
Proceedings of the third International Workshop on Machine Learning in Systems Biology , pages =
Accuracy-Rejection Curves (ARCs) for Comparing Classification Methods with a Reject Option , author =. Proceedings of the third International Workshop on Machine Learning in Systems Biology , pages =. 2009 , editor =
work page 2009
-
[25]
The Twelfth International Conference on Learning Representations , year=
Conformal Language Modeling , author=. The Twelfth International Conference on Learning Representations , year=
-
[26]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
Conformal Alignment: Knowing When to Trust Foundation Models with Guarantees , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[27]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
Large language model validity via enhanced conformal prediction methods , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[28]
Proceedings of the 41st International Conference on Machine Learning , articleno =
Mohri, Christopher and Hashimoto, Tatsunori , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =
work page 2024
-
[29]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
Selective Generation for Controllable Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[30]
arXiv preprint arXiv:2405.01563 , year=
Mitigating LLM Hallucinations via Conformal Abstention , author=. arXiv preprint arXiv:2405.01563 , year=
-
[31]
Selectively Answering Ambiguous Questions
Cole, Jeremy and Zhang, Michael and Gillick, Daniel and Eisenschlos, Julian and Dhingra, Bhuwan and Eisenstein, Jacob. Selectively Answering Ambiguous Questions. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.35
-
[32]
Advances in neural information processing systems , volume=
Selective classification for deep neural networks , author=. Advances in neural information processing systems , volume=
- [33]
-
[34]
arXiv preprint arXiv:2401.17072 , year=
SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity , author=. arXiv preprint arXiv:2401.17072 , year=
-
[35]
JudgeBench: A Benchmark for Evaluating LLM-based Judges
Judgebench: A benchmark for evaluating llm-based judges , author=. arXiv preprint arXiv:2410.12784 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
arXiv preprint arXiv:2411.16594 , year=
From generation to judgment: Opportunities and challenges of llm-as-a-judge , author=. arXiv preprint arXiv:2411.16594 , year=
-
[37]
IEEE Transactions on Neural Networks and Learning Systems , year=
Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods , author=. IEEE Transactions on Neural Networks and Learning Systems , year=
-
[38]
arXiv preprint arXiv:2404.09135 , year=
Unveiling LLM Evaluation Focused on Metrics: Challenges and Solutions , author=. arXiv preprint arXiv:2404.09135 , year=
-
[39]
19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007) , volume=
Conformal prediction with neural networks , author=. 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007) , volume=. 2007 , organization=
work page 2007
-
[40]
Why We Need New Evaluation Metrics for NLG
Why we need new evaluation metrics for NLG , author=. arXiv preprint arXiv:1707.06875 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
Journal of Artificial Intelligence Research , volume=
Survey of the state of the art in natural language generation: Core tasks, applications and evaluation , author=. Journal of Artificial Intelligence Research , volume=
-
[42]
Area under the precision-recall curve: point estimates and confidence intervals , author=. Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13 , pages=. 2013 , organization=
work page 2013
-
[43]
arXiv preprint arXiv:2407.00994 , year=
Llm uncertainty quantification through directional entailment graph and claim level response augmentation , author=. arXiv preprint arXiv:2407.00994 , year=
-
[44]
arXiv preprint arXiv:2311.08298 , year=
A survey of language model confidence estimation and calibration , author=. arXiv preprint arXiv:2311.08298 , year=
-
[45]
arXiv preprint arXiv:2206.09034 , year=
Towards better selective classification , author=. arXiv preprint arXiv:2206.09034 , year=
-
[46]
and Szlam, Arthur and Dinan, Emily and Boureau, Y-Lan
Mielke, Sabrina J. and Szlam, Arthur and Dinan, Emily and Boureau, Y-Lan. Reducing Conversational Agents ' Overconfidence Through Linguistic Calibration. Transactions of the Association for Computational Linguistics. 2022. doi:10.1162/tacl_a_00494
-
[47]
Re-Examining Calibration: The Case of Question Answering
Si, Chenglei and Zhao, Chen and Min, Sewon and Boyd-Graber, Jordan. Re-Examining Calibration: The Case of Question Answering. Findings of the Association for Computational Linguistics: EMNLP 2022. 2022
work page 2022
-
[48]
Miao Xiong and Zhiyuan Hu and Xinyang Lu and YIFEI LI and Jie Fu and Junxian He and Bryan Hooi , booktitle=. Can. 2024 , url=
work page 2024
-
[49]
The 2023 Conference on Empirical Methods in Natural Language Processing , year=
On the Calibration of Large Language Models and Alignment , author=. The 2023 Conference on Empirical Methods in Natural Language Processing , year=
work page 2023
-
[50]
Uncertainty Estimation in Autoregressive Structured Prediction , author=
-
[51]
C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge
Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan. C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/...
-
[52]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Qasc: A dataset for question answering via sentence composition , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[53]
What disease does this patient have? a large-scale open domain question answering dataset from medical exams , author=. Applied Sciences , volume=. 2021 , publisher=
work page 2021
-
[54]
RACE : Large-scale R e A ding Comprehension Dataset From Examinations
Lai, Guokun and Xie, Qizhe and Liu, Hanxiao and Yang, Yiming and Hovy, Eduard. RACE : Large-scale R e A ding Comprehension Dataset From Examinations. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. doi:10.18653/v1/D17-1082
-
[55]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[56]
Phi-4 technical report , author=. arXiv preprint arXiv:2412.08905 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[57]
Qwen2. 5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
Proceedings of the 29th Symposium on Operating Systems Principles , pages=
Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th Symposium on Operating Systems Principles , pages=
-
[59]
International Conference on Learning Representations , year=
BERTScore: Evaluating Text Generation with BERT , author=. International Conference on Learning Representations , year=
-
[60]
International Conference on Learning Representations , year=
DeBERTa: Decoding-enhanced BERT with Disentangled Attention , author=. International Conference on Learning Representations , year=
-
[61]
The Eleventh International Conference on Learning Representations , year=
Out-of-Distribution Detection and Selective Generation for Conditional Language Models , author=. The Eleventh International Conference on Learning Representations , year=
-
[62]
The Internal State of an LLM Knows When It ' s Lying
Azaria, Amos and Mitchell, Tom. The Internal State of an LLM Knows When It`s Lying. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.68
-
[63]
Thirty-seventh Conference on Neural Information Processing Systems , year=
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
-
[64]
TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[65]
Journal of Machine Learning Research , year =
Vojtech Franc and Daniel Prusa and Vaclav Voracek , title =. Journal of Machine Learning Research , year =
-
[66]
C o QA : A Conversational Question Answering Challenge
Reddy, Siva and Chen, Danqi and Manning, Christopher D. C o QA : A Conversational Question Answering Challenge. Transactions of the Association for Computational Linguistics. 2019. doi:10.1162/tacl_a_00266
-
[67]
Are humans good intuitive statisticians after all? Rethinking some conclusions from the literature on judgment under uncertainty , author=. cognition , volume=. 1996 , publisher=
work page 1996
-
[68]
Advances in neural information processing systems , volume=
Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration , author=. Advances in neural information processing systems , volume=
-
[69]
Zhen Lin and Shubhendu Trivedi and Jimeng Sun , booktitle=. Taking a Step Back with. 2023 , url=
work page 2023
-
[70]
Proceedings of the 38th International Conference on Machine Learning , pages =
Meta-Cal: Well-controlled Post-hoc Calibration by Ranking , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =
work page 2021
-
[71]
Proceedings of the 37th International Conference on Machine Learning , pages =
Mix-n-Match : Ensemble and Compositional Methods for Uncertainty Calibration in Deep Learning , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , editor =
work page 2020
-
[72]
Zadrozny, Bianca and Elkan, Charles , title =. 2001 , isbn =. doi:10.1145/502512.502540 , booktitle =
-
[73]
and Zhang, Linchuan and Jerfel, Ghassen and Tran, Dustin , title =
Nixon, Jeremy and Dusenberry, Michael W. and Zhang, Linchuan and Jerfel, Ghassen and Tran, Dustin , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , month =
-
[74]
arXiv preprint arXiv:2502.01534 , year=
Preference Leakage: A Contamination Problem in LLM-as-a-judge , author=. arXiv preprint arXiv:2502.01534 , year=
-
[75]
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
Llms-as-judges: a comprehensive survey on llm-based evaluation methods , author=. arXiv preprint arXiv:2412.05579 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[76]
Advances in Neural Information Processing Systems , volume=
Benchmarking llms via uncertainty quantification , author=. Advances in Neural Information Processing Systems , volume=
-
[77]
Transactions of the Association for Computational Linguistics , volume=
Benchmarking uncertainty quantification methods for large language models with lm-polygraph , author=. Transactions of the Association for Computational Linguistics , volume=. 2025 , publisher=
work page 2025
-
[78]
arXiv preprint arXiv:2305.19187 , year=
Generating with confidence: Uncertainty quantification for black-box large language models , author=. arXiv preprint arXiv:2305.19187 , year=
-
[79]
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
Deberta: Decoding-enhanced bert with disentangled attention , author=. arXiv preprint arXiv:2006.03654 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[80]
Advances in neural information processing systems , volume=
On spectral clustering: Analysis and an algorithm , author=. Advances in neural information processing systems , volume=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.