Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering

Hua Wei; Longchao Da; Tiejin Chen; Xiaoou Liu

arxiv: 2605.19220 · v1 · pith:VWKEUBJGnew · submitted 2026-05-19 · 💻 cs.CL · cs.AI· cs.LG

Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering

Tiejin Chen , Longchao Da , Xiaoou Liu , Hua Wei This is my paper

Pith reviewed 2026-05-20 06:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords uncertainty quantificationlarge language modelshallucinationsinternal consistencyunsupervised clusteringfactual correctnessconfident errors

0 comments

The pith

Mainstream uncertainty quantification for LLMs measures only internal consistency of generations and misses confident factual errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that typical methods for estimating uncertainty in large language models function as unsupervised clustering routines. These routines group similar model outputs to assess how stable the generations are, without reference to any external standard of correctness. As a result the methods cannot flag cases in which a model produces a consistently wrong answer with high apparent confidence. This limitation leaves deployments vulnerable to undetected hallucinations and creates an unreliable sense of safety around model outputs.

Core claim

Mainstream UQ methods for LLMs are just unsupervised clustering algorithms that quantify the internal consistency of the model's generations rather than their external correctness and therefore fail to detect confident hallucinations.

What carries the argument

The reframing of UQ techniques as unsupervised clustering of model generations that captures only internal consistency rather than factual accuracy.

If this is right

UQ methods exhibit hyperparameter sensitivity that makes safe deployment difficult.
Evaluation loops equate output stability with truth and therefore cannot validate correctness.
Absence of ground truth forces reliance on unstable proxy metrics for assessing uncertainty quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future UQ systems would need explicit anchoring to external verification sources such as retrieved facts or executable checks.
Domains with clear objective answers, such as arithmetic or database queries, could serve as minimal test beds to expose the gap between consistency and correctness.
Native model mechanisms that expose calibration to real-world outcomes rather than token-level agreement might replace clustering-based proxies.

Load-bearing premise

Internal consistency of a model's generations cannot serve as a useful proxy for external factual correctness under any practical deployment condition.

What would settle it

A controlled test on a factual benchmark with known ground truth where an internal-consistency UQ score remains high for generations that are verifiably incorrect.

Figures

Figures reproduced from arXiv: 2605.19220 by Hua Wei, Longchao Da, Tiejin Chen, Xiaoou Liu.

**Figure 1.** Figure 1: The common UQ methods for LLM and its representative work (name with *) for inductive discussions in Section 2. argues that this is the wrong level. Instead, it aggregates sequences that share the same meaning into classes, effectively treating each semantic cluster Ci as a distinct “Answer Class”. Each answer class has a unique semantic meaning (e.g., “Paris” vs. “The capital of France” will be in the s… view at source ↗

**Figure 2.** Figure 2: PCA visualization of Qwen2.5-32b-Instruct hidden states during P(true) estimation on the QASC dataset. The visualization demonstrates that the model’s internal states during P(true) are geometrically partitioned into distinct belief clusters, empirically validating that P(true) functions as an implicit clustering. concentration around a single dominant mode. Conversely, when responses express multiple inco… view at source ↗

**Figure 3.** Figure 3: The effect of correctness threshold τ on UQ method evaluation consistency. As the threshold varies, method rankings become unstable. Figure adapted from Liu et al. (2025b). particularly for open-ended generation tasks. Obtaining accurate correctness labels is inherently challenging because correct answers are not unique. Semantically equivalent responses may differ substantially in surface form. Differen… view at source ↗

read the original abstract

Uncertainty Quantification (UQ) is widely regarded as the primary safeguard for deploying Large Language Models (LLMs) in high-stakes domains. However, we argue that the field suffers from a category error: mainstream UQ methods for LLMs are just unsupervised clustering algorithms. We demonstrate that most current approaches inherently quantify the internal consistency of the model's generations rather than their external correctness. Consequently, current methods are fundamentally blind to factual reality and fail to detect ``confident hallucinations,'' where models exhibit high confidence in stable but incorrect answers. Therefore, the current UQ methods may create a deceptive sense of safety when deploying the models with uncertainty. In detail, we identify three critical pathologies resulting from this dependence on internal state: a hyperparameter sensitivity crisis that renders deployment unsafe, an internal evaluation cycle that conflates stability with truth, and a fundamental lack of ground truth that forces reliance on unstable proxy metrics to evaluate uncertainty. To resolve this impasse, we advocate for a paradigm shift to UQ and outline a roadmap for the research community to adopt better evaluation metrics and settings, implement mechanism changes for native uncertainty, and anchor verification in objective truth, ensuring that model confidence serves as a reliable proxy for reality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript is a position paper claiming that mainstream uncertainty quantification (UQ) methods for LLMs are equivalent to unsupervised clustering algorithms. These methods, the authors argue, measure only the internal consistency of the model's generations rather than their external factual correctness, rendering them unable to detect confident hallucinations. The paper identifies three resulting pathologies: a hyperparameter sensitivity crisis, an internal evaluation cycle that conflates stability with truth, and a fundamental lack of ground truth that forces reliance on unstable proxy metrics. It concludes with a call for a paradigm shift involving improved evaluation metrics, mechanism changes for native uncertainty, and anchoring verification in objective truth.

Significance. If the central analogy is substantiated, the paper would offer a useful conceptual reframing that could redirect UQ research away from internal-consistency proxies toward methods with stronger external grounding. The explicit listing of three pathologies and the proposed roadmap provide concrete targets for future work on trustworthy LLM deployment. As a position piece without new derivations or experiments, its influence would depend on how well the clustering equivalence is demonstrated and how it engages existing evaluation practices.

major comments (2)

[Abstract] Abstract: The claim that current UQ methods 'inherently quantify the internal consistency of the model's generations rather than their external correctness' is presented as a category error without concrete mappings from specific techniques (e.g., semantic entropy or self-consistency sampling) to unsupervised clustering algorithms or counter-examples showing where the analogy breaks. This interpretive step is load-bearing for all three pathologies.
[Abstract] Abstract (paragraph beginning 'Consequently, current methods are fundamentally blind...'): The assertion that methods are 'fundamentally blind to factual reality' does not address standard UQ evaluation protocols that test uncertainty scores against ground-truth correctness labels via AUROC or correlation on labeled datasets such as TriviaQA or Natural Questions. Engaging these empirical checks is required to support the claim that internal consistency cannot proxy external correctness under any deployment condition.

minor comments (2)

[Abstract] The abstract would be strengthened by naming one or two concrete UQ methods and briefly indicating how each reduces to a clustering operation.
Clarify the distinction between 'hyperparameter sensitivity crisis' and ordinary sensitivity by providing a short illustrative example of how a small hyperparameter change alters UQ behavior in a way that affects deployment safety.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our position paper. We have carefully considered each major comment and revised the manuscript to provide greater clarity and engagement with existing practices while preserving the core argument.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that current UQ methods 'inherently quantify the internal consistency of the model's generations rather than their external correctness' is presented as a category error without concrete mappings from specific techniques (e.g., semantic entropy or self-consistency sampling) to unsupervised clustering algorithms or counter-examples showing where the analogy breaks. This interpretive step is load-bearing for all three pathologies.

Authors: We agree that the central analogy requires explicit substantiation through concrete mappings. In the revised manuscript we have expanded the abstract and added a new subsection (2.1) that directly maps representative methods to clustering operations. Semantic entropy clusters sentence embeddings of sampled generations and derives uncertainty from the entropy of the resulting cluster distribution; self-consistency sampling identifies the size of the largest agreeing cluster in answer space. We also include brief counter-examples (e.g., retrieval-augmented or externally calibrated methods) where the pure internal-clustering characterization weakens. These additions make the interpretive step explicit and thereby reinforce the three pathologies. revision: yes
Referee: [Abstract] Abstract (paragraph beginning 'Consequently, current methods are fundamentally blind...'): The assertion that methods are 'fundamentally blind to factual reality' does not address standard UQ evaluation protocols that test uncertainty scores against ground-truth correctness labels via AUROC or correlation on labeled datasets such as TriviaQA or Natural Questions. Engaging these empirical checks is required to support the claim that internal consistency cannot proxy external correctness under any deployment condition.

Authors: We acknowledge the importance of engaging standard evaluation protocols. The revised manuscript now includes a dedicated paragraph in the abstract and an expanded discussion section that directly addresses AUROC-style evaluations on TriviaQA and Natural Questions. We argue that observed correlations remain post-hoc and do not alter the fact that, at inference time, the uncertainty signal is computed exclusively from internal generation statistics without access to external ground truth. Consequently, these proxies cannot reliably flag confident hallucinations in deployment settings where labels are unavailable. We retain the stronger claim while clarifying its scope relative to existing empirical checks. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the position paper's conceptual argument

full rationale

The paper advances a position that mainstream UQ methods amount to unsupervised clustering by virtue of measuring internal generation consistency rather than external factual correctness. This reclassification is supported by identifying three resulting pathologies (hyperparameter sensitivity, internal evaluation cycles, and lack of ground truth) but does not rely on any equations, fitted parameters, or self-citation chains that reduce the central claim to its own inputs by construction. No load-bearing step equates a derived quantity to a prior definition or renames a known result via ansatz; the argument is interpretive and stands on analysis of existing literature without forcing the conclusion through definitional closure. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The position rests on the domain assumption that uncertainty quantification must ultimately be validated against external objective truth rather than model-internal statistics; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Uncertainty quantification for LLMs must be anchored in external factual correctness rather than internal generation consistency.
Invoked in the abstract when stating that current methods are 'fundamentally blind to factual reality' and when advocating 'anchor verification in objective truth'.

pith-pipeline@v0.9.0 · 5748 in / 1283 out tokens · 43909 ms · 2026-05-20T06:49:51.109473+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

mainstream UQ methods for LLMs are just unsupervised clustering algorithms that quantify the internal consistency of the model's generations rather than their external correctness
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three critical pathologies: hyperparameter sensitivity crisis, internal evaluation cycle, fundamental lack of ground truth

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

140 extracted references · 140 canonical work pages · 14 internal anchors

[1]

arXiv preprint arXiv:2502.14268 , year=

MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels , author=. arXiv preprint arXiv:2502.14268 , year=

work page arXiv
[2]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[3]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[4]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[5]

Bowman, and Shi Feng

Llm evaluators recognize and favor their own generations , author=. arXiv preprint arXiv:2404.13076 , year=

work page arXiv
[6]

Uncertainty in Language Models: Assessment through Rank-Calibration

Huang, Xinmeng and Li, Shuo and Yu, Mengxin and Sesia, Matteo and Hassani, Hamed and Lee, Insup and Bastani, Osbert and Dobriban, Edgar. Uncertainty in Language Models: Assessment through Rank-Calibration. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.18

work page doi:10.18653/v1/2024.emnlp-main.18 2024
[7]

arXiv preprint arXiv:2410.12831 , year=

Segment as You Wish--Free-Form Language-Based Segmentation for Medical Images , author=. arXiv preprint arXiv:2410.12831 , year=

work page arXiv
[8]

arXiv preprint arXiv:2401.00125 , year=

Llm-assist: Enhancing closed-loop planning with language-based reasoning , author=. arXiv preprint arXiv:2401.00125 , year=

work page arXiv
[9]

Advances in Neural Information Processing Systems , volume=

Toolqa: A dataset for llm question answering with external tools , author=. Advances in Neural Information Processing Systems , volume=

work page
[10]

arXiv preprint arXiv:2405.06652 , year=

Large language model (llm) ai text generation detection based on transformer deep learning algorithm , author=. arXiv preprint arXiv:2405.06652 , year=

work page arXiv
[11]

RACE: Large-scale ReAding Comprehension Dataset From Examinations

Race: Large-scale reading comprehension dataset from examinations , author=. arXiv preprint arXiv:1704.04683 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

The Eleventh International Conference on Learning Representations , year=

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation , author=. The Eleventh International Conference on Learning Representations , year=

work page
[13]

Transactions on Machine Learning Research , issn=

Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

work page 2024
[14]

Contextualized Sequence Likelihood: Enhanced Confidence Scores for Natural Language Generation

Lin, Zhen and Trivedi, Shubhendu and Sun, Jimeng. Contextualized Sequence Likelihood: Enhanced Confidence Scores for Natural Language Generation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.578

work page doi:10.18653/v1/2024.emnlp-main.578 2024
[15]

arXiv preprint arXiv:2410.14368 , year=

CoMAL: Collaborative Multi-Agent Large Language Models for Mixed-Autonomy Traffic , author=. arXiv preprint arXiv:2410.14368 , year=

work page arXiv
[16]

Language Models (Mostly) Know What They Know

Language models (mostly) know what they know , author=. arXiv preprint arXiv:2207.05221 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

, title =

Vinh, Nguyen Xuan and Houle, Michael E. , title =. Proceedings of the 14th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining - Volume Part I , pages =. 2010 , isbn =. doi:10.1007/978-3-642-13657-3_4 , abstract =

work page doi:10.1007/978-3-642-13657-3_4 2010
[18]

Position: Uncertainty Quantification Needs Reassessment for Large Language Model Agents , author=

work page
[19]

I Can't Believe It's Not Better: Failure Modes in the Age of Foundation Models

Self-Evaluation Improves Selective Generation in Large Language Models , author =. Proceedings on "I Can't Believe It's Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops , pages =. 2023 , editor =

work page 2023
[20]

Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models

Duan, Jinhao and Cheng, Hao and Wang, Shiqi and Zavalny, Alex and Wang, Chenan and Xu, Renjing and Kailkhura, Bhavya and Xu, Kaidi. Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa...

work page doi:10.18653/v1/2024.acl-long.276 2024
[21]

Combining Confidence Elicitation and Sample-based Methods for Uncertainty Quantification in Misinformation Mitigation

Rivera, Mauricio and Godbout, Jean-Fran c ois and Rabbany, Reihaneh and Pelrine, Kellin. Combining Confidence Elicitation and Sample-based Methods for Uncertainty Quantification in Misinformation Mitigation. Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024). 2024

work page 2024
[22]

S elf C heck GPT : Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Manakul, Potsawee and Liusie, Adian and Gales, Mark. S elf C heck GPT : Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.557

work page doi:10.18653/v1/2023.emnlp-main.557 2023
[23]

Proceedings of the 34th International Conference on Machine Learning , pages =

On Calibration of Modern Neural Networks , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

work page 2017
[24]

Proceedings of the third International Workshop on Machine Learning in Systems Biology , pages =

Accuracy-Rejection Curves (ARCs) for Comparing Classification Methods with a Reject Option , author =. Proceedings of the third International Workshop on Machine Learning in Systems Biology , pages =. 2009 , editor =

work page 2009
[25]

The Twelfth International Conference on Learning Representations , year=

Conformal Language Modeling , author=. The Twelfth International Conference on Learning Representations , year=

work page
[26]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Conformal Alignment: Knowing When to Trust Foundation Models with Guarantees , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[27]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Large language model validity via enhanced conformal prediction methods , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[28]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Mohri, Christopher and Hashimoto, Tatsunori , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

work page 2024
[29]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Selective Generation for Controllable Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[30]

arXiv preprint arXiv:2405.01563 , year=

Mitigating LLM Hallucinations via Conformal Abstention , author=. arXiv preprint arXiv:2405.01563 , year=

work page arXiv
[31]

Selectively Answering Ambiguous Questions

Cole, Jeremy and Zhang, Michael and Gillick, Daniel and Eisenschlos, Julian and Dhingra, Bhuwan and Eisenstein, Jacob. Selectively Answering Ambiguous Questions. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.35

work page doi:10.18653/v1/2023.emnlp-main.35 2023
[32]

Advances in neural information processing systems , volume=

Selective classification for deep neural networks , author=. Advances in neural information processing systems , volume=

work page
[33]

, author=

On the Foundations of Noise-free Selective Classification. , author=. Journal of Machine Learning Research , volume=

work page
[34]

arXiv preprint arXiv:2401.17072 , year=

SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity , author=. arXiv preprint arXiv:2401.17072 , year=

work page arXiv
[35]

JudgeBench: A Benchmark for Evaluating LLM-based Judges

Judgebench: A benchmark for evaluating llm-based judges , author=. arXiv preprint arXiv:2410.12784 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

arXiv preprint arXiv:2411.16594 , year=

From generation to judgment: Opportunities and challenges of llm-as-a-judge , author=. arXiv preprint arXiv:2411.16594 , year=

work page arXiv
[37]

IEEE Transactions on Neural Networks and Learning Systems , year=

Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods , author=. IEEE Transactions on Neural Networks and Learning Systems , year=

work page
[38]

arXiv preprint arXiv:2404.09135 , year=

Unveiling LLM Evaluation Focused on Metrics: Challenges and Solutions , author=. arXiv preprint arXiv:2404.09135 , year=

work page arXiv
[39]

19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007) , volume=

Conformal prediction with neural networks , author=. 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007) , volume=. 2007 , organization=

work page 2007
[40]

Why We Need New Evaluation Metrics for NLG

Why we need new evaluation metrics for NLG , author=. arXiv preprint arXiv:1707.06875 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Journal of Artificial Intelligence Research , volume=

Survey of the state of the art in natural language generation: Core tasks, applications and evaluation , author=. Journal of Artificial Intelligence Research , volume=

work page
[42]

Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13 , pages=

Area under the precision-recall curve: point estimates and confidence intervals , author=. Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13 , pages=. 2013 , organization=

work page 2013
[43]

arXiv preprint arXiv:2407.00994 , year=

Llm uncertainty quantification through directional entailment graph and claim level response augmentation , author=. arXiv preprint arXiv:2407.00994 , year=

work page arXiv
[44]

arXiv preprint arXiv:2311.08298 , year=

A survey of language model confidence estimation and calibration , author=. arXiv preprint arXiv:2311.08298 , year=

work page arXiv
[45]

arXiv preprint arXiv:2206.09034 , year=

Towards better selective classification , author=. arXiv preprint arXiv:2206.09034 , year=

work page arXiv
[46]

and Szlam, Arthur and Dinan, Emily and Boureau, Y-Lan

Mielke, Sabrina J. and Szlam, Arthur and Dinan, Emily and Boureau, Y-Lan. Reducing Conversational Agents ' Overconfidence Through Linguistic Calibration. Transactions of the Association for Computational Linguistics. 2022. doi:10.1162/tacl_a_00494

work page doi:10.1162/tacl_a_00494 2022
[47]

Re-Examining Calibration: The Case of Question Answering

Si, Chenglei and Zhao, Chen and Min, Sewon and Boyd-Graber, Jordan. Re-Examining Calibration: The Case of Question Answering. Findings of the Association for Computational Linguistics: EMNLP 2022. 2022

work page 2022
[48]

Miao Xiong and Zhiyuan Hu and Xinyang Lu and YIFEI LI and Jie Fu and Junxian He and Bryan Hooi , booktitle=. Can. 2024 , url=

work page 2024
[49]

The 2023 Conference on Empirical Methods in Natural Language Processing , year=

On the Calibration of Large Language Models and Alignment , author=. The 2023 Conference on Empirical Methods in Natural Language Processing , year=

work page 2023
[50]

Uncertainty Estimation in Autoregressive Structured Prediction , author=

work page
[51]

C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge

Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan. C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/...

work page doi:10.18653/v1/n19-1421 2019
[52]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Qasc: A dataset for question answering via sentence composition , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[53]

Applied Sciences , volume=

What disease does this patient have? a large-scale open domain question answering dataset from medical exams , author=. Applied Sciences , volume=. 2021 , publisher=

work page 2021
[54]

RACE : Large-scale R e A ding Comprehension Dataset From Examinations

Lai, Guokun and Xie, Qizhe and Liu, Hanxiao and Yang, Yiming and Hovy, Eduard. RACE : Large-scale R e A ding Comprehension Dataset From Examinations. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. doi:10.18653/v1/D17-1082

work page doi:10.18653/v1/d17-1082 2017
[55]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

Phi-4 Technical Report

Phi-4 technical report , author=. arXiv preprint arXiv:2412.08905 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Qwen2.5 Technical Report

Qwen2. 5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Proceedings of the 29th Symposium on Operating Systems Principles , pages=

Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th Symposium on Operating Systems Principles , pages=

work page
[59]

International Conference on Learning Representations , year=

BERTScore: Evaluating Text Generation with BERT , author=. International Conference on Learning Representations , year=

work page
[60]

International Conference on Learning Representations , year=

DeBERTa: Decoding-enhanced BERT with Disentangled Attention , author=. International Conference on Learning Representations , year=

work page
[61]

The Eleventh International Conference on Learning Representations , year=

Out-of-Distribution Detection and Selective Generation for Conditional Language Models , author=. The Eleventh International Conference on Learning Representations , year=

work page
[62]

The Internal State of an LLM Knows When It ' s Lying

Azaria, Amos and Mitchell, Tom. The Internal State of an LLM Knows When It`s Lying. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.68

work page doi:10.18653/v1/2023.findings-emnlp.68 2023
[63]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[64]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[65]

Journal of Machine Learning Research , year =

Vojtech Franc and Daniel Prusa and Vaclav Voracek , title =. Journal of Machine Learning Research , year =

work page
[66]

C o QA : A Conversational Question Answering Challenge

Reddy, Siva and Chen, Danqi and Manning, Christopher D. C o QA : A Conversational Question Answering Challenge. Transactions of the Association for Computational Linguistics. 2019. doi:10.1162/tacl_a_00266

work page doi:10.1162/tacl_a_00266 2019
[67]

cognition , volume=

Are humans good intuitive statisticians after all? Rethinking some conclusions from the literature on judgment under uncertainty , author=. cognition , volume=. 1996 , publisher=

work page 1996
[68]

Advances in neural information processing systems , volume=

Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration , author=. Advances in neural information processing systems , volume=

work page
[69]

Taking a Step Back with

Zhen Lin and Shubhendu Trivedi and Jimeng Sun , booktitle=. Taking a Step Back with. 2023 , url=

work page 2023
[70]

Proceedings of the 38th International Conference on Machine Learning , pages =

Meta-Cal: Well-controlled Post-hoc Calibration by Ranking , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

work page 2021
[71]

Proceedings of the 37th International Conference on Machine Learning , pages =

Mix-n-Match : Ensemble and Compositional Methods for Uncertainty Calibration in Deep Learning , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , editor =

work page 2020
[72]

2001 , isbn =

Zadrozny, Bianca and Elkan, Charles , title =. 2001 , isbn =. doi:10.1145/502512.502540 , booktitle =

work page doi:10.1145/502512.502540 2001
[73]

and Zhang, Linchuan and Jerfel, Ghassen and Tran, Dustin , title =

Nixon, Jeremy and Dusenberry, Michael W. and Zhang, Linchuan and Jerfel, Ghassen and Tran, Dustin , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , month =

work page
[74]

arXiv preprint arXiv:2502.01534 , year=

Preference Leakage: A Contamination Problem in LLM-as-a-judge , author=. arXiv preprint arXiv:2502.01534 , year=

work page arXiv
[75]

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Llms-as-judges: a comprehensive survey on llm-based evaluation methods , author=. arXiv preprint arXiv:2412.05579 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[76]

Advances in Neural Information Processing Systems , volume=

Benchmarking llms via uncertainty quantification , author=. Advances in Neural Information Processing Systems , volume=

work page
[77]

Transactions of the Association for Computational Linguistics , volume=

Benchmarking uncertainty quantification methods for large language models with lm-polygraph , author=. Transactions of the Association for Computational Linguistics , volume=. 2025 , publisher=

work page 2025
[78]

arXiv preprint arXiv:2305.19187 , year=

Generating with confidence: Uncertainty quantification for black-box large language models , author=. arXiv preprint arXiv:2305.19187 , year=

work page arXiv
[79]

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Deberta: Decoding-enhanced bert with disentangled attention , author=. arXiv preprint arXiv:2006.03654 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2006
[80]

Advances in neural information processing systems , volume=

On spectral clustering: Analysis and an algorithm , author=. Advances in neural information processing systems , volume=

work page

Showing first 80 references.

[1] [1]

arXiv preprint arXiv:2502.14268 , year=

MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels , author=. arXiv preprint arXiv:2502.14268 , year=

work page arXiv

[2] [2]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page

[3] [3]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page

[4] [4]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016

[5] [5]

Bowman, and Shi Feng

Llm evaluators recognize and favor their own generations , author=. arXiv preprint arXiv:2404.13076 , year=

work page arXiv

[6] [6]

Uncertainty in Language Models: Assessment through Rank-Calibration

Huang, Xinmeng and Li, Shuo and Yu, Mengxin and Sesia, Matteo and Hassani, Hamed and Lee, Insup and Bastani, Osbert and Dobriban, Edgar. Uncertainty in Language Models: Assessment through Rank-Calibration. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.18

work page doi:10.18653/v1/2024.emnlp-main.18 2024

[7] [7]

arXiv preprint arXiv:2410.12831 , year=

Segment as You Wish--Free-Form Language-Based Segmentation for Medical Images , author=. arXiv preprint arXiv:2410.12831 , year=

work page arXiv

[8] [8]

arXiv preprint arXiv:2401.00125 , year=

Llm-assist: Enhancing closed-loop planning with language-based reasoning , author=. arXiv preprint arXiv:2401.00125 , year=

work page arXiv

[9] [9]

Advances in Neural Information Processing Systems , volume=

Toolqa: A dataset for llm question answering with external tools , author=. Advances in Neural Information Processing Systems , volume=

work page

[10] [10]

arXiv preprint arXiv:2405.06652 , year=

Large language model (llm) ai text generation detection based on transformer deep learning algorithm , author=. arXiv preprint arXiv:2405.06652 , year=

work page arXiv

[11] [11]

RACE: Large-scale ReAding Comprehension Dataset From Examinations

Race: Large-scale reading comprehension dataset from examinations , author=. arXiv preprint arXiv:1704.04683 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

The Eleventh International Conference on Learning Representations , year=

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation , author=. The Eleventh International Conference on Learning Representations , year=

work page

[13] [13]

Transactions on Machine Learning Research , issn=

Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

work page 2024

[14] [14]

Contextualized Sequence Likelihood: Enhanced Confidence Scores for Natural Language Generation

Lin, Zhen and Trivedi, Shubhendu and Sun, Jimeng. Contextualized Sequence Likelihood: Enhanced Confidence Scores for Natural Language Generation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.578

work page doi:10.18653/v1/2024.emnlp-main.578 2024

[15] [15]

arXiv preprint arXiv:2410.14368 , year=

CoMAL: Collaborative Multi-Agent Large Language Models for Mixed-Autonomy Traffic , author=. arXiv preprint arXiv:2410.14368 , year=

work page arXiv

[16] [16]

Language Models (Mostly) Know What They Know

Language models (mostly) know what they know , author=. arXiv preprint arXiv:2207.05221 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

, title =

Vinh, Nguyen Xuan and Houle, Michael E. , title =. Proceedings of the 14th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining - Volume Part I , pages =. 2010 , isbn =. doi:10.1007/978-3-642-13657-3_4 , abstract =

work page doi:10.1007/978-3-642-13657-3_4 2010

[18] [18]

Position: Uncertainty Quantification Needs Reassessment for Large Language Model Agents , author=

work page

[19] [19]

I Can't Believe It's Not Better: Failure Modes in the Age of Foundation Models

Self-Evaluation Improves Selective Generation in Large Language Models , author =. Proceedings on "I Can't Believe It's Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops , pages =. 2023 , editor =

work page 2023

[20] [20]

Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models

Duan, Jinhao and Cheng, Hao and Wang, Shiqi and Zavalny, Alex and Wang, Chenan and Xu, Renjing and Kailkhura, Bhavya and Xu, Kaidi. Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa...

work page doi:10.18653/v1/2024.acl-long.276 2024

[21] [21]

Combining Confidence Elicitation and Sample-based Methods for Uncertainty Quantification in Misinformation Mitigation

Rivera, Mauricio and Godbout, Jean-Fran c ois and Rabbany, Reihaneh and Pelrine, Kellin. Combining Confidence Elicitation and Sample-based Methods for Uncertainty Quantification in Misinformation Mitigation. Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024). 2024

work page 2024

[22] [22]

S elf C heck GPT : Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Manakul, Potsawee and Liusie, Adian and Gales, Mark. S elf C heck GPT : Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.557

work page doi:10.18653/v1/2023.emnlp-main.557 2023

[23] [23]

Proceedings of the 34th International Conference on Machine Learning , pages =

On Calibration of Modern Neural Networks , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

work page 2017

[24] [24]

Proceedings of the third International Workshop on Machine Learning in Systems Biology , pages =

Accuracy-Rejection Curves (ARCs) for Comparing Classification Methods with a Reject Option , author =. Proceedings of the third International Workshop on Machine Learning in Systems Biology , pages =. 2009 , editor =

work page 2009

[25] [25]

The Twelfth International Conference on Learning Representations , year=

Conformal Language Modeling , author=. The Twelfth International Conference on Learning Representations , year=

work page

[26] [26]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Conformal Alignment: Knowing When to Trust Foundation Models with Guarantees , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page

[27] [27]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Large language model validity via enhanced conformal prediction methods , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page

[28] [28]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Mohri, Christopher and Hashimoto, Tatsunori , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

work page 2024

[29] [29]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Selective Generation for Controllable Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page

[30] [30]

arXiv preprint arXiv:2405.01563 , year=

Mitigating LLM Hallucinations via Conformal Abstention , author=. arXiv preprint arXiv:2405.01563 , year=

work page arXiv

[31] [31]

Selectively Answering Ambiguous Questions

Cole, Jeremy and Zhang, Michael and Gillick, Daniel and Eisenschlos, Julian and Dhingra, Bhuwan and Eisenstein, Jacob. Selectively Answering Ambiguous Questions. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.35

work page doi:10.18653/v1/2023.emnlp-main.35 2023

[32] [32]

Advances in neural information processing systems , volume=

Selective classification for deep neural networks , author=. Advances in neural information processing systems , volume=

work page

[33] [33]

, author=

On the Foundations of Noise-free Selective Classification. , author=. Journal of Machine Learning Research , volume=

work page

[34] [34]

arXiv preprint arXiv:2401.17072 , year=

SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity , author=. arXiv preprint arXiv:2401.17072 , year=

work page arXiv

[35] [35]

JudgeBench: A Benchmark for Evaluating LLM-based Judges

Judgebench: A benchmark for evaluating llm-based judges , author=. arXiv preprint arXiv:2410.12784 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

arXiv preprint arXiv:2411.16594 , year=

From generation to judgment: Opportunities and challenges of llm-as-a-judge , author=. arXiv preprint arXiv:2411.16594 , year=

work page arXiv

[37] [37]

IEEE Transactions on Neural Networks and Learning Systems , year=

Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods , author=. IEEE Transactions on Neural Networks and Learning Systems , year=

work page

[38] [38]

arXiv preprint arXiv:2404.09135 , year=

Unveiling LLM Evaluation Focused on Metrics: Challenges and Solutions , author=. arXiv preprint arXiv:2404.09135 , year=

work page arXiv

[39] [39]

19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007) , volume=

Conformal prediction with neural networks , author=. 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007) , volume=. 2007 , organization=

work page 2007

[40] [40]

Why We Need New Evaluation Metrics for NLG

Why we need new evaluation metrics for NLG , author=. arXiv preprint arXiv:1707.06875 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

Journal of Artificial Intelligence Research , volume=

Survey of the state of the art in natural language generation: Core tasks, applications and evaluation , author=. Journal of Artificial Intelligence Research , volume=

work page

[42] [42]

Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13 , pages=

Area under the precision-recall curve: point estimates and confidence intervals , author=. Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13 , pages=. 2013 , organization=

work page 2013

[43] [43]

arXiv preprint arXiv:2407.00994 , year=

Llm uncertainty quantification through directional entailment graph and claim level response augmentation , author=. arXiv preprint arXiv:2407.00994 , year=

work page arXiv

[44] [44]

arXiv preprint arXiv:2311.08298 , year=

A survey of language model confidence estimation and calibration , author=. arXiv preprint arXiv:2311.08298 , year=

work page arXiv

[45] [45]

arXiv preprint arXiv:2206.09034 , year=

Towards better selective classification , author=. arXiv preprint arXiv:2206.09034 , year=

work page arXiv

[46] [46]

and Szlam, Arthur and Dinan, Emily and Boureau, Y-Lan

Mielke, Sabrina J. and Szlam, Arthur and Dinan, Emily and Boureau, Y-Lan. Reducing Conversational Agents ' Overconfidence Through Linguistic Calibration. Transactions of the Association for Computational Linguistics. 2022. doi:10.1162/tacl_a_00494

work page doi:10.1162/tacl_a_00494 2022

[47] [47]

Re-Examining Calibration: The Case of Question Answering

Si, Chenglei and Zhao, Chen and Min, Sewon and Boyd-Graber, Jordan. Re-Examining Calibration: The Case of Question Answering. Findings of the Association for Computational Linguistics: EMNLP 2022. 2022

work page 2022

[48] [48]

Miao Xiong and Zhiyuan Hu and Xinyang Lu and YIFEI LI and Jie Fu and Junxian He and Bryan Hooi , booktitle=. Can. 2024 , url=

work page 2024

[49] [49]

The 2023 Conference on Empirical Methods in Natural Language Processing , year=

On the Calibration of Large Language Models and Alignment , author=. The 2023 Conference on Empirical Methods in Natural Language Processing , year=

work page 2023

[50] [50]

Uncertainty Estimation in Autoregressive Structured Prediction , author=

work page

[51] [51]

C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge

Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan. C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/...

work page doi:10.18653/v1/n19-1421 2019

[52] [52]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Qasc: A dataset for question answering via sentence composition , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[53] [53]

Applied Sciences , volume=

What disease does this patient have? a large-scale open domain question answering dataset from medical exams , author=. Applied Sciences , volume=. 2021 , publisher=

work page 2021

[54] [54]

RACE : Large-scale R e A ding Comprehension Dataset From Examinations

Lai, Guokun and Xie, Qizhe and Liu, Hanxiao and Yang, Yiming and Hovy, Eduard. RACE : Large-scale R e A ding Comprehension Dataset From Examinations. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. doi:10.18653/v1/D17-1082

work page doi:10.18653/v1/d17-1082 2017

[55] [55]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[56] [56]

Phi-4 Technical Report

Phi-4 technical report , author=. arXiv preprint arXiv:2412.08905 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[57] [57]

Qwen2.5 Technical Report

Qwen2. 5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[58] [58]

Proceedings of the 29th Symposium on Operating Systems Principles , pages=

Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th Symposium on Operating Systems Principles , pages=

work page

[59] [59]

International Conference on Learning Representations , year=

BERTScore: Evaluating Text Generation with BERT , author=. International Conference on Learning Representations , year=

work page

[60] [60]

International Conference on Learning Representations , year=

DeBERTa: Decoding-enhanced BERT with Disentangled Attention , author=. International Conference on Learning Representations , year=

work page

[61] [61]

The Eleventh International Conference on Learning Representations , year=

Out-of-Distribution Detection and Selective Generation for Conditional Language Models , author=. The Eleventh International Conference on Learning Representations , year=

work page

[62] [62]

The Internal State of an LLM Knows When It ' s Lying

Azaria, Amos and Mitchell, Tom. The Internal State of an LLM Knows When It`s Lying. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.68

work page doi:10.18653/v1/2023.findings-emnlp.68 2023

[63] [63]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page

[64] [64]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[65] [65]

Journal of Machine Learning Research , year =

Vojtech Franc and Daniel Prusa and Vaclav Voracek , title =. Journal of Machine Learning Research , year =

work page

[66] [66]

C o QA : A Conversational Question Answering Challenge

Reddy, Siva and Chen, Danqi and Manning, Christopher D. C o QA : A Conversational Question Answering Challenge. Transactions of the Association for Computational Linguistics. 2019. doi:10.1162/tacl_a_00266

work page doi:10.1162/tacl_a_00266 2019

[67] [67]

cognition , volume=

Are humans good intuitive statisticians after all? Rethinking some conclusions from the literature on judgment under uncertainty , author=. cognition , volume=. 1996 , publisher=

work page 1996

[68] [68]

Advances in neural information processing systems , volume=

Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration , author=. Advances in neural information processing systems , volume=

work page

[69] [69]

Taking a Step Back with

Zhen Lin and Shubhendu Trivedi and Jimeng Sun , booktitle=. Taking a Step Back with. 2023 , url=

work page 2023

[70] [70]

Proceedings of the 38th International Conference on Machine Learning , pages =

Meta-Cal: Well-controlled Post-hoc Calibration by Ranking , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

work page 2021

[71] [71]

Proceedings of the 37th International Conference on Machine Learning , pages =

Mix-n-Match : Ensemble and Compositional Methods for Uncertainty Calibration in Deep Learning , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , editor =

work page 2020

[72] [72]

2001 , isbn =

Zadrozny, Bianca and Elkan, Charles , title =. 2001 , isbn =. doi:10.1145/502512.502540 , booktitle =

work page doi:10.1145/502512.502540 2001

[73] [73]

and Zhang, Linchuan and Jerfel, Ghassen and Tran, Dustin , title =

Nixon, Jeremy and Dusenberry, Michael W. and Zhang, Linchuan and Jerfel, Ghassen and Tran, Dustin , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , month =

work page

[74] [74]

arXiv preprint arXiv:2502.01534 , year=

Preference Leakage: A Contamination Problem in LLM-as-a-judge , author=. arXiv preprint arXiv:2502.01534 , year=

work page arXiv

[75] [75]

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Llms-as-judges: a comprehensive survey on llm-based evaluation methods , author=. arXiv preprint arXiv:2412.05579 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[76] [76]

Advances in Neural Information Processing Systems , volume=

Benchmarking llms via uncertainty quantification , author=. Advances in Neural Information Processing Systems , volume=

work page

[77] [77]

Transactions of the Association for Computational Linguistics , volume=

Benchmarking uncertainty quantification methods for large language models with lm-polygraph , author=. Transactions of the Association for Computational Linguistics , volume=. 2025 , publisher=

work page 2025

[78] [78]

arXiv preprint arXiv:2305.19187 , year=

Generating with confidence: Uncertainty quantification for black-box large language models , author=. arXiv preprint arXiv:2305.19187 , year=

work page arXiv

[79] [79]

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Deberta: Decoding-enhanced bert with disentangled attention , author=. arXiv preprint arXiv:2006.03654 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2006

[80] [80]

Advances in neural information processing systems , volume=

On spectral clustering: Analysis and an algorithm , author=. Advances in neural information processing systems , volume=

work page