pith. sign in

arxiv: 2606.30814 · v1 · pith:SONURQ4Onew · submitted 2026-06-29 · 💻 cs.CL

When Calibration Rankings Reverse: Accuracy-Controlled Evaluation for Fair Comparison of LLMs

Pith reviewed 2026-07-01 02:14 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM calibrationaccuracy-controlled evaluationExpected Calibration Errorranking reversalmodel comparisonconfidence alignmentBrier Score
0
0 comments X

The pith

Raw global calibration metrics for LLMs are confounded by accuracy differences, making cross-model comparisons unreliable without accuracy control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard global metrics such as Expected Calibration Error and Brier Score produce distorted comparisons of calibration across large language models because differences in overall accuracy systematically favor higher-accuracy models. To enable fairer evaluation, the authors introduce the ACE framework, which supplies three accuracy-controlled views of calibration. Experiments on multiple benchmarks and model families demonstrate that previously reported calibration advantages often weaken or disappear once accuracy is held constant. Rankings between models reverse frequently under the controlled views compared with raw metrics. This finding indicates that accuracy-aware evaluation is required for robust cross-model calibration assessment.

Core claim

Comparisons of LLM calibration using raw global metrics such as Expected Calibration Error and Brier Score are confounded by model accuracy differences. For fair comparison, the ACE framework provides three complementary accuracy-controlled views of calibration: Instance-Aligned, Distribution-Aligned, and Candidate-Aligned. Across benchmarks, model families, and confidence elicitation methods, accuracy control substantially weakens previously reported calibration advantages and produces frequent ranking reversals between models favored by raw metrics and those favored after control.

What carries the argument

The ACE accuracy-controlled evaluation framework with its three complementary views (Instance-Aligned, Distribution-Aligned, Candidate-Aligned) that adjust comparisons to isolate calibration quality from accuracy differences.

If this is right

  • Many previously reported calibration advantages weaken substantially after accuracy control.
  • Ranking reversal is frequent: models favored by raw metrics often cease to be favored once accuracy is controlled.
  • Comparisons along axes such as small versus large models and thinking versus non-thinking models change under ACE.
  • Raw global calibration metrics are not robust for cross-model comparison.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmarking practices for LLMs should shift toward accuracy-controlled methods to avoid systematically biased conclusions about calibration.
  • The same accuracy-control principle could apply to other secondary properties such as uncertainty quantification or refusal behavior where primary performance differences may confound results.
  • Model selection decisions that rely on published calibration numbers may need re-examination using controlled views.

Load-bearing premise

The three views in ACE isolate calibration quality from accuracy differences without introducing new confounding factors or selection biases.

What would settle it

Re-running the same model comparisons with the three ACE views and finding that model rankings remain unchanged from raw global metrics across multiple benchmarks would show that accuracy control is unnecessary.

Figures

Figures reproduced from arXiv: 2606.30814 by Caiqi Zhang, Chengzu Li, Deqing Yang, Nigel Collier, Ruihan Yang, Zhichao Yang.

Figure 1
Figure 1. Figure 1: Accuracy differences strongly confound raw ECE comparison. Each point is one dataset-level average for a model pair (ρ=−0.670, n= 40, 95% CI [−0.81, −0.45], p < 10−5 ; Appendix A.2). Larger accuracy gaps correspond to larger raw calibration gaps; our accuracy-controlled views substantially reduce this confound. the magnitude of a calibration gap, controlling for accuracy can overturn the conclusion about w… view at source ↗
Figure 2
Figure 2. Figure 2: Calibration gap between small and large models within the same family, plotted before ( [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Calibration gap between non-thinking and thinking models, using the same setup as Figure [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overall reversal-combination distribution rel [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Reversal rate by accuracy-gap bin. The combined figure presents the same pattern from two complementary [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Histogram comparison of confidence distributions on both-wrong and both-right examples across three [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
read the original abstract

Calibration evaluates whether a model confidence aligns with its empirical accuracy. Existing studies often compare the calibration of different large language models using global calibration metrics such as Expected Calibration Error and Brier Score. We begin by showing, both theoretically and empirically, that such comparisons are confounded by differences in model accuracy. For fairer cross-model comparison, we then propose ACE, an accuracy-controlled evaluation framework with three complementary views: Instance-Aligned, Distribution-Aligned, and Candidate-Aligned calibration. Across multiple benchmarks, model families, and confidence elicitation methods, we use ACE to study two practically important comparison axes, small versus large models and thinking versus non-thinking models. We find that many previously reported calibration advantages under raw global metrics weaken substantially after accuracy control. We also find that ranking reversal is frequent: models favored by raw metrics often cease to be favored once accuracy is controlled. Our results show that raw global calibration metrics are not robust for cross-model comparison, and that fair calibration comparison requires accuracy-aware evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims that global calibration metrics such as ECE and Brier Score are confounded by model accuracy differences when comparing LLMs, as shown both theoretically and empirically. It proposes the ACE framework with three complementary views (Instance-Aligned, Distribution-Aligned, and Candidate-Aligned calibration) to enable accuracy-controlled evaluation. Experiments across benchmarks, model families, and confidence elicitation methods reveal frequent ranking reversals, with many reported calibration advantages weakening after accuracy control, implying that raw global metrics are not robust for cross-model comparison.

Significance. If the confounding result and the isolation provided by the three ACE views hold, the work would require re-examination of prior LLM calibration comparisons and establish accuracy-aware evaluation as a standard practice. The empirical demonstration of ranking reversals for small-vs-large and thinking-vs-non-thinking models supplies concrete, falsifiable evidence on two practically relevant axes.

minor comments (3)
  1. [§3] §3 (ACE definitions): the three alignment views are introduced via prose; adding a short algorithmic outline or pseudocode for each would improve reproducibility of the accuracy-control procedure.
  2. [Table 2, Figure 3] Table 2 and Figure 3: axis labels and legend entries use inconsistent abbreviations for the three ACE views; standardize notation across text, tables, and figures.
  3. [§5.2] §5.2 (ranking reversal analysis): the statistical test for reversal significance is described only in the caption; move the exact procedure and multiple-comparison correction into the main text.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper first establishes confounding of global metrics (ECE, Brier) by accuracy via separate theoretical and empirical arguments, then introduces ACE as a subsequent control framework. No quoted step reduces the central claim to a self-definition, fitted input renamed as prediction, or load-bearing self-citation chain. The three ACE views are presented as new constructs rather than tautological with the confounding demonstration. The derivation is self-contained and does not rely on ansatz smuggling or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard definitions of calibration and accuracy; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Calibration is the alignment between a model's reported confidence and its empirical accuracy on held-out data.
    Standard definition invoked to motivate the confounding issue.

pith-pipeline@v0.9.1-grok · 5714 in / 1098 out tokens · 27505 ms · 2026-07-01T02:14:35.003959+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

300 extracted references · 51 canonical work pages · 13 internal anchors

  1. [1]

    A Survey of Large Language Models

    A Survey of Large Language Models , year =. arXiv , author =:2303.18223 , primaryclass =

  2. [2]

    2025 , eprint=

    Beyond the Final Layer: Intermediate Representations for Better Multilingual Calibration in Large Language Models , author=. 2025 , eprint=

  3. [3]

    2023 , eprint=

    LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models , author=. 2023 , eprint=

  4. [4]

    A survey on evaluation of large language models , volume =

    Chang, Yupeng and Wang, Xu and Wang, Jindong and Wu, Yuan and Yang, Linyi and Zhu, Kaijie and Chen, Hao and Yi, Xiaoyuan and Wang, Cunxiang and Wang, Yidong and others , journal =. A survey on evaluation of large language models , volume =

  5. [5]

    A survey on large language models: Applications, challenges, limitations, and practical usage , year =

    Hadi, Muhammad Usman and Qureshi, Rizwan and Shah, Abbas and Irfan, Muhammad and Zafar, Anas and Shaikh, Muhammad Bilal and Akhtar, Naveed and Wu, Jia and Mirjalili, Seyedali and others , journal =. A survey on large language models: Applications, challenges, limitations, and practical usage , year =

  6. [6]

    Siren's song in the AI ocean: a survey on hallucination in large language models , url =

    Zhang, Yue and Li, Yafu and Cui, Leyang and Cai, Deng and Liu, Lemao and Fu, Tingchen and Huang, Xinting and Zhao, Enbo and Zhang, Yu and Chen, Yulong and others , journal =. Siren's song in the AI ocean: a survey on hallucination in large language models , url =

  7. [7]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions , year =. arXiv , author =:2311.05232 , primaryclass =

  8. [8]

    Weinberger , bibsource =

    Chuan Guo and Geoff Pleiss and Yu Sun and Kilian Q. Weinberger , bibsource =. On Calibration of Modern Neural Networks , url =. Proceedings of the 34th International Conference on Machine Learning,

  9. [9]

    arXiv , author =:2405.20974 , primaryclass =

    SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales , year =. arXiv , author =:2405.20974 , primaryclass =

  10. [10]

    Mistral 7B

    Mistral 7. arXiv , author =:2310.06825 , primaryclass =

  11. [11]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open Foundation and Fine-Tuned Chat Models , year =. arXiv , author =:2307.09288 , primaryclass =

  12. [12]

    arXiv , author =:2307.15703 , primaryclass =

    Uncertainty in Natural Language Generation: From Theory to Applications , year =. arXiv , author =:2307.15703 , primaryclass =

  13. [13]

    arXiv , author =:2404.03163 , primaryclass =

    Uncertainty in Language Models: Assessment through Rank-Calibration , year =. arXiv , author =:2404.03163 , primaryclass =

  14. [14]

    Andrey Malinin and Mark J. F. Gales , bibsource =. Uncertainty Estimation in Autoregressive Structured Prediction , url =. 9th International Conference on Learning Representations,

  15. [15]

    The measurement of observer agreement for categorical data , year =

    Landis, J Richard and Koch, Gary G , journal =. The measurement of observer agreement for categorical data , year =

  16. [16]

    arXiv , author =:2402.06544 , primaryclass =

    Calibrating Long-form Generations from Large Language Models , year =. arXiv , author =:2402.06544 , primaryclass =

  17. [17]

    Verification of forecasts expressed in terms of probability , volume =

    Brier, Glenn W , journal =. Verification of forecasts expressed in terms of probability , volume =

  18. [18]

    Cooper and Milos Hauskrecht , bibsource =

    Mahdi Pakdaman Naeini and Gregory F. Cooper and Milos Hauskrecht , bibsource =. Obtaining Well Calibrated Probabilities Using Bayesian Binning , url =. Proceedings of the Twenty-Ninth

  19. [19]

    arXiv , author =:2306.04459 , primaryclass =

    Uncertainty in Natural Language Processing: Sources, Quantification, and Applications , year =. arXiv , author =:2306.04459 , primaryclass =

  20. [20]

    GPT-4 Technical Report , url =

    OpenAI , journal =. GPT-4 Technical Report , url =

  21. [21]

    A survey of uncertainty in deep neural networks , url =

    Gawlikowski, Jakob and Tassi, Cedrique Rovile Njieutcheu and Ali, Mohsin and Lee, Jongseok and Humt, Matthias and Feng, Jianxiang and Kruspe, Anna and Triebel, Rudolph and Jung, Peter and Roscher, Ribana and others , journal =. A survey of uncertainty in deep neural networks , url =

  22. [22]

    What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , url =

    Alex Kendall and Yarin Gal , bibsource =. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , url =. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA,

  23. [23]

    Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods , url =

    H. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods , url =. Machine Learning , pages =

  24. [24]

    Look before you leap: An exploratory study of uncertainty measurement for large language models , url =

    Huang, Yuheng and Song, Jiayang and Wang, Zhijie and Chen, Huaming and Ma, Lei , journal =. Look before you leap: An exploratory study of uncertainty measurement for large language models , url =

  25. [25]

    Aleatory and epistemic uncertainty in probability elicitation with an example from hazardous waste management , url =

    Hora, Stephen C , journal =. Aleatory and epistemic uncertainty in probability elicitation with an example from hazardous waste management , url =

  26. [26]

    WildHallucinations: Evaluating Long-form Factuality in LLMs with Real-World Entity Queries , url =

    Wenting Zhao and Tanya Goyal and Yu Ying Chiu and Liwei Jiang and Benjamin Newman and Abhilasha Ravichander and Khyathi Chandu and Ronan Le Bras and Claire Cardie and Yuntian Deng and Yejin Choi , journal =. WildHallucinations: Evaluating Long-form Factuality in LLMs with Real-World Entity Queries , url =

  27. [27]

    Aleatory or epistemic? Does it matter? , url =

    Der Kiureghian, Armen and Ditlevsen, Ove , journal =. Aleatory or epistemic? Does it matter? , url =

  28. [28]

    Qwen2 Technical Report , url =

    An Yang and Baosong Yang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Zhou and Chengpeng Li and Chengyuan Li and Dayiheng Liu and Fei Huang and Guanting Dong and Haoran Wei and Huan Lin and Jialong Tang and Jialin Wang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Ma and Jianxin Yang and Jin Xu and Jingren Zhou and Jinze Bai and Jinzh...

  29. [29]

    Linguistic Calibration of Long-Form Generations , url =

    Neil Band and Xuechen Li and Tengyu Ma and Tatsunori Hashimoto , bibsource =. Linguistic Calibration of Long-Form Generations , url =. Forty-first International Conference on Machine Learning,

  30. [30]

    LoGU: Long-form Generation with Uncertainty Expressions , url =

    Ruihan Yang and Caiqi Zhang and Zhisong Zhang and Xinting Huang and Sen Yang and Nigel Collier and Dong Yu and Deqing Yang , journal =. LoGU: Long-form Generation with Uncertainty Expressions , url =

  31. [31]

    Language Models (Mostly) Know What They Know

    Language Models (Mostly) Know What They Know , year =. arXiv , author =:2207.05221 , primaryclass =

  32. [32]

    Self-critiquing models for assisting human evaluators

    Self-critiquing models for assisting human evaluators , year =. arXiv , author =:2206.05802 , primaryclass =

  33. [33]

    Atomic Calibration of LLMs in Long-Form Generations , url =

    Caiqi Zhang and Ruihan Yang and Zhisong Zhang and Xinting Huang and Sen Yang and Dong Yu and Nigel Collier , journal =. Atomic Calibration of LLMs in Long-Form Generations , url =

  34. [34]

    arXiv , author =:2406.19276 , primaryclass =

    VERISCORE: Evaluating the factuality of verifiable claims in long-form text generation , year =. arXiv , author =:2406.19276 , primaryclass =

  35. [35]

    Le , bibsource =

    Jerry Wei and Chengrun Yang and Xinying Song and Yifeng Lu and Nathan Hu and Jie Huang and Dustin Tran and Daiyi Peng and Ruibo Liu and Da Huang and Cosmo Du and Quoc V. Le , bibsource =. Long-form factuality in large language models , url =. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2...

  36. [36]

    Metrics for evaluating performance and uncertainty of Bayesian network models , volume =

    Marcot, Bruce G , journal =. Metrics for evaluating performance and uncertainty of Bayesian network models , volume =

  37. [37]

    Classification uncertainty of deep neural networks based on gradient information , url =

    Oberdiek, Philipp and Rottmann, Matthias and Gottschalk, Hanno , journal =. Classification uncertainty of deep neural networks based on gradient information , url =

  38. [38]

    Gradients as a measure of uncertainty in neural networks , url =

    Lee, Jinsol and AlRegib, Ghassan , journal =. Gradients as a measure of uncertainty in neural networks , url =

  39. [39]

    Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , url =

    Yarin Gal and Zoubin Ghahramani , bibsource =. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , url =. Proceedings of the 33nd International Conference on Machine Learning,

  40. [40]

    Bayesian Learning via Stochastic Gradient Langevin Dynamics , url =

    Max Welling and Yee Whye Teh , bibsource =. Bayesian Learning via Stochastic Gradient Langevin Dynamics , url =. Proceedings of the 28th International Conference on Machine Learning,

  41. [41]

    Estimating Model Uncertainty of Neural Networks in Sparse Information Form , url =

    Jongseok Lee and Matthias Humt and Jianxiang Feng and Rudolph Triebel , bibsource =. Estimating Model Uncertainty of Neural Networks in Sparse Information Form , url =. Proceedings of the 37th International Conference on Machine Learning,

  42. [42]

    BatchEnsemble: an Alternative Approach to Efficient Ensemble and Lifelong Learning , url =

    Yeming Wen and Dustin Tran and Jimmy Ba , bibsource =. BatchEnsemble: an Alternative Approach to Efficient Ensemble and Lifelong Learning , url =. 8th International Conference on Learning Representations,

  43. [43]

    Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , url =

    Balaji Lakshminarayanan and Alexander Pritzel and Charles Blundell , bibsource =. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , url =. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA,

  44. [44]

    Vetrov , bibsource =

    Alexander Lyzhov and Yuliya Molchanova and Arsenii Ashukha and Dmitry Molchanov and Dmitry P. Vetrov , bibsource =. Greedy Policy Search:. Proceedings of the Thirty-Sixth Conference on Uncertainty in Artificial Intelligence,

  45. [45]

    Learning Loss for Test-Time Augmentation , url =

    Ildoo Kim and Younghoon Kim and Sungwoong Kim , bibsource =. Learning Loss for Test-Time Augmentation , url =. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual , editor =

  46. [46]

    A Survey of Language Model Confidence Estimation and Calibration , url =

    Geng, Jiahui and Cai, Fengyu and Wang, Yuxia and Koeppl, Heinz and Nakov, Preslav and Gurevych, Iryna , journal =. A Survey of Language Model Confidence Estimation and Calibration , url =

  47. [47]

    arXiv , author =:2310.07521 , primaryclass =

    Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity , year =. arXiv , author =:2310.07521 , primaryclass =

  48. [48]

    Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation , url =

    Lorenz Kuhn and Yarin Gal and Sebastian Farquhar , bibsource =. Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation , url =. The Eleventh International Conference on Learning Representations,

  49. [49]

    Shifting attention to relevance: Towards the uncertainty estimation of large language models , url =

    Duan, Jinhao and Cheng, Hao and Wang, Shiqi and Wang, Chenan and Zavalny, Alex and Xu, Renjing and Kailkhura, Bhavya and Xu, Kaidi , journal =. Shifting attention to relevance: Towards the uncertainty estimation of large language models , url =

  50. [50]

    Teaching models to express their uncertainty in words , url =

    Lin, Stephanie and Hilton, Jacob and Evans, Owain , journal =. Teaching models to express their uncertainty in words , url =

  51. [51]

    Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs , url =

    Miao Xiong and Zhiyuan Hu and Xinyang Lu and Yifei Li and Jie Fu and Junxian He and Bryan Hooi , bibsource =. Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs , url =. The Twelfth International Conference on Learning Representations,

  52. [52]

    Combining Confidence Elicitation and Sample-based Methods for Uncertainty Quantification in Misinformation Mitigation , url =

    Rivera, Mauricio and Godbout, Jean-Fran. Combining Confidence Elicitation and Sample-based Methods for Uncertainty Quantification in Misinformation Mitigation , url =. Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024) , editor =

  53. [53]

    arXiv , author =:2305.19187 , primaryclass =

    Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models , year =. arXiv , author =:2305.19187 , primaryclass =

  54. [54]

    A Survey of Large Language Models , url =

    Zhao, Wayne Xin and Zhou, Kun and Li, Junyi and Tang, Tianyi and Wang, Xiaolei and Hou, Yupeng and Min, Yingqian and Zhang, Beichen and Zhang, Junjie and Dong, Zican and Du, Yifan and Yang, Chen and Chen, Yushuo and Chen, Zhipeng and Jiang, Jinhao and Ren, Ruiyang and Li, Yifan and Tang, Xinyu and Liu, Zikang and Liu, Peiyu and Nie, Jian-Yun and Wen, Ji-R...

  55. [55]

    A survey on evaluation of large language models , url =

    Chang, Yupeng and Wang, Xu and Wang, Jindong and Wu, Yuan and Yang, Linyi and Zhu, Kaijie and Chen, Hao and Yi, Xiaoyuan and Wang, Cunxiang and Wang, Yidong and others , journal =. A survey on evaluation of large language models , url =

  56. [56]

    Know the Unknown: An Uncertainty-Sensitive Method for LLM Instruction Tuning , url =

    Jiaqi Li and Yixuan Tang and Yi Yang , journal =. Know the Unknown: An Uncertainty-Sensitive Method for LLM Instruction Tuning , url =

  57. [57]

    Sheng. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024 , editor =

  58. [58]

    Gemini: A Family of Highly Capable Multimodal Models , type =

  59. [59]

    Introducing Claude 2.1 , year =

    Anthropic , note =. Introducing Claude 2.1 , year =

  60. [60]

    arXiv , author =:2308.16175 , primaryclass =

    Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness , year =. arXiv , author =:2308.16175 , primaryclass =

  61. [61]

    Weinberger and Yoav Artzi , bibsource =

    Tianyi Zhang and Varsha Kishore and Felix Wu and Kilian Q. Weinberger and Yoav Artzi , bibsource =. BERTScore: Evaluating Text Generation with. 8th International Conference on Learning Representations,

  62. [62]

    A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , url =

    Dan Hendrycks and Kevin Gimpel , bibsource =. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , url =. 5th International Conference on Learning Representations,

  63. [63]

    Selective Classification for Deep Neural Networks , url =

    Yonatan Geifman and Ran El. Selective Classification for Deep Neural Networks , url =. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA,

  64. [64]

    Comparison of values of Pearson's and Spearman's correlation coefficients on the same sets of data , volume =

    Hauke, Jan and Kossowski, Tomasz , journal =. Comparison of values of Pearson's and Spearman's correlation coefficients on the same sets of data , volume =

  65. [65]

    DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing , url =

    Pengcheng He and Jianfeng Gao and Weizhu Chen , bibsource =. DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing , url =. The Eleventh International Conference on Learning Representations,

  66. [66]

    A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , url =

    Williams, Adina and Nangia, Nikita and Bowman, Samuel , booktitle =. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , url =. doi:10.18653/v1/N18-1101 , editor =

  67. [67]

    Correlation coefficients: appropriate use and interpretation , url =

    Schober, Patrick and Boer, Christa and Schwarte, Lothar A , journal =. Correlation coefficients: appropriate use and interpretation , url =

  68. [68]

    arXiv , author =:2311.15451 , primaryclass =

    Uncertainty-aware Language Modeling for Selective Question Answering , year =. arXiv , author =:2311.15451 , primaryclass =

  69. [69]

    ChatGPT Blog Post , year =

    OpenAI , howpublished =. ChatGPT Blog Post , year =

  70. [70]

    Building the Next Generation of Open-Source and Bilingual LLMs , year =

  71. [71]

    arXiv , author =:2311.10702 , primaryclass =

    Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2 , year =. arXiv , author =:2311.10702 , primaryclass =

  72. [72]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , url =

    Lianmin Zheng and Wei. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , url =. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , editor =

  73. [73]

    Unfamiliar finetuning examples control how language models hallucinate , url =

    Kang, Katie and Wallace, Eric and Tomlin, Claire and Kumar, Aviral and Levine, Sergey , journal =. Unfamiliar finetuning examples control how language models hallucinate , url =

  74. [74]

    Detecting hallucinations in large language models using semantic entropy , volume =

    Farquhar, Sebastian and Kossen, Jannik and Kuhn, Lorenz and Gal, Yarin , journal =. Detecting hallucinations in large language models using semantic entropy , volume =

  75. [75]

    Manning and Chelsea Finn , bibsource =

    Katherine Tian and Eric Mitchell and Huaxiu Yao and Christopher D. Manning and Chelsea Finn , bibsource =. Fine-Tuning Language Models for Factuality , url =. The Twelfth International Conference on Learning Representations,

  76. [76]

    Alignment for Honesty , url =

    Yuqing Yang and Ethan Chern and Xipeng Qiu and Graham Neubig and Pengfei Liu , bibsource =. Alignment for Honesty , url =. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024 , editor =

  77. [77]

    Manning and Stefano Ermon and Chelsea Finn , bibsource =

    Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn , bibsource =. Direct Preference Optimization: Your Language Model is Secretly a Reward Model , url =. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orl...

  78. [78]

    Self-Refine: Iterative Refinement with Self-Feedback , url =

    Aman Madaan and Niket Tandon and Prakhar Gupta and Skyler Hallinan and Luyu Gao and Sarah Wiegreffe and Uri Alon and Nouha Dziri and Shrimai Prabhumoye and Yiming Yang and Shashank Gupta and Bodhisattwa Prasad Majumder and Katherine Hermann and Sean Welleck and Amir Yazdanbakhsh and Peter Clark , bibsource =. Self-Refine: Iterative Refinement with Self-Fe...

  79. [79]

    Kingma and Jimmy Ba , bibsource =

    Diederik P. Kingma and Jimmy Ba , bibsource =. Adam:. 3rd International Conference on Learning Representations,

  80. [80]

    Editing Factual Knowledge in Language Models , url =

    De Cao, Nicola and Aziz, Wilker and Titov, Ivan , booktitle =. Editing Factual Knowledge in Language Models , url =. doi:10.18653/v1/2021.emnlp-main.522 , editor =

Showing first 80 references.