pith. sign in

arxiv: 2412.02904 · v2 · submitted 2024-12-03 · 💻 cs.CL · cs.AI· cs.LG

Enhancing Trust in Large Language Models via Uncertainty-Calibrated Fine-Tuning

Pith reviewed 2026-05-23 07:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords uncertainty calibrationlarge language modelsfine-tuninghallucination detectiondecision theorycausal language modeling
0
0 comments X

The pith

An uncertainty-aware loss during fine-tuning produces better-calibrated probability estimates from large language models without hurting task accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes replacing the usual next-token prediction loss with a new loss that incorporates uncertainty calibration during fine-tuning. The new loss comes from decision theory and is designed so that the model learns to assign probabilities that better reflect its true error rate on free-form generation tasks. Experiments on several question-answering datasets and models show that the resulting models keep the same accuracy while giving uncertainty scores that more reliably flag hallucinations and out-of-domain inputs.

Core claim

Replacing the standard causal language modeling loss with an uncertainty-aware variant derived from decision theory during fine-tuning produces probability estimates whose calibration error is lower on natural language generation tasks, while task accuracy remains comparable; the same models also become better at detecting hallucinations and out-of-domain prompts.

What carries the argument

The uncertainty-aware causal language modeling loss, obtained by applying decision-theoretic principles to the usual next-token objective.

If this is right

  • The same loss can be plugged into any autoregressive fine-tuning pipeline without changing the model architecture.
  • Better-calibrated uncertainty scores become available for downstream filtering or abstention decisions on generated text.
  • Hallucination detectors that rely on token probabilities gain reliability when the underlying model was fine-tuned with this loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach might generalize to tasks outside free-form QA if the decision-theoretic loss is applied to other autoregressive objectives.
  • If the loss improves calibration on the training distribution, it could also reduce over on shifted distributions that share similar token statistics.

Load-bearing premise

Optimizing the new decision-theoretic loss on the fine-tuning data improves calibration on future data drawn from the same distribution and does not trade off against accuracy.

What would settle it

On a held-out free-form QA dataset, a model fine-tuned with the new loss shows higher expected calibration error or lower hallucination-detection AUC than the same model fine-tuned with the ordinary causal language modeling loss.

Figures

Figures reproduced from arXiv: 2412.02904 by Omesh Tickoo, Piyush Khanna, Ranganath Krishnan.

Figure 1
Figure 1. Figure 1: The proposed Uncertainty-aware Causal Language Modeling (UA-CLM) outperforms [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Uncertainty calibration analysis: Spearman’s rank correlation coefficient and Pearson [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy versus Expected Calibration Error (ECE) comparison between UA-CLM, CLM, [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Sample from CoQA (Reddy et al., 2019) illustrating the co-reference chain of conversa￾tional questions. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Data samples from OK-VQA (Marino et al., 2019) across different knowledge categories. BioASQ The BioASQ (Krithara et al., 2023) challenge, conducted every year, focuses on tech￾niques in large-scale biomedical semantic indexing and question answering (QA). For our exper￾iments, we utilize Task B ( [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study: Effect of temperature value on the quality of generated text and the quality of uncertainty estimates evaluated with AUROC for hallucination detection. The study was performed on pre￾trained Llama-2-7B model with CoQA dataset. Based on this study, we selected temperature T=0.3 as it results in optimal AUROC and ROUGE-L scores. A.3 UNCERTAINTY ESTIMATION METRICS We assess uncertainty in natu… view at source ↗
Figure 7
Figure 7. Figure 7: Analysis of Correct and Incorrect Token Counts in mini-batch during fine-tuning with CLM and UA-CLM. Both CLM and UA-CLM show increase in correct tokens and a decrease in incorrect tokens as fine￾tuning progresses. (a) CLM (b) UA-CLM [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Analysis of Token Uncertainty associated with Correct and Incorrect tokens in the mini-batch dur￾ing fine-tuning with CLM and UA-CLM. A well-calibrated model should provide low uncertainty for correct tokens and higher uncertainty for incorrect tokens. With standard CLM Loss, uncertainty for both correct and incorrect tokens decreases, indicating overconfidence even on incorrect tokens. In contract, with U… view at source ↗
Figure 9
Figure 9. Figure 9: Analysis of Token Softmax Probability associated with Correct and Incorrect tokens during fine￾tuning with CLM and UA-CLM. A well-calibrated model should assign high probability to correct tokens and lower probability to incorrect tokens. With standard CLM loss, probabilities for both correct and incorrect tokens increase as fine-tuning progress, indicating overconfidence. In contrast, UA-CLM fine-tuning r… view at source ↗
Figure 10
Figure 10. Figure 10: Llama-2-7B: Loss convergence and uncertainty values associated with correct and incorrect tokens. (i) CLM (ii) UA-CLM [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Llama-2-13B: Loss convergence and uncertainty values for correct and incorrect tokens [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Llava-1.5: Loss convergence and uncertainty values associated with correct and incorrect tokens. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Accuracy versus Expected Calibration Error (ECE) comparison between UA-CLM, CLM, and pre-trained baseline across different LLM architectures on CoQA dataset. The ideal model should have high accuracy and low expected calibration error, indicating accurate predictions with well-calibrated uncertainty quantification (top-left of the Accuracy vs ECE plot). When evaluating three different model architectures,… view at source ↗
Figure 14
Figure 14. Figure 14: Accuracy versus Expected Calibration Error (ECE) comparison between UA-CLM, CLM, and pre-trained baseline across different LLM architectures on TriviaQA dataset. The ideal model should have high accuracy and low expected calibration error, indicating accurate predictions with well-calibrated uncertainty quantification (top-left of the Accuracy vs ECE plot). When evaluating three different model architectu… view at source ↗
read the original abstract

Large language models (LLMs) have revolutionized the field of natural language processing with their impressive reasoning and question-answering capabilities. However, these models are sometimes prone to generating credible-sounding but incorrect information, a phenomenon known as LLM hallucinations. Reliable uncertainty estimation in LLMs is essential for fostering trust in their generated responses and serves as a critical tool for the detection and prevention of erroneous or hallucinated outputs. To achieve reliable and well-calibrated uncertainty quantification in open-ended and free-form natural language generation, we propose an uncertainty-aware fine-tuning approach for LLMs. This approach enhances the model's ability to provide reliable uncertainty estimates without compromising accuracy, thereby guiding them to produce more trustworthy responses. We introduce a novel uncertainty-aware causal language modeling loss function, grounded in the principles of decision theory. Through rigorous evaluation on multiple free-form question-answering datasets and models, we demonstrate that our uncertainty-aware fine-tuning approach yields better calibrated uncertainty estimates in natural language generation tasks than fine-tuning with the standard causal language modeling loss. Furthermore, the experimental results show that the proposed method significantly improves the model's ability to detect hallucinations and identify out-of-domain prompts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes an uncertainty-aware fine-tuning approach for LLMs that replaces the standard causal language modeling loss with a novel loss derived from decision theory. The central claim is that this yields better-calibrated uncertainty estimates on free-form QA tasks than standard fine-tuning, while preserving accuracy and improving hallucination detection and OOD identification, with supporting results reported across multiple datasets and models.

Significance. If the empirical claims hold with proper quantitative support, the work would be significant for LLM reliability: a decision-theoretic loss that improves calibration without accuracy trade-offs could directly aid hallucination mitigation and user trust. The grounding in decision theory and the focus on open-ended generation are positive differentiators from heuristic calibration methods.

major comments (2)
  1. [Abstract] Abstract: the claim that the method 'yields better calibrated uncertainty estimates' and 'significantly improves' hallucination detection is presented without any numerical results, baselines, metrics (e.g., ECE, AUROC), statistical tests, or implementation details. This absence makes the data-to-claim link unevaluable and is load-bearing for the central empirical contribution.
  2. [Evaluation / Experiments] The weakest assumption (that the new loss improves calibration while preserving accuracy and generalizing) is stated but not yet supported by evidence in the provided description; the evaluation section must supply the missing quantitative comparisons and controls to substantiate it.
minor comments (2)
  1. [Method] Clarify whether the decision-theoretic loss introduces any new hyperparameters and how they are chosen or shown to be robust.
  2. [Experiments] Add explicit comparison to existing calibration techniques (e.g., temperature scaling, post-hoc methods) in addition to the standard LM loss baseline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address the two major comments below. We agree that the abstract requires quantitative support and will revise it; the evaluation section already contains the requested comparisons but will be expanded for clarity.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the method 'yields better calibrated uncertainty estimates' and 'significantly improves' hallucination detection is presented without any numerical results, baselines, metrics (e.g., ECE, AUROC), statistical tests, or implementation details. This absence makes the data-to-claim link unevaluable and is load-bearing for the central empirical contribution.

    Authors: We agree that the abstract would be strengthened by including key numerical results. In the revised version we will add specific metrics (e.g., ECE reductions, AUROC gains for hallucination detection), mention the main baselines, models, and datasets, and note that statistical significance was assessed. This change directly addresses the data-to-claim concern while preserving the abstract's brevity. revision: yes

  2. Referee: [Evaluation / Experiments] The weakest assumption (that the new loss improves calibration while preserving accuracy and generalizing) is stated but not yet supported by evidence in the provided description; the evaluation section must supply the missing quantitative comparisons and controls to substantiate it.

    Authors: The full manuscript already reports results across multiple free-form QA datasets and models, showing improved calibration (via ECE and related metrics), maintained accuracy, and gains in hallucination detection and OOD identification. To make this support more explicit, we will add further quantitative tables, additional baseline comparisons, statistical tests, and implementation details in the evaluation section. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents a novel uncertainty-aware causal language modeling loss derived from decision theory principles, with empirical evaluations on multiple datasets showing improved calibration over standard fine-tuning. No load-bearing step reduces by construction to fitted inputs, self-citations, or renamings; the derivation chain and performance claims remain independent of the reported results themselves. The abstract and described approach contain no self-definitional or fitted-prediction patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.0 · 5737 in / 968 out tokens · 53199 ms · 2026-05-23T07:43:44.782203+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 5 internal anchors

  1. [1]

    A review of uncertainty quantification in deep learning: Techniques, applications and challenges

    Moloud Abdar, Farhad Pourpanah, Sadiq Hussain, Dana Rezazadegan, Li Liu, Mohammad Ghavamzadeh, Paul Fieguth, Xiaochun Cao, Abbas Khosravi, U Rajendra Acharya, et al. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information fusion, 76: 0 243--297, 2021

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    Creating trustworthy llms: Dealing with hallucinations in healthcare ai

    Muhammad Aurangzeb Ahmad, Ilker Yaramis, and Taposh Dutta Roy. Creating trustworthy llms: Dealing with hallucinations in healthcare ai. arXiv preprint arXiv:2311.01463, 2023

  4. [4]

    Concrete Problems in AI Safety

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man \'e . Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016

  5. [5]

    Linguistic calibration of long-form generations

    Neil Band, Xuechen Li, Tengyu Ma, and Tatsunori Hashimoto. Linguistic calibration of long-form generations. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=rJVjQSQ8ye

  6. [6]

    Pearson correlation coefficient

    Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. Pearson correlation coefficient. In Noise reduction in speech processing, pp.\ 37--40. Springer, 2009

  7. [7]

    Confabulations: a conceptual history

    German E Berrios. Confabulations: a conceptual history. Journal of the History of the Neurosciences, 7 0 (3): 0 225--241, 1998

  8. [8]

    Weight uncertainty in neural network

    Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In International conference on machine learning, pp.\ 1613--1622. PMLR, 2015

  9. [9]

    The relationship between precision-recall and roc curves

    Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, pp.\ 233--240, 2006

  10. [10]

    Calibration of pre-trained transformers

    Shrey Desai and Greg Durrett. Calibration of pre-trained transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 295--302, 2020

  11. [11]

    Determinants of llm-assisted decision-making

    Eva Eigner and Thorsten H \"a ndler. Determinants of llm-assisted decision-making. arXiv preprint arXiv:2402.17385, 2024

  12. [12]

    Lm-polygraph: Uncertainty estimation for language models

    Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun, Artem Vazhentsev, Sergey Petrakov, Kirill Fedyanin, Daniil Vasilev, Elizaveta Goncharova, Alexander Panchenko, Maxim Panov, et al. Lm-polygraph: Uncertainty estimation for language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.\ ...

  13. [13]

    Detecting hallucinations in large language models using semantic entropy

    Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy. Nature, 630 0 (8017): 0 625--630, 2024

  14. [14]

    Unsupervised quality estimation for neural machine translation

    Marina Fomicheva, Shuo Sun, Lisa Yankovskaya, Fr \'e d \'e ric Blain, Francisco Guzm \'a n, Mark Fishel, Nikolaos Aletras, Vishrav Chaudhary, and Lucia Specia. Unsupervised quality estimation for neural machine translation. Transactions of the Association for Computational Linguistics, 8: 0 539--555, 2020

  15. [15]

    A survey of uncertainty in deep neural networks

    Jakob Gawlikowski, Cedrique Rovile Njieutcheu Tassi, Mohsin Ali, Jongseok Lee, Matthias Humt, Jianxiang Feng, Anna Kruspe, Rudolph Triebel, Peter Jung, Ribana Roscher, et al. A survey of uncertainty in deep neural networks. Artificial Intelligence Review, 56 0 (Suppl 1): 0 1513--1589, 2023

  16. [16]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \`e re, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024

  17. [17]

    A survey of confidence estimation and calibration in large language models

    Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych. A survey of confidence estimation and calibration in large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.\ 6577--6595, 2024

  18. [18]

    On calibration of modern neural networks

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International conference on machine learning, pp.\ 1321--1330. PMLR, 2017

  19. [19]

    Augmix: A simple data processing method to improve robustness and uncertainty

    Dan Hendrycks, Norman Mu, Ekin Dogus Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. Augmix: A simple data processing method to improve robustness and uncertainty. In International Conference on Learning Representations, 2020

  20. [20]

    Parameter-efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pp.\ 2790--2799. PMLR, 2019

  21. [21]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

  22. [22]

    Survey of hallucination in natural language generation

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55 0 (12): 0 1--38, 2023

  23. [23]

    How can we know when language models know? on the calibration of language models for question answering

    Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham Neubig. How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, 9: 0 962--977, 2021

  24. [24]

    Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1601--1611, 2017

  25. [25]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022

  26. [26]

    Calibration-tuning: Teaching large language models to know what they don’t know

    Sanyam Kapoor, Nate Gruver, Manley Roberts, Arka Pal, Samuel Dooley, Micah Goldblum, and Andrew Wilson. Calibration-tuning: Teaching large language models to know what they don’t know. In Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024), pp.\ 1--14, 2024

  27. [27]

    Soft calibration objectives for neural networks

    Archit Karandikar, Nicholas Cain, Dustin Tran, Balaji Lakshminarayanan, Jonathon Shlens, Michael C Mozer, and Becca Roelofs. Soft calibration objectives for neural networks. Advances in Neural Information Processing Systems, 34: 0 29768--29779, 2021

  28. [28]

    Calibrated language model fine-tuning for in-and out-of-distribution data

    Lingkai Kong, Haoming Jiang, Yuchen Zhuang, Jie Lyu, Tuo Zhao, and Chao Zhang. Calibrated language model fine-tuning for in-and out-of-distribution data. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 1326--1340, 2020

  29. [29]

    Improving model calibration with accuracy versus uncertainty optimization

    Ranganath Krishnan and Omesh Tickoo. Improving model calibration with accuracy versus uncertainty optimization. Advances in Neural Information Processing Systems, 33: 0 18237--18248, 2020

  30. [30]

    Bioasq-qa: A manually curated corpus for biomedical question answering

    Anastasia Krithara, Anastasios Nentidis, Konstantinos Bougiatiotis, and Georgios Paliouras. Bioasq-qa: A manually curated corpus for biomedical question answering. Scientific Data, 10 0 (1): 0 170, 2023

  31. [31]

    Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations, 2023

  32. [32]

    Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers

    Meelis Kull, Telmo Silva Filho, and Peter Flach. Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers. In Artificial intelligence and statistics, pp.\ 623--631. PMLR, 2017

  33. [33]

    Trainable calibration measures for neural networks from kernel mean embeddings

    Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. Trainable calibration measures for neural networks from kernel mean embeddings. In International Conference on Machine Learning, pp.\ 2805--2814. PMLR, 2018

  34. [34]

    arXiv preprint arXiv:2305.18404 , year=

    Bhawesh Kumar, Charlie Lu, Gauri Gupta, Anil Palepu, David Bellamy, Ramesh Raskar, and Andrew Beam. Conformal prediction with large language models for multi-choice question answering. arXiv preprint arXiv:2305.18404, 2023

  35. [35]

    Simple and scalable predictive uncertainty estimation using deep ensembles

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017

  36. [36]

    Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics

    Chin-Yew Lin and Franz Josef Och. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd annual meeting of the association for computational linguistics (ACL-04), pp.\ 605--612, 2004

  37. [37]

    Generating with confidence: Uncertainty quantification for black-box large language models

    Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Generating with confidence: Uncertainty quantification for black-box large language models. Transactions on Machine Learning Research, 2024

  38. [38]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024 a

  39. [39]

    Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing

    Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55 0 (9): 0 1--35, 2023

  40. [40]

    Litcab: Lightweight language model calibration over short-and long-form responses

    Xin Liu, Muhammad Khalifa, and Lu Wang. Litcab: Lightweight language model calibration over short-and long-form responses. In The Twelfth International Conference on Learning Representations, 2024 b

  41. [41]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019

  42. [42]

    Hallucination-free? assessing the reliability of leading ai legal research tools

    Varun Magesh, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher D Manning, and Daniel E Ho. Hallucination-free? assessing the reliability of leading ai legal research tools. arXiv preprint arXiv:2405.20362, 2024

  43. [43]

    Uncertainty estimation in autoregressive structured prediction

    Andrey Malinin and Mark Gales. Uncertainty estimation in autoregressive structured prediction. In International Conference on Learning Representations, 2021

  44. [44]

    Peft: State-of-the-art parameter-efficient fine-tuning methods

    Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022

  45. [45]

    Ok-vqa: A visual question answering benchmark requiring external knowledge

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp.\ 3195--3204, 2019

  46. [46]

    Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

    Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023

  47. [47]

    Revisiting the calibration of modern neural networks

    Matthias Minderer, Josip Djolonga, Rob Romijnders, Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin Tran, and Mario Lucic. Revisiting the calibration of modern neural networks. Advances in Neural Information Processing Systems, 34: 0 15682--15694, 2021

  48. [48]

    Calibrating deep neural networks using focal loss

    Jishnu Mukhoti, Viveka Kulharia, Amartya Sanyal, Stuart Golodetz, Philip Torr, and Puneet Dokania. Calibrating deep neural networks using focal loss. Advances in Neural Information Processing Systems, 33: 0 15288--15299, 2020

  49. [49]

    Machine learning: a probabilistic perspective

    Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012

  50. [50]

    Accuracy-rejection curves (arcs) for comparing classification methods with a reject option

    Malik Sajjad Ahmed Nadeem, Jean-Daniel Zucker, and Blaise Hanczar. Accuracy-rejection curves (arcs) for comparing classification methods with a reject option. In Machine Learning in Systems Biology, pp.\ 65--81. PMLR, 2009

  51. [51]

    Obtaining well calibrated probabilities using bayesian binning

    Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the AAAI conference on artificial intelligence, volume 29, 2015

  52. [52]

    Measuring calibration in deep learning

    Jeremy Nixon, Michael W Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. Measuring calibration in deep learning. In CVPR workshops, 2019

  53. [53]

    Softmax probabilities (mostly) predict large language model correctness on multiple-choice q&a

    Benjamin Plaut, Khanh Nguyen, and Tu Trinh. Softmax probabilities (mostly) predict large language model correctness on multiple-choice q&a. arXiv preprint arXiv:2402.13213, 2024

  54. [54]

    Coqa: A conversational question answering challenge

    Siva Reddy, Danqi Chen, and Christopher D Manning. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7: 0 249--266, 2019

  55. [55]

    Out-of-distribution detection and selective generation for conditional language models

    Jie Ren, Jiaming Luo, Yao Zhao, Kundan Krishna, Mohammad Saleh, Balaji Lakshminarayanan, and Peter J Liu. Out-of-distribution detection and selective generation for conditional language models. In The Eleventh International Conference on Learning Representations, 2022

  56. [56]

    The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets

    Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PloS one, 10 0 (3): 0 e0118432, 2015

  57. [57]

    On mixup training: Improved calibration and predictive uncertainty for deep neural networks

    Sunil Thulasidasan, Gopinath Chennupati, Jeff A Bilmes, Tanmoy Bhattacharya, and Sarah Michalak. On mixup training: Improved calibration and predictive uncertainty for deep neural networks. Advances in neural information processing systems, 32, 2019

  58. [58]

    Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback

    Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023

  59. [59]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  60. [60]

    Adamix: Mixture-of-adaptations for parameter-efficient model tuning

    Yaqing Wang, Sahaj Agarwal, Subhabrata Mukherjee, Xiaodong Liu, Jing Gao, Ahmed Hassan, and Jianfeng Gao. Adamix: Mixture-of-adaptations for parameter-efficient model tuning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 5744--5760, 2022

  61. [61]

    Neural text generation with unlikelihood training

    Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. Neural text generation with unlikelihood training. In International Conference on Learning Representations, 2020

  62. [62]

    Quantifying uncertainties in natural language processing tasks

    Yijun Xiao and William Yang Wang. Quantifying uncertainties in natural language processing tasks. In Proceedings of the AAAI conference on artificial intelligence, pp.\ 7322--7329, 2019

  63. [63]

    Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms

    Miao Xiong, Zhiyuan Hu, Xinyang Lu, YIFEI LI, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. In The Twelfth International Conference on Learning Representations, 2024

  64. [64]

    Can we trust llms? mitigate overconfidence bias in llms through knowledge transfer

    Haoyan Yang, Yixuan Wang, Xingyin Xu, Hanyuan Zhang, and Yirong Bian. Can we trust llms? mitigate overconfidence bias in llms through knowledge transfer. arXiv preprint arXiv:2405.16856, 2024

  65. [65]

    Bertscore: Evaluating text generation with bert

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2020

  66. [66]

    CRC standard probability and statistics tables and formulae

    Daniel Zwillinger and Stephen Kokoska. CRC standard probability and statistics tables and formulae. Crc Press, 1999

  67. [67]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  68. [68]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  69. [69]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  70. [70]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...