Large Language Models Are Overconfident in Their Own Responses

Katharina von der Wense; Manuel Mager; Mario Sanz-Guerrero

arxiv: 2606.03437 · v1 · pith:ADMVMVNOnew · submitted 2026-06-02 · 💻 cs.CL

Large Language Models Are Overconfident in Their Own Responses

Mario Sanz-Guerrero , Manuel Mager , Katharina von der Wense This is my paper

Pith reviewed 2026-06-28 10:12 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelsmodel calibrationoverconfidenceownership biaschat templateinstruction tuningconfidence elicitation

0 comments

The pith

Instruction-tuned LLMs assign up to 26% higher confidence to answers they generated themselves than to identical answers framed as user input.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Instruction-tuned large language models are known to be poorly calibrated, but the conversational chat template adds a separate problem. Models exhibit an ownership bias in which they treat responses they produced as more credible than the exact same text presented by a user. Across multiple models, benchmarks, and confidence-elicitation methods, this bias produces up to 26% higher reported confidence. A minimal change at inference time—presenting the model's own answer as if it came from the user—removes most of the excess confidence and brings calibration closer to that of base models.

Core claim

Instruction tuning harms calibration, yet the chat template aggravates the problem through an ownership bias: models assign significantly higher confidence to their own responses than to identical responses attributed to a user. This effect reaches 26% across six open-weight LLMs, three benchmarks, and three elicitation methods. Reframing the model's answer as user input during confidence elicitation reduces overconfidence and improves calibration by up to 26% without retraining.

What carries the argument

Ownership bias: the systematic elevation of reported confidence when an answer is framed as the model's own output rather than as user-provided text.

If this is right

Models assign up to 26% higher confidence to their own responses than to identical user-provided answers.
Reframing the model's answer as user input during confidence elicitation cuts overconfidence.
The same reframing improves calibration by up to 26%.
The improvement narrows the calibration gap between base and instruction-tuned models.
The ownership bias can be addressed at inference time without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same framing change may reduce over in other self-evaluation tasks such as self-correction or uncertainty quantification.
The bias could be measured in closed-source models that expose only chat interfaces.
If the effect traces to patterns in chat data, it may appear even when different prompts are used to elicit confidence.

Load-bearing premise

The tested models and benchmarks allow the chat-template effect to be isolated from the separate effects of instruction tuning.

What would settle it

A direct comparison in which the identical answer receives statistically indistinguishable confidence scores when framed once as the model's own response and once as user input would falsify the ownership bias.

Figures

Figures reproduced from arXiv: 2606.03437 by Katharina von der Wense, Manuel Mager, Mario Sanz-Guerrero.

**Figure 1.** Figure 1: LLMs are overconfident in their own answers, regardless of whether they are correct or not, leading to miscalibration. The figure represents real outputs from Llama 3.1 (8B). We address this question in four steps: 1) We investigate whether the reduced calibration stems from the training algorithm or the prompting style by isolating the effects of instruction tuning and the chat template (frequently introd… view at source ↗

**Figure 2.** Figure 2: Prompts used to evaluate models in Section [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Prompts used to measure confidence in an [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Reliability diagrams of all models using three [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of confidence scores for answers provided by the assistant and by the user, aggregated across [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Average total confidence summed across all [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Prompts used to measure confidence in an [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Prompts used to measure confidence in an [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

read the original abstract

Prior work has shown that instruction-tuned large language models (LLMs) are less well calibrated than their base pre-trained counterparts. However, little is known about the frequently used chat template's effect on the calibration of conversational LLMs. In this work, we investigate the mechanisms driving this miscalibration by decoupling the effects of the post-training algorithm and the chat format. We find that, while instruction tuning fundamentally harms calibration, the chat template aggravates the issue through an "ownership bias" -- models are significantly more confident in their own answers than in identical answers provided by a user. Extensive experiments across six recent open-weight LLMs, three benchmarks, and three confidence elicitation methods show that models assign up to 26% higher confidence to their own responses. Leveraging this insight, we propose a simple inference-time strategy: framing the model's answer as user input during confidence elicitation. This approach significantly reduces overconfidence and improves calibration by up to 26% without the need for retraining, narrowing the gap between base and instruction-tuned models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Chat templates create a measurable ownership bias that worsens calibration beyond instruction tuning alone, and the paper's reframing fix is a practical mitigation worth checking.

read the letter

The main point is that models give their own answers up to 26% higher than identical text presented as user input, and treating the model's output as user input at inference time cuts much of that overconfidence. The work separates the chat template from instruction tuning and tests the effect across six open-weight models, three benchmarks, and three confidence methods.

What stands out is the clean framing of the ownership bias and the fact that the proposed fix needs no retraining. Running the same idea on multiple recent models and elicitation styles gives some breadth, and the practical angle—narrowing the gap to base models without changing weights—is useful for anyone deploying these systems.

The decoupling step is the part that needs the closest look. The claim rests on showing the template effect holds after controlling for tuning, which requires the four combinations to be runnable. If base models cannot take the chat template without side changes, the attribution to the template alone is harder to isolate. The abstract states the result, but the actual tables and controls will decide whether the 26% gap is cleanly tied to ownership rather than other factors.

This is the sort of paper that matters for people who run chat models in production and care about calibration. It does not rewrite the literature on overconfidence, but it gives a testable lever that can be tried immediately. The experiments are broad enough and the intervention simple enough that a referee could verify the numbers without heroic effort.

I would send it for peer review.

Referee Report

3 major / 1 minor

Summary. The paper claims that while instruction tuning harms calibration in LLMs, the chat template further aggravates miscalibration via an 'ownership bias' in which models assign up to 26% higher confidence to their own responses than to identical content framed as user input. Across six open-weight LLMs, three benchmarks, and three elicitation methods, the authors isolate this effect by decoupling post-training from chat format and propose a simple inference-time intervention (reframing the model's answer as user input) that reduces overconfidence and improves calibration by up to 26%.

Significance. If the isolation of the chat-template effect holds, the work supplies a concrete, training-free mechanism for a known calibration gap and a practical fix that narrows the base-vs-tuned difference. The empirical scale (six models, multiple benchmarks and elicitation protocols) and the falsifiable prediction that reframing reduces the bias are strengths.

major comments (3)

[Experiments] Experiments section (and any supplementary tables): the central attribution of the 26% confidence gap to the chat template alone requires explicit verification that all four combinations (base vs. instruction-tuned × with vs. without chat template) were tested on every model. If base models cannot accept the template without additional prompting changes, the decoupling is incomplete and the ownership-bias claim is not isolated.
[Results] Results (the 26% figures): the reported improvements must be accompanied by per-benchmark, per-elicitation-method error bars or confidence intervals and by statistical tests against the no-reframing baseline; without these, it is impossible to judge whether the reduction is robust or driven by a subset of conditions.
[Proposed method] § on the proposed inference-time strategy: the manuscript should report whether the reframing intervention changes the actual answer content or only the elicited confidence score, because any change in answer distribution would confound the calibration improvement claim.

minor comments (1)

[Abstract] Clarify in the abstract and introduction whether the three confidence elicitation methods are applied identically to base and tuned models or whether prompt formatting differs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of experimental clarity, statistical reporting, and methodological transparency that we will address in revision. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [Experiments] Experiments section (and any supplementary tables): the central attribution of the 26% confidence gap to the chat template alone requires explicit verification that all four combinations (base vs. instruction-tuned × with vs. without chat template) were tested on every model. If base models cannot accept the template without additional prompting changes, the decoupling is incomplete and the ownership-bias claim is not isolated.

Authors: The decoupling of post-training and chat format is central to our claims, and the experimental design did evaluate the four combinations wherever the base models could process the chat template (using the identical template strings as the tuned versions, with no additional system prompts). Results for these conditions appear in the main experiments and supplementary tables, though we agree the presentation could make the four-way breakdown more explicit. In the revision we will add a dedicated table (or expanded supplementary table) that explicitly lists results for base-with-template, base-without-template, tuned-with-template, and tuned-without-template for each model and benchmark. revision: yes
Referee: [Results] Results (the 26% figures): the reported improvements must be accompanied by per-benchmark, per-elicitation-method error bars or confidence intervals and by statistical tests against the no-reframing baseline; without these, it is impossible to judge whether the reduction is robust or driven by a subset of conditions.

Authors: We agree that error bars and statistical tests would strengthen the presentation of the 26% figures. The current manuscript reports aggregate improvements but does not include per-benchmark/per-method confidence intervals or formal tests. In the revised version we will add standard-error bars (computed across the three benchmarks or repeated runs where applicable) to all relevant figures and tables, and we will report paired statistical tests (e.g., Wilcoxon signed-rank or t-tests) comparing the reframing condition against the no-reframing baseline for each elicitation method. revision: yes
Referee: [Proposed method] § on the proposed inference-time strategy: the manuscript should report whether the reframing intervention changes the actual answer content or only the elicited confidence score, because any change in answer distribution would confound the calibration improvement claim.

Authors: The reframing step occurs strictly after answer generation: the model first produces its response using the standard chat template, and only the subsequent confidence-elicitation prompt is modified to present that same answer as user input. Consequently the answer distribution itself is unchanged; only the numeric confidence score is affected. We will insert a clarifying sentence (and, if space permits, a short illustrative example) in the proposed-method section to make this separation explicit. revision: yes

Circularity Check

0 steps flagged

Empirical measurement study with no derivations or self-referential predictions

full rationale

The paper conducts direct experimental measurements of confidence scores across six LLMs, three benchmarks, and three elicitation methods, comparing responses under different template and tuning conditions. No equations, first-principles derivations, or fitted parameters are used to define or predict the target quantities (e.g., the 26% confidence gap or calibration improvement). The 'ownership bias' is reported as an observed empirical pattern, not derived from prior self-citations or ansatzes. The proposed inference-time framing strategy is a straightforward application of the measured effect rather than a circular redefinition. This is a standard non-circular empirical study; the decoupling experiments, while potentially incomplete per external critique, do not reduce to self-definition or fitted-input renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study; no free parameters, axioms, or invented entities are introduced. The work relies on standard assumptions of calibration measurement in machine learning.

pith-pipeline@v0.9.1-grok · 5713 in / 1149 out tokens · 18968 ms · 2026-06-28T10:12:31.067931+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 12 canonical work pages

[1]

Glenn W. Brier. 1950. https://doi.org/10.1175/1520-0493(1950)078<0001:vofeit>2.0.co;2 Verification of forecasts expressed in terms of probability . Monthly Weather Review, 78(1):1--3

work page doi:10.1175/1520-0493(1950)078 1950
[2]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. https://arxiv.org/abs/2110.14168 Training verifiers to solve math word problems . Preprint, arXiv:2110.14168

Pith/arXiv arXiv 2021
[3]

Bradley Efron and Robert J Tibshirani. 1994. https://doi.org/10.1201/9780429246593 An introduction to the bootstrap . Chapman and Hall/CRC

work page doi:10.1201/9780429246593 1994
[4]

Gemma Team , Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and 197 others. 2025. https://arxiv.org/abs/2503.1978...

Pith/arXiv arXiv 2025
[5]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783 The Llama 3...

Pith/arXiv arXiv 2024
[6]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. https://proceedings.mlr.press/v70/guo17a.html On calibration of modern neural networks . In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1321--1330. PMLR

2017
[7]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. https://openreview.net/forum?id=d7KBjmI3GmQ Measuring massive multitask language understanding . In International Conference on Learning Representations

2021
[8]

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, and 17 others. 2022. https://arxiv.org/abs/2207.05221 Language models (mostly...

Pith/arXiv arXiv 2022
[9]

Vempala, and Edwin Zhang

Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. 2025. https://arxiv.org/abs/2509.04664 Why language models hallucinate . Preprint, arXiv:2509.04664

Pith/arXiv arXiv 2025
[10]

Jixuan Leng, Chengsong Huang, Banghua Zhu, and Jiaxin Huang. 2025. https://openreview.net/forum?id=l0tg0jzsdL Taming overconfidence in LLM s: Reward calibration in RLHF . In The Thirteenth International Conference on Learning Representations

2025
[11]

Rensis Likert. 1932. A technique for the measurement of attitudes. Archives of Psychology, 22(140):55

1932
[12]

Stephanie Lin, Jacob Hilton, and Owain Evans. 2022 a . https://openreview.net/forum?id=8s8K2UZGTZ Teaching models to express their uncertainty in words . Transactions on Machine Learning Research

2022
[13]

Stephanie Lin, Jacob Hilton, and Owain Evans. 2022 b . https://doi.org/10.18653/v1/2022.acl-long.229 T ruthful QA : Measuring how models mimic human falsehoods . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214--3252, Dublin, Ireland. Association for Computational Linguistics

work page doi:10.18653/v1/2022.acl-long.229 2022
[14]

Beier Luo, Shuoyuan Wang, Sharon Li, and Hongxin Wei. 2025. https://openreview.net/forum?id=I4PJYZvfW5 Your pre-trained LLM is secretly an unsupervised confidence calibrator . In The Thirty-ninth Annual Conference on Neural Information Processing Systems

2025
[15]

Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau

Sabrina J. Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau. 2022. https://doi.org/10.1162/tacl_a_00494 Reducing conversational agents' overconfidence through linguistic calibration . Transactions of the Association for Computational Linguistics, 10:857--872

work page doi:10.1162/tacl_a_00494 2022
[16]

Preetum Nakkiran, Arwen Bradley, Adam Goliński, Eugene Ndiaye, Michael Kirchhof, and Sinead Williamson. 2025. https://arxiv.org/abs/2511.04869 Trained on tokens, calibrated on concepts: The emergence of semantic calibration in LLM s . Preprint, arXiv:2511.04869

arXiv 2025
[17]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, and 262 others. 2024. https://arxiv.org/abs/2303.08774 GPT -4 technical ...

Pith/arXiv arXiv 2024
[18]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/b...

2022
[19]

Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. 2015. https://doi.org/10.1609/aaai.v29i1.9602 Obtaining well calibrated probabilities using bayesian binning . Proceedings of the AAAI Conference on Artificial Intelligence, 29(1)

work page doi:10.1609/aaai.v29i1.9602 2015
[20]

Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Benjamin Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, and 44 others. 2023. https://doi.org/10.18653/v1/2023.findings-acl.847 Discoverin...

work page doi:10.18653/v1/2023.findings-acl.847 2023
[21]

Mario Sanz-Guerrero, Minh Duc Bui, and Katharina von der Wense. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.988 Mind the gap: A closer look at tokenization for multiple-choice question answering with LLM s . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19573--19583, Suzhou, China. Association for C...

work page doi:10.18653/v1/2025.emnlp-main.988 2025
[22]

Mario Sanz-Guerrero and Katharina von der Wense. 2025. https://doi.org/10.18653/v1/2025.ijcnlp-long.78 Mitigating label length bias in large language models . In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 14...

work page doi:10.18653/v1/2025.ijcnlp-long.78 2025
[23]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. https://arxiv.org/abs/1707.06347 Proximal policy optimization algorithms . Preprint, arXiv:1707.06347

Pith/arXiv arXiv 2017
[24]

Sagi Shaier, Mario Sanz-Guerrero, and Katharina von der Wense. 2025. https://arxiv.org/abs/2412.07923 Asking again and again: Exploring llm robustness to repeated questions . Preprint, arXiv:2412.07923

arXiv 2025
[25]

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, Shauna M Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. 2024. https://openreview.net/forum?id=tvhaxkMKAn Towards understanding sycoph...

2024
[26]

Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher Manning. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.330 Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback . In Proceedings of the 2023 Conference on Em...

work page doi:10.18653/v1/2023.emnlp-main.330 2023
[27]

Dennis Ulmer, Martin Gubri, Hwaran Lee, Sangdoo Yun, and Seong Oh. 2024. https://doi.org/10.18653/v1/2024.acl-long.824 Calibrating large language models using their generations only . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15440--15459, Bangkok, Thailand. Association for Co...

work page doi:10.18653/v1/2024.acl-long.824 2024
[28]

Xinpeng Wang, Bolei Ma, Chengzhi Hu, Leon Weber-Genzel, Paul R \"o ttger, Frauke Kreuter, Dirk Hovy, and Barbara Plank. 2024. https://doi.org/10.18653/v1/2024.findings-acl.441 `` My answer is C '': First-token probabilities do not match text answers in instruction-tuned language models . In Findings of the Association for Computational Linguistics: ACL 20...

work page doi:10.18653/v1/2024.findings-acl.441 2024
[29]

Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, and Quoc V. Le. 2024. https://arxiv.org/abs/2308.03958 Simple synthetic data reduces sycophancy in large language models . Preprint, arXiv:2308.03958

Pith/arXiv arXiv 2024
[30]

Frank Wilcoxon. 1945. http://www.jstor.org/stable/3001968 Individual comparisons by ranking methods . Biometrics Bulletin, 1(6):80--83

arXiv 1945
[31]

Jiancong Xiao, Bojian Hou, Zhanliang Wang, Ruochen Jin, Qi Long, Weijie J Su, and Li Shen. 2025. https://openreview.net/forum?id=51tMpvPNSm Restoring calibration for aligned large language models: A calibration-aware fine-tuning approach . In Forty-second International Conference on Machine Learning

2025
[32]

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. 2024. https://openreview.net/forum?id=gjeQKFxFpZ Can LLM s express their uncertainty? An empirical evaluation of confidence elicitation in LLM s . In The Twelfth International Conference on Learning Representations

2024
[33]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388

Pith/arXiv arXiv 2025
[34]

Chiwei Zhu, Benfeng Xu, Quan Wang, Yongdong Zhang, and Zhendong Mao. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.654 On the calibration of large language models and alignment . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9778--9795, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.findings-emnlp.654 2023

[1] [1]

Glenn W. Brier. 1950. https://doi.org/10.1175/1520-0493(1950)078<0001:vofeit>2.0.co;2 Verification of forecasts expressed in terms of probability . Monthly Weather Review, 78(1):1--3

work page doi:10.1175/1520-0493(1950)078 1950

[2] [2]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. https://arxiv.org/abs/2110.14168 Training verifiers to solve math word problems . Preprint, arXiv:2110.14168

Pith/arXiv arXiv 2021

[3] [3]

Bradley Efron and Robert J Tibshirani. 1994. https://doi.org/10.1201/9780429246593 An introduction to the bootstrap . Chapman and Hall/CRC

work page doi:10.1201/9780429246593 1994

[4] [4]

Gemma Team , Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and 197 others. 2025. https://arxiv.org/abs/2503.1978...

Pith/arXiv arXiv 2025

[5] [5]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783 The Llama 3...

Pith/arXiv arXiv 2024

[6] [6]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. https://proceedings.mlr.press/v70/guo17a.html On calibration of modern neural networks . In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1321--1330. PMLR

2017

[7] [7]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. https://openreview.net/forum?id=d7KBjmI3GmQ Measuring massive multitask language understanding . In International Conference on Learning Representations

2021

[8] [8]

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, and 17 others. 2022. https://arxiv.org/abs/2207.05221 Language models (mostly...

Pith/arXiv arXiv 2022

[9] [9]

Vempala, and Edwin Zhang

Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. 2025. https://arxiv.org/abs/2509.04664 Why language models hallucinate . Preprint, arXiv:2509.04664

Pith/arXiv arXiv 2025

[10] [10]

Jixuan Leng, Chengsong Huang, Banghua Zhu, and Jiaxin Huang. 2025. https://openreview.net/forum?id=l0tg0jzsdL Taming overconfidence in LLM s: Reward calibration in RLHF . In The Thirteenth International Conference on Learning Representations

2025

[11] [11]

Rensis Likert. 1932. A technique for the measurement of attitudes. Archives of Psychology, 22(140):55

1932

[12] [12]

Stephanie Lin, Jacob Hilton, and Owain Evans. 2022 a . https://openreview.net/forum?id=8s8K2UZGTZ Teaching models to express their uncertainty in words . Transactions on Machine Learning Research

2022

[13] [13]

Stephanie Lin, Jacob Hilton, and Owain Evans. 2022 b . https://doi.org/10.18653/v1/2022.acl-long.229 T ruthful QA : Measuring how models mimic human falsehoods . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214--3252, Dublin, Ireland. Association for Computational Linguistics

work page doi:10.18653/v1/2022.acl-long.229 2022

[14] [14]

Beier Luo, Shuoyuan Wang, Sharon Li, and Hongxin Wei. 2025. https://openreview.net/forum?id=I4PJYZvfW5 Your pre-trained LLM is secretly an unsupervised confidence calibrator . In The Thirty-ninth Annual Conference on Neural Information Processing Systems

2025

[15] [15]

Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau

Sabrina J. Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau. 2022. https://doi.org/10.1162/tacl_a_00494 Reducing conversational agents' overconfidence through linguistic calibration . Transactions of the Association for Computational Linguistics, 10:857--872

work page doi:10.1162/tacl_a_00494 2022

[16] [16]

Preetum Nakkiran, Arwen Bradley, Adam Goliński, Eugene Ndiaye, Michael Kirchhof, and Sinead Williamson. 2025. https://arxiv.org/abs/2511.04869 Trained on tokens, calibrated on concepts: The emergence of semantic calibration in LLM s . Preprint, arXiv:2511.04869

arXiv 2025

[17] [17]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, and 262 others. 2024. https://arxiv.org/abs/2303.08774 GPT -4 technical ...

Pith/arXiv arXiv 2024

[18] [18]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/b...

2022

[19] [19]

Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. 2015. https://doi.org/10.1609/aaai.v29i1.9602 Obtaining well calibrated probabilities using bayesian binning . Proceedings of the AAAI Conference on Artificial Intelligence, 29(1)

work page doi:10.1609/aaai.v29i1.9602 2015

[20] [20]

Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Benjamin Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, and 44 others. 2023. https://doi.org/10.18653/v1/2023.findings-acl.847 Discoverin...

work page doi:10.18653/v1/2023.findings-acl.847 2023

[21] [21]

Mario Sanz-Guerrero, Minh Duc Bui, and Katharina von der Wense. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.988 Mind the gap: A closer look at tokenization for multiple-choice question answering with LLM s . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19573--19583, Suzhou, China. Association for C...

work page doi:10.18653/v1/2025.emnlp-main.988 2025

[22] [22]

Mario Sanz-Guerrero and Katharina von der Wense. 2025. https://doi.org/10.18653/v1/2025.ijcnlp-long.78 Mitigating label length bias in large language models . In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 14...

work page doi:10.18653/v1/2025.ijcnlp-long.78 2025

[23] [23]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. https://arxiv.org/abs/1707.06347 Proximal policy optimization algorithms . Preprint, arXiv:1707.06347

Pith/arXiv arXiv 2017

[24] [24]

Sagi Shaier, Mario Sanz-Guerrero, and Katharina von der Wense. 2025. https://arxiv.org/abs/2412.07923 Asking again and again: Exploring llm robustness to repeated questions . Preprint, arXiv:2412.07923

arXiv 2025

[25] [25]

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, Shauna M Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. 2024. https://openreview.net/forum?id=tvhaxkMKAn Towards understanding sycoph...

2024

[26] [26]

Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher Manning. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.330 Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback . In Proceedings of the 2023 Conference on Em...

work page doi:10.18653/v1/2023.emnlp-main.330 2023

[27] [27]

Dennis Ulmer, Martin Gubri, Hwaran Lee, Sangdoo Yun, and Seong Oh. 2024. https://doi.org/10.18653/v1/2024.acl-long.824 Calibrating large language models using their generations only . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15440--15459, Bangkok, Thailand. Association for Co...

work page doi:10.18653/v1/2024.acl-long.824 2024

[28] [28]

Xinpeng Wang, Bolei Ma, Chengzhi Hu, Leon Weber-Genzel, Paul R \"o ttger, Frauke Kreuter, Dirk Hovy, and Barbara Plank. 2024. https://doi.org/10.18653/v1/2024.findings-acl.441 `` My answer is C '': First-token probabilities do not match text answers in instruction-tuned language models . In Findings of the Association for Computational Linguistics: ACL 20...

work page doi:10.18653/v1/2024.findings-acl.441 2024

[29] [29]

Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, and Quoc V. Le. 2024. https://arxiv.org/abs/2308.03958 Simple synthetic data reduces sycophancy in large language models . Preprint, arXiv:2308.03958

Pith/arXiv arXiv 2024

[30] [30]

Frank Wilcoxon. 1945. http://www.jstor.org/stable/3001968 Individual comparisons by ranking methods . Biometrics Bulletin, 1(6):80--83

arXiv 1945

[31] [31]

Jiancong Xiao, Bojian Hou, Zhanliang Wang, Ruochen Jin, Qi Long, Weijie J Su, and Li Shen. 2025. https://openreview.net/forum?id=51tMpvPNSm Restoring calibration for aligned large language models: A calibration-aware fine-tuning approach . In Forty-second International Conference on Machine Learning

2025

[32] [32]

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. 2024. https://openreview.net/forum?id=gjeQKFxFpZ Can LLM s express their uncertainty? An empirical evaluation of confidence elicitation in LLM s . In The Twelfth International Conference on Learning Representations

2024

[33] [33]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388

Pith/arXiv arXiv 2025

[34] [34]

Chiwei Zhu, Benfeng Xu, Quan Wang, Yongdong Zhang, and Zhendong Mao. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.654 On the calibration of large language models and alignment . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9778--9795, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.findings-emnlp.654 2023