Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident, Especially When They are Wrong
Pith reviewed 2026-05-23 05:35 UTC · model grok-4.3
The pith
Reasoning before answering makes large language models assign higher probability to their choices, especially when those choices are wrong.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Models are systematically more confident when providing reasoning before answering, and that this confidence increase is larger when the selected answer is incorrect than when it is correct. The reasoning process alters token probabilities, as the final answer prediction depends jointly on the question and the model's self-generated reasoning, leading to inflated confidence estimates. Chain-of-Thought prompting degrades calibration by increasing the proportion of high-confidence wrong answers.
What carries the argument
The change in the probability assigned to the chosen answer token when the input to the final prediction includes the model's own preceding reasoning chain versus when it does not.
If this is right
- LLM-estimated probabilities should be used with caution as a basis for evaluation when Chain-of-Thought prompting is applied.
- The share of high-confidence errors rises under reasoning prompts.
- Standard calibration metrics such as Expected Calibration Error and Brier score register worse performance once reasoning is inserted before the answer choice.
- Metacognitive mechanisms that rely on these probabilities become less reliable in multiple-choice settings that use Chain-of-Thought.
Where Pith is reading between the lines
- Evaluations that mix direct-answer and reasoning-answer runs may need separate calibration baselines for each style.
- Prompt designs that aim to improve self-assessment could test whether shortening or constraining the reasoning step reduces the extra inflation on wrong answers.
- The same joint-dependence mechanism might appear in other generation tasks where an intermediate text is produced before a final numeric or categorical output.
Load-bearing premise
The probability assigned to the chosen answer token remains a comparable measure of confidence across direct-answer and reasoning-before-answer conditions, even though the input context to the final prediction differs.
What would settle it
A controlled comparison in which the increase in chosen-token probability after reasoning is shown to be no larger for incorrect answers than for correct answers, or in which Expected Calibration Error and Brier score remain unchanged or improve under Chain-of-Thought prompting.
Figures
read the original abstract
Multiple Choice Question (MCQ) tests are among the most used methods for evaluating large language models (LLMs). Besides checking the correctness of the selected answer, evaluations often consider the model's confidence through the probability assigned to its response. In this work, we investigate how LLM confidence is influenced by the answering approach when the model answers directly or reasons before responding. Experiments on a general knowledge benchmark, covering 57 subjects and seven LLMs, show that models are systematically more confident when providing reasoning before answering, and that this confidence increase is larger when the selected answer is incorrect than when it is correct. We hypothesize that the reasoning process alters token probabilities, as the final answer prediction depends jointly on the question and the model's self-generated reasoning, leading to inflated confidence estimates. Using standard calibration metrics such as Expected Calibration Error and Brier score, we further show that Chain-of-Thought (CoT) prompting degrades calibration by increasing the proportion of high-confidence wrong answers. These findings indicate that, in MCQ evaluation settings with CoT prompting, LLM-estimated probabilities should be used with caution as a basis for evaluation and metacognitive mechanisms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that in multiple-choice question evaluations, LLMs assign systematically higher token probabilities to their selected answers when they first generate reasoning (CoT) than when answering directly; this increase is larger for incorrect answers than correct ones. Across 57 subjects and seven models, CoT is shown to worsen calibration (via ECE and Brier score) by raising the share of high-confidence errors. The authors hypothesize that self-generated reasoning alters the conditional distribution used for the final answer token.
Significance. If the central measurement is shown to be robust, the result would caution against treating token probabilities as stable confidence estimates under CoT prompting in MCQ settings and would affect how calibration and metacognition are assessed in LLM evaluations. The breadth of the empirical study (57 subjects, seven models) is a clear strength; the work supplies falsifiable, replicable patterns rather than parameter-fitted derivations.
major comments (2)
- [Methods / Experimental Setup] Methods / Experimental Setup: the central claim requires that the probability of the chosen answer token remains a comparable confidence proxy in the direct condition P(token | question) versus the CoT condition P(token | question + model-generated reasoning). No controls (fixed reasoning, length-matched contexts, or alternative verbalized-certainty measures) are described to isolate epistemic change from the shift in conditioning context; this assumption is load-bearing for interpreting the reported increase as genuine overconfidence rather than an artifact.
- [Results] Results: the reported degradation in Expected Calibration Error and Brier score under CoT is attributed to an increased proportion of high-confidence errors, yet the manuscript provides no detail on binning procedure, whether the same probability thresholds are applied across conditions, or statistical tests confirming the effect size difference between correct and incorrect answers; without these, the calibration claim cannot be fully evaluated.
minor comments (2)
- [Abstract] Abstract: the phrase 'alters token probabilities' is used without a short parenthetical clarifying that the final prediction is conditioned on the model's own preceding tokens.
- [Introduction] The title uses 'Self-Confident'; a brief footnote or parenthetical in the introduction would help readers distinguish token-probability confidence from verbalized self-assessment.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important aspects of our experimental design and reporting. We address each point below and will revise the manuscript accordingly to improve clarity and robustness.
read point-by-point responses
-
Referee: [Methods / Experimental Setup] Methods / Experimental Setup: the central claim requires that the probability of the chosen answer token remains a comparable confidence proxy in the direct condition P(token | question) versus the CoT condition P(token | question + model-generated reasoning). No controls (fixed reasoning, length-matched contexts, or alternative verbalized-certainty measures) are described to isolate epistemic change from the shift in conditioning context; this assumption is load-bearing for interpreting the reported increase as genuine overconfidence rather than an artifact.
Authors: We agree that the lack of explicit controls makes it harder to fully isolate the effect of self-generated reasoning from changes in conditioning context. In the revision we will add two new control experiments: (1) length-matched direct prompts that append neutral filler text of similar length to the CoT output, and (2) fixed-reasoning conditions where the same model-generated reasoning chain is prepended to both correct and incorrect answer tokens. We will also expand the discussion to acknowledge that the current results reflect the joint effect of reasoning content and context shift, consistent with our stated hypothesis that the final answer depends on the question plus self-generated reasoning. revision: yes
-
Referee: [Results] Results: the reported degradation in Expected Calibration Error and Brier score under CoT is attributed to an increased proportion of high-confidence errors, yet the manuscript provides no detail on binning procedure, whether the same probability thresholds are applied across conditions, or statistical tests confirming the effect size difference between correct and incorrect answers; without these, the calibration claim cannot be fully evaluated.
Authors: We appreciate this observation. The revised manuscript will include: (i) an explicit description of the ECE binning procedure (equal-width bins over [0,1] with M=10 bins, as is standard), (ii) confirmation that identical probability thresholds are used for both direct and CoT conditions, and (iii) statistical tests (bootstrap confidence intervals and paired Wilcoxon tests) on the difference in ECE, Brier score, and the proportion of high-confidence errors between correct and incorrect answers. These additions will allow readers to fully evaluate the calibration results. revision: yes
Circularity Check
No circularity: purely empirical measurements with no derivations or fitted predictions
full rationale
The paper reports direct experimental results comparing token probabilities assigned to answers under direct vs. reasoning-before-answer conditions across 7 LLMs and 57 subjects. No equations, parameter fittings, uniqueness theorems, or ansatzes are invoked. The central claim (higher confidence with CoT, larger when wrong) and the hypothesis about altered conditional distributions are presented as empirical observations and interpretation, not as outputs derived from or equivalent to inputs by construction. Standard calibration metrics (ECE, Brier) are applied without self-referential fitting. This matches the default case of a self-contained empirical study.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Does Self-Consistency Improve the Recall of Encyclopedic Knowledge?
Self-consistency improves LLM performance on encyclopedic knowledge recall as well as symbolic reasoning, setting a new 89% accuracy record on MMLU with GPT-4o.
Reference graph
Works this paper leans on
-
[1]
Evaluating large language models: A comprehensive survey
Zishan Guo, Renren Jin, Chuang Liu, Yufei Huang, Dan Shi, Linhao Yu, Yan Liu, Jiaxuan Li, Bojian Xiong, Deyi Xiong, et al. Evaluating large language models: A comprehensive survey. arXiv preprint arXiv:2310.19736, 2023
-
[2]
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Judging llm-as-a-judge with mt-bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[4]
LLM Evaluators Recognize and Favor Their Own Generations
Arjun Panickssery, Samuel R Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations. arXiv preprint arXiv:2404.13076, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Measuring mathematical problem solving with the math dataset, 2021
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021
work page 2021
-
[7]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Annual Meeting of the Association for Computational Linguistics, 2019
work page 2019
-
[8]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[9]
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023
Aarohi Srivastava et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023
work page 2023
-
[10]
A survey of confidence estimation and calibration in large language models
Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych. A survey of confidence estimation and calibration in large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6577–6595, 2024
work page 2024
-
[11]
Efficient and effective uncertainty quantification for LLMs
Miao Xiong, Andrea Santilli, Michael Kirchhof, Adam Golinski, and Sinead Williamson. Efficient and effective uncertainty quantification for LLMs. In Neurips Safe Generative AI Workshop 2024, 2024
work page 2024
-
[12]
Confidence in the reasoning of large language models
Yudi Pawitan and Chris Holmes. Confidence in the reasoning of large language models. arXiv preprint arXiv:2412.15296, 2024
-
[13]
Benchmarking llms via uncertainty quantification.arXiv preprint arXiv:2401.12794,
Fanghua Ye, Mingming Yang, Jianhui Pang, Longyue Wang, Derek F Wong, Emine Yilmaz, Shuming Shi, and Zhaopeng Tu. Benchmarking llms via uncertainty quantification. arXiv preprint arXiv:2401.12794, 2024
-
[14]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022. 8 Reasoning Makes Large Language Models (LLM) More Self-Confident
work page 2022
-
[15]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [16]
-
[17]
Gemma 2: Improving open language models at a practical size
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size. arXiv e-prints, pages arXiv–2408, 2024
work page 2024
-
[18]
01. AI, :, Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and...
work page 2024
- [19]
-
[20]
Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Tao He, Haotian Wang, Weihua Peng, Ming Liu, Bing Qin, and Ting Liu. Navigate through enigmatic labyrinth a survey of chain of thought reasoning: Advances, frontiers and future. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1...
work page 2024
-
[21]
Donald A Curtis, Samuel L Lind, Christy K Boscardin, and Mark Dellinges. Does student confidence on multiple-choice question assessments provide useful information? Medical education, 47(6):578–584, 2013
work page 2013
-
[22]
Steven J Durning, Ting Dong, Anthony R Artino, Cees van der Vleuten, Eric Holmboe, and Lambert Schuwirth. Dual processing theory and experts’ reasoning: exploring thinking on national multiple-choice questions. Perspec- tives on medical Education, 4:168–175, 2015
work page 2015
-
[23]
Explanation, imagination, and confidence in judgment
Derek J Koehler. Explanation, imagination, and confidence in judgment. Psychological bulletin, 110(3):499, 1991
work page 1991
-
[24]
Confidence and accuracy in deductive reasoning
Jody M Shynkaruk and Valerie A Thompson. Confidence and accuracy in deductive reasoning. Memory & cognition, 34(3):619–632, 2006. 9
work page 2006
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.