pith. sign in

arxiv: 2501.09775 · v3 · submitted 2025-01-16 · 💻 cs.CL · cs.AI

Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident, Especially When They are Wrong

Pith reviewed 2026-05-23 05:35 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords large language modelschain of thoughtconfidence calibrationmultiple choice questionstoken probabilityexpected calibration errorbrier scorereasoning prompts
0
0 comments X

The pith

Reasoning before answering makes large language models assign higher probability to their choices, especially when those choices are wrong.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the method of answering multiple-choice questions changes how large language models express confidence through token probabilities. It compares direct selection of an answer against first producing a reasoning chain and then selecting. Across seven models and a broad benchmark, the probability given to the chosen answer rises when reasoning is required first. The rise is larger for answers that turn out incorrect than for answers that turn out correct. Because the final prediction now depends on both the original question and the model's own earlier text, the resulting probabilities no longer track actual correctness as well, and standard calibration scores worsen.

Core claim

Models are systematically more confident when providing reasoning before answering, and that this confidence increase is larger when the selected answer is incorrect than when it is correct. The reasoning process alters token probabilities, as the final answer prediction depends jointly on the question and the model's self-generated reasoning, leading to inflated confidence estimates. Chain-of-Thought prompting degrades calibration by increasing the proportion of high-confidence wrong answers.

What carries the argument

The change in the probability assigned to the chosen answer token when the input to the final prediction includes the model's own preceding reasoning chain versus when it does not.

If this is right

  • LLM-estimated probabilities should be used with caution as a basis for evaluation when Chain-of-Thought prompting is applied.
  • The share of high-confidence errors rises under reasoning prompts.
  • Standard calibration metrics such as Expected Calibration Error and Brier score register worse performance once reasoning is inserted before the answer choice.
  • Metacognitive mechanisms that rely on these probabilities become less reliable in multiple-choice settings that use Chain-of-Thought.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Evaluations that mix direct-answer and reasoning-answer runs may need separate calibration baselines for each style.
  • Prompt designs that aim to improve self-assessment could test whether shortening or constraining the reasoning step reduces the extra inflation on wrong answers.
  • The same joint-dependence mechanism might appear in other generation tasks where an intermediate text is produced before a final numeric or categorical output.

Load-bearing premise

The probability assigned to the chosen answer token remains a comparable measure of confidence across direct-answer and reasoning-before-answer conditions, even though the input context to the final prediction differs.

What would settle it

A controlled comparison in which the increase in chosen-token probability after reasoning is shown to be no larger for incorrect answers than for correct answers, or in which Expected Calibration Error and Brier score remain unchanged or improve under Chain-of-Thought prompting.

Figures

Figures reproduced from arXiv: 2501.09775 by Gonzalo Mart\'inez, Javier Conde, Mar\'ia Grandury, Pedro Reviriego, Tairan Fu.

Figure 1
Figure 1. Figure 1: Accuracy Comparison Across Models on MMLU Categories with Direct and CoT Prompts [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Average Probabilities of Selected Option Across Models on MMLU with Direct and CoT Prompts [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average Probabilities of Correctly Selected Option Across Models on MMLU with Direct and CoT Prompts [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average Probabilities of Incorrectly Selected Option Across Models on MMLU with Direct and CoT Prompts [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Probability Distribution of Correctly Selected Option Across Models in MMLU [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Probability Distribution of Incorrectly Selected Option Across Models in MMLU [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Increments in accuracy, in the probability of the selected option, in the probability of the selected option for [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Average increase in the probability of the selected option when options for both prompts are incorrect and [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
read the original abstract

Multiple Choice Question (MCQ) tests are among the most used methods for evaluating large language models (LLMs). Besides checking the correctness of the selected answer, evaluations often consider the model's confidence through the probability assigned to its response. In this work, we investigate how LLM confidence is influenced by the answering approach when the model answers directly or reasons before responding. Experiments on a general knowledge benchmark, covering 57 subjects and seven LLMs, show that models are systematically more confident when providing reasoning before answering, and that this confidence increase is larger when the selected answer is incorrect than when it is correct. We hypothesize that the reasoning process alters token probabilities, as the final answer prediction depends jointly on the question and the model's self-generated reasoning, leading to inflated confidence estimates. Using standard calibration metrics such as Expected Calibration Error and Brier score, we further show that Chain-of-Thought (CoT) prompting degrades calibration by increasing the proportion of high-confidence wrong answers. These findings indicate that, in MCQ evaluation settings with CoT prompting, LLM-estimated probabilities should be used with caution as a basis for evaluation and metacognitive mechanisms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that in multiple-choice question evaluations, LLMs assign systematically higher token probabilities to their selected answers when they first generate reasoning (CoT) than when answering directly; this increase is larger for incorrect answers than correct ones. Across 57 subjects and seven models, CoT is shown to worsen calibration (via ECE and Brier score) by raising the share of high-confidence errors. The authors hypothesize that self-generated reasoning alters the conditional distribution used for the final answer token.

Significance. If the central measurement is shown to be robust, the result would caution against treating token probabilities as stable confidence estimates under CoT prompting in MCQ settings and would affect how calibration and metacognition are assessed in LLM evaluations. The breadth of the empirical study (57 subjects, seven models) is a clear strength; the work supplies falsifiable, replicable patterns rather than parameter-fitted derivations.

major comments (2)
  1. [Methods / Experimental Setup] Methods / Experimental Setup: the central claim requires that the probability of the chosen answer token remains a comparable confidence proxy in the direct condition P(token | question) versus the CoT condition P(token | question + model-generated reasoning). No controls (fixed reasoning, length-matched contexts, or alternative verbalized-certainty measures) are described to isolate epistemic change from the shift in conditioning context; this assumption is load-bearing for interpreting the reported increase as genuine overconfidence rather than an artifact.
  2. [Results] Results: the reported degradation in Expected Calibration Error and Brier score under CoT is attributed to an increased proportion of high-confidence errors, yet the manuscript provides no detail on binning procedure, whether the same probability thresholds are applied across conditions, or statistical tests confirming the effect size difference between correct and incorrect answers; without these, the calibration claim cannot be fully evaluated.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'alters token probabilities' is used without a short parenthetical clarifying that the final prediction is conditioned on the model's own preceding tokens.
  2. [Introduction] The title uses 'Self-Confident'; a brief footnote or parenthetical in the introduction would help readers distinguish token-probability confidence from verbalized self-assessment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of our experimental design and reporting. We address each point below and will revise the manuscript accordingly to improve clarity and robustness.

read point-by-point responses
  1. Referee: [Methods / Experimental Setup] Methods / Experimental Setup: the central claim requires that the probability of the chosen answer token remains a comparable confidence proxy in the direct condition P(token | question) versus the CoT condition P(token | question + model-generated reasoning). No controls (fixed reasoning, length-matched contexts, or alternative verbalized-certainty measures) are described to isolate epistemic change from the shift in conditioning context; this assumption is load-bearing for interpreting the reported increase as genuine overconfidence rather than an artifact.

    Authors: We agree that the lack of explicit controls makes it harder to fully isolate the effect of self-generated reasoning from changes in conditioning context. In the revision we will add two new control experiments: (1) length-matched direct prompts that append neutral filler text of similar length to the CoT output, and (2) fixed-reasoning conditions where the same model-generated reasoning chain is prepended to both correct and incorrect answer tokens. We will also expand the discussion to acknowledge that the current results reflect the joint effect of reasoning content and context shift, consistent with our stated hypothesis that the final answer depends on the question plus self-generated reasoning. revision: yes

  2. Referee: [Results] Results: the reported degradation in Expected Calibration Error and Brier score under CoT is attributed to an increased proportion of high-confidence errors, yet the manuscript provides no detail on binning procedure, whether the same probability thresholds are applied across conditions, or statistical tests confirming the effect size difference between correct and incorrect answers; without these, the calibration claim cannot be fully evaluated.

    Authors: We appreciate this observation. The revised manuscript will include: (i) an explicit description of the ECE binning procedure (equal-width bins over [0,1] with M=10 bins, as is standard), (ii) confirmation that identical probability thresholds are used for both direct and CoT conditions, and (iii) statistical tests (bootstrap confidence intervals and paired Wilcoxon tests) on the difference in ECE, Brier score, and the proportion of high-confidence errors between correct and incorrect answers. These additions will allow readers to fully evaluate the calibration results. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or fitted predictions

full rationale

The paper reports direct experimental results comparing token probabilities assigned to answers under direct vs. reasoning-before-answer conditions across 7 LLMs and 57 subjects. No equations, parameter fittings, uniqueness theorems, or ansatzes are invoked. The central claim (higher confidence with CoT, larger when wrong) and the hypothesis about altered conditional distributions are presented as empirical observations and interpretation, not as outputs derived from or equivalent to inputs by construction. Standard calibration metrics (ECE, Brier) are applied without self-referential fitting. This matches the default case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is purely empirical; no free parameters, mathematical axioms, or new postulated entities are introduced.

pith-pipeline@v0.9.0 · 5751 in / 1011 out tokens · 56202 ms · 2026-05-23T05:35:34.030517+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Does Self-Consistency Improve the Recall of Encyclopedic Knowledge?

    cs.CL 2026-04 unverdicted novelty 6.0

    Self-consistency improves LLM performance on encyclopedic knowledge recall as well as symbolic reasoning, setting a new 89% accuracy record on MMLU with GPT-4o.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    Evaluating large language models: A comprehensive survey

    Zishan Guo, Renren Jin, Chuang Liu, Yufei Huang, Dan Shi, Linhao Yu, Yan Liu, Jiaxuan Li, Bojian Xiong, Deyi Xiong, et al. Evaluating large language models: A comprehensive survey. arXiv preprint arXiv:2310.19736, 2023

  2. [2]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132, 2024

  3. [3]

    Judging llm-as-a-judge with mt-bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024

  4. [4]

    LLM Evaluators Recognize and Favor Their Own Generations

    Arjun Panickssery, Samuel R Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations. arXiv preprint arXiv:2404.13076, 2024

  5. [5]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024

  6. [6]

    Measuring mathematical problem solving with the math dataset, 2021

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021

  7. [7]

    Hellaswag: Can a machine really finish your sentence? In Annual Meeting of the Association for Computational Linguistics, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Annual Meeting of the Association for Computational Linguistics, 2019

  8. [8]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

  9. [9]

    Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023

    Aarohi Srivastava et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023

  10. [10]

    A survey of confidence estimation and calibration in large language models

    Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych. A survey of confidence estimation and calibration in large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6577–6595, 2024

  11. [11]

    Efficient and effective uncertainty quantification for LLMs

    Miao Xiong, Andrea Santilli, Michael Kirchhof, Adam Golinski, and Sinead Williamson. Efficient and effective uncertainty quantification for LLMs. In Neurips Safe Generative AI Workshop 2024, 2024

  12. [12]

    Confidence in the reasoning of large language models

    Yudi Pawitan and Chris Holmes. Confidence in the reasoning of large language models. arXiv preprint arXiv:2412.15296, 2024

  13. [13]

    Benchmarking llms via uncertainty quantification.arXiv preprint arXiv:2401.12794,

    Fanghua Ye, Mingming Yang, Jianhui Pang, Longyue Wang, Derek F Wong, Emine Yilmaz, Shuming Shi, and Zhaopeng Tu. Benchmarking llms via uncertainty quantification. arXiv preprint arXiv:2401.12794, 2024

  14. [14]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022. 8 Reasoning Makes Large Language Models (LLM) More Self-Confident

  15. [15]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  16. [16]

    Jiang et al

    Albert Q. Jiang et al. Mistral 7b, 2023

  17. [17]

    Gemma 2: Improving open language models at a practical size

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size. arXiv e-prints, pages arXiv–2408, 2024

  18. [18]

    01. AI, :, Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and...

  19. [19]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  20. [20]

    Navigate through enigmatic labyrinth a survey of chain of thought reasoning: Advances, frontiers and future

    Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Tao He, Haotian Wang, Weihua Peng, Ming Liu, Bing Qin, and Ting Liu. Navigate through enigmatic labyrinth a survey of chain of thought reasoning: Advances, frontiers and future. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1...

  21. [21]

    Does student confidence on multiple-choice question assessments provide useful information? Medical education, 47(6):578–584, 2013

    Donald A Curtis, Samuel L Lind, Christy K Boscardin, and Mark Dellinges. Does student confidence on multiple-choice question assessments provide useful information? Medical education, 47(6):578–584, 2013

  22. [22]

    Dual processing theory and experts’ reasoning: exploring thinking on national multiple-choice questions

    Steven J Durning, Ting Dong, Anthony R Artino, Cees van der Vleuten, Eric Holmboe, and Lambert Schuwirth. Dual processing theory and experts’ reasoning: exploring thinking on national multiple-choice questions. Perspec- tives on medical Education, 4:168–175, 2015

  23. [23]

    Explanation, imagination, and confidence in judgment

    Derek J Koehler. Explanation, imagination, and confidence in judgment. Psychological bulletin, 110(3):499, 1991

  24. [24]

    Confidence and accuracy in deductive reasoning

    Jody M Shynkaruk and Valerie A Thompson. Confidence and accuracy in deductive reasoning. Memory & cognition, 34(3):619–632, 2006. 9