pith. sign in

arxiv: 2507.15707 · v2 · submitted 2025-07-21 · 💻 cs.CL · cs.AI

Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked?

Pith reviewed 2026-05-19 03:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords large language modelsreasoning tasksquestion formatsmultiple choiceperformance evaluationdeductive reasoningquantitative reasoning
0
0 comments X

The pith

The way questions are asked changes how accurately large language models reason and select final answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests five large language models on quantitative and deductive reasoning tasks presented in three different question formats: multiple choice, true false, and open ended. It measures accuracy both in the intermediate reasoning steps and in the choice of the final answer. The results show clear performance gaps across formats, with the number of options and specific word choices playing a role. Reasoning step accuracy does not always predict whether the model picks the right final answer. A reader would care because this means common evaluation methods may give misleading pictures of model capability depending on how the test questions are written.

Core claim

Large language models exhibit significantly different performance on the same reasoning tasks when the questions are posed in multiple choice, true false, or short answer formats. Accuracy at following correct reasoning steps does not necessarily match accuracy at selecting the final answer. Both the number of answer options and the specific wording used in the questions affect the outcomes.

What carries the argument

Side by side comparison of three question formats on quantitative and deductive reasoning tasks, tracking separate accuracies for reasoning steps and final answer selection.

If this is right

  • LLM evaluations should test models with multiple question formats instead of a single style.
  • The number of options in a question can shift model accuracy on reasoning problems.
  • Word choice in questions influences both reasoning accuracy and final selection accuracy.
  • Separate metrics for reasoning steps and final answers are needed because they can diverge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmarks that use only one question format may overestimate or underestimate true reasoning ability.
  • Training or prompting methods could be developed to make models less sensitive to format changes.
  • Practical applications should verify performance using the exact question styles the system will face.

Load-bearing premise

The chosen reasoning tasks and the particular ways the three question types were written are representative enough to show the general effect of question format.

What would settle it

Repeating the tests with identical wording and option counts across formats and finding no meaningful performance differences would undermine the claim.

Figures

Figures reproduced from arXiv: 2507.15707 by Mohna Chakraborty, Qi Li, Seok Hwan Song, Wallapak Tavanapong.

Figure 1
Figure 1. Figure 1: Examples of different types of questions generated from the original problem. Variables in italics are [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples of different patterns of incorrect outputs by LLMs for MCQ questions [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Percent of incorrect pattern outputs by LLMs [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have been evaluated using diverse question types, e.g., multiple-choice, true/false, and short/long answers. This study answers an unexplored question about the impact of different question types on LLM accuracy on reasoning tasks. We investigate the performance of five LLMs on three different types of questions using quantitative and deductive reasoning tasks. The performance metrics include accuracy in the reasoning steps and choosing the final answer. Key Findings: (1) Significant differences exist in LLM performance across different question types. (2) Reasoning accuracy does not necessarily correlate with the final selection accuracy. (3) The number of options and the choice of words, influence LLM performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript presents an empirical study examining whether different question formats affect the performance of five LLMs on quantitative and deductive reasoning tasks. It compares three question types, separately scoring accuracy on reasoning steps versus final answer selection, and reports significant performance differences across formats, a decoupling between reasoning accuracy and final-answer accuracy, and influences from option count and wording.

Significance. If the reported differences prove robust, the work usefully demonstrates prompt-format sensitivity in LLM reasoning evaluations and the value of scoring intermediate reasoning traces independently of the final choice. The multi-model, multi-task design is a strength that supports the central empirical claims.

major comments (2)
  1. Results section: the headline claim of 'significant differences' across question types requires explicit reporting of the statistical test, degrees of freedom, and exact p-values (or correction method) for each pairwise comparison; without these the robustness of the first key finding cannot be verified from the presented numbers alone.
  2. §3.2 Experimental Setup: the three question formats are described at a high level, but concrete prompt templates and the exact number of instances per task type are not supplied, which is load-bearing for assessing whether format effects are isolated from prompt wording or dataset size confounds.
minor comments (3)
  1. Abstract: the sources and sizes of the quantitative and deductive reasoning datasets should be stated so readers can gauge the scale and representativeness of the evaluation.
  2. Figure captions: error bars or per-model variance should be added to performance plots to allow visual assessment of consistency across the five LLMs.
  3. Related Work: a brief citation to prior studies on prompt sensitivity (e.g., on multiple-choice vs. open-ended formats) would better situate the novelty of the current comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive suggestions. Both major comments identify areas where additional detail will improve verifiability and reproducibility; we have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: Results section: the headline claim of 'significant differences' across question types requires explicit reporting of the statistical test, degrees of freedom, and exact p-values (or correction method) for each pairwise comparison; without these the robustness of the first key finding cannot be verified from the presented numbers alone.

    Authors: We agree that formal statistical support is necessary to substantiate the claim. The revised manuscript now includes a dedicated paragraph in the Results section reporting the statistical procedure (paired McNemar tests for accuracy comparisons, with Bonferroni correction for multiple pairwise tests across question types and models). We report the test statistic, degrees of freedom, exact p-values, and effect sizes for all relevant comparisons. These additions directly address the concern and allow readers to assess the robustness of the reported differences. revision: yes

  2. Referee: §3.2 Experimental Setup: the three question formats are described at a high level, but concrete prompt templates and the exact number of instances per task type are not supplied, which is load-bearing for assessing whether format effects are isolated from prompt wording or dataset size confounds.

    Authors: We accept this point. The revised §3.2 now contains the full verbatim prompt templates for each of the three question formats (multiple-choice, true/false, and open-ended) in a new appendix, together with the exact instance counts per task (quantitative and deductive) and per format. These details confirm that the same underlying problems were used across formats and that dataset sizes were balanced, thereby isolating the format variable from wording or size confounds. revision: yes

Circularity Check

0 steps flagged

Empirical comparison study with no derivation chain

full rationale

This is a straightforward empirical study comparing LLM accuracy across three question formats on quantitative and deductive reasoning tasks. No equations, fitted parameters, or mathematical derivations are present. Central claims rest on direct experimental measurements (accuracy on reasoning steps vs. final answers) across five models, with no self-citation load-bearing steps or reductions of results to inputs by construction. The design isolates format effects through controlled task implementations, making the reported differences falsifiable against the data shown.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study is empirical and relies on standard assumptions in LLM evaluation; no free parameters, invented entities, or non-standard axioms are described in the abstract.

axioms (1)
  • domain assumption Accuracy on reasoning steps and final answer selection can be measured independently and meaningfully for the chosen tasks.
    Implicit in the separation of the two performance metrics described in the abstract.

pith-pipeline@v0.9.0 · 5650 in / 1178 out tokens · 41231 ms · 2026-05-19T03:49:45.457020+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 4 internal anchors

  1. [1]

    Nishant Balepur, Abhilasha Ravichander, and Rachel Rudinger. 2024. https://doi.org/10.18653/v1/2024.acl-long.555 Artifacts or abduction: How do LLM s answer multiple-choice questions without the question? In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10308--10330, Bangkok, Thailan...

  2. [2]

    Sumithra Bhakthavatsalam, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, and Peter Clark. 2021. https://arxiv.org/abs/2102.03315 Think you have solved direct-answer question answering? try arc-da, the direct-answer AI2 reasoning challenge . CoRR, abs/2102.03315

  3. [3]

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2019. https://api.semanticscholar.org/CorpusID:208290939 Piqa: Reasoning about physical commonsense in natural language . In AAAI Conference on Artificial Intelligence

  4. [4]

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1300 B ool Q : Exploring the surprising difficulty of natural yes/no questions . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language...

  5. [5]

    Peter Clark, Oyvind Tafjord, and Kyle Richardson. 2020. Transformers as soft reasoners over language. arXiv preprint arXiv:2002.05867

  6. [6]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. https://arxiv.org/pdf/2110.14168.pdf Training verifiers to solve math word problems . arXiv preprint arXiv:2110.14168

  7. [7]

    Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. https://doi.org/10.18653/v1/N19-1246 DROP : A reading comprehension benchmark requiring discrete reasoning over paragraphs . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language...

  8. [8]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

  9. [9]

    Ali Emami, Paul Trichelair, Adam Trischler, Kaheer Suleman, Hannes Schulz, and Jackie Chi Kit Cheung. 2019. https://doi.org/10.18653/v1/P19-1386 The K now R ef coreference corpus: Removing gender and number cues for difficult pronominal anaphora resolution . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages ...

  10. [10]

    J \"o rg Frohberg and Frank Binder. 2022. https://aclanthology.org/2022.lrec-1.229 CRASS : A novel data set and benchmark to test counterfactual reasoning of large language models . In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2126--2140, Marseille, France. European Language Resources Association

  11. [11]

    Gordon, Zornitsa Kozareva, and Melissa Roemmele

    Andrew S. Gordon, Zornitsa Kozareva, and Melissa Roemmele. 2011. https://api.semanticscholar.org/CorpusID:434646 Choice of plausible alternatives: An evaluation of commonsense causal reasoning . In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning

  12. [12]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR)

  13. [13]

    Jie Huang and Kevin Chen-Chuan Chang. 2023. https://doi.org/10.18653/v1/2023.findings-acl.67 Towards reasoning in large language models: A survey . In Findings of the Association for Computational Linguistics: ACL 2023, pages 1049--1065, Toronto, Canada. Association for Computational Linguistics

  14. [14]

    Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/P17-1147 T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601--1611, Vancouver, Canada. Assoc...

  15. [15]

    Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. 2016. Mawps: A math word problem repository. In Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: human language technologies, pages 1152--1157

  16. [16]

    and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. https://doi.org/10.1162/tacl_a_00276 Natural questions: A benchma...

  17. [17]

    Wangyue Li, Liangzhi Li, Tong Xiang, Xiao Liu, Wei Deng, and Noa Garcia. 2024. Can multiple-choice questions really be useful in detecting the abilities of llms? arXiv preprint arXiv:2403.17752

  18. [18]

    Bill Yuchen Lin, Seyeon Lee, Rahul Khanna, and Xiang Ren. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.557 B irds have four legs?! N umer S ense: P robing N umerical C ommonsense K nowledge of P re- T rained L anguage M odels . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6862--6868, Online....

  19. [19]

    Leora Morgenstern, Ernest Davis, and Charles L. Ortiz. 2016. https://doi.org/10.1609/aimag.v37i1.2639 Planning, executing, and evaluating the winograd schema challenge . AI Magazine, 37(1):50--54

  20. [20]

    Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. https://www.microsoft.com/en-us/research/publication/ms-marco-human-generated-machine-reading-comprehension-dataset/ Ms marco: A human generated machine reading comprehension dataset

  21. [21]

    OpenAI. 2023. https://arxiv.org/abs/2303.08774 Gpt-4 technical report . Preprint, arXiv:2303.08774

  22. [22]

    Denis Paperno, Germ \'a n Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern \'a ndez. 2016. https://doi.org/10.18653/v1/P16-1144 The LAMBADA dataset: Word prediction requiring a broad discourse context . In Proceedings of the 54th Annual Meeting of the Association for Computati...

  23. [23]

    Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are nlp models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080--2094

  24. [24]

    Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. 2023. https://doi.org/10.18653/v1/2023.acl-long.294 Reasoning with language model prompting: A survey . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5368--5393, Tor...

  25. [25]

    Altaf Rahman and Vincent Ng. 2012. https://aclanthology.org/D12-1071 Resolving complex cases of definite pronouns: The W inograd schema challenge . In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 777--789, Jeju Island, Korea. Association for Computational Li...

  26. [26]

    Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme . 2018. Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana. Association for Computational Linguistics

  27. [27]

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. https://doi.org/10.1145/3474381 Winogrande: an adversarial winograd schema challenge at scale . 64(9):99–106

  28. [28]

    Soumya Sanyal, Zeyi Liao, and Xiang Ren. 2022. Robustlr: A diagnostic benchmark for evaluating logical robustness of deductive reasoners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9614--9631

  29. [29]

    Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Sch \"a rli, and Denny Zhou. 2023. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pages 31210--31227. PMLR

  30. [30]

    Seok Hwan Song and Wallapak Tavanapong. 2024. How much do prompting methods help llms on quantitative reasoning with irrelevant information? In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, pages 2128--2137

  31. [31]

    Oyvind Tafjord, Bhavana Dalvi Mishra, and Peter Clark. 2020. Proofwriter: Generating implications, proofs, and abductive statements over natural language. arXiv preprint arXiv:2012.13048

  32. [32]

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \`e re, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295

  33. [33]

    Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. SuperGLUE: a stickier benchmark for general-purpose language understanding systems. Curran Associates Inc., Red Hook, NY, USA

  34. [34]

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. https://doi.org/10.18653/v1/W18-5446 GLUE : A multi-task benchmark and analysis platform for natural language understanding . In Proceedings of the 2018 EMNLP Workshop B lackbox NLP : Analyzing and Interpreting Neural Networks for NLP , pages 353--355, Brussels, Be...

  35. [35]

    Xinpeng Wang, Bolei Ma, Chengzhi Hu, Leon Weber-Genzel, Paul R \"o ttger, Frauke Kreuter, Dirk Hovy, and Barbara Plank. 2024. https://doi.org/10.18653/v1/2024.findings-acl.441 `` my answer is C '' : First-token probabilities do not match text answers in instruction-tuned language models . In Findings of the Association for Computational Linguistics ACL 20...

  36. [36]

    Liu, and Matt Gardner

    Johannes Welbl, Nelson F. Liu, and Matt Gardner. 2017. https://doi.org/10.18653/v1/W17-4413 Crowdsourcing multiple choice science questions . In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 94--106, Copenhagen, Denmark. Association for Computational Linguistics

  37. [37]

    Rush, Bart Van Merriënboer, Armand Joulin, and Tomas Mikolov

    Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M. Rush, Bart Van Merriënboer, Armand Joulin, and Tomas Mikolov. 2016. https://www.scopus.com/inward/record.uri?eid=2-s2.0-85083951707&partnerID=40&md5=ca2789841ffaaa76da95cccab2acc690 Towards ai-complete question answering: A set of prerequisite toy tasks . Cited by: 177

  38. [38]

    Frank Wilcoxon. 1992. https://doi.org/10.1007/978-1-4612-4380-9_16 Individual Comparisons by Ranking Methods , pages 196--202. Springer New York, New York, NY

  39. [39]

    Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. https://doi.org/10.18653/v1/N18-1101 A broad-coverage challenge corpus for sentence understanding through inference . In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages 1112...

  40. [40]

    Fei Yu, Hongbo Zhang, Prayag Tiwari, and Benyou Wang. 2024. Natural language reasoning, a survey. ACM Computing Surveys, 56(12):1--39

  41. [41]

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. https://doi.org/10.18653/v1/P19-1472 H ella S wag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791--4800, Florence, Italy. Association for Computational Linguistics

  42. [42]

    Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. 2023. Large language models are not robust multiple choice selectors. In The Twelfth International Conference on Learning Representations

  43. [43]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  44. [44]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...