Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked?
Pith reviewed 2026-05-19 03:49 UTC · model grok-4.3
The pith
The way questions are asked changes how accurately large language models reason and select final answers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Large language models exhibit significantly different performance on the same reasoning tasks when the questions are posed in multiple choice, true false, or short answer formats. Accuracy at following correct reasoning steps does not necessarily match accuracy at selecting the final answer. Both the number of answer options and the specific wording used in the questions affect the outcomes.
What carries the argument
Side by side comparison of three question formats on quantitative and deductive reasoning tasks, tracking separate accuracies for reasoning steps and final answer selection.
If this is right
- LLM evaluations should test models with multiple question formats instead of a single style.
- The number of options in a question can shift model accuracy on reasoning problems.
- Word choice in questions influences both reasoning accuracy and final selection accuracy.
- Separate metrics for reasoning steps and final answers are needed because they can diverge.
Where Pith is reading between the lines
- Benchmarks that use only one question format may overestimate or underestimate true reasoning ability.
- Training or prompting methods could be developed to make models less sensitive to format changes.
- Practical applications should verify performance using the exact question styles the system will face.
Load-bearing premise
The chosen reasoning tasks and the particular ways the three question types were written are representative enough to show the general effect of question format.
What would settle it
Repeating the tests with identical wording and option counts across formats and finding no meaningful performance differences would undermine the claim.
Figures
read the original abstract
Large Language Models (LLMs) have been evaluated using diverse question types, e.g., multiple-choice, true/false, and short/long answers. This study answers an unexplored question about the impact of different question types on LLM accuracy on reasoning tasks. We investigate the performance of five LLMs on three different types of questions using quantitative and deductive reasoning tasks. The performance metrics include accuracy in the reasoning steps and choosing the final answer. Key Findings: (1) Significant differences exist in LLM performance across different question types. (2) Reasoning accuracy does not necessarily correlate with the final selection accuracy. (3) The number of options and the choice of words, influence LLM performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an empirical study examining whether different question formats affect the performance of five LLMs on quantitative and deductive reasoning tasks. It compares three question types, separately scoring accuracy on reasoning steps versus final answer selection, and reports significant performance differences across formats, a decoupling between reasoning accuracy and final-answer accuracy, and influences from option count and wording.
Significance. If the reported differences prove robust, the work usefully demonstrates prompt-format sensitivity in LLM reasoning evaluations and the value of scoring intermediate reasoning traces independently of the final choice. The multi-model, multi-task design is a strength that supports the central empirical claims.
major comments (2)
- Results section: the headline claim of 'significant differences' across question types requires explicit reporting of the statistical test, degrees of freedom, and exact p-values (or correction method) for each pairwise comparison; without these the robustness of the first key finding cannot be verified from the presented numbers alone.
- §3.2 Experimental Setup: the three question formats are described at a high level, but concrete prompt templates and the exact number of instances per task type are not supplied, which is load-bearing for assessing whether format effects are isolated from prompt wording or dataset size confounds.
minor comments (3)
- Abstract: the sources and sizes of the quantitative and deductive reasoning datasets should be stated so readers can gauge the scale and representativeness of the evaluation.
- Figure captions: error bars or per-model variance should be added to performance plots to allow visual assessment of consistency across the five LLMs.
- Related Work: a brief citation to prior studies on prompt sensitivity (e.g., on multiple-choice vs. open-ended formats) would better situate the novelty of the current comparison.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and constructive suggestions. Both major comments identify areas where additional detail will improve verifiability and reproducibility; we have revised the manuscript accordingly.
read point-by-point responses
-
Referee: Results section: the headline claim of 'significant differences' across question types requires explicit reporting of the statistical test, degrees of freedom, and exact p-values (or correction method) for each pairwise comparison; without these the robustness of the first key finding cannot be verified from the presented numbers alone.
Authors: We agree that formal statistical support is necessary to substantiate the claim. The revised manuscript now includes a dedicated paragraph in the Results section reporting the statistical procedure (paired McNemar tests for accuracy comparisons, with Bonferroni correction for multiple pairwise tests across question types and models). We report the test statistic, degrees of freedom, exact p-values, and effect sizes for all relevant comparisons. These additions directly address the concern and allow readers to assess the robustness of the reported differences. revision: yes
-
Referee: §3.2 Experimental Setup: the three question formats are described at a high level, but concrete prompt templates and the exact number of instances per task type are not supplied, which is load-bearing for assessing whether format effects are isolated from prompt wording or dataset size confounds.
Authors: We accept this point. The revised §3.2 now contains the full verbatim prompt templates for each of the three question formats (multiple-choice, true/false, and open-ended) in a new appendix, together with the exact instance counts per task (quantitative and deductive) and per format. These details confirm that the same underlying problems were used across formats and that dataset sizes were balanced, thereby isolating the format variable from wording or size confounds. revision: yes
Circularity Check
Empirical comparison study with no derivation chain
full rationale
This is a straightforward empirical study comparing LLM accuracy across three question formats on quantitative and deductive reasoning tasks. No equations, fitted parameters, or mathematical derivations are present. Central claims rest on direct experimental measurements (accuracy on reasoning steps vs. final answers) across five models, with no self-citation load-bearing steps or reductions of results to inputs by construction. The design isolates format effects through controlled task implementations, making the reported differences falsifiable against the data shown.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Accuracy on reasoning steps and final answer selection can be measured independently and meaningfully for the chosen tasks.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Key Findings: (1) Significant differences exist in LLM performance across different question types. (2) Reasoning accuracy does not necessarily correlate with the final selection accuracy.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Nishant Balepur, Abhilasha Ravichander, and Rachel Rudinger. 2024. https://doi.org/10.18653/v1/2024.acl-long.555 Artifacts or abduction: How do LLM s answer multiple-choice questions without the question? In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10308--10330, Bangkok, Thailan...
-
[2]
Sumithra Bhakthavatsalam, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, and Peter Clark. 2021. https://arxiv.org/abs/2102.03315 Think you have solved direct-answer question answering? try arc-da, the direct-answer AI2 reasoning challenge . CoRR, abs/2102.03315
-
[3]
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2019. https://api.semanticscholar.org/CorpusID:208290939 Piqa: Reasoning about physical commonsense in natural language . In AAAI Conference on Artificial Intelligence
work page 2019
-
[4]
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1300 B ool Q : Exploring the surprising difficulty of natural yes/no questions . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language...
- [5]
-
[6]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. https://arxiv.org/pdf/2110.14168.pdf Training verifiers to solve math word problems . arXiv preprint arXiv:2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. https://doi.org/10.18653/v1/N19-1246 DROP : A reading comprehension benchmark requiring discrete reasoning over paragraphs . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language...
-
[8]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Ali Emami, Paul Trichelair, Adam Trischler, Kaheer Suleman, Hannes Schulz, and Jackie Chi Kit Cheung. 2019. https://doi.org/10.18653/v1/P19-1386 The K now R ef coreference corpus: Removing gender and number cues for difficult pronominal anaphora resolution . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages ...
-
[10]
J \"o rg Frohberg and Frank Binder. 2022. https://aclanthology.org/2022.lrec-1.229 CRASS : A novel data set and benchmark to test counterfactual reasoning of large language models . In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2126--2140, Marseille, France. European Language Resources Association
work page 2022
-
[11]
Gordon, Zornitsa Kozareva, and Melissa Roemmele
Andrew S. Gordon, Zornitsa Kozareva, and Melissa Roemmele. 2011. https://api.semanticscholar.org/CorpusID:434646 Choice of plausible alternatives: An evaluation of commonsense causal reasoning . In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning
work page 2011
-
[12]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR)
work page 2021
-
[13]
Jie Huang and Kevin Chen-Chuan Chang. 2023. https://doi.org/10.18653/v1/2023.findings-acl.67 Towards reasoning in large language models: A survey . In Findings of the Association for Computational Linguistics: ACL 2023, pages 1049--1065, Toronto, Canada. Association for Computational Linguistics
-
[14]
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/P17-1147 T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601--1611, Vancouver, Canada. Assoc...
-
[15]
Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. 2016. Mawps: A math word problem repository. In Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: human language technologies, pages 1152--1157
work page 2016
-
[16]
and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. https://doi.org/10.1162/tacl_a_00276 Natural questions: A benchma...
- [17]
-
[18]
Bill Yuchen Lin, Seyeon Lee, Rahul Khanna, and Xiang Ren. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.557 B irds have four legs?! N umer S ense: P robing N umerical C ommonsense K nowledge of P re- T rained L anguage M odels . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6862--6868, Online....
-
[19]
Leora Morgenstern, Ernest Davis, and Charles L. Ortiz. 2016. https://doi.org/10.1609/aimag.v37i1.2639 Planning, executing, and evaluating the winograd schema challenge . AI Magazine, 37(1):50--54
-
[20]
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. https://www.microsoft.com/en-us/research/publication/ms-marco-human-generated-machine-reading-comprehension-dataset/ Ms marco: A human generated machine reading comprehension dataset
work page 2016
-
[21]
OpenAI. 2023. https://arxiv.org/abs/2303.08774 Gpt-4 technical report . Preprint, arXiv:2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Denis Paperno, Germ \'a n Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern \'a ndez. 2016. https://doi.org/10.18653/v1/P16-1144 The LAMBADA dataset: Word prediction requiring a broad discourse context . In Proceedings of the 54th Annual Meeting of the Association for Computati...
-
[23]
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are nlp models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080--2094
work page 2021
-
[24]
Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. 2023. https://doi.org/10.18653/v1/2023.acl-long.294 Reasoning with language model prompting: A survey . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5368--5393, Tor...
-
[25]
Altaf Rahman and Vincent Ng. 2012. https://aclanthology.org/D12-1071 Resolving complex cases of definite pronouns: The W inograd schema challenge . In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 777--789, Jeju Island, Korea. Association for Computational Li...
work page 2012
-
[26]
Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme . 2018. Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana. Association for Computational Linguistics
work page 2018
-
[27]
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. https://doi.org/10.1145/3474381 Winogrande: an adversarial winograd schema challenge at scale . 64(9):99–106
-
[28]
Soumya Sanyal, Zeyi Liao, and Xiang Ren. 2022. Robustlr: A diagnostic benchmark for evaluating logical robustness of deductive reasoners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9614--9631
work page 2022
-
[29]
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Sch \"a rli, and Denny Zhou. 2023. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pages 31210--31227. PMLR
work page 2023
-
[30]
Seok Hwan Song and Wallapak Tavanapong. 2024. How much do prompting methods help llms on quantitative reasoning with irrelevant information? In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, pages 2128--2137
work page 2024
- [31]
-
[32]
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \`e re, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. SuperGLUE: a stickier benchmark for general-purpose language understanding systems. Curran Associates Inc., Red Hook, NY, USA
work page 2019
-
[34]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. https://doi.org/10.18653/v1/W18-5446 GLUE : A multi-task benchmark and analysis platform for natural language understanding . In Proceedings of the 2018 EMNLP Workshop B lackbox NLP : Analyzing and Interpreting Neural Networks for NLP , pages 353--355, Brussels, Be...
-
[35]
Xinpeng Wang, Bolei Ma, Chengzhi Hu, Leon Weber-Genzel, Paul R \"o ttger, Frauke Kreuter, Dirk Hovy, and Barbara Plank. 2024. https://doi.org/10.18653/v1/2024.findings-acl.441 `` my answer is C '' : First-token probabilities do not match text answers in instruction-tuned language models . In Findings of the Association for Computational Linguistics ACL 20...
-
[36]
Johannes Welbl, Nelson F. Liu, and Matt Gardner. 2017. https://doi.org/10.18653/v1/W17-4413 Crowdsourcing multiple choice science questions . In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 94--106, Copenhagen, Denmark. Association for Computational Linguistics
-
[37]
Rush, Bart Van Merriënboer, Armand Joulin, and Tomas Mikolov
Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M. Rush, Bart Van Merriënboer, Armand Joulin, and Tomas Mikolov. 2016. https://www.scopus.com/inward/record.uri?eid=2-s2.0-85083951707&partnerID=40&md5=ca2789841ffaaa76da95cccab2acc690 Towards ai-complete question answering: A set of prerequisite toy tasks . Cited by: 177
work page 2016
-
[38]
Frank Wilcoxon. 1992. https://doi.org/10.1007/978-1-4612-4380-9_16 Individual Comparisons by Ranking Methods , pages 196--202. Springer New York, New York, NY
-
[39]
Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. https://doi.org/10.18653/v1/N18-1101 A broad-coverage challenge corpus for sentence understanding through inference . In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages 1112...
-
[40]
Fei Yu, Hongbo Zhang, Prayag Tiwari, and Benyou Wang. 2024. Natural language reasoning, a survey. ACM Computing Surveys, 56(12):1--39
work page 2024
-
[41]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. https://doi.org/10.18653/v1/P19-1472 H ella S wag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791--4800, Florence, Italy. Association for Computational Linguistics
-
[42]
Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. 2023. Large language models are not robust multiple choice selectors. In The Twelfth International Conference on Learning Representations
work page 2023
-
[43]
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[44]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.