Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked?

Mohna Chakraborty; Qi Li; Seok Hwan Song; Wallapak Tavanapong

arxiv: 2507.15707 · v2 · submitted 2025-07-21 · 💻 cs.CL · cs.AI

Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked?

Seok Hwan Song , Mohna Chakraborty , Qi Li , Wallapak Tavanapong This is my paper

Pith reviewed 2026-05-19 03:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords large language modelsreasoning tasksquestion formatsmultiple choiceperformance evaluationdeductive reasoningquantitative reasoning

0 comments

The pith

The way questions are asked changes how accurately large language models reason and select final answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests five large language models on quantitative and deductive reasoning tasks presented in three different question formats: multiple choice, true false, and open ended. It measures accuracy both in the intermediate reasoning steps and in the choice of the final answer. The results show clear performance gaps across formats, with the number of options and specific word choices playing a role. Reasoning step accuracy does not always predict whether the model picks the right final answer. A reader would care because this means common evaluation methods may give misleading pictures of model capability depending on how the test questions are written.

Core claim

Large language models exhibit significantly different performance on the same reasoning tasks when the questions are posed in multiple choice, true false, or short answer formats. Accuracy at following correct reasoning steps does not necessarily match accuracy at selecting the final answer. Both the number of answer options and the specific wording used in the questions affect the outcomes.

What carries the argument

Side by side comparison of three question formats on quantitative and deductive reasoning tasks, tracking separate accuracies for reasoning steps and final answer selection.

If this is right

LLM evaluations should test models with multiple question formats instead of a single style.
The number of options in a question can shift model accuracy on reasoning problems.
Word choice in questions influences both reasoning accuracy and final selection accuracy.
Separate metrics for reasoning steps and final answers are needed because they can diverge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmarks that use only one question format may overestimate or underestimate true reasoning ability.
Training or prompting methods could be developed to make models less sensitive to format changes.
Practical applications should verify performance using the exact question styles the system will face.

Load-bearing premise

The chosen reasoning tasks and the particular ways the three question types were written are representative enough to show the general effect of question format.

What would settle it

Repeating the tests with identical wording and option counts across formats and finding no meaningful performance differences would undermine the claim.

Figures

Figures reproduced from arXiv: 2507.15707 by Mohna Chakraborty, Qi Li, Seok Hwan Song, Wallapak Tavanapong.

**Figure 1.** Figure 1: Examples of different types of questions generated from the original problem. Variables in italics are [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Examples of different patterns of incorrect outputs by LLMs for MCQ questions [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Percent of incorrect pattern outputs by LLMs [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have been evaluated using diverse question types, e.g., multiple-choice, true/false, and short/long answers. This study answers an unexplored question about the impact of different question types on LLM accuracy on reasoning tasks. We investigate the performance of five LLMs on three different types of questions using quantitative and deductive reasoning tasks. The performance metrics include accuracy in the reasoning steps and choosing the final answer. Key Findings: (1) Significant differences exist in LLM performance across different question types. (2) Reasoning accuracy does not necessarily correlate with the final selection accuracy. (3) The number of options and the choice of words, influence LLM performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Question format changes LLM accuracy on reasoning tasks and step-by-step accuracy can decouple from final-answer accuracy in a straightforward empirical test.

read the letter

The main takeaway is that how you ask the question affects LLM performance on these reasoning tasks, and that getting the reasoning steps right does not always mean the model picks the right final answer. They ran five models on quantitative and deductive problems, scored the reasoning trace separately from the final choice, and found consistent differences across three question formats. The decoupling observation is the piece that stands out as new; earlier prompt-sensitivity work exists, but measuring the two accuracies apart gives a clearer picture of where the models are actually failing or succeeding. The experiment stays simple, covers multiple models and task types, and the reported patterns match what they show in the data. No hidden contradictions or uncontrolled variables jump out that would break the central claim. The soft spots are mostly about scope and detail. The tasks stay narrow, so it is not yet clear how far the pattern travels to other reasoning domains or model scales. More on exact prompt wording, run-to-run variance, and statistical checks would make the differences easier to trust at face value, though the stress-test indicates the differences are present in the results they have. This is useful reading for anyone who designs or interprets reasoning benchmarks. It is not a theoretical leap, but it is a practical reminder that evaluation numbers are format-dependent. I would send it to peer review so the methods can be checked and the decoupling result can be tested for replication.

Referee Report

2 major / 3 minor

Summary. The manuscript presents an empirical study examining whether different question formats affect the performance of five LLMs on quantitative and deductive reasoning tasks. It compares three question types, separately scoring accuracy on reasoning steps versus final answer selection, and reports significant performance differences across formats, a decoupling between reasoning accuracy and final-answer accuracy, and influences from option count and wording.

Significance. If the reported differences prove robust, the work usefully demonstrates prompt-format sensitivity in LLM reasoning evaluations and the value of scoring intermediate reasoning traces independently of the final choice. The multi-model, multi-task design is a strength that supports the central empirical claims.

major comments (2)

Results section: the headline claim of 'significant differences' across question types requires explicit reporting of the statistical test, degrees of freedom, and exact p-values (or correction method) for each pairwise comparison; without these the robustness of the first key finding cannot be verified from the presented numbers alone.
§3.2 Experimental Setup: the three question formats are described at a high level, but concrete prompt templates and the exact number of instances per task type are not supplied, which is load-bearing for assessing whether format effects are isolated from prompt wording or dataset size confounds.

minor comments (3)

Abstract: the sources and sizes of the quantitative and deductive reasoning datasets should be stated so readers can gauge the scale and representativeness of the evaluation.
Figure captions: error bars or per-model variance should be added to performance plots to allow visual assessment of consistency across the five LLMs.
Related Work: a brief citation to prior studies on prompt sensitivity (e.g., on multiple-choice vs. open-ended formats) would better situate the novelty of the current comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive suggestions. Both major comments identify areas where additional detail will improve verifiability and reproducibility; we have revised the manuscript accordingly.

read point-by-point responses

Referee: Results section: the headline claim of 'significant differences' across question types requires explicit reporting of the statistical test, degrees of freedom, and exact p-values (or correction method) for each pairwise comparison; without these the robustness of the first key finding cannot be verified from the presented numbers alone.

Authors: We agree that formal statistical support is necessary to substantiate the claim. The revised manuscript now includes a dedicated paragraph in the Results section reporting the statistical procedure (paired McNemar tests for accuracy comparisons, with Bonferroni correction for multiple pairwise tests across question types and models). We report the test statistic, degrees of freedom, exact p-values, and effect sizes for all relevant comparisons. These additions directly address the concern and allow readers to assess the robustness of the reported differences. revision: yes
Referee: §3.2 Experimental Setup: the three question formats are described at a high level, but concrete prompt templates and the exact number of instances per task type are not supplied, which is load-bearing for assessing whether format effects are isolated from prompt wording or dataset size confounds.

Authors: We accept this point. The revised §3.2 now contains the full verbatim prompt templates for each of the three question formats (multiple-choice, true/false, and open-ended) in a new appendix, together with the exact instance counts per task (quantitative and deductive) and per format. These details confirm that the same underlying problems were used across formats and that dataset sizes were balanced, thereby isolating the format variable from wording or size confounds. revision: yes

Circularity Check

0 steps flagged

Empirical comparison study with no derivation chain

full rationale

This is a straightforward empirical study comparing LLM accuracy across three question formats on quantitative and deductive reasoning tasks. No equations, fitted parameters, or mathematical derivations are present. Central claims rest on direct experimental measurements (accuracy on reasoning steps vs. final answers) across five models, with no self-citation load-bearing steps or reductions of results to inputs by construction. The design isolates format effects through controlled task implementations, making the reported differences falsifiable against the data shown.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study is empirical and relies on standard assumptions in LLM evaluation; no free parameters, invented entities, or non-standard axioms are described in the abstract.

axioms (1)

domain assumption Accuracy on reasoning steps and final answer selection can be measured independently and meaningfully for the chosen tasks.
Implicit in the separation of the two performance metrics described in the abstract.

pith-pipeline@v0.9.0 · 5650 in / 1178 out tokens · 41231 ms · 2026-05-19T03:49:45.457020+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Key Findings: (1) Significant differences exist in LLM performance across different question types. (2) Reasoning accuracy does not necessarily correlate with the final selection accuracy.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 4 internal anchors

[1]

Nishant Balepur, Abhilasha Ravichander, and Rachel Rudinger. 2024. https://doi.org/10.18653/v1/2024.acl-long.555 Artifacts or abduction: How do LLM s answer multiple-choice questions without the question? In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10308--10330, Bangkok, Thailan...

work page doi:10.18653/v1/2024.acl-long.555 2024
[2]

Sumithra Bhakthavatsalam, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, and Peter Clark. 2021. https://arxiv.org/abs/2102.03315 Think you have solved direct-answer question answering? try arc-da, the direct-answer AI2 reasoning challenge . CoRR, abs/2102.03315

work page arXiv 2021
[3]

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2019. https://api.semanticscholar.org/CorpusID:208290939 Piqa: Reasoning about physical commonsense in natural language . In AAAI Conference on Artificial Intelligence

work page 2019
[4]

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1300 B ool Q : Exploring the surprising difficulty of natural yes/no questions . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language...

work page doi:10.18653/v1/n19-1300 2019
[5]

Peter Clark, Oyvind Tafjord, and Kyle Richardson. 2020. Transformers as soft reasoners over language. arXiv preprint arXiv:2002.05867

work page arXiv 2020
[6]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. https://arxiv.org/pdf/2110.14168.pdf Training verifiers to solve math word problems . arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. https://doi.org/10.18653/v1/N19-1246 DROP : A reading comprehension benchmark requiring discrete reasoning over paragraphs . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language...

work page doi:10.18653/v1/n19-1246 2019
[8]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Ali Emami, Paul Trichelair, Adam Trischler, Kaheer Suleman, Hannes Schulz, and Jackie Chi Kit Cheung. 2019. https://doi.org/10.18653/v1/P19-1386 The K now R ef coreference corpus: Removing gender and number cues for difficult pronominal anaphora resolution . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages ...

work page doi:10.18653/v1/p19-1386 2019
[10]

J \"o rg Frohberg and Frank Binder. 2022. https://aclanthology.org/2022.lrec-1.229 CRASS : A novel data set and benchmark to test counterfactual reasoning of large language models . In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2126--2140, Marseille, France. European Language Resources Association

work page 2022
[11]

Gordon, Zornitsa Kozareva, and Melissa Roemmele

Andrew S. Gordon, Zornitsa Kozareva, and Melissa Roemmele. 2011. https://api.semanticscholar.org/CorpusID:434646 Choice of plausible alternatives: An evaluation of commonsense causal reasoning . In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning

work page 2011
[12]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR)

work page 2021
[13]

Jie Huang and Kevin Chen-Chuan Chang. 2023. https://doi.org/10.18653/v1/2023.findings-acl.67 Towards reasoning in large language models: A survey . In Findings of the Association for Computational Linguistics: ACL 2023, pages 1049--1065, Toronto, Canada. Association for Computational Linguistics

work page doi:10.18653/v1/2023.findings-acl.67 2023
[14]

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/P17-1147 T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601--1611, Vancouver, Canada. Assoc...

work page doi:10.18653/v1/p17-1147 2017
[15]

Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. 2016. Mawps: A math word problem repository. In Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: human language technologies, pages 1152--1157

work page 2016
[16]

and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. https://doi.org/10.1162/tacl_a_00276 Natural questions: A benchma...

work page doi:10.1162/tacl_a_00276 2019
[17]

Wangyue Li, Liangzhi Li, Tong Xiang, Xiao Liu, Wei Deng, and Noa Garcia. 2024. Can multiple-choice questions really be useful in detecting the abilities of llms? arXiv preprint arXiv:2403.17752

work page arXiv 2024
[18]

Bill Yuchen Lin, Seyeon Lee, Rahul Khanna, and Xiang Ren. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.557 B irds have four legs?! N umer S ense: P robing N umerical C ommonsense K nowledge of P re- T rained L anguage M odels . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6862--6868, Online....

work page doi:10.18653/v1/2020.emnlp-main.557 2020
[19]

Leora Morgenstern, Ernest Davis, and Charles L. Ortiz. 2016. https://doi.org/10.1609/aimag.v37i1.2639 Planning, executing, and evaluating the winograd schema challenge . AI Magazine, 37(1):50--54

work page doi:10.1609/aimag.v37i1.2639 2016
[20]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. https://www.microsoft.com/en-us/research/publication/ms-marco-human-generated-machine-reading-comprehension-dataset/ Ms marco: A human generated machine reading comprehension dataset

work page 2016
[21]

OpenAI. 2023. https://arxiv.org/abs/2303.08774 Gpt-4 technical report . Preprint, arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Denis Paperno, Germ \'a n Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern \'a ndez. 2016. https://doi.org/10.18653/v1/P16-1144 The LAMBADA dataset: Word prediction requiring a broad discourse context . In Proceedings of the 54th Annual Meeting of the Association for Computati...

work page doi:10.18653/v1/p16-1144 2016
[23]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are nlp models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080--2094

work page 2021
[24]

Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. 2023. https://doi.org/10.18653/v1/2023.acl-long.294 Reasoning with language model prompting: A survey . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5368--5393, Tor...

work page doi:10.18653/v1/2023.acl-long.294 2023
[25]

Altaf Rahman and Vincent Ng. 2012. https://aclanthology.org/D12-1071 Resolving complex cases of definite pronouns: The W inograd schema challenge . In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 777--789, Jeju Island, Korea. Association for Computational Li...

work page 2012
[26]

Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme . 2018. Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana. Association for Computational Linguistics

work page 2018
[27]

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. https://doi.org/10.1145/3474381 Winogrande: an adversarial winograd schema challenge at scale . 64(9):99–106

work page doi:10.1145/3474381 2021
[28]

Soumya Sanyal, Zeyi Liao, and Xiang Ren. 2022. Robustlr: A diagnostic benchmark for evaluating logical robustness of deductive reasoners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9614--9631

work page 2022
[29]

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Sch \"a rli, and Denny Zhou. 2023. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pages 31210--31227. PMLR

work page 2023
[30]

Seok Hwan Song and Wallapak Tavanapong. 2024. How much do prompting methods help llms on quantitative reasoning with irrelevant information? In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, pages 2128--2137

work page 2024
[31]

Oyvind Tafjord, Bhavana Dalvi Mishra, and Peter Clark. 2020. Proofwriter: Generating implications, proofs, and abductive statements over natural language. arXiv preprint arXiv:2012.13048

work page arXiv 2020
[32]

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \`e re, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. SuperGLUE: a stickier benchmark for general-purpose language understanding systems. Curran Associates Inc., Red Hook, NY, USA

work page 2019
[34]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. https://doi.org/10.18653/v1/W18-5446 GLUE : A multi-task benchmark and analysis platform for natural language understanding . In Proceedings of the 2018 EMNLP Workshop B lackbox NLP : Analyzing and Interpreting Neural Networks for NLP , pages 353--355, Brussels, Be...

work page doi:10.18653/v1/w18-5446 2018
[35]

Xinpeng Wang, Bolei Ma, Chengzhi Hu, Leon Weber-Genzel, Paul R \"o ttger, Frauke Kreuter, Dirk Hovy, and Barbara Plank. 2024. https://doi.org/10.18653/v1/2024.findings-acl.441 `` my answer is C '' : First-token probabilities do not match text answers in instruction-tuned language models . In Findings of the Association for Computational Linguistics ACL 20...

work page doi:10.18653/v1/2024.findings-acl.441 2024
[36]

Liu, and Matt Gardner

Johannes Welbl, Nelson F. Liu, and Matt Gardner. 2017. https://doi.org/10.18653/v1/W17-4413 Crowdsourcing multiple choice science questions . In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 94--106, Copenhagen, Denmark. Association for Computational Linguistics

work page doi:10.18653/v1/w17-4413 2017
[37]

Rush, Bart Van Merriënboer, Armand Joulin, and Tomas Mikolov

Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M. Rush, Bart Van Merriënboer, Armand Joulin, and Tomas Mikolov. 2016. https://www.scopus.com/inward/record.uri?eid=2-s2.0-85083951707&partnerID=40&md5=ca2789841ffaaa76da95cccab2acc690 Towards ai-complete question answering: A set of prerequisite toy tasks . Cited by: 177

work page 2016
[38]

Frank Wilcoxon. 1992. https://doi.org/10.1007/978-1-4612-4380-9_16 Individual Comparisons by Ranking Methods , pages 196--202. Springer New York, New York, NY

work page doi:10.1007/978-1-4612-4380-9_16 1992
[39]

Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. https://doi.org/10.18653/v1/N18-1101 A broad-coverage challenge corpus for sentence understanding through inference . In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages 1112...

work page Pith review doi:10.18653/v1/n18-1101 2018
[40]

Fei Yu, Hongbo Zhang, Prayag Tiwari, and Benyou Wang. 2024. Natural language reasoning, a survey. ACM Computing Surveys, 56(12):1--39

work page 2024
[41]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. https://doi.org/10.18653/v1/P19-1472 H ella S wag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791--4800, Florence, Italy. Association for Computational Linguistics

work page doi:10.18653/v1/p19-1472 2019
[42]

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. 2023. Large language models are not robust multiple choice selectors. In The Twelfth International Conference on Learning Representations

work page 2023
[43]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[44]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[1] [1]

Nishant Balepur, Abhilasha Ravichander, and Rachel Rudinger. 2024. https://doi.org/10.18653/v1/2024.acl-long.555 Artifacts or abduction: How do LLM s answer multiple-choice questions without the question? In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10308--10330, Bangkok, Thailan...

work page doi:10.18653/v1/2024.acl-long.555 2024

[2] [2]

Sumithra Bhakthavatsalam, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, and Peter Clark. 2021. https://arxiv.org/abs/2102.03315 Think you have solved direct-answer question answering? try arc-da, the direct-answer AI2 reasoning challenge . CoRR, abs/2102.03315

work page arXiv 2021

[3] [3]

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2019. https://api.semanticscholar.org/CorpusID:208290939 Piqa: Reasoning about physical commonsense in natural language . In AAAI Conference on Artificial Intelligence

work page 2019

[4] [4]

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1300 B ool Q : Exploring the surprising difficulty of natural yes/no questions . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language...

work page doi:10.18653/v1/n19-1300 2019

[5] [5]

Peter Clark, Oyvind Tafjord, and Kyle Richardson. 2020. Transformers as soft reasoners over language. arXiv preprint arXiv:2002.05867

work page arXiv 2020

[6] [6]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. https://arxiv.org/pdf/2110.14168.pdf Training verifiers to solve math word problems . arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. https://doi.org/10.18653/v1/N19-1246 DROP : A reading comprehension benchmark requiring discrete reasoning over paragraphs . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language...

work page doi:10.18653/v1/n19-1246 2019

[8] [8]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Ali Emami, Paul Trichelair, Adam Trischler, Kaheer Suleman, Hannes Schulz, and Jackie Chi Kit Cheung. 2019. https://doi.org/10.18653/v1/P19-1386 The K now R ef coreference corpus: Removing gender and number cues for difficult pronominal anaphora resolution . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages ...

work page doi:10.18653/v1/p19-1386 2019

[10] [10]

J \"o rg Frohberg and Frank Binder. 2022. https://aclanthology.org/2022.lrec-1.229 CRASS : A novel data set and benchmark to test counterfactual reasoning of large language models . In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2126--2140, Marseille, France. European Language Resources Association

work page 2022

[11] [11]

Gordon, Zornitsa Kozareva, and Melissa Roemmele

Andrew S. Gordon, Zornitsa Kozareva, and Melissa Roemmele. 2011. https://api.semanticscholar.org/CorpusID:434646 Choice of plausible alternatives: An evaluation of commonsense causal reasoning . In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning

work page 2011

[12] [12]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR)

work page 2021

[13] [13]

Jie Huang and Kevin Chen-Chuan Chang. 2023. https://doi.org/10.18653/v1/2023.findings-acl.67 Towards reasoning in large language models: A survey . In Findings of the Association for Computational Linguistics: ACL 2023, pages 1049--1065, Toronto, Canada. Association for Computational Linguistics

work page doi:10.18653/v1/2023.findings-acl.67 2023

[14] [14]

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/P17-1147 T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601--1611, Vancouver, Canada. Assoc...

work page doi:10.18653/v1/p17-1147 2017

[15] [15]

Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. 2016. Mawps: A math word problem repository. In Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: human language technologies, pages 1152--1157

work page 2016

[16] [16]

and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. https://doi.org/10.1162/tacl_a_00276 Natural questions: A benchma...

work page doi:10.1162/tacl_a_00276 2019

[17] [17]

Wangyue Li, Liangzhi Li, Tong Xiang, Xiao Liu, Wei Deng, and Noa Garcia. 2024. Can multiple-choice questions really be useful in detecting the abilities of llms? arXiv preprint arXiv:2403.17752

work page arXiv 2024

[18] [18]

Bill Yuchen Lin, Seyeon Lee, Rahul Khanna, and Xiang Ren. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.557 B irds have four legs?! N umer S ense: P robing N umerical C ommonsense K nowledge of P re- T rained L anguage M odels . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6862--6868, Online....

work page doi:10.18653/v1/2020.emnlp-main.557 2020

[19] [19]

Leora Morgenstern, Ernest Davis, and Charles L. Ortiz. 2016. https://doi.org/10.1609/aimag.v37i1.2639 Planning, executing, and evaluating the winograd schema challenge . AI Magazine, 37(1):50--54

work page doi:10.1609/aimag.v37i1.2639 2016

[20] [20]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. https://www.microsoft.com/en-us/research/publication/ms-marco-human-generated-machine-reading-comprehension-dataset/ Ms marco: A human generated machine reading comprehension dataset

work page 2016

[21] [21]

OpenAI. 2023. https://arxiv.org/abs/2303.08774 Gpt-4 technical report . Preprint, arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Denis Paperno, Germ \'a n Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern \'a ndez. 2016. https://doi.org/10.18653/v1/P16-1144 The LAMBADA dataset: Word prediction requiring a broad discourse context . In Proceedings of the 54th Annual Meeting of the Association for Computati...

work page doi:10.18653/v1/p16-1144 2016

[23] [23]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are nlp models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080--2094

work page 2021

[24] [24]

Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. 2023. https://doi.org/10.18653/v1/2023.acl-long.294 Reasoning with language model prompting: A survey . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5368--5393, Tor...

work page doi:10.18653/v1/2023.acl-long.294 2023

[25] [25]

Altaf Rahman and Vincent Ng. 2012. https://aclanthology.org/D12-1071 Resolving complex cases of definite pronouns: The W inograd schema challenge . In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 777--789, Jeju Island, Korea. Association for Computational Li...

work page 2012

[26] [26]

Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme . 2018. Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana. Association for Computational Linguistics

work page 2018

[27] [27]

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. https://doi.org/10.1145/3474381 Winogrande: an adversarial winograd schema challenge at scale . 64(9):99–106

work page doi:10.1145/3474381 2021

[28] [28]

Soumya Sanyal, Zeyi Liao, and Xiang Ren. 2022. Robustlr: A diagnostic benchmark for evaluating logical robustness of deductive reasoners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9614--9631

work page 2022

[29] [29]

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Sch \"a rli, and Denny Zhou. 2023. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pages 31210--31227. PMLR

work page 2023

[30] [30]

Seok Hwan Song and Wallapak Tavanapong. 2024. How much do prompting methods help llms on quantitative reasoning with irrelevant information? In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, pages 2128--2137

work page 2024

[31] [31]

Oyvind Tafjord, Bhavana Dalvi Mishra, and Peter Clark. 2020. Proofwriter: Generating implications, proofs, and abductive statements over natural language. arXiv preprint arXiv:2012.13048

work page arXiv 2020

[32] [32]

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \`e re, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. SuperGLUE: a stickier benchmark for general-purpose language understanding systems. Curran Associates Inc., Red Hook, NY, USA

work page 2019

[34] [34]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. https://doi.org/10.18653/v1/W18-5446 GLUE : A multi-task benchmark and analysis platform for natural language understanding . In Proceedings of the 2018 EMNLP Workshop B lackbox NLP : Analyzing and Interpreting Neural Networks for NLP , pages 353--355, Brussels, Be...

work page doi:10.18653/v1/w18-5446 2018

[35] [35]

Xinpeng Wang, Bolei Ma, Chengzhi Hu, Leon Weber-Genzel, Paul R \"o ttger, Frauke Kreuter, Dirk Hovy, and Barbara Plank. 2024. https://doi.org/10.18653/v1/2024.findings-acl.441 `` my answer is C '' : First-token probabilities do not match text answers in instruction-tuned language models . In Findings of the Association for Computational Linguistics ACL 20...

work page doi:10.18653/v1/2024.findings-acl.441 2024

[36] [36]

Liu, and Matt Gardner

Johannes Welbl, Nelson F. Liu, and Matt Gardner. 2017. https://doi.org/10.18653/v1/W17-4413 Crowdsourcing multiple choice science questions . In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 94--106, Copenhagen, Denmark. Association for Computational Linguistics

work page doi:10.18653/v1/w17-4413 2017

[37] [37]

Rush, Bart Van Merriënboer, Armand Joulin, and Tomas Mikolov

Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M. Rush, Bart Van Merriënboer, Armand Joulin, and Tomas Mikolov. 2016. https://www.scopus.com/inward/record.uri?eid=2-s2.0-85083951707&partnerID=40&md5=ca2789841ffaaa76da95cccab2acc690 Towards ai-complete question answering: A set of prerequisite toy tasks . Cited by: 177

work page 2016

[38] [38]

Frank Wilcoxon. 1992. https://doi.org/10.1007/978-1-4612-4380-9_16 Individual Comparisons by Ranking Methods , pages 196--202. Springer New York, New York, NY

work page doi:10.1007/978-1-4612-4380-9_16 1992

[39] [39]

Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. https://doi.org/10.18653/v1/N18-1101 A broad-coverage challenge corpus for sentence understanding through inference . In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages 1112...

work page Pith review doi:10.18653/v1/n18-1101 2018

[40] [40]

Fei Yu, Hongbo Zhang, Prayag Tiwari, and Benyou Wang. 2024. Natural language reasoning, a survey. ACM Computing Surveys, 56(12):1--39

work page 2024

[41] [41]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. https://doi.org/10.18653/v1/P19-1472 H ella S wag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791--4800, Florence, Italy. Association for Computational Linguistics

work page doi:10.18653/v1/p19-1472 2019

[42] [42]

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. 2023. Large language models are not robust multiple choice selectors. In The Twelfth International Conference on Learning Representations

work page 2023

[43] [43]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page

[44] [44]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page