arxiv: 2604.04944 · v1 · submitted 2026-03-15 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space

Mohammad Reza Ghasemi Madani , Soyeon Caren Han , Shuo Yang , Jey Han Lau

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:27 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Inclusion-of-Thoughtspreference instabilitydistractorsmultiple-choice questionschain-of-thoughtLLM reasoningself-filteringdecision stability

0 comments

The pith

Inclusion-of-Thoughts purifies multiple-choice questions by removing implausible distractors to stabilize LLM reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often waver between correct and wrong answers in multiple-choice questions because plausible but incorrect options divert their attention. The Inclusion-of-Thoughts method counters this by having the model progressively identify and remove implausible choices, then reconstruct the question using only the remaining plausible options. This creates a cleaner setting for the model to compare answers and maintain consistent internal reasoning. The approach also records each filtering step, making the decision process more transparent. Extensive tests show it improves chain-of-thought results on arithmetic, commonsense, and educational tasks while adding little computational cost.

Core claim

By reconstructing multiple-choice questions to include only plausible option choices through a progressive self-filtering process, the model mitigates instability of preferences under distractors, enabling more effective focus on comparative judgements and thereby boosting chain-of-thought performance across arithmetic, commonsense reasoning, and educational benchmarks.

What carries the argument

Inclusion-of-Thoughts (IoT), a progressive self-filtering strategy that reconstructs the MCQ by retaining only plausible options and documents the steps to reduce cognitive load from distractors.

If this is right

Chain-of-thought accuracy rises on arithmetic benchmarks.
Performance improves on commonsense reasoning tasks.
Results strengthen on educational multiple-choice evaluations.
Decision transparency increases through explicit logging of each removed option.
Computational overhead stays minimal while stability under option perturbation grows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying the same purification step to generated distractors in open-ended questions could extend stability benefits beyond fixed MCQs.
Analyzing which options get filtered might expose specific reasoning shortcuts the model uses.
Embedding this filtering into repeated self-correction loops could compound gains on harder reasoning problems.

Load-bearing premise

The model can reliably identify and remove only implausible distractors in a progressive self-filtering process without discarding useful information or introducing new biases in the reconstructed question.

What would settle it

An experiment on a benchmark where the model is forced to classify a correct answer as implausible during filtering, resulting in either no accuracy gain or a performance drop compared to standard chain-of-thought prompting.

Figures

Figures reproduced from arXiv: 2604.04944 by Jey Han Lau, Mohammad Reza Ghasemi Madani, Shuo Yang, Soyeon Caren Han.

**Figure 1.** Figure 1: Inclusion-of-Thoughts Framework is a selffiltering strategy that reconstructs multiple-choice questions using only plausible options to mitigate model instability and enhance performance. that mimic the reasoning process a person might employ in solving a task. It has been observed that CoT prompting significantly improves model performance across a variety of multi-step reasoning tasks (Wei et al., 202… view at source ↗

**Figure 2.** Figure 2: Inclusion-of-Thoughts Framework. The pipeline allows the model to choose up to two options. Then the model looks at its preferences and decides the final answer in isolation. We remove the initial (stage 1) selection o ∗ 1 from O and replace it with a neutral placeholder, “none of the options”, yielding the modified option set O ′ = (O \ {o ∗ 1}) ∪ {none of the options}. The model is then queried again on … view at source ↗

**Figure 3.** Figure 3: The figure illustrates the effect of options [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Results of Olmo-2-7B categorized by benchmarks type. While the margin in Commonsense or Education is more significant, IoT achieves performance comparable to self-consistency (SC) with much lower computational cost. 3.3 Transition Analysis Beyond aggregate accuracy, IoT enables a finegrained analysis of how model predictions evolve under option perturbation, revealing distinct transition patterns that … view at source ↗

**Figure 5.** Figure 5: Answer transition process of Olmo-2-7B over OBQA dataset. The final improvement is the difference between FTT and TFF. The TFF and FTF cases are of critical importance for our analysis. These instances are characterized by the model identifying the correct answer within its intermediate steps but selecting an incorrect final prediction. Specifically, we hypothesize that when the model ranks the correct… view at source ↗

**Figure 6.** Figure 6: The portion of noisy samples agreed across [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Cost-Benefit. Vertical axes illustrate the accuracy of each method, while the horizontal axes show the [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Example of TFT case from IoT (OBQA, Olmo-2-7b). [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Example of TTT case from IoT (CSQA, Olmo-2-7b). Here, the IoT stops the process and skips stage 3 [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Example of FTT case from IoT (CSQA, Olmo-2-7b). [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Example of FFF case from IoT (AQUA, Olmo-2-7b). Here, the IoT stops the process and skips stage 3 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Example of FTT case from IoT (GSM8K-MC, Olmo-2-7b). [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

read the original abstract

Multiple-choice questions (MCQs) are widely used to evaluate large language models (LLMs). However, LLMs remain vulnerable to the presence of plausible distractors. This often diverts attention toward irrelevant choices, resulting in unstable oscillation between correct and incorrect answers. In this paper, we propose Inclusion-of-Thoughts (IoT), a progressive self-filtering strategy that is designed to mitigate this cognitive load (i.e., instability of model preferences under the presence of distractors) and enable the model to focus more effectively on plausible answers. Our method operates to reconstruct the MCQ using only plausible option choices, providing a controlled setting for examining comparative judgements and therefore the stability of the model's internal reasoning under perturbation. By explicitly documenting this filtering process, IoT also enhances the transparency and interpretability of the model's decision-making. Extensive empirical evaluation demonstrates that IoT substantially boosts chain-of-thought performance across a range of arithmetic, commonsense reasoning, and educational benchmarks with minimal computational overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IoT adds a progressive self-filter step to MCQ prompts but the abstract gives no numbers and the filtering step risks dropping the correct answer with no backup check.

read the letter

The new piece here is Inclusion-of-Thoughts, a prompt method that has the model step through the options in an MCQ and drop the ones it judges implausible before running chain-of-thought on the cleaned-up version. The goal is to cut the instability that comes from plausible-looking distractors pulling the model in different directions. They also note that logging the filter decisions makes the process more transparent than a plain prompt. That framing is straightforward and targets a real pain point in LLM evaluation on arithmetic, commonsense, and education tasks. If the full experiments show consistent gains with low overhead, the idea could be a practical addition to the prompt-engineering toolkit. The abstract, however, states substantial boosts without any numbers, baselines, or statistical details, so the performance claim sits unsupported on the page we have. The bigger issue is that the same model does the filtering. If it mislabels the correct answer as implausible in an early round, that answer is gone and there is no described mechanism to recover it or cross-check against ground truth. This assumption is load-bearing and untested in the summary. The method reads as an incremental variant on self-consistency and option pruning rather than a sharp departure from prior work. For readers who run a lot of MCQ benchmarks and want another prompt lever to try, the paper is worth pulling if the full version supplies the missing results and ablation checks. I would send it to referees so the empirical side can be examined properly rather than desk-rejecting on the abstract alone.

Referee Report

3 major / 2 minor

Summary. The paper proposes Inclusion-of-Thoughts (IoT), a progressive self-filtering strategy that reconstructs multiple-choice questions (MCQs) by retaining only options the LLM deems plausible. This is intended to reduce preference instability caused by distractors, improve chain-of-thought reasoning stability, and increase decision transparency. The central claim is that IoT yields substantial performance gains on arithmetic, commonsense, and educational benchmarks with negligible added cost.

Significance. If the empirical results and filtering reliability can be substantiated, IoT would supply a lightweight, model-internal procedure for stabilizing LLM judgments on MCQs without external oracles or retraining. The explicit logging of filtering steps could also aid interpretability studies. The approach is procedural rather than derived from first principles, so its value hinges entirely on whether the self-filtering step demonstrably preserves the correct answer while removing only distractors.

major comments (3)

[Abstract] Abstract: the assertion of 'substantial boosts' and 'extensive empirical evaluation' across benchmarks is unsupported by any quantitative results, baselines, accuracy deltas, statistical tests, or implementation details on how plausibility filtering is performed or validated.
[Method] Method (self-filtering procedure): the progressive filtering is performed by the same LLM whose preference instability is the problem being solved; no external ground-truth check, oracle, or error-correction mechanism is described, so an early misclassification of the correct answer as implausible would permanently exclude it from the reconstructed MCQ.
[Experiments] Experiments section: the weakest assumption—that the model reliably retains the correct answer while discarding only distractors—is not tested via ablation on filtering accuracy, false-negative rates on correct options, or comparison against an oracle-filtered baseline.

minor comments (2)

[Method] Notation for the reconstructed MCQ and the plausibility threshold is introduced without a formal definition or pseudocode, making the exact reconstruction step difficult to reproduce.
[Title/Abstract] The title uses 'Inclusion-of-Thoughts' while the abstract uses 'Inclusion-of-Thoughts (IoT)'; consistent capitalization and acronym placement would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below and outline the revisions we will make to strengthen the presentation and empirical support for Inclusion-of-Thoughts.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion of 'substantial boosts' and 'extensive empirical evaluation' across benchmarks is unsupported by any quantitative results, baselines, accuracy deltas, statistical tests, or implementation details on how plausibility filtering is performed or validated.

Authors: We agree that the abstract would be strengthened by including concrete quantitative details. Although the full manuscript reports accuracy deltas, baselines, and statistical tests in Section 4, the abstract currently summarizes these at a high level. In the revised version, we will update the abstract to include specific examples of performance gains (e.g., accuracy improvements on arithmetic and commonsense benchmarks) along with a brief description of the filtering validation process. revision: yes
Referee: [Method] Method (self-filtering procedure): the progressive filtering is performed by the same LLM whose preference instability is the problem being solved; no external ground-truth check, oracle, or error-correction mechanism is described, so an early misclassification of the correct answer as implausible would permanently exclude it from the reconstructed MCQ.

Authors: The self-contained design is intentional to provide a lightweight, model-internal solution without external dependencies. The progressive filtering aims to reduce the risk of early misclassification by evaluating options iteratively. Nevertheless, we acknowledge the potential for error propagation if the correct answer is filtered out. We will revise the Method section to add an explicit discussion of this limitation, including analysis of failure modes and mitigation strategies such as multiple sampling rounds. revision: partial
Referee: [Experiments] Experiments section: the weakest assumption—that the model reliably retains the correct answer while discarding only distractors—is not tested via ablation on filtering accuracy, false-negative rates on correct options, or comparison against an oracle-filtered baseline.

Authors: We concur that direct validation of the filtering reliability is important. While overall benchmark gains provide indirect evidence that the correct answer is typically retained, we did not include dedicated ablations on filtering accuracy. In the revised manuscript, we will add new experiments reporting the false-negative rate for correct options across datasets and, where ground-truth labels allow, a comparison to an oracle-filtered baseline to quantify how closely IoT approximates ideal filtering. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural filtering strategy with empirical evaluation

full rationale

The paper presents Inclusion-of-Thoughts as a progressive self-filtering procedure for reconstructing MCQs by removing implausible distractors, followed by chain-of-thought reasoning on the purified set. No equations, fitted parameters, or derivations are described that reduce to their own inputs by construction. The central claim rests on empirical performance gains across benchmarks rather than any self-referential mathematical structure or load-bearing self-citation chain. The method is self-contained as an algorithmic heuristic whose correctness is evaluated externally via standard benchmarks, with no renaming of known results or ansatz smuggling via prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LLMs can accurately self-identify plausible options and that removing distractors improves stability without side effects.

axioms (1)

domain assumption LLMs exhibit preference instability due to plausible distractors in MCQs
Stated directly in the abstract as the core problem being solved.

pith-pipeline@v0.9.0 · 5483 in / 1105 out tokens · 27862 ms · 2026-05-15T11:27:59.092589+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

IoT operates in three main stages: Stage 1: Initial Preference Elicitation... Stage 2: Second Plausibility Assessment... Stage 3: Confined Final Inference... reconstructs a reduced MCQ consisting solely of the two most plausible model-selected candidates.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

IoT achieves this by strategically perturbing the input to elicit and then isolate the model's top two preferences, followed by a final, unconstrained comparative judgement.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization
cs.AI 2026-04 unverdicted novelty 6.0

T-STAR consolidates multi-turn trajectories into a Cognitive Tree for variance-reduced step-level advantages and surgical policy optimization via thought grafting at critical points.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 1 Pith paper · 16 internal anchors

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

AI@Meta. 2024. https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md Llama 3 model card

work page 2024
[4]

Daniel Andor, Luheng He, Kenton Lee, and Emily Pitler. 2019. https://arxiv.org/abs/1909.00109 Giving bert a calculator: Finding operations and arguments with reading comprehension . Preprint, arXiv:1909.00109

work page arXiv 2019
[5]

Nishant Balepur, Shramay Palta, and Rachel Rudinger. 2024. It’s not easy being wrong: Large language models struggle with process of elimination reasoning. In Findings of the Association for Computational Linguistics: ACL 2024, pages 10143--10166

work page 2024
[6]

BIG bench authors. 2023. https://openreview.net/forum?id=uyTL5Bvosj Beyond the imitation game: Quantifying and extrapolating the capabilities of language models . Transactions on Machine Learning Research

work page 2023
[7]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. https://arxiv.org/abs/1803.05457 Think you have solved question answering? try arc, the ai2 reasoning challenge . Preprint, arXiv:1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. https://arxiv.org/abs/2110.14168 Training verifiers to solve math word problems . Preprint, arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Qihang Fu, Yongbin Qin, Ruizhang Huang, Yanping Chen, Yulin Zhou, and Lintao Long. 2025. https://doi.org/10.18653/v1/2025.acl-long.1051 Exclusion of thought: Mitigating cognitive load in large language models for enhanced reasoning in multiple-choice tasks . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume...

work page doi:10.18653/v1/2025.acl-long.1051 2025
[10]

Mor Geva, Ankit Gupta, and Jonathan Berant. 2020. https://arxiv.org/abs/2004.04487 Injecting numerical reasoning skills into language models . Preprint, arXiv:2004.04487

work page arXiv 2020
[11]

Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023. https://arxiv.org/abs/2305.14992 Reasoning with language model is planning with world model . Preprint, arXiv:2305.14992

work page arXiv 2023
[12]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. https://arxiv.org/abs/2009.03300 Measuring massive multitask language understanding . Preprint, arXiv:2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Weisen Jiang, Han Shi, Longhui Yu, Zhengying Liu, Yu Zhang, Zhenguo Li, and James Kwok. 2024. Forward-backward reasoning in large language models for mathematical verification. In Findings of the Association for Computational Linguistics: ACL 2024, pages 6647--6661

work page 2024
[14]

Jaehun Jung, Lianhui Qin, Sean Welleck, Faeze Brahman, Chandra Bhagavatula, Ronan Le Bras, and Yejin Choi. 2022. Maieutic prompting: Logically consistent reasoning with recursive explanations. In Proceedings of the 2022 conference on empirical methods in natural language processing, pages 1266--1279

work page 2022
[15]

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, and 1 others. 2023. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. 2017. https://arxiv.org/abs/1705.04146 Program induction by rationale generation : Learning to solve and explain algebraic word problems . Preprint, arXiv:1705.04146

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch. 2023. Faithful chain-of-thought reasoning. In The 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (IJCNLP-AACL 2023)

work page 2023
[18]

Chenkai Ma and Xinya Du. 2023. https://arxiv.org/abs/2310.15575 Poe: Process of elimination for multiple choice reasoning . Preprint, arXiv:2310.15575

work page arXiv 2023
[19]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. https://arxiv.org/abs/2303.17651 Self-refine: Iterative refinement with self-feedback . Preprin...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Julian Michael, Salsabila Mahdi, David Rein, Jackson Petty, Julien Dirani, Vishakh Padmakumar, and Samuel R Bowman. 2023. Debate helps supervise unreliable experts. arXiv preprint arXiv:2311.08702

work page arXiv 2023
[21]

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. https://arxiv.org/abs/1809.02789 Can a suit of armor conduct electricity? a new dataset for open book question answering . Preprint, arXiv:1809.02789

work page internal anchor Pith review Pith/arXiv arXiv 2018
[22]

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, and 21 others. 2024. https://arxiv.org/abs/2501.00656 2 olmo 2 furious

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, and 401 others. 2024. https://arxiv.org/abs/2410.21276 Gpt-4o system card . Preprint, arXiv:2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Pouya Pezeshkpour and Estevam Hruschka. 2023. https://arxiv.org/abs/2308.11483 Large language models sensitivity to the order of options in multiple-choice questions . Preprint, arXiv:2308.11483

work page arXiv 2023
[25]

Pouya Pezeshkpour and Estevam Hruschka. 2024. Large language models sensitivity to the order of options in multiple-choice questions. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 2006--2017

work page 2024
[26]

Piotr Piękos, Henryk Michalewski, and Mateusz Malinowski. 2021. https://arxiv.org/abs/2106.03921 Measuring and improving bert's mathematical abilities by predicting the order of reasoning . Preprint, arXiv:2106.03921

work page arXiv 2021
[27]

Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, and 61 others. 2022. https://arxiv.org/abs/2112.11...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

Qiu Ran, Yankai Lin, Peng Li, Jie Zhou, and Zhiyuan Liu. 2019. https://arxiv.org/abs/1910.06701 Numnet: Machine reading comprehension with numerical reasoning . Preprint, arXiv:1910.06701

work page arXiv 2019
[29]

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. 2019. https://arxiv.org/abs/1904.09728 Socialiqa: Commonsense reasoning about social interactions . Preprint, arXiv:1904.09728

work page internal anchor Pith review Pith/arXiv arXiv 2019
[30]

Abulhair Saparov and He He. 2022. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. arXiv preprint arXiv:2210.01240

work page arXiv 2022
[31]

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023 a . https://arxiv.org/abs/2303.11366 Reflexion: Language agents with verbal reinforcement learning . Preprint, arXiv:2303.11366

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023 b . Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36:8634--8652

work page 2023
[33]

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. https://arxiv.org/abs/1811.00937 Commonsenseqa: A question answering challenge targeting commonsense knowledge . Preprint, arXiv:1811.00937

work page internal anchor Pith review Pith/arXiv arXiv 2019
[34]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. https://arxiv.org/abs/2203.11171 Self-consistency improves chain of thought reasoning in language models . Preprint, arXiv:2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. https://arxiv.org/abs/2201.11903 Chain-of-thought prompting elicits reasoning in large language models . Preprint, arXiv:2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, and Zaid Harchaoui. 2024. https://arxiv.org/abs/2406.16838 From decoding to meta-generation: Inference-time algorithms for large language models . Preprint, arXiv:2406.16838

work page arXiv 2024
[37]

Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Shengping Liu, Bin Sun, Kang Liu, and Jun Zhao. 2023. Large language models are better reasoners with self-verification. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2550--2575

work page 2023
[38]

Tianci Xue, Ziqi Wang, Zhenhailong Wang, Chi Han, Pengfei Yu, and Heng Ji. 2023. Rcot: Detecting and rectifying factual inconsistency in reasoning by reversing chain-of-thought. arXiv preprint arXiv:2305.11499

work page arXiv 2023
[39]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023 a . https://arxiv.org/abs/2305.10601 Tree of thoughts: Deliberate problem solving with large language models . Preprint, arXiv:2305.10601

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023 b . Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809--11822

work page 2023
[41]

Eric Zhao, Pranjal Awasthi, and Sreenivas Gollapudi. 2025. https://arxiv.org/abs/2502.01839 Sample, scrutinize and scale: Effective inference-time search by scaling verification . Preprint, arXiv:2502.01839

work page arXiv 2025