pith. machine review for the scientific record. sign in

arxiv: 2604.04944 · v1 · submitted 2026-03-15 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:27 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords Inclusion-of-Thoughtspreference instabilitydistractorsmultiple-choice questionschain-of-thoughtLLM reasoningself-filteringdecision stability
0
0 comments X

The pith

Inclusion-of-Thoughts purifies multiple-choice questions by removing implausible distractors to stabilize LLM reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often waver between correct and wrong answers in multiple-choice questions because plausible but incorrect options divert their attention. The Inclusion-of-Thoughts method counters this by having the model progressively identify and remove implausible choices, then reconstruct the question using only the remaining plausible options. This creates a cleaner setting for the model to compare answers and maintain consistent internal reasoning. The approach also records each filtering step, making the decision process more transparent. Extensive tests show it improves chain-of-thought results on arithmetic, commonsense, and educational tasks while adding little computational cost.

Core claim

By reconstructing multiple-choice questions to include only plausible option choices through a progressive self-filtering process, the model mitigates instability of preferences under distractors, enabling more effective focus on comparative judgements and thereby boosting chain-of-thought performance across arithmetic, commonsense reasoning, and educational benchmarks.

What carries the argument

Inclusion-of-Thoughts (IoT), a progressive self-filtering strategy that reconstructs the MCQ by retaining only plausible options and documents the steps to reduce cognitive load from distractors.

If this is right

  • Chain-of-thought accuracy rises on arithmetic benchmarks.
  • Performance improves on commonsense reasoning tasks.
  • Results strengthen on educational multiple-choice evaluations.
  • Decision transparency increases through explicit logging of each removed option.
  • Computational overhead stays minimal while stability under option perturbation grows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applying the same purification step to generated distractors in open-ended questions could extend stability benefits beyond fixed MCQs.
  • Analyzing which options get filtered might expose specific reasoning shortcuts the model uses.
  • Embedding this filtering into repeated self-correction loops could compound gains on harder reasoning problems.

Load-bearing premise

The model can reliably identify and remove only implausible distractors in a progressive self-filtering process without discarding useful information or introducing new biases in the reconstructed question.

What would settle it

An experiment on a benchmark where the model is forced to classify a correct answer as implausible during filtering, resulting in either no accuracy gain or a performance drop compared to standard chain-of-thought prompting.

Figures

Figures reproduced from arXiv: 2604.04944 by Jey Han Lau, Mohammad Reza Ghasemi Madani, Shuo Yang, Soyeon Caren Han.

Figure 1
Figure 1. Figure 1: Inclusion-of-Thoughts Framework is a self￾filtering strategy that reconstructs multiple-choice ques￾tions using only plausible options to mitigate model instability and enhance performance. that mimic the reasoning process a person might employ in solving a task. It has been observed that CoT prompting significantly improves model per￾formance across a variety of multi-step reasoning tasks (Wei et al., 202… view at source ↗
Figure 2
Figure 2. Figure 2: Inclusion-of-Thoughts Framework. The pipeline allows the model to choose up to two options. Then the model looks at its preferences and decides the final answer in isolation. We remove the initial (stage 1) selection o ∗ 1 from O and replace it with a neutral placeholder, “none of the options”, yielding the modified option set O ′ = (O \ {o ∗ 1}) ∪ {none of the options}. The model is then queried again on … view at source ↗
Figure 3
Figure 3. Figure 3: The figure illustrates the effect of options [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Results of Olmo-2-7B categorized by bench￾marks type. While the margin in Commonsense or Ed￾ucation is more significant, IoT achieves performance comparable to self-consistency (SC) with much lower computational cost. 3.3 Transition Analysis Beyond aggregate accuracy, IoT enables a fine￾grained analysis of how model predictions evolve under option perturbation, revealing distinct transi￾tion patterns that … view at source ↗
Figure 5
Figure 5. Figure 5: Answer transition process of Olmo-2-7B over OBQA dataset. The final improvement is the difference between FTT and TFF. The TFF and FTF cases are of critical impor￾tance for our analysis. These instances are char￾acterized by the model identifying the correct an￾swer within its intermediate steps but selecting an incorrect final prediction. Specifically, we hypothe￾size that when the model ranks the correct… view at source ↗
Figure 6
Figure 6. Figure 6: The portion of noisy samples agreed across [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cost-Benefit. Vertical axes illustrate the accuracy of each method, while the horizontal axes show the [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example of TFT case from IoT (OBQA, Olmo-2-7b). [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example of TTT case from IoT (CSQA, Olmo-2-7b). Here, the IoT stops the process and skips stage 3 [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example of FTT case from IoT (CSQA, Olmo-2-7b). [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example of FFF case from IoT (AQUA, Olmo-2-7b). Here, the IoT stops the process and skips stage 3 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Example of FTT case from IoT (GSM8K-MC, Olmo-2-7b). [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
read the original abstract

Multiple-choice questions (MCQs) are widely used to evaluate large language models (LLMs). However, LLMs remain vulnerable to the presence of plausible distractors. This often diverts attention toward irrelevant choices, resulting in unstable oscillation between correct and incorrect answers. In this paper, we propose Inclusion-of-Thoughts (IoT), a progressive self-filtering strategy that is designed to mitigate this cognitive load (i.e., instability of model preferences under the presence of distractors) and enable the model to focus more effectively on plausible answers. Our method operates to reconstruct the MCQ using only plausible option choices, providing a controlled setting for examining comparative judgements and therefore the stability of the model's internal reasoning under perturbation. By explicitly documenting this filtering process, IoT also enhances the transparency and interpretability of the model's decision-making. Extensive empirical evaluation demonstrates that IoT substantially boosts chain-of-thought performance across a range of arithmetic, commonsense reasoning, and educational benchmarks with minimal computational overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Inclusion-of-Thoughts (IoT), a progressive self-filtering strategy that reconstructs multiple-choice questions (MCQs) by retaining only options the LLM deems plausible. This is intended to reduce preference instability caused by distractors, improve chain-of-thought reasoning stability, and increase decision transparency. The central claim is that IoT yields substantial performance gains on arithmetic, commonsense, and educational benchmarks with negligible added cost.

Significance. If the empirical results and filtering reliability can be substantiated, IoT would supply a lightweight, model-internal procedure for stabilizing LLM judgments on MCQs without external oracles or retraining. The explicit logging of filtering steps could also aid interpretability studies. The approach is procedural rather than derived from first principles, so its value hinges entirely on whether the self-filtering step demonstrably preserves the correct answer while removing only distractors.

major comments (3)
  1. [Abstract] Abstract: the assertion of 'substantial boosts' and 'extensive empirical evaluation' across benchmarks is unsupported by any quantitative results, baselines, accuracy deltas, statistical tests, or implementation details on how plausibility filtering is performed or validated.
  2. [Method] Method (self-filtering procedure): the progressive filtering is performed by the same LLM whose preference instability is the problem being solved; no external ground-truth check, oracle, or error-correction mechanism is described, so an early misclassification of the correct answer as implausible would permanently exclude it from the reconstructed MCQ.
  3. [Experiments] Experiments section: the weakest assumption—that the model reliably retains the correct answer while discarding only distractors—is not tested via ablation on filtering accuracy, false-negative rates on correct options, or comparison against an oracle-filtered baseline.
minor comments (2)
  1. [Method] Notation for the reconstructed MCQ and the plausibility threshold is introduced without a formal definition or pseudocode, making the exact reconstruction step difficult to reproduce.
  2. [Title/Abstract] The title uses 'Inclusion-of-Thoughts' while the abstract uses 'Inclusion-of-Thoughts (IoT)'; consistent capitalization and acronym placement would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below and outline the revisions we will make to strengthen the presentation and empirical support for Inclusion-of-Thoughts.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion of 'substantial boosts' and 'extensive empirical evaluation' across benchmarks is unsupported by any quantitative results, baselines, accuracy deltas, statistical tests, or implementation details on how plausibility filtering is performed or validated.

    Authors: We agree that the abstract would be strengthened by including concrete quantitative details. Although the full manuscript reports accuracy deltas, baselines, and statistical tests in Section 4, the abstract currently summarizes these at a high level. In the revised version, we will update the abstract to include specific examples of performance gains (e.g., accuracy improvements on arithmetic and commonsense benchmarks) along with a brief description of the filtering validation process. revision: yes

  2. Referee: [Method] Method (self-filtering procedure): the progressive filtering is performed by the same LLM whose preference instability is the problem being solved; no external ground-truth check, oracle, or error-correction mechanism is described, so an early misclassification of the correct answer as implausible would permanently exclude it from the reconstructed MCQ.

    Authors: The self-contained design is intentional to provide a lightweight, model-internal solution without external dependencies. The progressive filtering aims to reduce the risk of early misclassification by evaluating options iteratively. Nevertheless, we acknowledge the potential for error propagation if the correct answer is filtered out. We will revise the Method section to add an explicit discussion of this limitation, including analysis of failure modes and mitigation strategies such as multiple sampling rounds. revision: partial

  3. Referee: [Experiments] Experiments section: the weakest assumption—that the model reliably retains the correct answer while discarding only distractors—is not tested via ablation on filtering accuracy, false-negative rates on correct options, or comparison against an oracle-filtered baseline.

    Authors: We concur that direct validation of the filtering reliability is important. While overall benchmark gains provide indirect evidence that the correct answer is typically retained, we did not include dedicated ablations on filtering accuracy. In the revised manuscript, we will add new experiments reporting the false-negative rate for correct options across datasets and, where ground-truth labels allow, a comparison to an oracle-filtered baseline to quantify how closely IoT approximates ideal filtering. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural filtering strategy with empirical evaluation

full rationale

The paper presents Inclusion-of-Thoughts as a progressive self-filtering procedure for reconstructing MCQs by removing implausible distractors, followed by chain-of-thought reasoning on the purified set. No equations, fitted parameters, or derivations are described that reduce to their own inputs by construction. The central claim rests on empirical performance gains across benchmarks rather than any self-referential mathematical structure or load-bearing self-citation chain. The method is self-contained as an algorithmic heuristic whose correctness is evaluated externally via standard benchmarks, with no renaming of known results or ansatz smuggling via prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LLMs can accurately self-identify plausible options and that removing distractors improves stability without side effects.

axioms (1)
  • domain assumption LLMs exhibit preference instability due to plausible distractors in MCQs
    Stated directly in the abstract as the core problem being solved.

pith-pipeline@v0.9.0 · 5483 in / 1105 out tokens · 27862 ms · 2026-05-15T11:27:59.092589+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization

    cs.AI 2026-04 unverdicted novelty 6.0

    T-STAR consolidates multi-turn trajectories into a Cognitive Tree for variance-reduced step-level advantages and surgical policy optimization via thought grafting at critical points.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 1 Pith paper · 16 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    AI@Meta. 2024. https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md Llama 3 model card

  4. [4]

    Daniel Andor, Luheng He, Kenton Lee, and Emily Pitler. 2019. https://arxiv.org/abs/1909.00109 Giving bert a calculator: Finding operations and arguments with reading comprehension . Preprint, arXiv:1909.00109

  5. [5]

    Nishant Balepur, Shramay Palta, and Rachel Rudinger. 2024. It’s not easy being wrong: Large language models struggle with process of elimination reasoning. In Findings of the Association for Computational Linguistics: ACL 2024, pages 10143--10166

  6. [6]

    BIG bench authors. 2023. https://openreview.net/forum?id=uyTL5Bvosj Beyond the imitation game: Quantifying and extrapolating the capabilities of language models . Transactions on Machine Learning Research

  7. [7]

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. https://arxiv.org/abs/1803.05457 Think you have solved question answering? try arc, the ai2 reasoning challenge . Preprint, arXiv:1803.05457

  8. [8]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. https://arxiv.org/abs/2110.14168 Training verifiers to solve math word problems . Preprint, arXiv:2110.14168

  9. [9]

    Qihang Fu, Yongbin Qin, Ruizhang Huang, Yanping Chen, Yulin Zhou, and Lintao Long. 2025. https://doi.org/10.18653/v1/2025.acl-long.1051 Exclusion of thought: Mitigating cognitive load in large language models for enhanced reasoning in multiple-choice tasks . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume...

  10. [10]

    Mor Geva, Ankit Gupta, and Jonathan Berant. 2020. https://arxiv.org/abs/2004.04487 Injecting numerical reasoning skills into language models . Preprint, arXiv:2004.04487

  11. [11]

    Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023. https://arxiv.org/abs/2305.14992 Reasoning with language model is planning with world model . Preprint, arXiv:2305.14992

  12. [12]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. https://arxiv.org/abs/2009.03300 Measuring massive multitask language understanding . Preprint, arXiv:2009.03300

  13. [13]

    Weisen Jiang, Han Shi, Longhui Yu, Zhengying Liu, Yu Zhang, Zhenguo Li, and James Kwok. 2024. Forward-backward reasoning in large language models for mathematical verification. In Findings of the Association for Computational Linguistics: ACL 2024, pages 6647--6661

  14. [14]

    Jaehun Jung, Lianhui Qin, Sean Welleck, Faeze Brahman, Chandra Bhagavatula, Ronan Le Bras, and Yejin Choi. 2022. Maieutic prompting: Logically consistent reasoning with recursive explanations. In Proceedings of the 2022 conference on empirical methods in natural language processing, pages 1266--1279

  15. [15]

    Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, and 1 others. 2023. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702

  16. [16]

    Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. 2017. https://arxiv.org/abs/1705.04146 Program induction by rationale generation : Learning to solve and explain algebraic word problems . Preprint, arXiv:1705.04146

  17. [17]

    Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch. 2023. Faithful chain-of-thought reasoning. In The 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (IJCNLP-AACL 2023)

  18. [18]

    Chenkai Ma and Xinya Du. 2023. https://arxiv.org/abs/2310.15575 Poe: Process of elimination for multiple choice reasoning . Preprint, arXiv:2310.15575

  19. [19]

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. https://arxiv.org/abs/2303.17651 Self-refine: Iterative refinement with self-feedback . Preprin...

  20. [20]

    Julian Michael, Salsabila Mahdi, David Rein, Jackson Petty, Julien Dirani, Vishakh Padmakumar, and Samuel R Bowman. 2023. Debate helps supervise unreliable experts. arXiv preprint arXiv:2311.08702

  21. [21]

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. https://arxiv.org/abs/1809.02789 Can a suit of armor conduct electricity? a new dataset for open book question answering . Preprint, arXiv:1809.02789

  22. [22]

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, and 21 others. 2024. https://arxiv.org/abs/2501.00656 2 olmo 2 furious

  23. [23]

    OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, and 401 others. 2024. https://arxiv.org/abs/2410.21276 Gpt-4o system card . Preprint, arXiv:2410.21276

  24. [24]

    Pouya Pezeshkpour and Estevam Hruschka. 2023. https://arxiv.org/abs/2308.11483 Large language models sensitivity to the order of options in multiple-choice questions . Preprint, arXiv:2308.11483

  25. [25]

    Pouya Pezeshkpour and Estevam Hruschka. 2024. Large language models sensitivity to the order of options in multiple-choice questions. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 2006--2017

  26. [26]

    Piotr Piękos, Henryk Michalewski, and Mateusz Malinowski. 2021. https://arxiv.org/abs/2106.03921 Measuring and improving bert's mathematical abilities by predicting the order of reasoning . Preprint, arXiv:2106.03921

  27. [27]

    Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, and 61 others. 2022. https://arxiv.org/abs/2112.11...

  28. [28]

    Qiu Ran, Yankai Lin, Peng Li, Jie Zhou, and Zhiyuan Liu. 2019. https://arxiv.org/abs/1910.06701 Numnet: Machine reading comprehension with numerical reasoning . Preprint, arXiv:1910.06701

  29. [29]

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. 2019. https://arxiv.org/abs/1904.09728 Socialiqa: Commonsense reasoning about social interactions . Preprint, arXiv:1904.09728

  30. [30]

    Abulhair Saparov and He He. 2022. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. arXiv preprint arXiv:2210.01240

  31. [31]

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023 a . https://arxiv.org/abs/2303.11366 Reflexion: Language agents with verbal reinforcement learning . Preprint, arXiv:2303.11366

  32. [32]

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023 b . Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36:8634--8652

  33. [33]

    Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. https://arxiv.org/abs/1811.00937 Commonsenseqa: A question answering challenge targeting commonsense knowledge . Preprint, arXiv:1811.00937

  34. [34]

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. https://arxiv.org/abs/2203.11171 Self-consistency improves chain of thought reasoning in language models . Preprint, arXiv:2203.11171

  35. [35]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. https://arxiv.org/abs/2201.11903 Chain-of-thought prompting elicits reasoning in large language models . Preprint, arXiv:2201.11903

  36. [36]

    Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, and Zaid Harchaoui. 2024. https://arxiv.org/abs/2406.16838 From decoding to meta-generation: Inference-time algorithms for large language models . Preprint, arXiv:2406.16838

  37. [37]

    Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Shengping Liu, Bin Sun, Kang Liu, and Jun Zhao. 2023. Large language models are better reasoners with self-verification. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2550--2575

  38. [38]

    Tianci Xue, Ziqi Wang, Zhenhailong Wang, Chi Han, Pengfei Yu, and Heng Ji. 2023. Rcot: Detecting and rectifying factual inconsistency in reasoning by reversing chain-of-thought. arXiv preprint arXiv:2305.11499

  39. [39]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023 a . https://arxiv.org/abs/2305.10601 Tree of thoughts: Deliberate problem solving with large language models . Preprint, arXiv:2305.10601

  40. [40]

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023 b . Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809--11822

  41. [41]

    Eric Zhao, Pranjal Awasthi, and Sreenivas Gollapudi. 2025. https://arxiv.org/abs/2502.01839 Sample, scrutinize and scale: Effective inference-time search by scaling verification . Preprint, arXiv:2502.01839