pith. machine review for the scientific record. sign in

arxiv: 2502.17419 · v6 · submitted 2025-02-24 · 💻 cs.AI

Recognition: 3 theorem links

· Lean Theorem

From System 1 to System 2: A Survey of Reasoning Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:32 UTC · model grok-4.3

classification 💻 cs.AI
keywords reasoning LLMsSystem 1System 2large language modelsmathematical reasoningcode generationcognitive abilitiesbenchmark evaluation
0
0 comments X

The pith

Reasoning large language models shift from fast intuitive decisions to deliberate step-by-step analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys how foundational large language models, strong at quick heuristic choices, combine with early System 2 techniques to produce models capable of logical, multi-step reasoning. It reviews construction approaches, the methods that enable this shift, and direct comparisons of model performance across reasoning benchmarks. The survey also identifies open challenges and directions that could extend these capabilities toward more robust human-like judgment in complex domains.

Core claim

The survey establishes that the integration of foundational large language models with System 2 technologies has produced models that perform expert-level analysis in mathematics and coding through explicit step-by-step logical processes rather than pure pattern matching.

What carries the argument

The core methods for constructing reasoning large language models that promote explicit step-by-step logical analysis over fast heuristic responses.

If this is right

  • These models deliver more accurate judgments and fewer biases on tasks that require extended analysis.
  • The field gains concrete techniques for scaling deliberate reasoning beyond initial domains like mathematics.
  • Future work can prioritize refinements that maintain performance while expanding to additional problem types.
  • The overview of benchmarks supplies a baseline for measuring further gains in reasoning depth.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the surveyed methods generalize, hybrid systems could combine multiple reasoning models to tackle problems requiring both speed and depth.
  • The emphasis on benchmark comparisons suggests that progress will depend on creating harder tests that separate true reasoning from data familiarity.
  • Tracking rapid changes in this area may require living resources that update as new models and techniques appear.

Load-bearing premise

Strong results on existing math and coding benchmarks demonstrate genuine step-by-step logical reasoning rather than advanced pattern matching learned from training data.

What would settle it

A new benchmark consisting of problems outside known training distributions where the models show no consistent advantage in producing verifiable logical chains over standard large language models.

read the original abstract

Achieving human-level intelligence requires refining the transition from the fast, intuitive System 1 to the slower, more deliberate System 2 reasoning. While System 1 excels in quick, heuristic decisions, System 2 relies on logical reasoning for more accurate judgments and reduced biases. Foundational Large Language Models (LLMs) excel at fast decision-making but lack the depth for complex reasoning, as they have not yet fully embraced the step-by-step analysis characteristic of true System 2 thinking. Recently, reasoning LLMs like OpenAI's o1/o3 and DeepSeek's R1 have demonstrated expert-level performance in fields such as mathematics and coding, closely mimicking the deliberate reasoning of System 2 and showcasing human-like cognitive abilities. This survey begins with a brief overview of the progress in foundational LLMs and the early development of System 2 technologies, exploring how their combination has paved the way for reasoning LLMs. Next, we discuss how to construct reasoning LLMs, analyzing their features, the core methods enabling advanced reasoning, and the evolution of various reasoning LLMs. Additionally, we provide an overview of reasoning benchmarks, offering an in-depth comparison of the performance of representative reasoning LLMs. Finally, we explore promising directions for advancing reasoning LLMs and maintain a real-time \href{https://github.com/zzli2022/Awesome-Slow-Reason-System}{GitHub Repository} to track the latest developments. We hope this survey will serve as a valuable resource to inspire innovation and drive progress in this rapidly evolving field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript is a survey tracing the development of large language models from fast, heuristic System 1 processing to deliberate, step-by-step System 2 reasoning. It reviews foundational LLMs, early System 2 techniques, methods for building reasoning models (with emphasis on OpenAI o1/o3 and DeepSeek R1), core enabling approaches, an overview of reasoning benchmarks with performance comparisons, and future research directions, while linking to a live GitHub repository for ongoing updates.

Significance. As a timely synthesis of an active research area, the survey usefully organizes recent advances in reasoning LLMs and collates benchmark results across representative models. The maintained repository adds practical value for readers tracking the field. The interpretive framing that benchmark gains demonstrate 'human-like cognitive abilities' and close mimicry of System 2 is presented as a central motivation but rests on the accuracy of the cited literature rather than new analysis.

major comments (1)
  1. [Abstract] Abstract: The central claim that o1/o3 and R1 'closely mimicking the deliberate reasoning of System 2' is grounded solely in reported expert-level math and coding benchmark scores. The survey does not supply or cite ablations, contamination audits, or out-of-distribution tests that would distinguish genuine step-by-step deliberation from scaled pattern completion or leakage; this interpretive step therefore remains an unexamined assumption rather than a substantiated conclusion.
minor comments (3)
  1. [Abstract / Introduction] The abstract and introduction would benefit from a brief explicit statement of the survey's scope (e.g., which models and benchmarks are covered through what cutoff date) to help readers assess completeness.
  2. [Abstract] The GitHub repository is described as 'real-time'; adding a sentence noting the date of the most recent update or commit would increase transparency.
  3. [Benchmark overview] In the benchmark comparison section, ensure that performance tables or figures include error bars or variance measures where available from the original papers, to avoid over-interpreting single-run scores.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the survey's timeliness and the value of the accompanying repository. We address the single major comment below and have revised the manuscript to clarify the scope and grounding of our interpretive framing.

read point-by-point responses
  1. Referee: The central claim that o1/o3 and R1 'closely mimicking the deliberate reasoning of System 2' is grounded solely in reported expert-level math and coding benchmark scores. The survey does not supply or cite ablations, contamination audits, or out-of-distribution tests that would distinguish genuine step-by-step deliberation from scaled pattern completion or leakage; this interpretive step therefore remains an unexamined assumption rather than a substantiated conclusion.

    Authors: We agree that a survey cannot itself supply new ablations, contamination audits, or OOD tests; those must come from primary research. The abstract phrasing summarizes claims made in the cited source papers (OpenAI o1/o3 technical reports and the DeepSeek-R1 paper), which present the models as performing step-by-step reasoning on the reported benchmarks. To avoid presenting this as our own substantiated conclusion, we have revised the abstract to read “as described in the original works, these models achieve expert-level performance on benchmarks that require multi-step reasoning” and added a new paragraph in the introduction that explicitly notes the active debate in the literature. We now cite recent critical analyses that examine potential data contamination, alternative explanations for benchmark gains, and the limits of current evaluation protocols. These additions preserve the survey’s role as a synthesis while making the evidential basis and interpretive caveats transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: survey without derivations, predictions, or self-referential reductions

full rationale

The paper is a literature survey reviewing foundational LLMs, reasoning LLMs (o1/o3, R1), construction methods, benchmarks, and future directions. It contains no equations, quantitative derivations, fitted parameters, or predictions that could reduce to inputs by construction. All claims reference external published models and benchmarks rather than self-citations or internal definitions. The interpretive analogy between benchmark scores and System 2 reasoning is not a derivation step and does not meet any of the enumerated circularity patterns. The paper is self-contained as a review against external sources.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper with no new technical claims, derivations, or postulates; it draws on standard concepts from cognitive science and AI literature without introducing free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5658 in / 1134 out tokens · 53347 ms · 2026-05-13T01:32:04.418171+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 37 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

    cs.AI 2026-05 conditional novelty 8.0

    LLMs exhibit myopic planning in four-in-a-row: move choices are best explained by shallow nodes in reasoning traces, not the deep lookahead they generate, unlike humans where depth drives performance.

  2. Unsupervised Process Reward Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.

  3. Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

    cs.AI 2026-05 unverdicted novelty 7.0

    LLMs exhibit myopic planning in games, with move choices driven by shallow nodes despite deep reasoning traces, in contrast to human deep-search reliance.

  4. Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

    cs.AI 2026-05 unverdicted novelty 7.0

    LLM move selection in four-in-a-row is best explained by myopic models that ignore deep nodes in their own reasoning traces, while performance correlates with search breadth rather than depth.

  5. Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

    cs.AI 2026-05 unverdicted novelty 7.0

    LLMs display myopic planning in games: move selection is driven by shallow nodes in reasoning traces despite generating deep lookahead, with performance tied to search breadth rather than depth.

  6. Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

    cs.AI 2026-05 conditional novelty 7.0

    Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

  7. ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation

    cs.CL 2026-04 unverdicted novelty 7.0

    ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.

  8. Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information

    cs.CL 2026-04 unverdicted novelty 7.0

    A CoT distillation framework transfers stepwise teacher attention on key information via a Mixture-of-Layers module to improve reasoning in small language models.

  9. AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search

    cs.SE 2026-04 unverdicted novelty 7.0

    AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.

  10. Video-R1: Reinforcing Video Reasoning in MLLMs

    cs.CV 2025-03 conditional novelty 7.0

    Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.

  11. Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

    cs.AI 2026-05 unverdicted novelty 6.0

    LLMs show a knowing-doing gap in tool use: they often recognize when tools are needed via internal states but fail to translate that into actual tool calls, with mismatches of 26-54% on arithmetic and factual tasks.

  12. Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

    cs.AI 2026-05 unverdicted novelty 6.0

    MORA breaks the safety-helpfulness trade-off in LLM alignment by pre-sampling single-reward prompts and rewriting them to expand multi-dimensional reward diversity, yielding 5-12.4% single-preference gains in sequenti...

  13. Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

    cs.AI 2026-05 unverdicted novelty 6.0

    MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall i...

  14. Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.

  15. CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics

    cs.CL 2026-05 unverdicted novelty 6.0

    CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.

  16. Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

    cs.AI 2026-05 unverdicted novelty 6.0

    CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and sp...

  17. HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

    cs.AI 2026-04 unverdicted novelty 6.0

    HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

  18. OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

    cs.CV 2026-04 unverdicted novelty 6.0

    OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.

  19. Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

    cs.AI 2026-04 conditional novelty 6.0

    Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.

  20. LACE: Lattice Attention for Cross-thread Exploration

    cs.AI 2026-04 unverdicted novelty 6.0

    LACE enables parallel reasoning paths in LLMs to communicate via lattice attention and error-correct using synthetic training data, improving accuracy by over 7 points over standard parallel search.

  21. The role of System 1 and System 2 semantic memory structure in human and LLM biases

    cs.CL 2026-04 unverdicted novelty 6.0

    Human semantic memory networks for System 1 and System 2 are structurally distinct and consistently relate to implicit gender bias levels, but LLM networks do not exhibit these properties.

  22. Leveraging Mathematical Reasoning of LLMs for Efficient GPU Thread Mapping

    cs.DC 2026-04 unverdicted novelty 6.0

    Large language models derive exact analytical GPU thread mappings for complex 2D/3D domains and fractals via in-context learning, outperforming symbolic regression and enabling up to thousands-fold speedups and energy...

  23. TimelineReasoner: Advancing Timeline Summarization with Large Reasoning Models

    cs.CL 2026-04 unverdicted novelty 6.0

    TimelineReasoner applies large reasoning models in a Global Cognition plus Detail Exploration loop to produce more accurate, complete, and coherent timelines from news than prior LLM-based methods.

  24. KG-Hopper: Empowering Compact Open LLMs with Knowledge Graph Reasoning via Reinforcement Learning

    cs.CL 2026-03 unverdicted novelty 6.0

    KG-Hopper uses RL to embed full multi-hop KG traversal and backtracking into a single LLM inference round, enabling a 7B model to outperform larger multi-step systems and compete with GPT-3.5/GPT-4o-mini on eight benchmarks.

  25. CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning

    cs.CL 2026-03 unverdicted novelty 6.0

    CODA uses rollout-based difficulty signals to drive two gates that penalize verbosity on easy instances and promote deliberation on hard ones, cutting token use over 60% on simple tasks while maintaining accuracy.

  26. Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

    cs.AI 2026-05 conditional novelty 5.0

    The survey proposes the LIFE framework to unify fragmented research on collaboration, failure attribution, and self-evolution in LLM multi-agent systems into a progression toward self-organizing intelligence.

  27. Do LLMs have core beliefs?

    cs.LG 2026-05 unverdicted novelty 5.0

    LLMs generally fail to maintain stable worldviews under adversarial conversational pressure, indicating they lack core beliefs akin to those in human cognition.

  28. UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks

    cs.CV 2026-04 unverdicted novelty 5.0

    UpstreamQA disentangles video reasoning by using LRMs for explicit upstream object identification and scene context before downstream LMM VideoQA, improving performance and interpretability on OpenEQA and NExTQA in so...

  29. The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus

    cs.AI 2026-04 unverdicted novelty 5.0

    System 1 intuition in edge SLMs delivers 100% adversarial robustness and low latency for DAO consensus while System 2 reasoning causes 26.7% cognitive collapse and 17x slowdown.

  30. Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning

    cs.LG 2026-04 unverdicted novelty 5.0

    CPO++ adapts reinforcement fine-tuning of MLLMs to endogenous multi-modal concept drift through counterfactual reasoning and preference optimization, yielding better coherence and cross-domain robustness in safety-cri...

  31. LACE: Lattice Attention for Cross-thread Exploration

    cs.AI 2026-04 unverdicted novelty 5.0

    LACE adds lattice attention to let parallel LLM reasoning threads interact and correct errors, raising accuracy over 7 points versus standard independent sampling.

  32. LACE: Lattice Attention for Cross-thread Exploration

    cs.AI 2026-04 unverdicted novelty 5.0

    LACE enables concurrent reasoning paths in LLMs to interact via lattice attention and a synthetic training pipeline, raising accuracy more than 7 points over independent parallel search.

  33. H-Probes: Extracting Hierarchical Structures From Latent Representations of Language Models

    cs.CL 2026-04 unverdicted novelty 5.0

    H-probes locate low-dimensional subspaces encoding hierarchy in LLM activations for synthetic tree tasks, show causal importance and generalization, and detect weaker signals in mathematical reasoning traces.

  34. KG-Reasoner: A Reinforced Model for End-to-End Multi-Hop Knowledge Graph Reasoning

    cs.CL 2026-04 unverdicted novelty 5.0

    KG-Reasoner uses reinforcement learning to train LLMs for end-to-end multi-hop knowledge graph reasoning, achieving competitive or better results on eight benchmarks.

  35. Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

    cs.CL 2025-03 accept novelty 5.0

    A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.

  36. A Brief Overview: Agentic Reinforcement Learning In Large Language Models

    cs.AI 2026-04 unverdicted novelty 2.0

    This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.

  37. A Brief Overview: Agentic Reinforcement Learning In Large Language Models

    cs.AI 2026-04 unverdicted novelty 2.0

    The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...

Reference graph

Works this paper leans on

295 extracted references · 295 canonical work pages · cited by 30 Pith papers · 21 internal anchors

  1. [1]

    System 1+ system 2= better world: Neural-symbolic chain of logic reasoning,

    W. Hua and Y. Zhang, “System 1+ system 2= better world: Neural-symbolic chain of logic reasoning,” in Findings of the Association for Computational Linguistics: EMNLP 2022 , 2022, pp. 601–612

  2. [2]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

  3. [3]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models,

    X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-Consistency Improves Chain of Thought Reasoning in Language Models,” in The Eleventh International Conference on Learning Representations, 2023

  4. [4]

    Least-to-Most Prompt- ing Enables Complex Reasoning in Large Language Models,

    D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuur- mans, C. Cui, O. Bousquet, Q. V . Leet al., “Least-to-Most Prompt- ing Enables Complex Reasoning in Large Language Models,” in The Eleventh International Conference on Learning Representations , 2023

  5. [5]

    STaR: Self- taught reasoner bootstrapping reasoning with reasoning,

    E. Zelikman, Y. Wu, J. Mu, and N. D. Goodman, “STaR: Self- taught reasoner bootstrapping reasoning with reasoning,” in Proc. the 36th International Conference on Neural Information Pro- cessing Systems, vol. 1126, 2024

  6. [6]

    Heuristic and analytic processes in reasoning,

    J. S. B. Evans, “Heuristic and analytic processes in reasoning,” British Journal of Psychology, vol. 75, no. 4, pp. 451–468, 1984

  7. [7]

    Maps of bounded rationality: Psychology for behavioral economics,

    D. Kahneman, “Maps of bounded rationality: Psychology for behavioral economics,” American economic review , vol. 93, no. 5, pp. 1449–1475, 2003

  8. [8]

    Towards Reasoning in Large Language Models: A Survey,

    J. Huang and K. C.-C. Chang, “Towards Reasoning in Large Language Models: A Survey,” in Findings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 1049–1065

  9. [9]

    Reasoning with Language Model Prompting: A Survey,

    S. Qiao, Y. Ou, N. Zhang, X. Chen, Y. Yao, S. Deng, C. Tan, F. Huang, and H. Chen, “Reasoning with Language Model Prompting: A Survey,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 5368–5393

  10. [10]

    Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters,

    B. Wang, S. Min, X. Deng, J. Shen, Y. Wu, L. Zettlemoyer, and H. Sun, “Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 2717–2739

  11. [11]

    On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning,

    O. Shaikh, H. Zhang, W. Held, M. Bernstein, and D. Yang, “On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 4454–4470

  12. [12]

    Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning,

    H. Shao, S. Qian, H. Xiao, G. Song, Z. Zong, L. Wang, Y. Liu, and H. Li, “Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning,” in The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

  13. [13]

    Automatic Chain of Thought Prompting in Large Language Models,

    Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic Chain of Thought Prompting in Large Language Models,” in The Eleventh International Conference on Learning Representations, 2023

  14. [14]

    Reasoning with Language Model is Planning with World Model,

    S. Hao, Y. Gu, H. Ma, J. Hong, Z. Wang, D. Wang, and Z. Hu, “Reasoning with Language Model is Planning with World Model,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 8154–8173

  15. [15]

    Meta prompting for agi systems,

    Y. Zhang, “Meta prompting for agi systems,” arXiv preprint arXiv:2311.11482, 2023

  16. [16]

    Hello GPT-4o,

    OpenAI, “Hello GPT-4o,” May 2024. [Online]. Available: https://openai.com/index/hello-gpt-4o/

  17. [17]

    DeepSeek-V3 Technical Report

    A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan et al. , “Deepseek-v3 technical report,” arXiv preprint arXiv:2412.19437, 2024

  18. [18]

    Attention is all you need,

    A. Vaswani, “Attention is all you need,” Advances in Neural Information Processing Systems, 2017

  19. [19]

    BERT: Pre- training of Deep Bidirectional Transformers for Language Under- standing,

    J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre- training of Deep Bidirectional Transformers for Language Under- standing,” in Proceedings of the 2019 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and S...

  20. [20]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” CoRR, vol. abs/1907.11692, 2019. JOURNAL OF LATEX CLASS FILES, JANUARY 2025 23

  21. [21]

    Improving language understanding by generative pre-training,

    A. Radford, “Improving language understanding by generative pre-training,” 2018

  22. [22]

    Language models are unsupervised multitask learners,

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al. , “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019

  23. [23]

    Language models are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P . Dhari- wal, A. Neelakantan, P . Shyam, G. Sastry, A. Askell et al. , “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

  24. [24]

    Train- ing language models to follow instructions with human feed- back,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P . Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Train- ing language models to follow instructions with human feed- back,” Advances in neural information processing systems , vol. 35, pp. 27 730–27 744, 2022

  25. [25]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al. , “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023

  26. [26]

    A Survey of Large Language Models

    W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023

  27. [27]

    Visual Instruction Tuning,

    H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual Instruction Tuning,” in Thirty-seventh Conference on Neural Information Processing Systems , 2023

  28. [28]

    MM-LLMs: Recent Advances in MultiModal Large Language Models,

    D. Zhang, Y. Yu, J. Dong, C. Li, D. Su, C. Chu, and D. Yu, “MM-LLMs: Recent Advances in MultiModal Large Language Models,” in Findings of the Association for Computational Linguis- tics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11- 16, 2024. Association for Computational Linguistics, 2024, pp. 12 401–12 430

  29. [29]

    Learning to reason with LLMs,

    OpenAI, “Learning to reason with LLMs,” Septem- ber 2024. [Online]. Available: https://openai.com/index/ learning-to-reason-with-llms/

  30. [30]

    OpenAI o3-mini,

    ——, “OpenAI o3-mini,” January 2025. [Online]. Available: https://openai.com/index/openai-o3-mini/

  31. [31]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P . Wang, X. Bi et al. , “DeepSeek-R1: Incentivizing Rea- soning Capability in LLMs via Reinforcement Learning,” arXiv preprint arXiv:2501.12948, 2025

  32. [32]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano et al. , “Train- ing verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021

  33. [33]

    Large language models are zero-shot reasoners,

    T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” Advances in neural information processing systems, vol. 35, pp. 22 199–22 213, 2022

  34. [34]

    Improving large language model fine-tuning for solving math problems,

    Y. Liu, A. Singh, C. D. Freeman, J. D. Co-Reyes, and P . J. Liu, “Improving large language model fine-tuning for solving math problems,” arXiv preprint arXiv:2310.10047, 2023

  35. [35]

    Solving Math Word Problems via Cooperative Reasoning induced Language Models,

    X. Zhu, J. Wang, L. Zhang, Y. Zhang, Y. Huang, R. Gan, J. Zhang, and Y. Yang, “Solving Math Word Problems via Cooperative Reasoning induced Language Models,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 4471–4485

  36. [36]

    Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning,

    P . Lu, L. Qiu, K.-W. Chang, Y. N. Wu, S.-C. Zhu, T. Rajpurohit, P . Clark, and A. Kalyan, “Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning,” in The Eleventh International Conference on Learning Representations, 2023

  37. [37]

    Let’s Verify Step by Step,

    H. Lightman, V . Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe, “Let’s Verify Step by Step,” in The Twelfth International Conference on Learning Representations, 2024

  38. [38]

    Thinking like an expert: Multimodal hypergraph- of-thought (hot) reasoning to boost foundation modals,

    F. Yao, C. Tian, J. Liu, Z. Zhang, Q. Liu, L. Jin, S. Li, X. Li, and X. Sun, “Thinking like an expert: Multimodal hypergraph- of-thought (hot) reasoning to boost foundation modals,” arXiv preprint arXiv:2308.06207, 2023

  39. [39]

    Beyond Chain-of-Thought, Effec- tive Graph-of-Thought Reasoning in Language Models,

    Y. Yao, Z. Li, and H. Zhao, “Beyond Chain-of-Thought, Effec- tive Graph-of-Thought Reasoning in Language Models,” arXiv preprint arXiv:2305.16582, 2023

  40. [40]

    Mindmap: Knowledge graph prompting sparks graph of thoughts in large language models,

    Y. Wen, Z. Wang, and J. Sun, “Mindmap: Knowledge graph prompting sparks graph of thoughts in large language models,” arXiv preprint arXiv:2308.09729, 2023

  41. [41]

    Boosting logical reasoning in large language models through a new framework: The graph of thought,

    B. Lei, C. Liao, C. Ding et al. , “Boosting logical reasoning in large language models through a new framework: The graph of thought,” arXiv preprint arXiv:2308.08614, 2023

  42. [42]

    The impact of reasoning step length on large language models

    M. Jin, Q. Yu, D. Shu, H. Zhao, W. Hua, Y. Meng, Y. Zhang, and M. Du, “The impact of reasoning step length on large language models,” arXiv preprint arXiv:2401.04925, 2024

  43. [43]

    Graph of thoughts: Solving elaborate problems with large language models,

    M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P . Nyczyk et al. , “Graph of thoughts: Solving elaborate problems with large language models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 16, 2024, pp. 17 682–17 690

  44. [44]

    Self- playing Adversarial Language Game Enhances LLM Reasoning,

    P . Cheng, T. Hu, H. Xu, Z. Zhang, Y. Dai, L. Han, and N. Du, “Self- playing Adversarial Language Game Enhances LLM Reasoning,” arXiv preprint arXiv:2404.10642, 2024

  45. [45]

    IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models,

    H. You, R. Sun, Z. Wang, L. Chen, G. Wang, H. Ayyubi, K.- W. Chang, and S.-F. Chang, “IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models,” in Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 11 289–11 303

  46. [46]

    V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs,

    P . Wu and S. Xie, “V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 13 084– 13 094

  47. [47]

    GENOME: Gener- ative Neuro-Symbolic Visual Reasoning by Growing and Reusing Modules,

    Z. Chen, R. Sun, W. Liu, Y. Hong, and C. Gan, “GENOME: Gener- ative Neuro-Symbolic Visual Reasoning by Growing and Reusing Modules,” in International Conference on Learning Representations , 2024

  48. [48]

    A comparative study on reasoning patterns of openai’s o1 model

    S. Wu, Z. Peng, X. Du, T. Zheng, M. Liu, J. Wu, J. Ma, Y. Li, J. Yang, W. Zhou et al., “A Comparative Study on Reasoning Patterns of OpenAI’s o1 Model,”arXiv preprint arXiv:2410.13639, 2024

  49. [49]

    Towards system 2 reasoning in llms: Learning how to think with meta chain-of-though,

    V . Xiang, C. Snell, K. Gandhi, A. Albalak, A. Singh, C. Blagden, D. Phung, R. Rafailov, N. Lile, D. Mahan et al., “Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain- of-Though,” arXiv preprint arXiv:2501.04682, 2025

  50. [50]

    O1 replication journey: A strategic progress report–part 1

    Y. Qin, X. Li, H. Zou, Y. Liu, S. Xia, Z. Huang, Y. Ye, W. Yuan, H. Liu, Y. Li et al., “O1 Replication Journey: A Strategic Progress Report–Part 1,” arXiv preprint arXiv:2410.18982, 2024

  51. [51]

    Yoshitaka Inoue, Tianci Song, and Tianfan Fu

    Z. Huang, H. Zou, X. Li, Y. Liu, Y. Zheng, E. Chern, S. Xia, Y. Qin, W. Yuan, and P . Liu, “O1 Replication Journey–Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?” arXiv preprint arXiv:2411.16489, 2024

  52. [52]

    O1 Replication Journey–Part 3: Inference-time Scaling for Medical Reasoning.arXiv preprint arXiv:2501.06458

    Z. Huang, G. Geng, S. Hua, Z. Huang, H. Zou, S. Zhang, P . Liu, and X. Zhang, “O1 Replication Journey–Part 3: Inference-time Scaling for Medical Reasoning,” arXiv preprint arXiv:2501.06458 , 2025

  53. [53]

    Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.arXiv preprint arXiv:2412.09413, 2024

    Y. Min, Z. Chen, J. Jiang, J. Chen, J. Deng, Y. Hu, Y. Tang, J. Wang, X. Cheng, H. Song, W. X. Zhao, Z. Liu, Z. Wang, and J.-R. Wen, “Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems,” arXiv preprint arXiv:2412.09413, 2024

  54. [54]

    RedStar: Does Scaling Long- CoT Data Unlock Better Slow-Reasoning Systems?

    H. Xu, X. Wu, W. Wang, Z. Li, D. Zheng, B. Chen, Y. Hu, S. Kang, J. Ji, Y. Zhang et al. , “RedStar: Does Scaling Long- CoT Data Unlock Better Slow-Reasoning Systems?”arXiv preprint arXiv:2501.11284, 2025

  55. [55]

    Scaling of search and learning: A roadmap to reproduce o1 from reinforcement learning perspective

    Z. Zeng, Q. Cheng, Z. Yin, B. Wang, S. Li, Y. Zhou, Q. Guo, X. Huang, and X. Qiu, “Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Per- spective,” arXiv preprint arXiv:2412.14135, 2024

  56. [56]

    arXiv preprint arXiv:2501.02497 , year=

    Y. Ji, J. Li, H. Ye, K. Wu, J. Xu, L. Mo, and M. Zhang, “Test- time Computing: from System-1 Thinking to System-2 Thinking,” arXiv preprint arXiv:2501.02497, 2025

  57. [57]

    Reasoning Language Models: A Blueprint,

    M. Besta, J. Barth, E. Schreiber, A. Kubicek, A. Catarino, R. Ger- stenberger, P . Nyczyk, P . Iff, Y. Li, S. Houlistonet al., “Reasoning Language Models: A Blueprint,” arXiv preprint arXiv:2501.11223, 2025

  58. [58]

    Llm as a mastermind: A survey of strate- gic reasoning with large language models.arXiv preprint arXiv:2404.01230, 2024

    Y. Zhang, S. Mao, T. Ge, X. Wang, A. de Wynter, Y. Xia, W. Wu, T. Song, M. Lan, and F. Wei, “LLM as a Mastermind: A Survey of Strategic Reasoning with Large Language Models,” arXiv preprint arXiv:2404.01230, 2024

  59. [59]

    Towards large reasoning models: A survey of reinforced reasoning with large language models

    F. Xu, Q. Hao, Z. Zong, J. Wang, Y. Zhang, J. Wang, X. Lan, J. Gong, T. Ouyang, F. Meng et al. , “Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models,” arXiv preprint arXiv:2501.09686, 2025

  60. [60]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- wal, G. Sastry, A. Askell, P . Mishkin, J. Clark et al. , “Learning transferable visual models from natural language supervision,” in International conference on machine learning . PMLR, 2021, pp. 8748–8763

  61. [61]

    Zero-shot text-to-image generation,

    A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” JOURNAL OF LATEX CLASS FILES, JANUARY 2025 24 in International conference on machine learning . Pmlr, 2021, pp. 8821–8831

  62. [62]

    GPT-4 Technical Report,

    OpenAI, “GPT-4 Technical Report,” 2023

  63. [63]

    Flamingo: a visual language model for few-shot learning,

    J.-B. Alayrac, J. Donahue, P . Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 23 716–23 736, 2022

  64. [64]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models,

    J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models,” in International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, 2023, pp. 19 730–19 742

  65. [65]

    InstructBLIP: Towards General- purpose Vision-Language Models with Instruction Tuning,

    W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P . Fung, and S. C. H. Hoi, “InstructBLIP: Towards General- purpose Vision-Language Models with Instruction Tuning,” in Thirty-seventh Conference on Neural Information Processing Systems , 2023

  66. [66]

    FastMoE: A fast mixture-of-expert training system.arXiv preprint arXiv:2103.13262,

    J. He, J. Qiu, A. Zeng, Z. Yang, J. Zhai, and J. Tang, “Fast- moe: A fast mixture-of-expert training system,” arXiv preprint arXiv:2103.13262, 2021

  67. [67]

    Glam: Efficient scaling of language models with mixture-of-experts,

    N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat et al. , “Glam: Efficient scaling of language models with mixture-of-experts,” in Interna- tional conference on machine learning. PMLR, 2022, pp. 5547–5569

  68. [68]

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models,

    D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu et al., “DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2024, pp. 1280– 1297

  69. [69]

    Learning representations by back-propagating errors,

    D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” nature, vol. 323, no. 6088, pp. 533–536, 1986

  70. [70]

    Convolutional networks for images, speech, and time series,

    Y. LeCun, Y. Bengio et al., “Convolutional networks for images, speech, and time series,” The handbook of brain theory and neural networks, vol. 3361, no. 10, p. 1995, 1995

  71. [71]

    Long Short-term Memory,

    S. Hochreiter, “Long Short-term Memory,” Neural Computation MIT-Press, 1997

  72. [72]

    A fast learning algorithm for deep belief nets,

    G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006

  73. [73]

    Reducing the dimension- ality of data with neural networks,

    G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimension- ality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006

  74. [74]

    Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,

    G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V . Vanhoucke, P . Nguyen, T. N. Sainath et al. , “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal processing magazine, vol. 29, no. 6, pp. 82–97, 2012

  75. [75]

    Imagenet classi- fication with deep convolutional neural networks,

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classi- fication with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012

  76. [76]

    Learning Phrase Rep- resentations using RNN Encoder-Decoder for Statistical Machine Translation,

    K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning Phrase Rep- resentations using RNN Encoder-Decoder for Statistical Machine Translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25- 29, 2014, Doha, Qatar, A meeting of SIGDAT, a Spe...

  77. [77]

    Sequence to Sequence Learning with Neural Networks

    I. Sutskever, “Sequence to Sequence Learning with Neural Net- works,” arXiv preprint arXiv:1409.3215, 2014

  78. [78]

    Dropout: a simple way to prevent neural net- works from overfitting,

    N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural net- works from overfitting,” The journal of machine learning research , vol. 15, no. 1, pp. 1929–1958, 2014

  79. [79]

    Adam: A Method for Stochastic Optimization

    D. P . Kingma and J. Ba, “Adam: A method for stochastic opti- mization,” arXiv preprint arXiv:1412.6980, 2014

  80. [80]

    Deep learning,

    Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015

Showing first 80 references.