Recognition: 3 theorem links
· Lean TheoremFrom System 1 to System 2: A Survey of Reasoning Large Language Models
Pith reviewed 2026-05-13 01:32 UTC · model grok-4.3
The pith
Reasoning large language models shift from fast intuitive decisions to deliberate step-by-step analysis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The survey establishes that the integration of foundational large language models with System 2 technologies has produced models that perform expert-level analysis in mathematics and coding through explicit step-by-step logical processes rather than pure pattern matching.
What carries the argument
The core methods for constructing reasoning large language models that promote explicit step-by-step logical analysis over fast heuristic responses.
If this is right
- These models deliver more accurate judgments and fewer biases on tasks that require extended analysis.
- The field gains concrete techniques for scaling deliberate reasoning beyond initial domains like mathematics.
- Future work can prioritize refinements that maintain performance while expanding to additional problem types.
- The overview of benchmarks supplies a baseline for measuring further gains in reasoning depth.
Where Pith is reading between the lines
- If the surveyed methods generalize, hybrid systems could combine multiple reasoning models to tackle problems requiring both speed and depth.
- The emphasis on benchmark comparisons suggests that progress will depend on creating harder tests that separate true reasoning from data familiarity.
- Tracking rapid changes in this area may require living resources that update as new models and techniques appear.
Load-bearing premise
Strong results on existing math and coding benchmarks demonstrate genuine step-by-step logical reasoning rather than advanced pattern matching learned from training data.
What would settle it
A new benchmark consisting of problems outside known training distributions where the models show no consistent advantage in producing verifiable logical chains over standard large language models.
read the original abstract
Achieving human-level intelligence requires refining the transition from the fast, intuitive System 1 to the slower, more deliberate System 2 reasoning. While System 1 excels in quick, heuristic decisions, System 2 relies on logical reasoning for more accurate judgments and reduced biases. Foundational Large Language Models (LLMs) excel at fast decision-making but lack the depth for complex reasoning, as they have not yet fully embraced the step-by-step analysis characteristic of true System 2 thinking. Recently, reasoning LLMs like OpenAI's o1/o3 and DeepSeek's R1 have demonstrated expert-level performance in fields such as mathematics and coding, closely mimicking the deliberate reasoning of System 2 and showcasing human-like cognitive abilities. This survey begins with a brief overview of the progress in foundational LLMs and the early development of System 2 technologies, exploring how their combination has paved the way for reasoning LLMs. Next, we discuss how to construct reasoning LLMs, analyzing their features, the core methods enabling advanced reasoning, and the evolution of various reasoning LLMs. Additionally, we provide an overview of reasoning benchmarks, offering an in-depth comparison of the performance of representative reasoning LLMs. Finally, we explore promising directions for advancing reasoning LLMs and maintain a real-time \href{https://github.com/zzli2022/Awesome-Slow-Reason-System}{GitHub Repository} to track the latest developments. We hope this survey will serve as a valuable resource to inspire innovation and drive progress in this rapidly evolving field.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a survey tracing the development of large language models from fast, heuristic System 1 processing to deliberate, step-by-step System 2 reasoning. It reviews foundational LLMs, early System 2 techniques, methods for building reasoning models (with emphasis on OpenAI o1/o3 and DeepSeek R1), core enabling approaches, an overview of reasoning benchmarks with performance comparisons, and future research directions, while linking to a live GitHub repository for ongoing updates.
Significance. As a timely synthesis of an active research area, the survey usefully organizes recent advances in reasoning LLMs and collates benchmark results across representative models. The maintained repository adds practical value for readers tracking the field. The interpretive framing that benchmark gains demonstrate 'human-like cognitive abilities' and close mimicry of System 2 is presented as a central motivation but rests on the accuracy of the cited literature rather than new analysis.
major comments (1)
- [Abstract] Abstract: The central claim that o1/o3 and R1 'closely mimicking the deliberate reasoning of System 2' is grounded solely in reported expert-level math and coding benchmark scores. The survey does not supply or cite ablations, contamination audits, or out-of-distribution tests that would distinguish genuine step-by-step deliberation from scaled pattern completion or leakage; this interpretive step therefore remains an unexamined assumption rather than a substantiated conclusion.
minor comments (3)
- [Abstract / Introduction] The abstract and introduction would benefit from a brief explicit statement of the survey's scope (e.g., which models and benchmarks are covered through what cutoff date) to help readers assess completeness.
- [Abstract] The GitHub repository is described as 'real-time'; adding a sentence noting the date of the most recent update or commit would increase transparency.
- [Benchmark overview] In the benchmark comparison section, ensure that performance tables or figures include error bars or variance measures where available from the original papers, to avoid over-interpreting single-run scores.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the survey's timeliness and the value of the accompanying repository. We address the single major comment below and have revised the manuscript to clarify the scope and grounding of our interpretive framing.
read point-by-point responses
-
Referee: The central claim that o1/o3 and R1 'closely mimicking the deliberate reasoning of System 2' is grounded solely in reported expert-level math and coding benchmark scores. The survey does not supply or cite ablations, contamination audits, or out-of-distribution tests that would distinguish genuine step-by-step deliberation from scaled pattern completion or leakage; this interpretive step therefore remains an unexamined assumption rather than a substantiated conclusion.
Authors: We agree that a survey cannot itself supply new ablations, contamination audits, or OOD tests; those must come from primary research. The abstract phrasing summarizes claims made in the cited source papers (OpenAI o1/o3 technical reports and the DeepSeek-R1 paper), which present the models as performing step-by-step reasoning on the reported benchmarks. To avoid presenting this as our own substantiated conclusion, we have revised the abstract to read “as described in the original works, these models achieve expert-level performance on benchmarks that require multi-step reasoning” and added a new paragraph in the introduction that explicitly notes the active debate in the literature. We now cite recent critical analyses that examine potential data contamination, alternative explanations for benchmark gains, and the limits of current evaluation protocols. These additions preserve the survey’s role as a synthesis while making the evidential basis and interpretive caveats transparent. revision: yes
Circularity Check
No circularity: survey without derivations, predictions, or self-referential reductions
full rationale
The paper is a literature survey reviewing foundational LLMs, reasoning LLMs (o1/o3, R1), construction methods, benchmarks, and future directions. It contains no equations, quantitative derivations, fitted parameters, or predictions that could reduce to inputs by construction. All claims reference external published models and benchmarks rather than self-citations or internal definitions. The interpretive analogy between benchmark scores and System 2 reasoning is not a derivation step and does not meet any of the enumerated circularity patterns. The paper is self-contained as a review against external sources.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 36 Pith papers
-
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
LLMs exhibit myopic planning in four-in-a-row: move choices are best explained by shallow nodes in reasoning traces, not the deep lookahead they generate, unlike humans where depth drives performance.
-
Unsupervised Process Reward Models
Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.
-
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
LLMs exhibit myopic planning in games, with move choices driven by shallow nodes despite deep reasoning traces, in contrast to human deep-search reliance.
-
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
LLM move selection in four-in-a-row is best explained by myopic models that ignore deep nodes in their own reasoning traces, while performance correlates with search breadth rather than depth.
-
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
LLMs display myopic planning in games: move selection is driven by shallow nodes in reasoning traces despite generating deep lookahead, with performance tied to search breadth rather than depth.
-
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
-
ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation
ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.
-
Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information
A CoT distillation framework transfers stepwise teacher attention on key information via a Mixture-of-Layers module to improve reasoning in small language models.
-
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search
AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
-
Video-R1: Reinforcing Video Reasoning in MLLMs
Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
-
Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use
LLMs show a knowing-doing gap in tool use: they often recognize when tools are needed via internal states but fail to translate that into actual tool calls, with mismatches of 26-54% on arithmetic and factual tasks.
-
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
MORA breaks the safety-helpfulness trade-off in LLM alignment by pre-sampling single-reward prompts and rewriting them to expand multi-dimensional reward diversity, yielding 5-12.4% single-preference gains in sequenti...
-
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall i...
-
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning
Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.
-
CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics
CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.
-
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and sp...
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model
OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.
-
Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks
Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.
-
LACE: Lattice Attention for Cross-thread Exploration
LACE enables parallel reasoning paths in LLMs to communicate via lattice attention and error-correct using synthetic training data, improving accuracy by over 7 points over standard parallel search.
-
The role of System 1 and System 2 semantic memory structure in human and LLM biases
Human semantic memory networks for System 1 and System 2 are structurally distinct and consistently relate to implicit gender bias levels, but LLM networks do not exhibit these properties.
-
Leveraging Mathematical Reasoning of LLMs for Efficient GPU Thread Mapping
Large language models derive exact analytical GPU thread mappings for complex 2D/3D domains and fractals via in-context learning, outperforming symbolic regression and enabling up to thousands-fold speedups and energy...
-
TimelineReasoner: Advancing Timeline Summarization with Large Reasoning Models
TimelineReasoner applies large reasoning models in a Global Cognition plus Detail Exploration loop to produce more accurate, complete, and coherent timelines from news than prior LLM-based methods.
-
KG-Hopper: Empowering Compact Open LLMs with Knowledge Graph Reasoning via Reinforcement Learning
KG-Hopper uses RL to embed full multi-hop KG traversal and backtracking into a single LLM inference round, enabling a 7B model to outperform larger multi-step systems and compete with GPT-3.5/GPT-4o-mini on eight benchmarks.
-
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
The survey proposes the LIFE framework to unify fragmented research on collaboration, failure attribution, and self-evolution in LLM multi-agent systems into a progression toward self-organizing intelligence.
-
Do LLMs have core beliefs?
LLMs generally fail to maintain stable worldviews under adversarial conversational pressure, indicating they lack core beliefs akin to those in human cognition.
-
UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks
UpstreamQA disentangles video reasoning by using LRMs for explicit upstream object identification and scene context before downstream LMM VideoQA, improving performance and interpretability on OpenEQA and NExTQA in so...
-
The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus
System 1 intuition in edge SLMs delivers 100% adversarial robustness and low latency for DAO consensus while System 2 reasoning causes 26.7% cognitive collapse and 17x slowdown.
-
Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning
CPO++ adapts reinforcement fine-tuning of MLLMs to endogenous multi-modal concept drift through counterfactual reasoning and preference optimization, yielding better coherence and cross-domain robustness in safety-cri...
-
LACE: Lattice Attention for Cross-thread Exploration
LACE adds lattice attention to let parallel LLM reasoning threads interact and correct errors, raising accuracy over 7 points versus standard independent sampling.
-
LACE: Lattice Attention for Cross-thread Exploration
LACE enables concurrent reasoning paths in LLMs to interact via lattice attention and a synthetic training pipeline, raising accuracy more than 7 points over independent parallel search.
-
H-Probes: Extracting Hierarchical Structures From Latent Representations of Language Models
H-probes locate low-dimensional subspaces encoding hierarchy in LLM activations for synthetic tree tasks, show causal importance and generalization, and detect weaker signals in mathematical reasoning traces.
-
KG-Reasoner: A Reinforced Model for End-to-End Multi-Hop Knowledge Graph Reasoning
KG-Reasoner uses reinforcement learning to train LLMs for end-to-end multi-hop knowledge graph reasoning, achieving competitive or better results on eight benchmarks.
-
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
-
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...
-
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.
Reference graph
Works this paper leans on
-
[1]
System 1+ system 2= better world: Neural-symbolic chain of logic reasoning,
W. Hua and Y. Zhang, “System 1+ system 2= better world: Neural-symbolic chain of logic reasoning,” in Findings of the Association for Computational Linguistics: EMNLP 2022 , 2022, pp. 601–612
work page 2022
-
[2]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022
work page 2022
-
[3]
Self-Consistency Improves Chain of Thought Reasoning in Language Models,
X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-Consistency Improves Chain of Thought Reasoning in Language Models,” in The Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[4]
Least-to-Most Prompt- ing Enables Complex Reasoning in Large Language Models,
D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuur- mans, C. Cui, O. Bousquet, Q. V . Leet al., “Least-to-Most Prompt- ing Enables Complex Reasoning in Large Language Models,” in The Eleventh International Conference on Learning Representations , 2023
work page 2023
-
[5]
STaR: Self- taught reasoner bootstrapping reasoning with reasoning,
E. Zelikman, Y. Wu, J. Mu, and N. D. Goodman, “STaR: Self- taught reasoner bootstrapping reasoning with reasoning,” in Proc. the 36th International Conference on Neural Information Pro- cessing Systems, vol. 1126, 2024
work page 2024
-
[6]
Heuristic and analytic processes in reasoning,
J. S. B. Evans, “Heuristic and analytic processes in reasoning,” British Journal of Psychology, vol. 75, no. 4, pp. 451–468, 1984
work page 1984
-
[7]
Maps of bounded rationality: Psychology for behavioral economics,
D. Kahneman, “Maps of bounded rationality: Psychology for behavioral economics,” American economic review , vol. 93, no. 5, pp. 1449–1475, 2003
work page 2003
-
[8]
Towards Reasoning in Large Language Models: A Survey,
J. Huang and K. C.-C. Chang, “Towards Reasoning in Large Language Models: A Survey,” in Findings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 1049–1065
work page 2023
-
[9]
Reasoning with Language Model Prompting: A Survey,
S. Qiao, Y. Ou, N. Zhang, X. Chen, Y. Yao, S. Deng, C. Tan, F. Huang, and H. Chen, “Reasoning with Language Model Prompting: A Survey,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 5368–5393
work page 2023
-
[10]
Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters,
B. Wang, S. Min, X. Deng, J. Shen, Y. Wu, L. Zettlemoyer, and H. Sun, “Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 2717–2739
work page 2023
-
[11]
On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning,
O. Shaikh, H. Zhang, W. Held, M. Bernstein, and D. Yang, “On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 4454–4470
work page 2023
-
[12]
H. Shao, S. Qian, H. Xiao, G. Song, Z. Zong, L. Wang, Y. Liu, and H. Li, “Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning,” in The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024
work page 2024
-
[13]
Automatic Chain of Thought Prompting in Large Language Models,
Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic Chain of Thought Prompting in Large Language Models,” in The Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[14]
Reasoning with Language Model is Planning with World Model,
S. Hao, Y. Gu, H. Ma, J. Hong, Z. Wang, D. Wang, and Z. Hu, “Reasoning with Language Model is Planning with World Model,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 8154–8173
work page 2023
-
[15]
Meta prompting for agi systems,
Y. Zhang, “Meta prompting for agi systems,” arXiv preprint arXiv:2311.11482, 2023
-
[16]
OpenAI, “Hello GPT-4o,” May 2024. [Online]. Available: https://openai.com/index/hello-gpt-4o/
work page 2024
-
[17]
A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan et al. , “Deepseek-v3 technical report,” arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
A. Vaswani, “Attention is all you need,” Advances in Neural Information Processing Systems, 2017
work page 2017
-
[19]
BERT: Pre- training of Deep Bidirectional Transformers for Language Under- standing,
J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre- training of Deep Bidirectional Transformers for Language Under- standing,” in Proceedings of the 2019 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and S...
work page 2019
-
[20]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” CoRR, vol. abs/1907.11692, 2019. JOURNAL OF LATEX CLASS FILES, JANUARY 2025 23
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[21]
Improving language understanding by generative pre-training,
A. Radford, “Improving language understanding by generative pre-training,” 2018
work page 2018
-
[22]
Language models are unsupervised multitask learners,
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al. , “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019
work page 2019
-
[23]
Language models are few-shot learners,
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P . Dhari- wal, A. Neelakantan, P . Shyam, G. Sastry, A. Askell et al. , “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020
work page 1901
-
[24]
Train- ing language models to follow instructions with human feed- back,
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P . Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Train- ing language models to follow instructions with human feed- back,” Advances in neural information processing systems , vol. 35, pp. 27 730–27 744, 2022
work page 2022
-
[25]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al. , “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
A Survey of Large Language Models
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual Instruction Tuning,” in Thirty-seventh Conference on Neural Information Processing Systems , 2023
work page 2023
-
[28]
MM-LLMs: Recent Advances in MultiModal Large Language Models,
D. Zhang, Y. Yu, J. Dong, C. Li, D. Su, C. Chu, and D. Yu, “MM-LLMs: Recent Advances in MultiModal Large Language Models,” in Findings of the Association for Computational Linguis- tics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11- 16, 2024. Association for Computational Linguistics, 2024, pp. 12 401–12 430
work page 2024
-
[29]
OpenAI, “Learning to reason with LLMs,” Septem- ber 2024. [Online]. Available: https://openai.com/index/ learning-to-reason-with-llms/
work page 2024
-
[30]
——, “OpenAI o3-mini,” January 2025. [Online]. Available: https://openai.com/index/openai-o3-mini/
work page 2025
-
[31]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P . Wang, X. Bi et al. , “DeepSeek-R1: Incentivizing Rea- soning Capability in LLMs via Reinforcement Learning,” arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano et al. , “Train- ing verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[33]
Large language models are zero-shot reasoners,
T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” Advances in neural information processing systems, vol. 35, pp. 22 199–22 213, 2022
work page 2022
-
[34]
Improving large language model fine-tuning for solving math problems,
Y. Liu, A. Singh, C. D. Freeman, J. D. Co-Reyes, and P . J. Liu, “Improving large language model fine-tuning for solving math problems,” arXiv preprint arXiv:2310.10047, 2023
-
[35]
Solving Math Word Problems via Cooperative Reasoning induced Language Models,
X. Zhu, J. Wang, L. Zhang, Y. Zhang, Y. Huang, R. Gan, J. Zhang, and Y. Yang, “Solving Math Word Problems via Cooperative Reasoning induced Language Models,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 4471–4485
work page 2023
-
[36]
Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning,
P . Lu, L. Qiu, K.-W. Chang, Y. N. Wu, S.-C. Zhu, T. Rajpurohit, P . Clark, and A. Kalyan, “Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning,” in The Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[37]
H. Lightman, V . Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe, “Let’s Verify Step by Step,” in The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[38]
F. Yao, C. Tian, J. Liu, Z. Zhang, Q. Liu, L. Jin, S. Li, X. Li, and X. Sun, “Thinking like an expert: Multimodal hypergraph- of-thought (hot) reasoning to boost foundation modals,” arXiv preprint arXiv:2308.06207, 2023
-
[39]
Beyond Chain-of-Thought, Effec- tive Graph-of-Thought Reasoning in Language Models,
Y. Yao, Z. Li, and H. Zhao, “Beyond Chain-of-Thought, Effec- tive Graph-of-Thought Reasoning in Language Models,” arXiv preprint arXiv:2305.16582, 2023
-
[40]
Mindmap: Knowledge graph prompting sparks graph of thoughts in large language models,
Y. Wen, Z. Wang, and J. Sun, “Mindmap: Knowledge graph prompting sparks graph of thoughts in large language models,” arXiv preprint arXiv:2308.09729, 2023
-
[41]
Boosting logical reasoning in large language models through a new framework: The graph of thought,
B. Lei, C. Liao, C. Ding et al. , “Boosting logical reasoning in large language models through a new framework: The graph of thought,” arXiv preprint arXiv:2308.08614, 2023
-
[42]
The impact of reasoning step length on large language models
M. Jin, Q. Yu, D. Shu, H. Zhao, W. Hua, Y. Meng, Y. Zhang, and M. Du, “The impact of reasoning step length on large language models,” arXiv preprint arXiv:2401.04925, 2024
-
[43]
Graph of thoughts: Solving elaborate problems with large language models,
M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P . Nyczyk et al. , “Graph of thoughts: Solving elaborate problems with large language models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 16, 2024, pp. 17 682–17 690
work page 2024
-
[44]
Self- playing Adversarial Language Game Enhances LLM Reasoning,
P . Cheng, T. Hu, H. Xu, Z. Zhang, Y. Dai, L. Han, and N. Du, “Self- playing Adversarial Language Game Enhances LLM Reasoning,” arXiv preprint arXiv:2404.10642, 2024
-
[45]
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models,
H. You, R. Sun, Z. Wang, L. Chen, G. Wang, H. Ayyubi, K.- W. Chang, and S.-F. Chang, “IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models,” in Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 11 289–11 303
work page 2023
-
[46]
V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs,
P . Wu and S. Xie, “V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 13 084– 13 094
work page 2024
-
[47]
GENOME: Gener- ative Neuro-Symbolic Visual Reasoning by Growing and Reusing Modules,
Z. Chen, R. Sun, W. Liu, Y. Hong, and C. Gan, “GENOME: Gener- ative Neuro-Symbolic Visual Reasoning by Growing and Reusing Modules,” in International Conference on Learning Representations , 2024
work page 2024
-
[48]
A comparative study on reasoning patterns of openai’s o1 model
S. Wu, Z. Peng, X. Du, T. Zheng, M. Liu, J. Wu, J. Ma, Y. Li, J. Yang, W. Zhou et al., “A Comparative Study on Reasoning Patterns of OpenAI’s o1 Model,”arXiv preprint arXiv:2410.13639, 2024
-
[49]
Towards system 2 reasoning in llms: Learning how to think with meta chain-of-though,
V . Xiang, C. Snell, K. Gandhi, A. Albalak, A. Singh, C. Blagden, D. Phung, R. Rafailov, N. Lile, D. Mahan et al., “Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain- of-Though,” arXiv preprint arXiv:2501.04682, 2025
-
[50]
Y. Qin, X. Li, H. Zou, Y. Liu, S. Xia, Z. Huang, Y. Ye, W. Yuan, H. Liu, Y. Li et al., “O1 Replication Journey: A Strategic Progress Report–Part 1,” arXiv preprint arXiv:2410.18982, 2024
-
[51]
Yoshitaka Inoue, Tianci Song, and Tianfan Fu
Z. Huang, H. Zou, X. Li, Y. Liu, Y. Zheng, E. Chern, S. Xia, Y. Qin, W. Yuan, and P . Liu, “O1 Replication Journey–Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?” arXiv preprint arXiv:2411.16489, 2024
-
[52]
Z. Huang, G. Geng, S. Hua, Z. Huang, H. Zou, S. Zhang, P . Liu, and X. Zhang, “O1 Replication Journey–Part 3: Inference-time Scaling for Medical Reasoning,” arXiv preprint arXiv:2501.06458 , 2025
-
[53]
Y. Min, Z. Chen, J. Jiang, J. Chen, J. Deng, Y. Hu, Y. Tang, J. Wang, X. Cheng, H. Song, W. X. Zhao, Z. Liu, Z. Wang, and J.-R. Wen, “Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems,” arXiv preprint arXiv:2412.09413, 2024
-
[54]
RedStar: Does Scaling Long- CoT Data Unlock Better Slow-Reasoning Systems?
H. Xu, X. Wu, W. Wang, Z. Li, D. Zheng, B. Chen, Y. Hu, S. Kang, J. Ji, Y. Zhang et al. , “RedStar: Does Scaling Long- CoT Data Unlock Better Slow-Reasoning Systems?”arXiv preprint arXiv:2501.11284, 2025
-
[55]
Scaling of search and learning: A roadmap to reproduce o1 from reinforcement learning perspective
Z. Zeng, Q. Cheng, Z. Yin, B. Wang, S. Li, Y. Zhou, Q. Guo, X. Huang, and X. Qiu, “Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Per- spective,” arXiv preprint arXiv:2412.14135, 2024
-
[56]
arXiv preprint arXiv:2501.02497 , year=
Y. Ji, J. Li, H. Ye, K. Wu, J. Xu, L. Mo, and M. Zhang, “Test- time Computing: from System-1 Thinking to System-2 Thinking,” arXiv preprint arXiv:2501.02497, 2025
-
[57]
Reasoning Language Models: A Blueprint,
M. Besta, J. Barth, E. Schreiber, A. Kubicek, A. Catarino, R. Ger- stenberger, P . Nyczyk, P . Iff, Y. Li, S. Houlistonet al., “Reasoning Language Models: A Blueprint,” arXiv preprint arXiv:2501.11223, 2025
-
[58]
Y. Zhang, S. Mao, T. Ge, X. Wang, A. de Wynter, Y. Xia, W. Wu, T. Song, M. Lan, and F. Wei, “LLM as a Mastermind: A Survey of Strategic Reasoning with Large Language Models,” arXiv preprint arXiv:2404.01230, 2024
-
[59]
Towards large reasoning models: A survey of reinforced reasoning with large language models
F. Xu, Q. Hao, Z. Zong, J. Wang, Y. Zhang, J. Wang, X. Lan, J. Gong, T. Ouyang, F. Meng et al. , “Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models,” arXiv preprint arXiv:2501.09686, 2025
-
[60]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- wal, G. Sastry, A. Askell, P . Mishkin, J. Clark et al. , “Learning transferable visual models from natural language supervision,” in International conference on machine learning . PMLR, 2021, pp. 8748–8763
work page 2021
-
[61]
Zero-shot text-to-image generation,
A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” JOURNAL OF LATEX CLASS FILES, JANUARY 2025 24 in International conference on machine learning . Pmlr, 2021, pp. 8821–8831
work page 2025
- [62]
-
[63]
Flamingo: a visual language model for few-shot learning,
J.-B. Alayrac, J. Donahue, P . Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 23 716–23 736, 2022
work page 2022
-
[64]
J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models,” in International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, 2023, pp. 19 730–19 742
work page 2023
-
[65]
InstructBLIP: Towards General- purpose Vision-Language Models with Instruction Tuning,
W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P . Fung, and S. C. H. Hoi, “InstructBLIP: Towards General- purpose Vision-Language Models with Instruction Tuning,” in Thirty-seventh Conference on Neural Information Processing Systems , 2023
work page 2023
-
[66]
FastMoE: A fast mixture-of-expert training system.arXiv preprint arXiv:2103.13262,
J. He, J. Qiu, A. Zeng, Z. Yang, J. Zhai, and J. Tang, “Fast- moe: A fast mixture-of-expert training system,” arXiv preprint arXiv:2103.13262, 2021
-
[67]
Glam: Efficient scaling of language models with mixture-of-experts,
N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat et al. , “Glam: Efficient scaling of language models with mixture-of-experts,” in Interna- tional conference on machine learning. PMLR, 2022, pp. 5547–5569
work page 2022
-
[68]
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models,
D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu et al., “DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2024, pp. 1280– 1297
work page 2024
-
[69]
Learning representations by back-propagating errors,
D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” nature, vol. 323, no. 6088, pp. 533–536, 1986
work page 1986
-
[70]
Convolutional networks for images, speech, and time series,
Y. LeCun, Y. Bengio et al., “Convolutional networks for images, speech, and time series,” The handbook of brain theory and neural networks, vol. 3361, no. 10, p. 1995, 1995
work page 1995
-
[71]
S. Hochreiter, “Long Short-term Memory,” Neural Computation MIT-Press, 1997
work page 1997
-
[72]
A fast learning algorithm for deep belief nets,
G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006
work page 2006
-
[73]
Reducing the dimension- ality of data with neural networks,
G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimension- ality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006
work page 2006
-
[74]
G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V . Vanhoucke, P . Nguyen, T. N. Sainath et al. , “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal processing magazine, vol. 29, no. 6, pp. 82–97, 2012
work page 2012
-
[75]
Imagenet classi- fication with deep convolutional neural networks,
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classi- fication with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012
work page 2012
-
[76]
Learning Phrase Rep- resentations using RNN Encoder-Decoder for Statistical Machine Translation,
K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning Phrase Rep- resentations using RNN Encoder-Decoder for Statistical Machine Translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25- 29, 2014, Doha, Qatar, A meeting of SIGDAT, a Spe...
work page 2014
-
[77]
Sequence to Sequence Learning with Neural Networks
I. Sutskever, “Sequence to Sequence Learning with Neural Net- works,” arXiv preprint arXiv:1409.3215, 2014
work page Pith review arXiv 2014
-
[78]
Dropout: a simple way to prevent neural net- works from overfitting,
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural net- works from overfitting,” The journal of machine learning research , vol. 15, no. 1, pp. 1929–1958, 2014
work page 1929
-
[79]
Adam: A Method for Stochastic Optimization
D. P . Kingma and J. Ba, “Adam: A method for stochastic opti- mization,” arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[80]
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.