arxiv: 2502.17419 · v6 · submitted 2025-02-24 · 💻 cs.AI

Recognition: 3 theorem links

· Lean Theorem

From System 1 to System 2: A Survey of Reasoning Large Language Models

Zhong-Zhi Li , Duzhen Zhang , Ming-Liang Zhang , Jiaxin Zhang , Zengyan Liu , Yuxuan Yao , Haotian Xu , Junhao Zheng

show 13 more authors

Pei-Jie Wang Xiuyi Chen Yingying Zhang Fei Yin Jiahua Dong Zhiwei Li Bao-Long Bi Ling-Rui Mei Junfeng Fang Xiao Liang Zhijiang Guo Le Song Cheng-Lin Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:32 UTC · model grok-4.3

classification 💻 cs.AI

keywords reasoning LLMsSystem 1System 2large language modelsmathematical reasoningcode generationcognitive abilitiesbenchmark evaluation

0 comments

The pith

Reasoning large language models shift from fast intuitive decisions to deliberate step-by-step analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys how foundational large language models, strong at quick heuristic choices, combine with early System 2 techniques to produce models capable of logical, multi-step reasoning. It reviews construction approaches, the methods that enable this shift, and direct comparisons of model performance across reasoning benchmarks. The survey also identifies open challenges and directions that could extend these capabilities toward more robust human-like judgment in complex domains.

Core claim

The survey establishes that the integration of foundational large language models with System 2 technologies has produced models that perform expert-level analysis in mathematics and coding through explicit step-by-step logical processes rather than pure pattern matching.

What carries the argument

The core methods for constructing reasoning large language models that promote explicit step-by-step logical analysis over fast heuristic responses.

If this is right

These models deliver more accurate judgments and fewer biases on tasks that require extended analysis.
The field gains concrete techniques for scaling deliberate reasoning beyond initial domains like mathematics.
Future work can prioritize refinements that maintain performance while expanding to additional problem types.
The overview of benchmarks supplies a baseline for measuring further gains in reasoning depth.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the surveyed methods generalize, hybrid systems could combine multiple reasoning models to tackle problems requiring both speed and depth.
The emphasis on benchmark comparisons suggests that progress will depend on creating harder tests that separate true reasoning from data familiarity.
Tracking rapid changes in this area may require living resources that update as new models and techniques appear.

Load-bearing premise

Strong results on existing math and coding benchmarks demonstrate genuine step-by-step logical reasoning rather than advanced pattern matching learned from training data.

What would settle it

A new benchmark consisting of problems outside known training distributions where the models show no consistent advantage in producing verifiable logical chains over standard large language models.

read the original abstract

Achieving human-level intelligence requires refining the transition from the fast, intuitive System 1 to the slower, more deliberate System 2 reasoning. While System 1 excels in quick, heuristic decisions, System 2 relies on logical reasoning for more accurate judgments and reduced biases. Foundational Large Language Models (LLMs) excel at fast decision-making but lack the depth for complex reasoning, as they have not yet fully embraced the step-by-step analysis characteristic of true System 2 thinking. Recently, reasoning LLMs like OpenAI's o1/o3 and DeepSeek's R1 have demonstrated expert-level performance in fields such as mathematics and coding, closely mimicking the deliberate reasoning of System 2 and showcasing human-like cognitive abilities. This survey begins with a brief overview of the progress in foundational LLMs and the early development of System 2 technologies, exploring how their combination has paved the way for reasoning LLMs. Next, we discuss how to construct reasoning LLMs, analyzing their features, the core methods enabling advanced reasoning, and the evolution of various reasoning LLMs. Additionally, we provide an overview of reasoning benchmarks, offering an in-depth comparison of the performance of representative reasoning LLMs. Finally, we explore promising directions for advancing reasoning LLMs and maintain a real-time \href{https://github.com/zzli2022/Awesome-Slow-Reason-System}{GitHub Repository} to track the latest developments. We hope this survey will serve as a valuable resource to inspire innovation and drive progress in this rapidly evolving field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A practical survey that organizes reasoning LLM work and benchmarks but accepts the System 2 mimicry claim without examining the supporting evidence.

read the letter

This survey compiles recent work on reasoning LLMs, framing models like o1/o3 and R1 as a shift toward deliberate System 2 thinking. It covers background on standard LLMs, methods for adding reasoning such as chain-of-thought and search, benchmark comparisons on math and coding tasks, and some forward directions, plus a GitHub repo for updates. The structure is straightforward and the benchmark tables give a single place to see how different models perform. That synthesis is the main practical value for anyone trying to track the area without reading every new paper. The citations are broad and the descriptions of existing techniques read as accurate. The soft spot is the handling of the central claim. The abstract states that these models closely mimic System 2 reasoning on the basis of expert-level scores, yet the text does not discuss whether those scores reflect step-by-step deliberation or other factors like training patterns and possible leakage. No ablations or harder tests are added to address the alternative, which leaves the human-like cognitive abilities part resting on the same evidence already in the literature. This is for readers who want a consolidated reference rather than new analysis or critique. The paper is coherent and shows honest engagement with the published work. I would bring it to a reading group to discuss the benchmark numbers and what they actually measure. I would cite it in the next year as a pointer to the literature. It deserves peer review because a timely, well-organized survey on an active topic can save others time even without original results.

Referee Report

1 major / 3 minor

Summary. The manuscript is a survey tracing the development of large language models from fast, heuristic System 1 processing to deliberate, step-by-step System 2 reasoning. It reviews foundational LLMs, early System 2 techniques, methods for building reasoning models (with emphasis on OpenAI o1/o3 and DeepSeek R1), core enabling approaches, an overview of reasoning benchmarks with performance comparisons, and future research directions, while linking to a live GitHub repository for ongoing updates.

Significance. As a timely synthesis of an active research area, the survey usefully organizes recent advances in reasoning LLMs and collates benchmark results across representative models. The maintained repository adds practical value for readers tracking the field. The interpretive framing that benchmark gains demonstrate 'human-like cognitive abilities' and close mimicry of System 2 is presented as a central motivation but rests on the accuracy of the cited literature rather than new analysis.

major comments (1)

[Abstract] Abstract: The central claim that o1/o3 and R1 'closely mimicking the deliberate reasoning of System 2' is grounded solely in reported expert-level math and coding benchmark scores. The survey does not supply or cite ablations, contamination audits, or out-of-distribution tests that would distinguish genuine step-by-step deliberation from scaled pattern completion or leakage; this interpretive step therefore remains an unexamined assumption rather than a substantiated conclusion.

minor comments (3)

[Abstract / Introduction] The abstract and introduction would benefit from a brief explicit statement of the survey's scope (e.g., which models and benchmarks are covered through what cutoff date) to help readers assess completeness.
[Abstract] The GitHub repository is described as 'real-time'; adding a sentence noting the date of the most recent update or commit would increase transparency.
[Benchmark overview] In the benchmark comparison section, ensure that performance tables or figures include error bars or variance measures where available from the original papers, to avoid over-interpreting single-run scores.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the survey's timeliness and the value of the accompanying repository. We address the single major comment below and have revised the manuscript to clarify the scope and grounding of our interpretive framing.

read point-by-point responses

Referee: The central claim that o1/o3 and R1 'closely mimicking the deliberate reasoning of System 2' is grounded solely in reported expert-level math and coding benchmark scores. The survey does not supply or cite ablations, contamination audits, or out-of-distribution tests that would distinguish genuine step-by-step deliberation from scaled pattern completion or leakage; this interpretive step therefore remains an unexamined assumption rather than a substantiated conclusion.

Authors: We agree that a survey cannot itself supply new ablations, contamination audits, or OOD tests; those must come from primary research. The abstract phrasing summarizes claims made in the cited source papers (OpenAI o1/o3 technical reports and the DeepSeek-R1 paper), which present the models as performing step-by-step reasoning on the reported benchmarks. To avoid presenting this as our own substantiated conclusion, we have revised the abstract to read “as described in the original works, these models achieve expert-level performance on benchmarks that require multi-step reasoning” and added a new paragraph in the introduction that explicitly notes the active debate in the literature. We now cite recent critical analyses that examine potential data contamination, alternative explanations for benchmark gains, and the limits of current evaluation protocols. These additions preserve the survey’s role as a synthesis while making the evidential basis and interpretive caveats transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: survey without derivations, predictions, or self-referential reductions

full rationale

The paper is a literature survey reviewing foundational LLMs, reasoning LLMs (o1/o3, R1), construction methods, benchmarks, and future directions. It contains no equations, quantitative derivations, fitted parameters, or predictions that could reduce to inputs by construction. All claims reference external published models and benchmarks rather than self-citations or internal definitions. The interpretive analogy between benchmark scores and System 2 reasoning is not a derivation step and does not meet any of the enumerated circularity patterns. The paper is self-contained as a review against external sources.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper with no new technical claims, derivations, or postulates; it draws on standard concepts from cognitive science and AI literature without introducing free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5658 in / 1134 out tokens · 53347 ms · 2026-05-13T01:32:04.418171+00:00 · methodology

discussion (0)

Forward citations

Cited by 36 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 conditional novelty 8.0

LLMs exhibit myopic planning in four-in-a-row: move choices are best explained by shallow nodes in reasoning traces, not the deep lookahead they generate, unlike humans where depth drives performance.
Unsupervised Process Reward Models
cs.LG 2026-05 unverdicted novelty 7.0

Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 unverdicted novelty 7.0

LLMs exhibit myopic planning in games, with move choices driven by shallow nodes despite deep reasoning traces, in contrast to human deep-search reliance.
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 unverdicted novelty 7.0

LLM move selection in four-in-a-row is best explained by myopic models that ignore deep nodes in their own reasoning traces, while performance correlates with search breadth rather than depth.
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 unverdicted novelty 7.0

LLMs display myopic planning in games: move selection is driven by shallow nodes in reasoning traces despite generating deep lookahead, with performance tied to search breadth rather than depth.
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
cs.AI 2026-05 conditional novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation
cs.CL 2026-04 unverdicted novelty 7.0

ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.
Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information
cs.CL 2026-04 unverdicted novelty 7.0

A CoT distillation framework transfers stepwise teacher attention on key information via a Mixture-of-Layers module to improve reasoning in small language models.
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search
cs.SE 2026-04 unverdicted novelty 7.0

AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
Video-R1: Reinforcing Video Reasoning in MLLMs
cs.CV 2025-03 conditional novelty 7.0

Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use
cs.AI 2026-05 unverdicted novelty 6.0

LLMs show a knowing-doing gap in tool use: they often recognize when tools are needed via internal states but fail to translate that into actual tool calls, with mismatches of 26-54% on arithmetic and factual tasks.
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
cs.AI 2026-05 unverdicted novelty 6.0

MORA breaks the safety-helpfulness trade-off in LLM alignment by pre-sampling single-reward prompts and rewriting them to expand multi-dimensional reward diversity, yielding 5-12.4% single-preference gains in sequenti...
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
cs.AI 2026-05 unverdicted novelty 6.0

MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall i...
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.
CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics
cs.CL 2026-05 unverdicted novelty 6.0

CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable
cs.AI 2026-05 unverdicted novelty 6.0

CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and sp...
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
cs.AI 2026-04 unverdicted novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model
cs.CV 2026-04 unverdicted novelty 6.0

OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.
Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks
cs.AI 2026-04 conditional novelty 6.0

Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.
LACE: Lattice Attention for Cross-thread Exploration
cs.AI 2026-04 unverdicted novelty 6.0

LACE enables parallel reasoning paths in LLMs to communicate via lattice attention and error-correct using synthetic training data, improving accuracy by over 7 points over standard parallel search.
The role of System 1 and System 2 semantic memory structure in human and LLM biases
cs.CL 2026-04 unverdicted novelty 6.0

Human semantic memory networks for System 1 and System 2 are structurally distinct and consistently relate to implicit gender bias levels, but LLM networks do not exhibit these properties.
Leveraging Mathematical Reasoning of LLMs for Efficient GPU Thread Mapping
cs.DC 2026-04 unverdicted novelty 6.0

Large language models derive exact analytical GPU thread mappings for complex 2D/3D domains and fractals via in-context learning, outperforming symbolic regression and enabling up to thousands-fold speedups and energy...
TimelineReasoner: Advancing Timeline Summarization with Large Reasoning Models
cs.CL 2026-04 unverdicted novelty 6.0

TimelineReasoner applies large reasoning models in a Global Cognition plus Detail Exploration loop to produce more accurate, complete, and coherent timelines from news than prior LLM-based methods.
KG-Hopper: Empowering Compact Open LLMs with Knowledge Graph Reasoning via Reinforcement Learning
cs.CL 2026-03 unverdicted novelty 6.0

KG-Hopper uses RL to embed full multi-hop KG traversal and backtracking into a single LLM inference round, enabling a 7B model to outperform larger multi-step systems and compete with GPT-3.5/GPT-4o-mini on eight benchmarks.
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
cs.AI 2026-05 conditional novelty 5.0

The survey proposes the LIFE framework to unify fragmented research on collaboration, failure attribution, and self-evolution in LLM multi-agent systems into a progression toward self-organizing intelligence.
Do LLMs have core beliefs?
cs.LG 2026-05 unverdicted novelty 5.0

LLMs generally fail to maintain stable worldviews under adversarial conversational pressure, indicating they lack core beliefs akin to those in human cognition.
UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks
cs.CV 2026-04 unverdicted novelty 5.0

UpstreamQA disentangles video reasoning by using LRMs for explicit upstream object identification and scene context before downstream LMM VideoQA, improving performance and interpretability on OpenEQA and NExTQA in so...
The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus
cs.AI 2026-04 unverdicted novelty 5.0

System 1 intuition in edge SLMs delivers 100% adversarial robustness and low latency for DAO consensus while System 2 reasoning causes 26.7% cognitive collapse and 17x slowdown.
Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning
cs.LG 2026-04 unverdicted novelty 5.0

CPO++ adapts reinforcement fine-tuning of MLLMs to endogenous multi-modal concept drift through counterfactual reasoning and preference optimization, yielding better coherence and cross-domain robustness in safety-cri...
LACE: Lattice Attention for Cross-thread Exploration
cs.AI 2026-04 unverdicted novelty 5.0

LACE adds lattice attention to let parallel LLM reasoning threads interact and correct errors, raising accuracy over 7 points versus standard independent sampling.
LACE: Lattice Attention for Cross-thread Exploration
cs.AI 2026-04 unverdicted novelty 5.0

LACE enables concurrent reasoning paths in LLMs to interact via lattice attention and a synthetic training pipeline, raising accuracy more than 7 points over independent parallel search.
H-Probes: Extracting Hierarchical Structures From Latent Representations of Language Models
cs.CL 2026-04 unverdicted novelty 5.0

H-probes locate low-dimensional subspaces encoding hierarchy in LLM activations for synthetic tree tasks, show causal importance and generalization, and detect weaker signals in mathematical reasoning traces.
KG-Reasoner: A Reinforced Model for End-to-End Multi-Hop Knowledge Graph Reasoning
cs.CL 2026-04 unverdicted novelty 5.0

KG-Reasoner uses reinforcement learning to train LLMs for end-to-end multi-hop knowledge graph reasoning, achieving competitive or better results on eight benchmarks.
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
cs.CL 2025-03 accept novelty 5.0

A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 2.0

The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 2.0

This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.

Reference graph

Works this paper leans on

295 extracted references · 295 canonical work pages · cited by 29 Pith papers · 20 internal anchors

[1]

System 1+ system 2= better world: Neural-symbolic chain of logic reasoning,

W. Hua and Y. Zhang, “System 1+ system 2= better world: Neural-symbolic chain of logic reasoning,” in Findings of the Association for Computational Linguistics: EMNLP 2022 , 2022, pp. 601–612

work page 2022
[2]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

work page 2022
[3]

Self-Consistency Improves Chain of Thought Reasoning in Language Models,

X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-Consistency Improves Chain of Thought Reasoning in Language Models,” in The Eleventh International Conference on Learning Representations, 2023

work page 2023
[4]

Least-to-Most Prompt- ing Enables Complex Reasoning in Large Language Models,

D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuur- mans, C. Cui, O. Bousquet, Q. V . Leet al., “Least-to-Most Prompt- ing Enables Complex Reasoning in Large Language Models,” in The Eleventh International Conference on Learning Representations , 2023

work page 2023
[5]

STaR: Self- taught reasoner bootstrapping reasoning with reasoning,

E. Zelikman, Y. Wu, J. Mu, and N. D. Goodman, “STaR: Self- taught reasoner bootstrapping reasoning with reasoning,” in Proc. the 36th International Conference on Neural Information Pro- cessing Systems, vol. 1126, 2024

work page 2024
[6]

Heuristic and analytic processes in reasoning,

J. S. B. Evans, “Heuristic and analytic processes in reasoning,” British Journal of Psychology, vol. 75, no. 4, pp. 451–468, 1984

work page 1984
[7]

Maps of bounded rationality: Psychology for behavioral economics,

D. Kahneman, “Maps of bounded rationality: Psychology for behavioral economics,” American economic review , vol. 93, no. 5, pp. 1449–1475, 2003

work page 2003
[8]

Towards Reasoning in Large Language Models: A Survey,

J. Huang and K. C.-C. Chang, “Towards Reasoning in Large Language Models: A Survey,” in Findings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 1049–1065

work page 2023
[9]

Reasoning with Language Model Prompting: A Survey,

S. Qiao, Y. Ou, N. Zhang, X. Chen, Y. Yao, S. Deng, C. Tan, F. Huang, and H. Chen, “Reasoning with Language Model Prompting: A Survey,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 5368–5393

work page 2023
[10]

Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters,

B. Wang, S. Min, X. Deng, J. Shen, Y. Wu, L. Zettlemoyer, and H. Sun, “Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 2717–2739

work page 2023
[11]

On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning,

O. Shaikh, H. Zhang, W. Held, M. Bernstein, and D. Yang, “On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 4454–4470

work page 2023
[12]

Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning,

H. Shao, S. Qian, H. Xiao, G. Song, Z. Zong, L. Wang, Y. Liu, and H. Li, “Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning,” in The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

work page 2024
[13]

Automatic Chain of Thought Prompting in Large Language Models,

Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic Chain of Thought Prompting in Large Language Models,” in The Eleventh International Conference on Learning Representations, 2023

work page 2023
[14]

Reasoning with Language Model is Planning with World Model,

S. Hao, Y. Gu, H. Ma, J. Hong, Z. Wang, D. Wang, and Z. Hu, “Reasoning with Language Model is Planning with World Model,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 8154–8173

work page 2023
[15]

Meta prompting for agi systems,

Y. Zhang, “Meta prompting for agi systems,” arXiv preprint arXiv:2311.11482, 2023

work page arXiv 2023
[16]

Hello GPT-4o,

OpenAI, “Hello GPT-4o,” May 2024. [Online]. Available: https://openai.com/index/hello-gpt-4o/

work page 2024
[17]

DeepSeek-V3 Technical Report

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan et al. , “Deepseek-v3 technical report,” arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Attention is all you need,

A. Vaswani, “Attention is all you need,” Advances in Neural Information Processing Systems, 2017

work page 2017
[19]

BERT: Pre- training of Deep Bidirectional Transformers for Language Under- standing,

J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre- training of Deep Bidirectional Transformers for Language Under- standing,” in Proceedings of the 2019 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and S...

work page 2019
[20]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” CoRR, vol. abs/1907.11692, 2019. JOURNAL OF LATEX CLASS FILES, JANUARY 2025 23

work page internal anchor Pith review Pith/arXiv arXiv 1907
[21]

Improving language understanding by generative pre-training,

A. Radford, “Improving language understanding by generative pre-training,” 2018

work page 2018
[22]

Language models are unsupervised multitask learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al. , “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019

work page 2019
[23]

Language models are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P . Dhari- wal, A. Neelakantan, P . Shyam, G. Sastry, A. Askell et al. , “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

work page 1901
[24]

Train- ing language models to follow instructions with human feed- back,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P . Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Train- ing language models to follow instructions with human feed- back,” Advances in neural information processing systems , vol. 35, pp. 27 730–27 744, 2022

work page 2022
[25]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al. , “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

A Survey of Large Language Models

W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Visual Instruction Tuning,

H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual Instruction Tuning,” in Thirty-seventh Conference on Neural Information Processing Systems , 2023

work page 2023
[28]

MM-LLMs: Recent Advances in MultiModal Large Language Models,

D. Zhang, Y. Yu, J. Dong, C. Li, D. Su, C. Chu, and D. Yu, “MM-LLMs: Recent Advances in MultiModal Large Language Models,” in Findings of the Association for Computational Linguis- tics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11- 16, 2024. Association for Computational Linguistics, 2024, pp. 12 401–12 430

work page 2024
[29]

Learning to reason with LLMs,

OpenAI, “Learning to reason with LLMs,” Septem- ber 2024. [Online]. Available: https://openai.com/index/ learning-to-reason-with-llms/

work page 2024
[30]

OpenAI o3-mini,

——, “OpenAI o3-mini,” January 2025. [Online]. Available: https://openai.com/index/openai-o3-mini/

work page 2025
[31]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P . Wang, X. Bi et al. , “DeepSeek-R1: Incentivizing Rea- soning Capability in LLMs via Reinforcement Learning,” arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano et al. , “Train- ing verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[33]

Large language models are zero-shot reasoners,

T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” Advances in neural information processing systems, vol. 35, pp. 22 199–22 213, 2022

work page 2022
[34]

Improving large language model fine-tuning for solving math problems,

Y. Liu, A. Singh, C. D. Freeman, J. D. Co-Reyes, and P . J. Liu, “Improving large language model fine-tuning for solving math problems,” arXiv preprint arXiv:2310.10047, 2023

work page arXiv 2023
[35]

Solving Math Word Problems via Cooperative Reasoning induced Language Models,

X. Zhu, J. Wang, L. Zhang, Y. Zhang, Y. Huang, R. Gan, J. Zhang, and Y. Yang, “Solving Math Word Problems via Cooperative Reasoning induced Language Models,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 4471–4485

work page 2023
[36]

Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning,

P . Lu, L. Qiu, K.-W. Chang, Y. N. Wu, S.-C. Zhu, T. Rajpurohit, P . Clark, and A. Kalyan, “Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning,” in The Eleventh International Conference on Learning Representations, 2023

work page 2023
[37]

Let’s Verify Step by Step,

H. Lightman, V . Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe, “Let’s Verify Step by Step,” in The Twelfth International Conference on Learning Representations, 2024

work page 2024
[38]

Thinking like an expert: Multimodal hypergraph- of-thought (hot) reasoning to boost foundation modals,

F. Yao, C. Tian, J. Liu, Z. Zhang, Q. Liu, L. Jin, S. Li, X. Li, and X. Sun, “Thinking like an expert: Multimodal hypergraph- of-thought (hot) reasoning to boost foundation modals,” arXiv preprint arXiv:2308.06207, 2023

work page arXiv 2023
[39]

Beyond Chain-of-Thought, Effec- tive Graph-of-Thought Reasoning in Language Models,

Y. Yao, Z. Li, and H. Zhao, “Beyond Chain-of-Thought, Effec- tive Graph-of-Thought Reasoning in Language Models,” arXiv preprint arXiv:2305.16582, 2023

work page arXiv 2023
[40]

Mindmap: Knowledge graph prompting sparks graph of thoughts in large language models,

Y. Wen, Z. Wang, and J. Sun, “Mindmap: Knowledge graph prompting sparks graph of thoughts in large language models,” arXiv preprint arXiv:2308.09729, 2023

work page arXiv 2023
[41]

Boosting logical reasoning in large language models through a new framework: The graph of thought,

B. Lei, C. Liao, C. Ding et al. , “Boosting logical reasoning in large language models through a new framework: The graph of thought,” arXiv preprint arXiv:2308.08614, 2023

work page arXiv 2023
[42]

The impact of reasoning step length on large language models

M. Jin, Q. Yu, D. Shu, H. Zhao, W. Hua, Y. Meng, Y. Zhang, and M. Du, “The impact of reasoning step length on large language models,” arXiv preprint arXiv:2401.04925, 2024

work page arXiv 2024
[43]

Graph of thoughts: Solving elaborate problems with large language models,

M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P . Nyczyk et al. , “Graph of thoughts: Solving elaborate problems with large language models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 16, 2024, pp. 17 682–17 690

work page 2024
[44]

Self- playing Adversarial Language Game Enhances LLM Reasoning,

P . Cheng, T. Hu, H. Xu, Z. Zhang, Y. Dai, L. Han, and N. Du, “Self- playing Adversarial Language Game Enhances LLM Reasoning,” arXiv preprint arXiv:2404.10642, 2024

work page arXiv 2024
[45]

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models,

H. You, R. Sun, Z. Wang, L. Chen, G. Wang, H. Ayyubi, K.- W. Chang, and S.-F. Chang, “IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models,” in Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 11 289–11 303

work page 2023
[46]

V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs,

P . Wu and S. Xie, “V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 13 084– 13 094

work page 2024
[47]

GENOME: Gener- ative Neuro-Symbolic Visual Reasoning by Growing and Reusing Modules,

Z. Chen, R. Sun, W. Liu, Y. Hong, and C. Gan, “GENOME: Gener- ative Neuro-Symbolic Visual Reasoning by Growing and Reusing Modules,” in International Conference on Learning Representations , 2024

work page 2024
[48]

A comparative study on reasoning patterns of openai’s o1 model

S. Wu, Z. Peng, X. Du, T. Zheng, M. Liu, J. Wu, J. Ma, Y. Li, J. Yang, W. Zhou et al., “A Comparative Study on Reasoning Patterns of OpenAI’s o1 Model,”arXiv preprint arXiv:2410.13639, 2024

work page arXiv 2024
[49]

Towards system 2 reasoning in llms: Learning how to think with meta chain-of-though,

V . Xiang, C. Snell, K. Gandhi, A. Albalak, A. Singh, C. Blagden, D. Phung, R. Rafailov, N. Lile, D. Mahan et al., “Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain- of-Though,” arXiv preprint arXiv:2501.04682, 2025

work page arXiv 2025
[50]

Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, et al

Y. Qin, X. Li, H. Zou, Y. Liu, S. Xia, Z. Huang, Y. Ye, W. Yuan, H. Liu, Y. Li et al., “O1 Replication Journey: A Strategic Progress Report–Part 1,” arXiv preprint arXiv:2410.18982, 2024

work page arXiv 2024
[51]

Yoshitaka Inoue, Tianci Song, and Tianfan Fu

Z. Huang, H. Zou, X. Li, Y. Liu, Y. Zheng, E. Chern, S. Xia, Y. Qin, W. Yuan, and P . Liu, “O1 Replication Journey–Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?” arXiv preprint arXiv:2411.16489, 2024

work page arXiv 2024
[52]

O1 Replication Journey–Part 3: Inference-time Scaling for Medical Reasoning.arXiv preprint arXiv:2501.06458

Z. Huang, G. Geng, S. Hua, Z. Huang, H. Zou, S. Zhang, P . Liu, and X. Zhang, “O1 Replication Journey–Part 3: Inference-time Scaling for Medical Reasoning,” arXiv preprint arXiv:2501.06458 , 2025

work page arXiv 2025
[53]

Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.arXiv preprint arXiv:2412.09413, 2024

Y. Min, Z. Chen, J. Jiang, J. Chen, J. Deng, Y. Hu, Y. Tang, J. Wang, X. Cheng, H. Song, W. X. Zhao, Z. Liu, Z. Wang, and J.-R. Wen, “Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems,” arXiv preprint arXiv:2412.09413, 2024

work page arXiv 2024
[54]

RedStar: Does Scaling Long- CoT Data Unlock Better Slow-Reasoning Systems?

H. Xu, X. Wu, W. Wang, Z. Li, D. Zheng, B. Chen, Y. Hu, S. Kang, J. Ji, Y. Zhang et al. , “RedStar: Does Scaling Long- CoT Data Unlock Better Slow-Reasoning Systems?”arXiv preprint arXiv:2501.11284, 2025

work page arXiv 2025
[55]

Scaling of search and learning: A roadmap to reproduce o1 from reinforcement learning perspective

Z. Zeng, Q. Cheng, Z. Yin, B. Wang, S. Li, Y. Zhou, Q. Guo, X. Huang, and X. Qiu, “Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Per- spective,” arXiv preprint arXiv:2412.14135, 2024

work page arXiv 2024
[56]

arXiv preprint arXiv:2501.02497 , year=

Y. Ji, J. Li, H. Ye, K. Wu, J. Xu, L. Mo, and M. Zhang, “Test- time Computing: from System-1 Thinking to System-2 Thinking,” arXiv preprint arXiv:2501.02497, 2025

work page arXiv 2025
[57]

Reasoning Language Models: A Blueprint,

M. Besta, J. Barth, E. Schreiber, A. Kubicek, A. Catarino, R. Ger- stenberger, P . Nyczyk, P . Iff, Y. Li, S. Houlistonet al., “Reasoning Language Models: A Blueprint,” arXiv preprint arXiv:2501.11223, 2025

work page arXiv 2025
[58]

Llm as a mastermind: A survey of strate- gic reasoning with large language models.arXiv preprint arXiv:2404.01230, 2024

Y. Zhang, S. Mao, T. Ge, X. Wang, A. de Wynter, Y. Xia, W. Wu, T. Song, M. Lan, and F. Wei, “LLM as a Mastermind: A Survey of Strategic Reasoning with Large Language Models,” arXiv preprint arXiv:2404.01230, 2024

work page arXiv 2024
[59]

Towards large reasoning models: A survey of reinforced reasoning with large language models

F. Xu, Q. Hao, Z. Zong, J. Wang, Y. Zhang, J. Wang, X. Lan, J. Gong, T. Ouyang, F. Meng et al. , “Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models,” arXiv preprint arXiv:2501.09686, 2025

work page arXiv 2025
[60]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- wal, G. Sastry, A. Askell, P . Mishkin, J. Clark et al. , “Learning transferable visual models from natural language supervision,” in International conference on machine learning . PMLR, 2021, pp. 8748–8763

work page 2021
[61]

Zero-shot text-to-image generation,

A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” JOURNAL OF LATEX CLASS FILES, JANUARY 2025 24 in International conference on machine learning . Pmlr, 2021, pp. 8821–8831

work page 2025
[62]

GPT-4 Technical Report,

OpenAI, “GPT-4 Technical Report,” 2023

work page 2023
[63]

Flamingo: a visual language model for few-shot learning,

J.-B. Alayrac, J. Donahue, P . Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 23 716–23 736, 2022

work page 2022
[64]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models,

J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models,” in International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, 2023, pp. 19 730–19 742

work page 2023
[65]

InstructBLIP: Towards General- purpose Vision-Language Models with Instruction Tuning,

W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P . Fung, and S. C. H. Hoi, “InstructBLIP: Towards General- purpose Vision-Language Models with Instruction Tuning,” in Thirty-seventh Conference on Neural Information Processing Systems , 2023

work page 2023
[66]

FastMoE: A fast mixture-of-expert training system.arXiv preprint arXiv:2103.13262,

J. He, J. Qiu, A. Zeng, Z. Yang, J. Zhai, and J. Tang, “Fast- moe: A fast mixture-of-expert training system,” arXiv preprint arXiv:2103.13262, 2021

work page arXiv 2021
[67]

Glam: Efficient scaling of language models with mixture-of-experts,

N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat et al. , “Glam: Efficient scaling of language models with mixture-of-experts,” in Interna- tional conference on machine learning. PMLR, 2022, pp. 5547–5569

work page 2022
[68]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models,

D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu et al., “DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2024, pp. 1280– 1297

work page 2024
[69]

Learning representations by back-propagating errors,

D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” nature, vol. 323, no. 6088, pp. 533–536, 1986

work page 1986
[70]

Convolutional networks for images, speech, and time series,

Y. LeCun, Y. Bengio et al., “Convolutional networks for images, speech, and time series,” The handbook of brain theory and neural networks, vol. 3361, no. 10, p. 1995, 1995

work page 1995
[71]

Long Short-term Memory,

S. Hochreiter, “Long Short-term Memory,” Neural Computation MIT-Press, 1997

work page 1997
[72]

A fast learning algorithm for deep belief nets,

G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006

work page 2006
[73]

Reducing the dimension- ality of data with neural networks,

G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimension- ality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006

work page 2006
[74]

Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,

G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V . Vanhoucke, P . Nguyen, T. N. Sainath et al. , “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal processing magazine, vol. 29, no. 6, pp. 82–97, 2012

work page 2012
[75]

Imagenet classi- fication with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classi- fication with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012

work page 2012
[76]

Learning Phrase Rep- resentations using RNN Encoder-Decoder for Statistical Machine Translation,

K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning Phrase Rep- resentations using RNN Encoder-Decoder for Statistical Machine Translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25- 29, 2014, Doha, Qatar, A meeting of SIGDAT, a Spe...

work page 2014
[77]

Sequence to Sequence Learning with Neural Networks

I. Sutskever, “Sequence to Sequence Learning with Neural Net- works,” arXiv preprint arXiv:1409.3215, 2014

work page Pith review arXiv 2014
[78]

Dropout: a simple way to prevent neural net- works from overfitting,

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural net- works from overfitting,” The journal of machine learning research , vol. 15, no. 1, pp. 1929–1958, 2014

work page 1929
[79]

Adam: A Method for Stochastic Optimization

D. P . Kingma and J. Ba, “Adam: A method for stochastic opti- mization,” arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[80]

Deep learning,

Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015

work page 2015

Showing first 80 references.