Recognition: 2 theorem links
· Lean TheoremThe Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
Pith reviewed 2026-05-15 16:04 UTC · model grok-4.3
The pith
Large Reasoning Models exhibit complete accuracy collapse beyond certain complexities and reduce reasoning effort despite available compute.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using controllable puzzle environments, LRMs face a complete accuracy collapse beyond certain complexities. They exhibit a counterintuitive scaling limit where their reasoning effort increases with problem complexity up to a point, then declines despite having remaining token budget. Compared to standard LLMs under the same inference compute, three performance regimes emerge: standard models outperform at low complexity, LRMs at medium complexity, and both collapse at high complexity. LRMs fail to use explicit algorithms and reason inconsistently.
What carries the argument
Controllable puzzle environments that enable precise manipulation of complexity while keeping logical structures consistent, allowing detailed analysis of both final answers and internal reasoning traces.
If this is right
- Standard LLMs outperform LRMs on low-complexity tasks under same compute.
- LRMs demonstrate advantage on medium-complexity tasks.
- Both models face complete collapse on high-complexity tasks.
- LRMs fail to use explicit algorithms for exact computation.
- Reasoning in LRMs is inconsistent across different scales.
Where Pith is reading between the lines
- This implies that increasing model size or tokens alone may not overcome the scaling limit in reasoning effort.
- Real-world applications like advanced math or coding may hit similar walls unless new mechanisms are introduced.
- Evaluation benchmarks should shift toward controllable complexity measures to avoid contamination issues.
Load-bearing premise
The chosen controllable puzzle environments provide an unbiased and generalizable measure of reasoning complexity without introducing artifacts absent in domains like math or coding.
What would settle it
Testing if the accuracy collapse and decline in reasoning effort occur when applying similar complexity increases to standard math or coding problems.
read the original abstract
Recent generations of language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established math and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from contamination and does not provide insights into the reasoning traces. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs think. Through extensive experiments, we show that LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having remaining token budget. By comparing LRMs with their standard LLM counterparts under same inference compute, we identify three performance regimes: (1) low-complexity tasks where standard models outperform LRMs, (2) medium-complexity tasks where LRMs demonstrates advantage, and (3) high-complexity tasks where both models face complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across scales. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models' computational behavior, shedding light on their strengths, limitations, and raising questions about their reasoning capabilities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates Large Reasoning Models (LRMs) using controllable puzzle environments that enable precise manipulation of complexity while preserving logical structure. It claims that LRMs exhibit complete accuracy collapse beyond certain complexity thresholds, display non-monotonic reasoning effort (increasing then declining despite remaining token budget), and occupy three distinct regimes relative to standard LLMs: standard models outperform at low complexity, LRMs at medium complexity, and both collapse at high complexity. The work further analyzes reasoning traces to identify failures in exact computation, lack of explicit algorithm use, and inconsistent reasoning across scales.
Significance. If the results hold, the work provides a useful controlled lens on LRM scaling limits and internal reasoning behavior that avoids benchmark contamination issues common in math/coding evaluations. The three-regime characterization and non-monotonic effort finding could guide future research on reasoning model design, though the absence of cross-domain validation limits immediate applicability.
major comments (2)
- [Experimental setup and results sections] The central claims of complete accuracy collapse and non-monotonic effort scaling rest on the assumption that the chosen puzzle environments induce a representative complexity axis. The manuscript provides no cross-validation experiments comparing collapse thresholds or effort curves against standard math or coding benchmarks, leaving open the possibility that the observed behaviors are artifacts of the puzzle domain rather than fundamental LRM limitations.
- [Results on reasoning effort] The reported decline in reasoning effort despite remaining token budget is load-bearing for the scaling-limit claim, yet the manuscript does not detail the precise operationalization of effort (e.g., token counts in thinking traces), statistical tests for the decline, or controls for model-specific generation stopping criteria.
minor comments (1)
- [Abstract] The abstract would benefit from an early, explicit definition of 'reasoning effort' to orient readers before the regime analysis.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive feedback. We address each major comment below and describe the revisions we will incorporate to improve the manuscript.
read point-by-point responses
-
Referee: [Experimental setup and results sections] The central claims of complete accuracy collapse and non-monotonic effort scaling rest on the assumption that the chosen puzzle environments induce a representative complexity axis. The manuscript provides no cross-validation experiments comparing collapse thresholds or effort curves against standard math or coding benchmarks, leaving open the possibility that the observed behaviors are artifacts of the puzzle domain rather than fundamental LRM limitations.
Authors: We appreciate the referee's concern regarding generalizability. Our puzzle environments were deliberately selected to enable precise, contamination-free manipulation of complexity while preserving logical structure, which is a core strength highlighted in the referee summary. In the revised manuscript, we will add an expanded discussion subsection that explicitly maps puzzle complexity metrics (e.g., state space size and required search depth) to equivalent demands in mathematical and coding tasks, providing qualitative bridges to standard benchmarks. We maintain that the controlled axis reveals fundamental scaling behaviors that are difficult to isolate in contaminated evaluations. However, performing full quantitative cross-validation experiments would require substantial new work beyond the current revision timeline. revision: partial
-
Referee: [Results on reasoning effort] The reported decline in reasoning effort despite remaining token budget is load-bearing for the scaling-limit claim, yet the manuscript does not detail the precise operationalization of effort (e.g., token counts in thinking traces), statistical tests for the decline, or controls for model-specific generation stopping criteria.
Authors: We agree that greater precision is required on this load-bearing result. In the revised version, we will add explicit definitions and methodology: reasoning effort will be operationalized as the token count within the thinking trace (prior to the final answer). We will include statistical support via quadratic regression on effort versus complexity, reporting coefficients and p-values for the negative quadratic term. We will also add controls by reporting average remaining token budgets at the observed decline points across models and confirming that declines occurred well before any model-specific length limits were approached. These details will be placed in the results section on reasoning effort. revision: yes
Circularity Check
Empirical measurements on puzzles; no derivational circularity
full rationale
The paper conducts direct experimental measurements of LRM accuracy, reasoning effort, and solution patterns across controlled puzzle complexities. No equations, fitted parameters, or predictions are derived; all claims rest on observed outcomes from the puzzle environments. No self-citations are used to justify uniqueness theorems or ansatzes, and the central results (accuracy collapse, non-monotonic effort) are reported as empirical findings rather than reductions of inputs by construction. The analysis is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
three performance regimes: (1) low-complexity tasks where standard models outperform LRMs, (2) medium-complexity tasks where LRMs demonstrates advantage, and (3) high-complexity tasks where both models face complete collapse
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers
LLM-generated combinatorial solvers achieve highest correctness when the model formalizes problems for verified backends rather than attempting to optimize search, which often causes regressions.
-
When Does Critique Improve AI-Assisted Theoretical Physics? SCALAR: Structured Critic--Actor Loop for Agentic Reasoning
Structured critic-actor loops improve AI performance on theoretical physics reasoning tasks, with benefits strongest in asymmetric model pairings using constructive feedback.
-
Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems
A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.
-
DeonticBench: A Benchmark for Reasoning over Rules
DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.
-
WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking
WMF-AM is a depth-parameterized benchmark that measures LLMs' cumulative state tracking ability without scratchpads, validated on 28 models across arithmetic and non-arithmetic tasks with ablations confirming the construct.
-
Robust Reasoning Benchmark
Perturbations to math problem text cause up to 55% average accuracy drops in open-weight LLMs and sequential solving reveals context pollution in attention mechanisms.
-
LEAD: Breaking the No-Recovery Bottleneck in Long-Horizon Reasoning
LEAD lets LLMs solve checkers jumping puzzles up to size 13 by using lookahead to recover from irreversible errors on hard steps that break extreme decomposition.
-
Weighted Rules under the Stable Model Semantics
Weighted rules extend stable model semantics to support probabilistic reasoning, model ranking, and statistical inference in answer set programs.
-
Hypothesis generation and updating in large language models
LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.
-
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
LLM accuracy on controlled procedural arithmetic drops from 61% at 5 steps to 20% at 95 steps, with failures including skipped steps, premature answers, and hallucinated operations.
-
Latent Phase-Shift Rollback: Inference-Time Error Correction via Residual Stream Monitoring and KV-Cache Steering
LPSR raises 8B-model accuracy on MATH-500 from 28.8% to 44.0% by detecting error-indicating phase shifts in the residual stream and correcting via KV-cache rollback plus steering vectors, outperforming prompted self-c...
-
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
-
Measuring and curing reasoning rigidity: from decorative chain-of-thought to genuine faithfulness
SLRC quantifies genuine step necessity in LLM reasoning as a causal estimator, LC-CoSR training reduces rigidity with stability guarantees, and evaluations reveal a faithfulness-sycophancy paradox across frontier models.
-
Distill: Uncovering the True Intent behind Human-Robot Communication
Distill refines user task specifications for robots by pruning unnecessary steps, generalizing meanings, and relaxing order constraints, as demonstrated in a crowdsourcing study on a web interface.
-
Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities
Absurd World automatically converts real-world problems into absurd yet logically coherent scenarios to test whether LLMs can reason without depending on familiar patterns.
-
How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem
Non-reasoning LLMs fail the equivalence class problem while reasoning LLMs perform better but remain incomplete, with difficulty peaking at phase transition for the former and maximum diameter for the latter.
-
Do LLMs have core beliefs?
LLMs generally fail to maintain stable worldviews under adversarial conversational pressure, indicating they lack core beliefs akin to those in human cognition.
-
Assistants, Not Architects: The Role of LLMs in Networked Systems Design
LLMs fail at architectural reasoning for networked systems, but Kepler uses structured constraints and SMT-based optimization to synthesize feasible designs with explanations.
-
From Understanding to Creation: A Prerequisite-Free AI Literacy Course with Technical Depth Across Majors
A university course design enables non-technical students across majors to reach the Create level of Bloom's taxonomy by repeatedly applying a problem-data-model-evaluation-reflection pipeline with concurrent ethics t...
Reference graph
Works this paper leans on
-
[1]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [2]
-
[3]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [4]
-
[5]
Gemini flash thinking.Google AI Blog, Jan 2025
Google. Gemini flash thinking.Google AI Blog, Jan 2025
work page 2025
-
[6]
GSM-symbolic: Understanding the limitations of mathematical reasoning in large language models
Seyed Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. GSM-symbolic: Understanding the limitations of mathematical reasoning in large language models. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[7]
Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021
work page 2021
-
[8]
ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems
Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv:2505.11831, 2025
work page internal anchor Pith review arXiv 2025
-
[9]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah I Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat S. Behl, and et. al. Phi-3 technical report: A highly capable language model locally on your phone.CoRR, abs/2404.14219, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chap- lot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.CoRR, abs/2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, 13 Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Beth...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaïd Harchaoui, and Yejin Choi
Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaïd Harchaoui, and Yejin Choi. Faith and fate: Limits of transformers on compositionality. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenk...
work page 2023
-
[13]
Thomas McCoy, Shunyu Yao, Dan Friedman, Matthew Hardy, and Thomas L
R. Thomas McCoy, Shunyu Yao, Dan Friedman, Matthew Hardy, and Thomas L. Griffiths. Embers of autoregression: Understanding large language models through the problem they are trained to solve, 2023
work page 2023
-
[14]
Marianna Nezhurina, Lucia Cipolina-Kun, Mehdi Cherti, and Jenia Jitsev. Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models. arXiv preprint arXiv:2406.02061, 2024
-
[15]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural...
work page 2022
-
[16]
Mehran Kazemi, Najoung Kim, Deepti Bhatia, Xin Xu, and Deepak Ramachandran. Lam- bada: Backward chaining for automated reasoning in natural language.arXiv preprint arXiv:2212.13894, 2022
-
[17]
Teaching algorithmic reasoning via in-context learning.arXiv preprint arXiv:2211.09066, 2022
Hattie Zhou, Azade Nova, Hugo Larochelle, Aaron Courville, Behnam Neyshabur, and Hanie Sedghi. Teaching algorithmic reasoning via in-context learning.arXiv preprint arXiv:2211.09066, 2022
-
[18]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022
work page 2022
-
[19]
Large language models are better reasoners with self-verification
Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Shengping Liu, Bin Sun, Kang Liu, and Jun Zhao. Large language models are better reasoners with self-verification. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2550–2575, Singapore, December 2023. Association for Compu...
work page 2023
-
[20]
Making language models better reasoners with step-aware verifier
Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. Making language models better reasoners with step-aware verifier. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5315–5333, 2023
work page 2023
-
[21]
Eric Zhao, Pranjal Awasthi, and Sreenivas Gollapudi. Sample, scrutinize and scale: Effective inference-time search by scaling verification.arXiv preprint arXiv:2502.01839, 2025
-
[22]
STar: Bootstrapping reasoning with reasoning
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. STar: Bootstrapping reasoning with reasoning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022
work page 2022
-
[23]
Think before you speak: Training language models with pause tokens
Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[24]
Thinking tokens for language modeling.ArXiv, abs/2405.08644, 2024
David Herel and Tomas Mikolov. Thinking tokens for language modeling.ArXiv, abs/2405.08644, 2024
-
[25]
Zhihong Shao, Peiyi Wang, Runxin Xu Qihao Zhu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024
work page 2024
-
[26]
Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment, 2024
Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment, 2024
work page 2024
-
[27]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Ha- jishir...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Reasoning Models Don't Always Say What They Think
Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, et al. Reasoning models don’t always say what they think.arXiv preprint arXiv:2505.05410, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Eric Tang, Sumanth Hegde, Kourosh Hakhamaneshi, Shishir G Patil, Matei Zaharia, et al. Llms can easily learn to reason from demonstrations structure, not content, is what matters!arXiv preprint arXiv:2502.07374, 2025
-
[30]
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms.arXiv preprint arXiv:2412.21187, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Hanjie Chen, Xia Hu, et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Deepseek-r1 thoughtology: Let’s< think> about llm reasoning.arXiv preprint arXiv:2504.07128, 2025
Sara Vera Marjanović, Arkil Patel, Vaibhav Adlakha, Milad Aghajohari, Parishad BehnamGhader, Mehar Bhatia, Aditi Khandelwal, Austin Kraft, Benno Krojer, Xing Han 15 Lù, et al. Deepseek-r1 thoughtology: Let’s< think> about llm reasoning.arXiv preprint arXiv:2504.07128, 2025
-
[33]
Yuxiao Qu, Matthew YR Yang, Amrith Setlur, Lewis Tunstall, Edward Emanuel Beeching, Ruslan Salakhutdinov, and Aviral Kumar. Optimizing test-time compute via meta reinforcement fine-tuning.arXiv preprint arXiv:2503.07572, 2025
-
[34]
Marthe Ballon, Andres Algaba, and Vincent Ginis. The relationship between reasoning and performance in large language models–o3 (mini) thinks harder, not longer.arXiv preprint arXiv:2502.15631, 2025
-
[35]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Nikola Zubić, Federico Soldá, Aurelio Sulser, and Davide Scaramuzza. Limits of deep learning: Sequence modeling through the lens of complexity theory.arXiv preprint arXiv:2405.16674, 2024
-
[37]
Lanzendörfer, Yannick Niedermayr, and Roger Wattenhofer
Benjamin Estermann, Luca A. Lanzendörfer, Yannick Niedermayr, and Roger Wattenhofer. Puzzles: A benchmark for neural algorithmic reasoning, 2024
work page 2024
-
[38]
Karthik Valmeekam, Alberto Olmo Hernandez, Sarath Sreedharan, and Subbarao Kambhampati. Large language models still can’t plan (A benchmark for llms on planning and reasoning about change).CoRR, abs/2206.10498, 2022
-
[39]
Lmact: A benchmark for in-context imitation learning with long multimodal demonstrations
Anian Ruoss, Fabio Pardo, Harris Chan, Bonnie Li, Volodymyr Mnih, and Tim Genewein. Lmact: A benchmark for in-context imitation learning with long multimodal demonstrations. arXiv preprint arXiv:2412.01441, 2024
-
[40]
Llms still can’t plan; can lrms? a preliminary evaluation of openai’s o1 on planbench
Karthik Valmeekam, Kaya Stechly, and Subbarao Kambhampati. Llms still can’t plan; can lrms? a preliminary evaluation of openai’s o1 on planbench. 2024
work page 2024
-
[41]
Reasoning models can be effective without thinking.arXiv preprint arXiv:2504.09858, 2025
Wenjie Ma, Jingxuan He, Charlie Snell, Tyler Griggs, Sewon Min, and Matei Zaharia. Reasoning models can be effective without thinking.arXiv preprint arXiv:2504.09858, 2025
-
[42]
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
American invitational math- ematics examination (aime)
Mathematical Association of America. American invitational math- ematics examination (aime). https://maa.org/math-competitions/ american-invitational-mathematics-examination-aime, 2025. Accessed: 2025-05-15
work page 2025
-
[44]
Amc historical results - aime i (february 1, 2024)
Art of Problem Solving. Amc historical results - aime i (february 1, 2024). https://artofproblemsolving.com/wiki/index.php/AMC_historical_results#AIME_ I_.28February_1.2C_2024.29, 2024. Accessed: 2025-05-15
work page 2024
-
[45]
Amc historical results – aime i (february 6, 2025)
Art of Problem Solving. Amc historical results – aime i (february 6, 2025). https://artofproblemsolving.com/wiki/index.php/AMC_historical_results#AIME_ I_.28February_6.2C_2025.29, 2025. Accessed: 2025-05-15. 16
work page 2025
-
[46]
Gary F Marcus.The algebraic mind: Integrating connectionism and cognitive science. MIT press, 2003
work page 2003
-
[47]
On representations of problems of reasoning about actions
Saul Amarel. On representations of problems of reasoning about actions. InReadings in artificial intelligence, pages 2–22. Elsevier, 1981
work page 1981
-
[48]
Crossing the bridge at night.Bulletin of the EATCS, 78:241, 2002
Günter Rote. Crossing the bridge at night.Bulletin of the EATCS, 78:241, 2002
work page 2002
-
[49]
Xinran Zhao, Hanie Sedghi, Bernd Bohnet, Dale Schuurmans, and Azade Nova. Improving large language model planning with action sequence similarity.arXiv preprint arXiv:2505.01009, 2025
-
[50]
Xiaomin Li, Zhou Yu, Zhiwei Zhang, Xupeng Chen, Ziji Zhang, Yingying Zhuang, Narayanan Sadagopan, and Anurag Beniwal. When thinking fails: The pitfalls of reasoning for instruction- following in llms.arXiv preprint arXiv:2505.11423, 2025. 17 A Appendix In this appendix, we provide details supplementing the main text, including our response to alternative ...
-
[51]
Only one disk can be moved at a time
-
[52]
Each move consists of taking the upper disk from one stack and placing it on top of another stack
-
[53]
The goal is to move the entire stack to the third peg
A larger disk may not be placed on top of a smaller disk. The goal is to move the entire stack to the third peg. Example:With 3 disks numbered 1 (smallest), 2, and 3 (largest), the initial state is [[3, 2, 1], [], []], and a solution might be: moves = [[1 , 0 , 2] , [2 , 0 , 1] , [1 , 2 , 1] , [3 , 0 , 2] , [1 , 1 , 0] , [2 , 1 , 2] , [1 , 0 , 2]] This me...
-
[54]
Sliding forward into an adjacent empty space, or
-
[55]
Jumping over exactly one checker of the opposite color to land in an empty space. The goal is to swap the positions of all red and blue checkers, effectively mirroring the initial state. Example:If the initial state is [’R’, ’_’, ’B’], the goal is to reach [’B’, ’_’, ’R’]. Your solution should be a list of moves where each move is represented as [checker_...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.