arxiv: 2506.06941 · v3 · submitted 2025-06-07 · 💻 cs.AI · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Parshin Shojaee , Iman Mirzadeh , Keivan Alizadeh , Maxwell Horton , Samy Bengio , Mehrdad Farajtabar

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:04 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords large reasoning modelsproblem complexityreasoning effortaccuracy collapsescaling limitspuzzle environmentsllm comparisonreasoning traces

0 comments

The pith

Large Reasoning Models exhibit complete accuracy collapse beyond certain complexities and reduce reasoning effort despite available compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper systematically examines Large Reasoning Models using controllable puzzle environments that allow precise control over problem complexity. It establishes that LRMs suffer a complete accuracy collapse at high complexities and display a counterintuitive pattern where reasoning effort rises with complexity then falls even with token budget left. By comparing to standard LLMs, it identifies three regimes where standard models win at low complexity, LRMs at medium, and both fail at high. This approach reveals limitations in exact computation and inconsistent reasoning across scales.

Core claim

Using controllable puzzle environments, LRMs face a complete accuracy collapse beyond certain complexities. They exhibit a counterintuitive scaling limit where their reasoning effort increases with problem complexity up to a point, then declines despite having remaining token budget. Compared to standard LLMs under the same inference compute, three performance regimes emerge: standard models outperform at low complexity, LRMs at medium complexity, and both collapse at high complexity. LRMs fail to use explicit algorithms and reason inconsistently.

What carries the argument

Controllable puzzle environments that enable precise manipulation of complexity while keeping logical structures consistent, allowing detailed analysis of both final answers and internal reasoning traces.

If this is right

Standard LLMs outperform LRMs on low-complexity tasks under same compute.
LRMs demonstrate advantage on medium-complexity tasks.
Both models face complete collapse on high-complexity tasks.
LRMs fail to use explicit algorithms for exact computation.
Reasoning in LRMs is inconsistent across different scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This implies that increasing model size or tokens alone may not overcome the scaling limit in reasoning effort.
Real-world applications like advanced math or coding may hit similar walls unless new mechanisms are introduced.
Evaluation benchmarks should shift toward controllable complexity measures to avoid contamination issues.

Load-bearing premise

The chosen controllable puzzle environments provide an unbiased and generalizable measure of reasoning complexity without introducing artifacts absent in domains like math or coding.

What would settle it

Testing if the accuracy collapse and decline in reasoning effort occur when applying similar complexity increases to standard math or coding problems.

read the original abstract

Recent generations of language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established math and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from contamination and does not provide insights into the reasoning traces. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs think. Through extensive experiments, we show that LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having remaining token budget. By comparing LRMs with their standard LLM counterparts under same inference compute, we identify three performance regimes: (1) low-complexity tasks where standard models outperform LRMs, (2) medium-complexity tasks where LRMs demonstrates advantage, and (3) high-complexity tasks where both models face complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across scales. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models' computational behavior, shedding light on their strengths, limitations, and raising questions about their reasoning capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LRMs hit accuracy collapse and a non-monotonic effort curve on these puzzles, but the setup leaves open whether the limits are general or puzzle-specific.

read the letter

The paper's core finding is that reasoning models collapse to zero accuracy past a complexity threshold on controllable puzzles, while their internal effort rises then falls even with tokens left. They also map three regimes: standard LLMs win on easy cases, LRMs pull ahead in the middle, and both fail at high complexity. The controllable puzzles are the real contribution here. By fixing the logical structure and varying parameters like element count or constraints, the authors can track reasoning traces directly instead of just final accuracy. That gives cleaner data on solution exploration and compute use than the usual contaminated math benchmarks. They show LRMs skip explicit algorithms and reason inconsistently across scales, which is useful to see laid out. The main soft spot is generalization. The puzzles keep a fixed logical form, so models might exploit shortcuts or state representations that do not appear in open-ended math or code. The abstract does not report cross-checks on MATH or coding problems, so the collapse and effort drop could be environment-specific rather than a fundamental limit. Statistical details on the runs and controls would also strengthen the claims. This work is for researchers tuning or evaluating reasoning models who need concrete evidence on where current approaches break. It is worth sending to peer review because the questions about exact computation and scaling behavior are timely and the experimental design is straightforward to replicate or extend.

Referee Report

2 major / 1 minor

Summary. The paper investigates Large Reasoning Models (LRMs) using controllable puzzle environments that enable precise manipulation of complexity while preserving logical structure. It claims that LRMs exhibit complete accuracy collapse beyond certain complexity thresholds, display non-monotonic reasoning effort (increasing then declining despite remaining token budget), and occupy three distinct regimes relative to standard LLMs: standard models outperform at low complexity, LRMs at medium complexity, and both collapse at high complexity. The work further analyzes reasoning traces to identify failures in exact computation, lack of explicit algorithm use, and inconsistent reasoning across scales.

Significance. If the results hold, the work provides a useful controlled lens on LRM scaling limits and internal reasoning behavior that avoids benchmark contamination issues common in math/coding evaluations. The three-regime characterization and non-monotonic effort finding could guide future research on reasoning model design, though the absence of cross-domain validation limits immediate applicability.

major comments (2)

[Experimental setup and results sections] The central claims of complete accuracy collapse and non-monotonic effort scaling rest on the assumption that the chosen puzzle environments induce a representative complexity axis. The manuscript provides no cross-validation experiments comparing collapse thresholds or effort curves against standard math or coding benchmarks, leaving open the possibility that the observed behaviors are artifacts of the puzzle domain rather than fundamental LRM limitations.
[Results on reasoning effort] The reported decline in reasoning effort despite remaining token budget is load-bearing for the scaling-limit claim, yet the manuscript does not detail the precise operationalization of effort (e.g., token counts in thinking traces), statistical tests for the decline, or controls for model-specific generation stopping criteria.

minor comments (1)

[Abstract] The abstract would benefit from an early, explicit definition of 'reasoning effort' to orient readers before the regime analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. We address each major comment below and describe the revisions we will incorporate to improve the manuscript.

read point-by-point responses

Referee: [Experimental setup and results sections] The central claims of complete accuracy collapse and non-monotonic effort scaling rest on the assumption that the chosen puzzle environments induce a representative complexity axis. The manuscript provides no cross-validation experiments comparing collapse thresholds or effort curves against standard math or coding benchmarks, leaving open the possibility that the observed behaviors are artifacts of the puzzle domain rather than fundamental LRM limitations.

Authors: We appreciate the referee's concern regarding generalizability. Our puzzle environments were deliberately selected to enable precise, contamination-free manipulation of complexity while preserving logical structure, which is a core strength highlighted in the referee summary. In the revised manuscript, we will add an expanded discussion subsection that explicitly maps puzzle complexity metrics (e.g., state space size and required search depth) to equivalent demands in mathematical and coding tasks, providing qualitative bridges to standard benchmarks. We maintain that the controlled axis reveals fundamental scaling behaviors that are difficult to isolate in contaminated evaluations. However, performing full quantitative cross-validation experiments would require substantial new work beyond the current revision timeline. revision: partial
Referee: [Results on reasoning effort] The reported decline in reasoning effort despite remaining token budget is load-bearing for the scaling-limit claim, yet the manuscript does not detail the precise operationalization of effort (e.g., token counts in thinking traces), statistical tests for the decline, or controls for model-specific generation stopping criteria.

Authors: We agree that greater precision is required on this load-bearing result. In the revised version, we will add explicit definitions and methodology: reasoning effort will be operationalized as the token count within the thinking trace (prior to the final answer). We will include statistical support via quadratic regression on effort versus complexity, reporting coefficients and p-values for the negative quadratic term. We will also add controls by reporting average remaining token budgets at the observed decline points across models and confirming that declines occurred well before any model-specific length limits were approached. These details will be placed in the results section on reasoning effort. revision: yes

Circularity Check

0 steps flagged

Empirical measurements on puzzles; no derivational circularity

full rationale

The paper conducts direct experimental measurements of LRM accuracy, reasoning effort, and solution patterns across controlled puzzle complexities. No equations, fitted parameters, or predictions are derived; all claims rest on observed outcomes from the puzzle environments. No self-citations are used to justify uniqueness theorems or ansatzes, and the central results (accuracy collapse, non-monotonic effort) are reported as empirical findings rather than reductions of inputs by construction. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on empirical observations from puzzle experiments rather than new theoretical postulates or derivations.

pith-pipeline@v0.9.0 · 5611 in / 970 out tokens · 50751 ms · 2026-05-15T16:04:32.572571+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three performance regimes: (1) low-complexity tasks where standard models outperform LRMs, (2) medium-complexity tasks where LRMs demonstrates advantage, and (3) high-complexity tasks where both models face complete collapse

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers
cs.AI 2026-05 unverdicted novelty 7.0

LLM-generated combinatorial solvers achieve highest correctness when the model formalizes problems for verified backends rather than attempting to optimize search, which often causes regressions.
When Does Critique Improve AI-Assisted Theoretical Physics? SCALAR: Structured Critic--Actor Loop for Agentic Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

Structured critic-actor loops improve AI performance on theoretical physics reasoning tasks, with benefits strongest in asymmetric model pairings using constructive feedback.
Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems
cs.AI 2026-04 unverdicted novelty 7.0

A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.
DeonticBench: A Benchmark for Reasoning over Rules
cs.CL 2026-04 unverdicted novelty 7.0

DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.
WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking
cs.AI 2026-03 unverdicted novelty 7.0

WMF-AM is a depth-parameterized benchmark that measures LLMs' cumulative state tracking ability without scratchpads, validated on 28 models across arithmetic and non-arithmetic tasks with ablations confirming the construct.
Robust Reasoning Benchmark
cs.LG 2026-03 unverdicted novelty 7.0

Perturbations to math problem text cause up to 55% average accuracy drops in open-weight LLMs and sequential solving reveals context pollution in attention mechanisms.
LEAD: Breaking the No-Recovery Bottleneck in Long-Horizon Reasoning
cs.AI 2026-03 unverdicted novelty 7.0

LEAD lets LLMs solve checkers jumping puzzles up to size 13 by using lookahead to recover from irreversible errors on hard steps that break extreme decomposition.
Weighted Rules under the Stable Model Semantics
cs.AI 2026-05 unverdicted novelty 6.0

Weighted rules extend stable model semantics to support probabilistic reasoning, model ranking, and statistical inference in answer set programs.
Hypothesis generation and updating in large language models
cs.LG 2026-05 unverdicted novelty 6.0

LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
cs.CL 2026-05 unverdicted novelty 6.0

LLM accuracy on controlled procedural arithmetic drops from 61% at 5 steps to 20% at 95 steps, with failures including skipped steps, premature answers, and hallucinated operations.
Latent Phase-Shift Rollback: Inference-Time Error Correction via Residual Stream Monitoring and KV-Cache Steering
cs.LG 2026-04 unverdicted novelty 6.0

LPSR raises 8B-model accuracy on MATH-500 from 28.8% to 44.0% by detecting error-indicating phase shifts in the residual stream and correcting via KV-cache rollback plus steering vectors, outperforming prompted self-c...
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
cs.LG 2026-04 unverdicted novelty 6.0

HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
Measuring and curing reasoning rigidity: from decorative chain-of-thought to genuine faithfulness
cs.CL 2026-03 unverdicted novelty 6.0

SLRC quantifies genuine step necessity in LLM reasoning as a causal estimator, LC-CoSR training reduces rigidity with stability guarantees, and evaluations reveal a faithfulness-sycophancy paradox across frontier models.
Distill: Uncovering the True Intent behind Human-Robot Communication
cs.RO 2026-05 unverdicted novelty 5.0

Distill refines user task specifications for robots by pruning unnecessary steps, generalizing meanings, and relaxing order constraints, as demonstrated in a crowdsourcing study on a web interface.
Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities
cs.AI 2026-05 unverdicted novelty 5.0

Absurd World automatically converts real-world problems into absurd yet logically coherent scenarios to test whether LLMs can reason without depending on familiar patterns.
How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem
cs.AI 2026-05 unverdicted novelty 5.0

Non-reasoning LLMs fail the equivalence class problem while reasoning LLMs perform better but remain incomplete, with difficulty peaking at phase transition for the former and maximum diameter for the latter.
Do LLMs have core beliefs?
cs.LG 2026-05 unverdicted novelty 5.0

LLMs generally fail to maintain stable worldviews under adversarial conversational pressure, indicating they lack core beliefs akin to those in human cognition.
Assistants, Not Architects: The Role of LLMs in Networked Systems Design
cs.NI 2026-04 unverdicted novelty 5.0

LLMs fail at architectural reasoning for networked systems, but Kepler uses structured constraints and SMT-based optimization to synthesize feasible designs with explanations.
From Understanding to Creation: A Prerequisite-Free AI Literacy Course with Technical Depth Across Majors
cs.CY 2026-03 unverdicted novelty 4.0

A university course design enables non-technical students across majors to reach the Create level of Bloom's taxonomy by repeatedly applying a problem-data-model-evaluation-reflection pipeline with concurrent ethics t...

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 19 Pith papers · 12 internal anchors

[1]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Introducing openai o1

OpenAI. Introducing openai o1. Jan 2024

work page 2024
[3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Claude 3.7 sonnet

Anthropic. Claude 3.7 sonnet. Feb 2025

work page 2025
[5]

Gemini flash thinking.Google AI Blog, Jan 2025

Google. Gemini flash thinking.Google AI Blog, Jan 2025

work page 2025
[6]

GSM-symbolic: Understanding the limitations of mathematical reasoning in large language models

Seyed Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. GSM-symbolic: Understanding the limitations of mathematical reasoning in large language models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[7]

On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021

Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021

work page 2021
[8]

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv:2505.11831, 2025

work page internal anchor Pith review arXiv 2025
[9]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah I Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat S. Behl, and et. al. Phi-3 technical report: A highly capable language model locally on your phone.CoRR, abs/2404.14219, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chap- lot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.CoRR, abs/2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, 13 Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Beth...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaïd Harchaoui, and Yejin Choi

Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaïd Harchaoui, and Yejin Choi. Faith and fate: Limits of transformers on compositionality. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenk...

work page 2023
[13]

Thomas McCoy, Shunyu Yao, Dan Friedman, Matthew Hardy, and Thomas L

R. Thomas McCoy, Shunyu Yao, Dan Friedman, Matthew Hardy, and Thomas L. Griffiths. Embers of autoregression: Understanding large language models through the problem they are trained to solve, 2023

work page 2023
[14]

Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models

Marianna Nezhurina, Lucia Cipolina-Kun, Mehdi Cherti, and Jenia Jitsev. Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models. arXiv preprint arXiv:2406.02061, 2024

work page arXiv 2024
[15]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural...

work page 2022
[16]

Lam- bada: Backward chaining for automated reasoning in natural language.arXiv preprint arXiv:2212.13894, 2022

Mehran Kazemi, Najoung Kim, Deepti Bhatia, Xin Xu, and Deepak Ramachandran. Lam- bada: Backward chaining for automated reasoning in natural language.arXiv preprint arXiv:2212.13894, 2022

work page arXiv 2022
[17]

Teaching algorithmic reasoning via in-context learning.arXiv preprint arXiv:2211.09066, 2022

Hattie Zhou, Azade Nova, Hugo Larochelle, Aaron Courville, Behnam Neyshabur, and Hanie Sedghi. Teaching algorithmic reasoning via in-context learning.arXiv preprint arXiv:2211.09066, 2022

work page arXiv 2022
[18]

Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

work page 2022
[19]

Large language models are better reasoners with self-verification

Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Shengping Liu, Bin Sun, Kang Liu, and Jun Zhao. Large language models are better reasoners with self-verification. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2550–2575, Singapore, December 2023. Association for Compu...

work page 2023
[20]

Making language models better reasoners with step-aware verifier

Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. Making language models better reasoners with step-aware verifier. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5315–5333, 2023

work page 2023
[21]

Sample, scrutinize and scale: Effective inference-time search by scaling verification.arXiv preprint arXiv:2502.01839, 2025

Eric Zhao, Pranjal Awasthi, and Sreenivas Gollapudi. Sample, scrutinize and scale: Effective inference-time search by scaling verification.arXiv preprint arXiv:2502.01839, 2025

work page arXiv 2025
[22]

STar: Bootstrapping reasoning with reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. STar: Bootstrapping reasoning with reasoning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022

work page 2022
[23]

Think before you speak: Training language models with pause tokens

Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[24]

Thinking tokens for language modeling.ArXiv, abs/2405.08644, 2024

David Herel and Tomas Mikolov. Thinking tokens for language modeling.ArXiv, abs/2405.08644, 2024

work page arXiv 2024
[25]

Zhihong Shao, Peiyi Wang, Runxin Xu Qihao Zhu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

work page 2024
[26]

Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment, 2024

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment, 2024

work page 2024
[27]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Ha- jishir...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Reasoning Models Don't Always Say What They Think

Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, et al. Reasoning models don’t always say what they think.arXiv preprint arXiv:2505.05410, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Llms can easily learn to reason from demonstrations structure, not content, is what matters!arXiv preprint arXiv:2502.07374, 2025

Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Eric Tang, Sumanth Hegde, Kourosh Hakhamaneshi, Shishir G Patil, Matei Zaharia, et al. Llms can easily learn to reason from demonstrations structure, not content, is what matters!arXiv preprint arXiv:2502.07374, 2025

work page arXiv 2025
[30]

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms.arXiv preprint arXiv:2412.21187, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Hanjie Chen, Xia Hu, et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Deepseek-r1 thoughtology: Let’s< think> about llm reasoning.arXiv preprint arXiv:2504.07128, 2025

Sara Vera Marjanović, Arkil Patel, Vaibhav Adlakha, Milad Aghajohari, Parishad BehnamGhader, Mehar Bhatia, Aditi Khandelwal, Austin Kraft, Benno Krojer, Xing Han 15 Lù, et al. Deepseek-r1 thoughtology: Let’s< think> about llm reasoning.arXiv preprint arXiv:2504.07128, 2025

work page arXiv 2025
[33]

Optimizing test-time compute via meta reinforcement fine-tuning.arXiv preprint arXiv:2503.07572, 2025

Yuxiao Qu, Matthew YR Yang, Amrith Setlur, Lewis Tunstall, Edward Emanuel Beeching, Ruslan Salakhutdinov, and Aviral Kumar. Optimizing test-time compute via meta reinforcement fine-tuning.arXiv preprint arXiv:2503.07572, 2025

work page arXiv 2025
[34]

The relationship between reasoning and performance in large language models–o3 (mini) thinks harder, not longer.arXiv preprint arXiv:2502.15631, 2025

Marthe Ballon, Andres Algaba, and Vincent Ginis. The relationship between reasoning and performance in large language models–o3 (mini) thinks harder, not longer.arXiv preprint arXiv:2502.15631, 2025

work page arXiv 2025
[35]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Limits of deep learning: Sequence modeling through the lens of complexity theory.arXiv preprint arXiv:2405.16674, 2024

Nikola Zubić, Federico Soldá, Aurelio Sulser, and Davide Scaramuzza. Limits of deep learning: Sequence modeling through the lens of complexity theory.arXiv preprint arXiv:2405.16674, 2024

work page arXiv 2024
[37]

Lanzendörfer, Yannick Niedermayr, and Roger Wattenhofer

Benjamin Estermann, Luca A. Lanzendörfer, Yannick Niedermayr, and Roger Wattenhofer. Puzzles: A benchmark for neural algorithmic reasoning, 2024

work page 2024
[38]

Large language models still can’t plan (A benchmark for llms on planning and reasoning about change).CoRR, abs/2206.10498, 2022

Karthik Valmeekam, Alberto Olmo Hernandez, Sarath Sreedharan, and Subbarao Kambhampati. Large language models still can’t plan (A benchmark for llms on planning and reasoning about change).CoRR, abs/2206.10498, 2022

work page arXiv 2022
[39]

Lmact: A benchmark for in-context imitation learning with long multimodal demonstrations

Anian Ruoss, Fabio Pardo, Harris Chan, Bonnie Li, Volodymyr Mnih, and Tim Genewein. Lmact: A benchmark for in-context imitation learning with long multimodal demonstrations. arXiv preprint arXiv:2412.01441, 2024

work page arXiv 2024
[40]

Llms still can’t plan; can lrms? a preliminary evaluation of openai’s o1 on planbench

Karthik Valmeekam, Kaya Stechly, and Subbarao Kambhampati. Llms still can’t plan; can lrms? a preliminary evaluation of openai’s o1 on planbench. 2024

work page 2024
[41]

Reasoning models can be effective without thinking.arXiv preprint arXiv:2504.09858, 2025

Wenjie Ma, Jingxuan He, Charlie Snell, Tyler Griggs, Sewon Min, and Matei Zaharia. Reasoning models can be effective without thinking.arXiv preprint arXiv:2504.09858, 2025

work page arXiv 2025
[42]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

American invitational math- ematics examination (aime)

Mathematical Association of America. American invitational math- ematics examination (aime). https://maa.org/math-competitions/ american-invitational-mathematics-examination-aime, 2025. Accessed: 2025-05-15

work page 2025
[44]

Amc historical results - aime i (february 1, 2024)

Art of Problem Solving. Amc historical results - aime i (february 1, 2024). https://artofproblemsolving.com/wiki/index.php/AMC_historical_results#AIME_ I_.28February_1.2C_2024.29, 2024. Accessed: 2025-05-15

work page 2024
[45]

Amc historical results – aime i (february 6, 2025)

Art of Problem Solving. Amc historical results – aime i (february 6, 2025). https://artofproblemsolving.com/wiki/index.php/AMC_historical_results#AIME_ I_.28February_6.2C_2025.29, 2025. Accessed: 2025-05-15. 16

work page 2025
[46]

MIT press, 2003

Gary F Marcus.The algebraic mind: Integrating connectionism and cognitive science. MIT press, 2003

work page 2003
[47]

On representations of problems of reasoning about actions

Saul Amarel. On representations of problems of reasoning about actions. InReadings in artificial intelligence, pages 2–22. Elsevier, 1981

work page 1981
[48]

Crossing the bridge at night.Bulletin of the EATCS, 78:241, 2002

Günter Rote. Crossing the bridge at night.Bulletin of the EATCS, 78:241, 2002

work page 2002
[49]

Improving large language model planning with action sequence similarity.arXiv preprint arXiv:2505.01009, 2025

Xinran Zhao, Hanie Sedghi, Bernd Bohnet, Dale Schuurmans, and Azade Nova. Improving large language model planning with action sequence similarity.arXiv preprint arXiv:2505.01009, 2025

work page arXiv 2025
[50]

When thinking fails: The pitfalls of reasoning for instruction- following in llms.arXiv preprint arXiv:2505.11423, 2025

Xiaomin Li, Zhou Yu, Zhiwei Zhang, Xupeng Chen, Ziji Zhang, Yingying Zhuang, Narayanan Sadagopan, and Anurag Beniwal. When thinking fails: The pitfalls of reasoning for instruction- following in llms.arXiv preprint arXiv:2505.11423, 2025. 17 A Appendix In this appendix, we provide details supplementing the main text, including our response to alternative ...

work page arXiv 2025
[51]

Only one disk can be moved at a time

work page
[52]

Each move consists of taking the upper disk from one stack and placing it on top of another stack

work page
[53]

The goal is to move the entire stack to the third peg

A larger disk may not be placed on top of a smaller disk. The goal is to move the entire stack to the third peg. Example:With 3 disks numbered 1 (smallest), 2, and 3 (largest), the initial state is [[3, 2, 1], [], []], and a solution might be: moves = [[1 , 0 , 2] , [2 , 0 , 1] , [1 , 2 , 1] , [3 , 0 , 2] , [1 , 1 , 0] , [2 , 1 , 2] , [1 , 0 , 2]] This me...

work page
[54]

Sliding forward into an adjacent empty space, or

work page
[55]

A2 " ,

Jumping over exactly one checker of the opposite color to land in an empty space. The goal is to swap the positions of all red and blue checkers, effectively mirroring the initial state. Example:If the initial state is [’R’, ’_’, ’B’], the goal is to reach [’B’, ’_’, ’R’]. Your solution should be a list of moves where each move is represented as [checker_...

work page