arxiv: 2510.04871 · v1 · submitted 2025-10-06 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Less is More: Recursive Reasoning with Tiny Networks

Alexia Jolicoeur-Martineau

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords tiny recursive modelrecursive reasoningARC-AGIparameter efficiencysmall neural networkspuzzle solvinggeneralizationiterative refinement

0 comments

The pith

A two-layer recursive network with 7 million parameters reaches 45 percent accuracy on ARC-AGI-1, surpassing most large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the Tiny Recursive Model as a minimal recursive approach to reasoning. TRM applies one small two-layer network repeatedly to refine its answer on each input. Trained on roughly one thousand examples, it records 45 percent test accuracy on ARC-AGI-1 and 8 percent on ARC-AGI-2. These scores exceed those reported for several much larger models while using far fewer parameters. The result indicates that iterative refinement by a tiny network can substitute for scale on certain hard puzzle tasks.

Core claim

TRM consists of a single tiny neural network with only two layers that recurses on the current state of a puzzle to produce successive refinements until a solution emerges. When trained on approximately one thousand examples, this model reaches 45 percent accuracy on the ARC-AGI-1 test set and 8 percent on ARC-AGI-2, outperforming the earlier Hierarchical Reasoning Model and most cited large language models that contain thousands of times more parameters.

What carries the argument

The Tiny Recursive Model (TRM), a single two-layer network that iterates refinement steps on the input through repeated application.

If this is right

Recursive iteration on a fixed small network produces measurable gains on visual reasoning benchmarks without added parameters.
Training sets of roughly one thousand examples suffice for nontrivial generalization on ARC-style tasks when recursion is used.
A single-network recursive design can exceed the performance of an earlier two-network hierarchical design on the same puzzles.
High accuracy on Sudoku, Maze, and ARC-AGI is achievable with total model size under 10 million parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same recursive loop could be tested on other step-by-step tasks such as theorem proving or program synthesis.
If recursion depth can be learned or scheduled, further reductions in required parameters may be possible.
The approach invites direct comparisons of iteration count versus parameter count across a wider range of benchmarks.

Load-bearing premise

The accuracy numbers obtained by TRM can be compared directly to the accuracies reported for the much larger language models despite differences in training data and evaluation protocols.

What would settle it

Re-evaluate the trained TRM on a fresh ARC-AGI test split whose puzzle distributions differ markedly from the original training set and check whether accuracy on ARC-AGI-1 falls below 25 percent.

read the original abstract

Hierarchical Reasoning Model (HRM) is a novel approach using two small neural networks recursing at different frequencies. This biologically inspired method beats Large Language models (LLMs) on hard puzzle tasks such as Sudoku, Maze, and ARC-AGI while trained with small models (27M parameters) on small data (around 1000 examples). HRM holds great promise for solving hard problems with small networks, but it is not yet well understood and may be suboptimal. We propose Tiny Recursive Model (TRM), a much simpler recursive reasoning approach that achieves significantly higher generalization than HRM, while using a single tiny network with only 2 layers. With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TRM strips HRM down to one tiny recursive network and claims stronger ARC-AGI numbers, but the missing experimental details leave the LLM comparisons unconvincing for now.

read the letter

The core move here is taking the two-network HRM setup and replacing it with a single 2-layer recursive network of 7M parameters. The paper reports that this simpler TRM reaches 45% on ARC-AGI-1 and 8% on ARC-AGI-2 after training on roughly 1000 examples, which it says beats several large LLMs at a tiny fraction of the parameter count. The simplification itself is the clearest contribution: the authors show that the extra hierarchy was not needed and that recursion alone can drive the gains on these tasks. That part is worth noting because it narrows the design space for small-model reasoning. The numbers are presented cleanly in the abstract and the claim about data efficiency is direct. The main weakness is the comparison to LLMs. Those models are run zero-shot or few-shot on the puzzles, while TRM receives supervised training on the same distribution. Without the exact train-test split, confirmation of no leakage, and the full set of baselines with error bars, it is hard to know whether the reported edge reflects better generalization or just the benefit of task-specific fitting. The abstract supplies no ablations or verification steps, so the soundness of the performance claim stays low until the methods are visible. This paper is aimed at people building compact reasoning systems rather than scaling up transformers. A reader who cares about recursion as an inductive bias will get something from it, but only if the experiments survive scrutiny. I would bring it to a reading group to walk through the recursion mechanism once the full text is available. It deserves peer review because the architectural reduction is concrete and the efficiency angle matters if the results hold, but the current version needs the experimental details before a referee can judge the central claim.

Referee Report

2 major / 1 minor

Summary. The paper introduces Tiny Recursive Model (TRM), a simplified single-network recursive reasoning architecture with only 2 layers and 7M parameters. It claims TRM, trained on around 1000 examples, reaches 45% test accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, outperforming most LLMs (Deepseek R1, o3-mini, Gemini 2.5 Pro) while using <0.01% of their parameters, and improves upon the prior Hierarchical Reasoning Model (HRM) for Sudoku, Maze, and ARC-AGI tasks.

Significance. If the empirical claims are verified with rigorous controls, the result would be significant: it would show that biologically inspired recursive reasoning in tiny networks can deliver strong generalization on hard puzzle benchmarks using orders-of-magnitude less data and compute than LLMs, offering a concrete counter-example to pure scaling and opening avenues for efficient, interpretable reasoning systems.

major comments (2)

Abstract: the central performance claims (45% ARC-AGI-1, 8% ARC-AGI-2) are stated without any description of the train/eval/test partitioning, whether the ~1000 training examples are strictly disjoint from the reported test sets, or confirmation of no leakage from ARC training tasks; this directly undermines the generalization interpretation relative to zero-shot LLM baselines.
Abstract and §1: the comparison to LLMs (o3-mini, Gemini 2.5 Pro, etc.) is presented as direct superiority, yet TRM receives task-specific gradient updates on ~1000 examples while the cited LLMs are evaluated zero- or few-shot; no section clarifies that the evaluation regimes are equivalent, making the parameter-efficiency claim load-bearing but currently unsupported.

minor comments (1)

Abstract: the statement 'higher than most LLMs' should be qualified with the exact subset of models and conditions under which the comparison holds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on evaluation clarity. We have revised the manuscript to explicitly describe the data partitioning, confirm disjoint splits, and distinguish the training regimes from LLM baselines.

read point-by-point responses

Referee: Abstract: the central performance claims (45% ARC-AGI-1, 8% ARC-AGI-2) are stated without any description of the train/eval/test partitioning, whether the ~1000 training examples are strictly disjoint from the reported test sets, or confirmation of no leakage from ARC training tasks; this directly undermines the generalization interpretation relative to zero-shot LLM baselines.

Authors: We agree that the original abstract omitted these details. In the revised version we have updated the abstract and added a dedicated paragraph in Section 3 to state: TRM is trained on approximately 1000 examples drawn from the public ARC-AGI training tasks and evaluated on the official test set, which consists of entirely disjoint tasks never seen during training. We explicitly confirm no leakage occurs because the test tasks are held out and the model never accesses ARC test data or private splits during any stage of training or validation. revision: yes
Referee: Abstract and §1: the comparison to LLMs (o3-mini, Gemini 2.5 Pro, etc.) is presented as direct superiority, yet TRM receives task-specific gradient updates on ~1000 examples while the cited LLMs are evaluated zero- or few-shot; no section clarifies that the evaluation regimes are equivalent, making the parameter-efficiency claim load-bearing but currently unsupported.

Authors: We thank the referee for noting the regime difference. Our comparison is deliberately between a task-specifically trained tiny model and zero/few-shot LLMs to illustrate parameter and data efficiency. We have revised the abstract and Section 1 to explicitly state that TRM receives gradient updates on ~1000 task-specific examples while the cited LLMs are evaluated without any ARC-AGI fine-tuning. This clarification makes the efficiency claim precise rather than claiming identical protocols; the result still shows that a 7 M-parameter model trained on limited data can exceed the performance of much larger models used in their standard inference setting. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical performance claims

full rationale

The paper reports test accuracies for the proposed TRM architecture on ARC-AGI benchmarks. No derivation chain, first-principles predictions, fitted parameters renamed as outputs, or self-referential equations are present. Central results are direct measurements of generalization on held-out tasks rather than reductions to inputs by construction. Self-citations (if any) to prior HRM work are not load-bearing for any claimed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based solely on the abstract; no mathematical derivations, free parameters, or axioms are specified.

pith-pipeline@v0.9.0 · 5464 in / 1058 out tokens · 57015 ms · 2026-05-15T04:49:25.470167+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs... using a single tiny network with only 2 layers
IndisputableMonolith.Foundation.EightTick eight_tick_forces_D3 unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

recursive hierarchical reasoning consists of recursing multiple times through two small networks (fL at high frequency and fH at low frequency)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Stability and Generalization in Looped Transformers
cs.LG 2026-04 unverdicted novelty 8.0

Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant ...
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
cs.LG 2026-05 unverdicted novelty 7.0

LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
A Mechanistic Analysis of Looped Reasoning Language Models
cs.LG 2026-04 unverdicted novelty 7.0

Looped LLMs converge to distinct cyclic fixed points per layer, repeating feedforward-style inference stages across recurrences.
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
cs.CL 2026-05 unverdicted novelty 6.0

MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating st...
The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents
cs.CV 2026-04 unverdicted novelty 6.0

A recursive sparse MoE framework integrated into diffusion models iteratively refines visual tokens via gated module selection to improve structured reasoning and image generation performance.
Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning
cs.LG 2026-04 conditional novelty 6.0

Memory tokens are required for non-trivial performance in adaptive Universal Transformers on Sudoku-Extreme, with 8-32 tokens yielding stable 57% exact-match accuracy while trading off against ponder depth.
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
cs.LG 2026-04 unverdicted novelty 6.0

Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
LASER: Low-Rank Activation SVD for Efficient Recursion
cs.LG 2026-04 unverdicted novelty 6.0

LASER tracks low-rank activation subspaces in recursive models via matrix-free SVD updates and fidelity resets to save 60% memory without accuracy loss.
Parcae: Scaling Laws For Stable Looped Language Models
cs.LG 2026-04 unverdicted novelty 6.0

Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...
Querying Structured Data Through Natural Language Using Language Models
cs.CL 2026-04 conditional novelty 6.0

Fine-tuning an 8B LLM with synthetic data enables accurate natural language querying of structured datasets like accessibility services in Spain, generalizing to new locations.
Thinking While Listening: Fast-Slow Recurrence for Long-Horizon Sequential Modeling
cs.LG 2026-04 unverdicted novelty 6.0

Fast-slow recurrence interleaves quick latent updates with slow observation processing to maintain coherent clustered representations over long horizons, improving out-of-distribution generalization versus LSTM, state...
bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition
cs.CV 2026-05 unverdicted novelty 5.0

A 12-step single-block recurrent ViT-B reaches accuracy comparable to a standard ViT-B on ImageNet-1K while using an order of magnitude fewer parameters.
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis
cs.CL 2026-05 unverdicted novelty 5.0

Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
State Representation and Termination for Recursive Reasoning Systems
cs.AI 2026-05 unverdicted novelty 5.0

Recursive reasoning systems can represent their state via an epistemic state graph and terminate when the linearized order-gap is non-degenerate near the fixed point, providing a local condition for when the stopping ...
Kuramoto Oscillatory Phase Encoding: Neuro-inspired Synchronization for Improved Learning Efficiency
cs.LG 2026-04 unverdicted novelty 5.0

KoPE adds Kuramoto-based oscillatory phase states and synchronization to Vision Transformers, improving training, parameter, and data efficiency on structured vision tasks.
Consolidation-Expansion Operator Mechanics:A Unified Framework for Adaptive Learning
cs.LG 2026-05 unverdicted novelty 4.0

OpMech defines the order-gap between consolidation and expansion operators as a real-time, trajectory-based signal for convergence and principled stopping in adaptive learning.
Consolidation-Expansion Operator Mechanics:A Unified Framework for Adaptive Learning
cs.LG 2026-05 unverdicted novelty 4.0

OpMech defines the order-gap as a computable non-commutativity measure between consolidation and expansion operators to provide real-time convergence signals and stopping rules in adaptive learning.
Measuring AI Reasoning: A Guide for Researchers
cs.AI 2026-05 unverdicted novelty 4.0

Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.
Dual-Track CoT: Budget-Aware Stepwise Guidance for Small LMs
cs.CL 2026-04 unverdicted novelty 4.0

Dual-Track CoT lets small language models perform reliable multi-step reasoning with the same or fewer tokens via budget tracking and rejection of redundant steps.
LIFE -- an energy efficient advanced continual learning agentic AI framework for frontier systems
cs.AI 2026-04 unverdicted novelty 4.0

LIFE is a proposed agentic framework that combines four components to enable incremental, flexible, and energy-efficient continual learning for HPC operations such as latency spike mitigation.
S-AI-Recursive: A Bio-Inspired and Temporal Sparse AI Architecture for Iterative, Introspective, and Energy-Frugal Reasoning
cs.NE 2026-05 unverdicted novelty 3.0

S-AI-Recursive operationalizes reasoning as a closed-loop hormonal iteration with Clarifine and Confusionin to reach stable equilibrium, achieving competitive benchmark performance with under 10 million parameters via...

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 20 Pith papers · 10 internal anchors

[1]

The Hidden Drivers of HRM’s Performance on ARC-AGI

ARC Prize Foundation. The Hidden Drivers of HRM’s Performance on ARC-AGI. https://arcprize. org/blog/hrm-analysis, 2025a. [Online; ac- cessed 2025-09-15]. ARC Prize Foundation. ARC-AGI Leaderboard. https://arcprize.org/leaderboard, 2025b. [Online; accessed 2025-09-24]. Bai, S., Kolter, J. Z., and Koltun, V . Deep equilibrium models.Advances in neural info...

work page 2025
[2]

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Brock, A., Donahue, J., and Simonyan, K. Large scale gan training for high fidelity natural image synthe- sis.arXiv preprint arXiv:1809.11096,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

On the Measure of Intelligence

Chollet, F. On the measure of intelligence.arXiv preprint arXiv:1911.01547,

work page internal anchor Pith review Pith/arXiv arXiv 1911
[4]

Arc- agi-2: A new challenge for frontier ai reasoning systems

Chollet, F., Knoop, M., Kamradt, G., Landers, B., and Pinkard, H. Arc-agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv:2505.11831,

work page arXiv
[5]

and Kolter, J

Geng, Z. and Kolter, J. Z. Torchdeq: A library for deep equilibrium models.arXiv preprint arXiv:2310.18605,

work page arXiv
[6]

Gaussian Error Linear Units (GELUs)

Hendrycks, D. and Gimpel, K. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Hierarchical graph generation with k2-trees

Jang, Y., Kim, D., and Ahn, S. Hierarchical graph generation with k2-trees. InICML 2023 Workshop on Structured Probabilistic Inference Generative Modeling,

work page 2023
[8]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[9]

Kingma, D. P . and Ba, J. Adam: A method for stochas- tic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Beyond a*: Better planning with transformers via search dynamics bootstrap- ping.arXiv preprint arXiv:2402.14083,

9 Recursive Reasoning with Tiny Networks Lehnert, L., Sukhbaatar, S., Su, D., Zheng, Q., Mcvay, P ., Rabbat, M., and Tian, Y. Beyond a*: Better planning with transformers via search dynamics bootstrap- ping.arXiv preprint arXiv:2402.14083,

work page arXiv
[11]

Decoupled Weight Decay Regularization

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

V ., and Mitchell, M

Moskvichev, A., Odouard, V . V ., and Mitchell, M. The conceptarc benchmark: Evaluating understanding and generalization in the arc domain.arXiv preprint arXiv:2305.07141,

work page arXiv
[13]

A., and Birdal, T

Prieto, L., Barsbey, M., Mediano, P . A., and Birdal, T. Grokking at the edge of numerical stability.arXiv preprint arXiv:2501.04697,

work page arXiv
[14]

GLU Variants Improve Transformer

Shazeer, N. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,

work page internal anchor Pith review Pith/arXiv arXiv 2002
[15]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neu- ral networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling llm test-time compute optimally can be more effec- tive than scaling model parameters.arXiv preprint arXiv:2408.03314,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Wang, G., Li, J., Sun, Y., Chen, X., Liu, C., Wu, Y., Lu, M., Song, S., and Yadkori, Y. A. Hierarchical reasoning model.arXiv preprint arXiv:2506.21734,

work page internal anchor Pith review arXiv
[18]

10 Recursive Reasoning with Tiny Networks Hyper-parameters and setup All models are trained with the AdamW opti- mizer(Loshchilov & Hutter, 2017; Kingma & Ba,

work page 2017
[19]

TRM uses an Exponential Moving Average (EMA) of 0.999

for improved stability. TRM uses an Exponential Moving Average (EMA) of 0.999. HRM uses n= 2, T= 2 with two 4-layers networks, while we usen=6,T=3 with one 2-layer network. For Sudoku-Extreme and Maze-Hard, we train for 60k epochs with learning rate 1e-4 and weight decay 1.0. For ARC-AGI, we train for 100K epochs with learning rate 1e-4 (with 1e-2 learnin...

work page 2025
[20]

This would provide a better justification for the 1-step gradient approximation

to replace the recursion steps by fixed-point iteration as done by Deep Equilibrium Models (Bai et al., 2019). This would provide a better justification for the 1-step gradient approximation. However, this slowed down training due to the fixed-point iteration and led to worse generalization. This highlights the fact that converging to a fixed-point is not...

work page 2019