Recognition: 2 theorem links
· Lean TheoremLess is More: Recursive Reasoning with Tiny Networks
Pith reviewed 2026-05-15 04:49 UTC · model grok-4.3
The pith
A two-layer recursive network with 7 million parameters reaches 45 percent accuracy on ARC-AGI-1, surpassing most large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TRM consists of a single tiny neural network with only two layers that recurses on the current state of a puzzle to produce successive refinements until a solution emerges. When trained on approximately one thousand examples, this model reaches 45 percent accuracy on the ARC-AGI-1 test set and 8 percent on ARC-AGI-2, outperforming the earlier Hierarchical Reasoning Model and most cited large language models that contain thousands of times more parameters.
What carries the argument
The Tiny Recursive Model (TRM), a single two-layer network that iterates refinement steps on the input through repeated application.
If this is right
- Recursive iteration on a fixed small network produces measurable gains on visual reasoning benchmarks without added parameters.
- Training sets of roughly one thousand examples suffice for nontrivial generalization on ARC-style tasks when recursion is used.
- A single-network recursive design can exceed the performance of an earlier two-network hierarchical design on the same puzzles.
- High accuracy on Sudoku, Maze, and ARC-AGI is achievable with total model size under 10 million parameters.
Where Pith is reading between the lines
- The same recursive loop could be tested on other step-by-step tasks such as theorem proving or program synthesis.
- If recursion depth can be learned or scheduled, further reductions in required parameters may be possible.
- The approach invites direct comparisons of iteration count versus parameter count across a wider range of benchmarks.
Load-bearing premise
The accuracy numbers obtained by TRM can be compared directly to the accuracies reported for the much larger language models despite differences in training data and evaluation protocols.
What would settle it
Re-evaluate the trained TRM on a fresh ARC-AGI test split whose puzzle distributions differ markedly from the original training set and check whether accuracy on ARC-AGI-1 falls below 25 percent.
read the original abstract
Hierarchical Reasoning Model (HRM) is a novel approach using two small neural networks recursing at different frequencies. This biologically inspired method beats Large Language models (LLMs) on hard puzzle tasks such as Sudoku, Maze, and ARC-AGI while trained with small models (27M parameters) on small data (around 1000 examples). HRM holds great promise for solving hard problems with small networks, but it is not yet well understood and may be suboptimal. We propose Tiny Recursive Model (TRM), a much simpler recursive reasoning approach that achieves significantly higher generalization than HRM, while using a single tiny network with only 2 layers. With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Tiny Recursive Model (TRM), a simplified single-network recursive reasoning architecture with only 2 layers and 7M parameters. It claims TRM, trained on around 1000 examples, reaches 45% test accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, outperforming most LLMs (Deepseek R1, o3-mini, Gemini 2.5 Pro) while using <0.01% of their parameters, and improves upon the prior Hierarchical Reasoning Model (HRM) for Sudoku, Maze, and ARC-AGI tasks.
Significance. If the empirical claims are verified with rigorous controls, the result would be significant: it would show that biologically inspired recursive reasoning in tiny networks can deliver strong generalization on hard puzzle benchmarks using orders-of-magnitude less data and compute than LLMs, offering a concrete counter-example to pure scaling and opening avenues for efficient, interpretable reasoning systems.
major comments (2)
- Abstract: the central performance claims (45% ARC-AGI-1, 8% ARC-AGI-2) are stated without any description of the train/eval/test partitioning, whether the ~1000 training examples are strictly disjoint from the reported test sets, or confirmation of no leakage from ARC training tasks; this directly undermines the generalization interpretation relative to zero-shot LLM baselines.
- Abstract and §1: the comparison to LLMs (o3-mini, Gemini 2.5 Pro, etc.) is presented as direct superiority, yet TRM receives task-specific gradient updates on ~1000 examples while the cited LLMs are evaluated zero- or few-shot; no section clarifies that the evaluation regimes are equivalent, making the parameter-efficiency claim load-bearing but currently unsupported.
minor comments (1)
- Abstract: the statement 'higher than most LLMs' should be qualified with the exact subset of models and conditions under which the comparison holds.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on evaluation clarity. We have revised the manuscript to explicitly describe the data partitioning, confirm disjoint splits, and distinguish the training regimes from LLM baselines.
read point-by-point responses
-
Referee: Abstract: the central performance claims (45% ARC-AGI-1, 8% ARC-AGI-2) are stated without any description of the train/eval/test partitioning, whether the ~1000 training examples are strictly disjoint from the reported test sets, or confirmation of no leakage from ARC training tasks; this directly undermines the generalization interpretation relative to zero-shot LLM baselines.
Authors: We agree that the original abstract omitted these details. In the revised version we have updated the abstract and added a dedicated paragraph in Section 3 to state: TRM is trained on approximately 1000 examples drawn from the public ARC-AGI training tasks and evaluated on the official test set, which consists of entirely disjoint tasks never seen during training. We explicitly confirm no leakage occurs because the test tasks are held out and the model never accesses ARC test data or private splits during any stage of training or validation. revision: yes
-
Referee: Abstract and §1: the comparison to LLMs (o3-mini, Gemini 2.5 Pro, etc.) is presented as direct superiority, yet TRM receives task-specific gradient updates on ~1000 examples while the cited LLMs are evaluated zero- or few-shot; no section clarifies that the evaluation regimes are equivalent, making the parameter-efficiency claim load-bearing but currently unsupported.
Authors: We thank the referee for noting the regime difference. Our comparison is deliberately between a task-specifically trained tiny model and zero/few-shot LLMs to illustrate parameter and data efficiency. We have revised the abstract and Section 1 to explicitly state that TRM receives gradient updates on ~1000 task-specific examples while the cited LLMs are evaluated without any ARC-AGI fine-tuning. This clarification makes the efficiency claim precise rather than claiming identical protocols; the result still shows that a 7 M-parameter model trained on limited data can exceed the performance of much larger models used in their standard inference setting. revision: yes
Circularity Check
No circularity: purely empirical performance claims
full rationale
The paper reports test accuracies for the proposed TRM architecture on ARC-AGI benchmarks. No derivation chain, first-principles predictions, fitted parameters renamed as outputs, or self-referential equations are present. Central results are direct measurements of generalization on held-out tasks rather than reductions to inputs by construction. Self-citations (if any) to prior HRM work are not load-bearing for any claimed derivation.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs... using a single tiny network with only 2 layers
-
IndisputableMonolith.Foundation.EightTickeight_tick_forces_D3 unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
recursive hierarchical reasoning consists of recursing multiple times through two small networks (fL at high frequency and fH at low frequency)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 21 Pith papers
-
Stability and Generalization in Looped Transformers
Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant ...
-
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
-
A Mechanistic Analysis of Looped Reasoning Language Models
Looped LLMs converge to distinct cyclic fixed points per layer, repeating feedforward-style inference stages across recurrences.
-
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating st...
-
The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents
A recursive sparse MoE framework integrated into diffusion models iteratively refines visual tokens via gated module selection to improve structured reasoning and image generation performance.
-
Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning
Memory tokens are required for non-trivial performance in adaptive Universal Transformers on Sudoku-Extreme, with 8-32 tokens yielding stable 57% exact-match accuracy while trading off against ponder depth.
-
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
-
LASER: Low-Rank Activation SVD for Efficient Recursion
LASER tracks low-rank activation subspaces in recursive models via matrix-free SVD updates and fidelity resets to save 60% memory without accuracy loss.
-
Parcae: Scaling Laws For Stable Looped Language Models
Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...
-
Querying Structured Data Through Natural Language Using Language Models
Fine-tuning an 8B LLM with synthetic data enables accurate natural language querying of structured datasets like accessibility services in Spain, generalizing to new locations.
-
Thinking While Listening: Fast-Slow Recurrence for Long-Horizon Sequential Modeling
Fast-slow recurrence interleaves quick latent updates with slow observation processing to maintain coherent clustered representations over long horizons, improving out-of-distribution generalization versus LSTM, state...
-
bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition
A 12-step single-block recurrent ViT-B reaches accuracy comparable to a standard ViT-B on ImageNet-1K while using an order of magnitude fewer parameters.
-
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis
Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
-
State Representation and Termination for Recursive Reasoning Systems
Recursive reasoning systems can represent their state via an epistemic state graph and terminate when the linearized order-gap is non-degenerate near the fixed point, providing a local condition for when the stopping ...
-
Kuramoto Oscillatory Phase Encoding: Neuro-inspired Synchronization for Improved Learning Efficiency
KoPE adds Kuramoto-based oscillatory phase states and synchronization to Vision Transformers, improving training, parameter, and data efficiency on structured vision tasks.
-
Consolidation-Expansion Operator Mechanics:A Unified Framework for Adaptive Learning
OpMech defines the order-gap between consolidation and expansion operators as a real-time, trajectory-based signal for convergence and principled stopping in adaptive learning.
-
Consolidation-Expansion Operator Mechanics:A Unified Framework for Adaptive Learning
OpMech defines the order-gap as a computable non-commutativity measure between consolidation and expansion operators to provide real-time convergence signals and stopping rules in adaptive learning.
-
Measuring AI Reasoning: A Guide for Researchers
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.
-
Dual-Track CoT: Budget-Aware Stepwise Guidance for Small LMs
Dual-Track CoT lets small language models perform reliable multi-step reasoning with the same or fewer tokens via budget tracking and rejection of redundant steps.
-
LIFE -- an energy efficient advanced continual learning agentic AI framework for frontier systems
LIFE is a proposed agentic framework that combines four components to enable incremental, flexible, and energy-efficient continual learning for HPC operations such as latency spike mitigation.
-
S-AI-Recursive: A Bio-Inspired and Temporal Sparse AI Architecture for Iterative, Introspective, and Energy-Frugal Reasoning
S-AI-Recursive operationalizes reasoning as a closed-loop hormonal iteration with Clarifine and Confusionin to reach stable equilibrium, achieving competitive benchmark performance with under 10 million parameters via...
Reference graph
Works this paper leans on
-
[1]
The Hidden Drivers of HRM’s Performance on ARC-AGI
ARC Prize Foundation. The Hidden Drivers of HRM’s Performance on ARC-AGI. https://arcprize. org/blog/hrm-analysis, 2025a. [Online; ac- cessed 2025-09-15]. ARC Prize Foundation. ARC-AGI Leaderboard. https://arcprize.org/leaderboard, 2025b. [Online; accessed 2025-09-24]. Bai, S., Kolter, J. Z., and Koltun, V . Deep equilibrium models.Advances in neural info...
work page 2025
-
[2]
Large Scale GAN Training for High Fidelity Natural Image Synthesis
Brock, A., Donahue, J., and Simonyan, K. Large scale gan training for high fidelity natural image synthe- sis.arXiv preprint arXiv:1809.11096,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
On the Measure of Intelligence
Chollet, F. On the measure of intelligence.arXiv preprint arXiv:1911.01547,
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[4]
Arc- agi-2: A new challenge for frontier ai reasoning systems
Chollet, F., Knoop, M., Kamradt, G., Landers, B., and Pinkard, H. Arc-agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv:2505.11831,
-
[5]
Geng, Z. and Kolter, J. Z. Torchdeq: A library for deep equilibrium models.arXiv preprint arXiv:2310.18605,
-
[6]
Gaussian Error Linear Units (GELUs)
Hendrycks, D. and Gimpel, K. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Hierarchical graph generation with k2-trees
Jang, Y., Kim, D., and Ahn, S. Hierarchical graph generation with k2-trees. InICML 2023 Workshop on Structured Probabilistic Inference Generative Modeling,
work page 2023
-
[8]
Scaling Laws for Neural Language Models
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[9]
Kingma, D. P . and Ba, J. Adam: A method for stochas- tic optimization.arXiv preprint arXiv:1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
9 Recursive Reasoning with Tiny Networks Lehnert, L., Sukhbaatar, S., Su, D., Zheng, Q., Mcvay, P ., Rabbat, M., and Tian, Y. Beyond a*: Better planning with transformers via search dynamics bootstrap- ping.arXiv preprint arXiv:2402.14083,
-
[11]
Decoupled Weight Decay Regularization
Loshchilov, I. and Hutter, F. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Moskvichev, A., Odouard, V . V ., and Mitchell, M. The conceptarc benchmark: Evaluating understanding and generalization in the arc domain.arXiv preprint arXiv:2305.07141,
-
[13]
Prieto, L., Barsbey, M., Mediano, P . A., and Birdal, T. Grokking at the edge of numerical stability.arXiv preprint arXiv:2501.04697,
-
[14]
GLU Variants Improve Transformer
Shazeer, N. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[15]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neu- ral networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling llm test-time compute optimally can be more effec- tive than scaling model parameters.arXiv preprint arXiv:2408.03314,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Wang, G., Li, J., Sun, Y., Chen, X., Liu, C., Wu, Y., Lu, M., Song, S., and Yadkori, Y. A. Hierarchical reasoning model.arXiv preprint arXiv:2506.21734,
work page internal anchor Pith review arXiv
-
[18]
10 Recursive Reasoning with Tiny Networks Hyper-parameters and setup All models are trained with the AdamW opti- mizer(Loshchilov & Hutter, 2017; Kingma & Ba,
work page 2017
-
[19]
TRM uses an Exponential Moving Average (EMA) of 0.999
for improved stability. TRM uses an Exponential Moving Average (EMA) of 0.999. HRM uses n= 2, T= 2 with two 4-layers networks, while we usen=6,T=3 with one 2-layer network. For Sudoku-Extreme and Maze-Hard, we train for 60k epochs with learning rate 1e-4 and weight decay 1.0. For ARC-AGI, we train for 100K epochs with learning rate 1e-4 (with 1e-2 learnin...
work page 2025
-
[20]
This would provide a better justification for the 1-step gradient approximation
to replace the recursion steps by fixed-point iteration as done by Deep Equilibrium Models (Bai et al., 2019). This would provide a better justification for the 1-step gradient approximation. However, this slowed down training due to the fixed-point iteration and led to worse generalization. This highlights the fact that converging to a fixed-point is not...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.