Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning

Benhao Huang; Zhengyang Geng; Zico Kolter

arxiv: 2605.21488 · v1 · pith:TLPOOVCSnew · submitted 2026-05-20 · 💻 cs.LG

Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning

Benhao Huang , Zhengyang Geng , Zico Kolter This is my paper

Pith reviewed 2026-05-21 04:53 UTC · model grok-4.3

classification 💻 cs.LG

keywords equilibrium reasonersattractorslatent dynamical systemstest-time scalingiterative reasoningfixed pointssudoku

0 comments

The pith

Neural networks learn task-conditioned attractors in latent space so that iterative updates at test time converge to valid solutions and scale reasoning performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that generalizable reasoning emerges when models are trained to form latent dynamical systems whose stable fixed points align with correct task solutions. If this holds, then simply running more internal iterations at test time lets the network refine its output adaptively, using greater depth for harder cases without external verifiers or hand-crafted rules. Experiments on Sudoku-Extreme demonstrate the effect: feedforward baselines reach only 2.6 percent accuracy, yet the same architecture unrolled to the equivalent of 40,000 layers exceeds 99 percent when convergence to solution attractors is strong. A sympathetic reader would care because the account supplies a concrete mechanism for why test-time scaling works and shows that performance gains track directly with attractor alignment rather than with raw iteration count alone.

Core claim

We hypothesize that generalizable reasoning arises from learning task-conditioned attractors: latent dynamical systems whose stable fixed points correspond to valid solutions. We formalize this process through Equilibrium Reasoners (EqR), which enable test-time scaling without external verifiers or task-specific priors. EqR scales internal dynamics along two axes: depth, by running more iterations, and breadth, by aggregating stochastic trajectories from multiple initializations. Empirically, gains from test-time scaling are tightly coupled with stronger convergence toward solution-aligned attractors. By unrolling up to the equivalent of 40,000 layers, scalable latent reasoning boosts the 2.

What carries the argument

Equilibrium Reasoners (EqR), models whose latent updates implement task-conditioned dynamical systems that converge to stable fixed points at valid solutions.

If this is right

Accuracy on hard instances rises in direct proportion to how closely latent trajectories approach the solution attractor.
Simpler problems converge in few steps while difficult ones require and benefit from thousands of additional iterations.
Compute can be allocated adaptively by monitoring convergence speed during test-time execution.
No external verifiers or task-specific priors are required for the scaling effect to appear.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same attractor-training recipe could be applied to other structured reasoning domains such as theorem proving or planning to produce analogous test-time scaling.
The perspective offers a possible dynamical explanation for why chain-of-thought or scratchpad methods improve performance in language models.
One could deliberately perturb training to misalign attractors and test whether reasoning accuracy collapses while feedforward accuracy remains intact.

Load-bearing premise

Generalizable reasoning requires learning task-conditioned attractors whose stable fixed points are exactly the valid solutions rather than arising from some other internal process.

What would settle it

Finding high accuracy on Sudoku-Extreme while the iterated latent states fail to converge to the known solution configurations, or observing that accuracy improvements decouple from measured convergence strength.

Figures

Figures reproduced from arXiv: 2605.21488 by Benhao Huang, Zhengyang Geng, Zico Kolter.

**Figure 1.** Figure 1: Scalable test-time compute emerges from convergence to neural attractors. We evaluate Equilibrium Reasoners (EqRs) with varying inference budgets and plot exact accuracy against the fixed-point residual ∥fθ(z; x) − z∥ (where z denotes latent states; lower indicates better convergence). Point color encodes the number of iterations on a log scale. Remarkably, while training is capped at 16 iterations, the le… view at source ↗

**Figure 2.** Figure 2: Landscape alignment enables scalable and generalizable reasoning. The lower surface depicts the task-metric landscape over solutions, while the upper surface depicts the model’s learned internal landscape over latent states. We conceptualize EqR as learning a task-conditioned fixed-point system z ⋆ = fθ(z ⋆ ; x), whose attractors form an internal landscape in latent space. Training shapes this internal … view at source ↗

**Figure 3.** Figure 3: Pareto heatmaps illustrating two-axis test-time scaling with iterations ("depth", D), and stochastic trajectories from multiple initializations ("breadth", B). Left: fixed-point residual r = ∥fθ(z; x) − z∥ (power-law normalized), where x stands for model input and z for latent states. Right: prediction error (lower is better, log-scaled). Breadth scaling becomes effective only beyond a minimum number of mo… view at source ↗

**Figure 4.** Figure 4: Controlled construction path. We build a controlled path from a feedforward model to a scalable iterative reasoner by progressively adding five ingredients: (1) Weight-tied parameters across iteration steps; (2) Gradient truncation with detached carry to stabilize optimization through long trajectories and reduce cost; (3) Segmented online training to shape intermediate solver states through interleaved pa… view at source ↗

**Figure 5.** Figure 5: Datasets and Evaluations. We evaluate our models and baselines on Sudoku-Extreme, a challenging 9 × 9 Sudoku benchmark for long-horizon constraint satisfaction, and on Maze-Unique, a uniquely solvable variant of Maze-hard-1k designed to remove ambiguity caused by multiple shortest paths. Additional details are provided in Appendix C. This attractor view keeps the core convergence claim without requiring c… view at source ↗

**Figure 6.** Figure 6: Four attractor landscape modes visualized by residual and task error. We run 512 random initializations over 256 Sudoku-Extreme examples, project trajectories to 2D with PCA, and color by sequence-level error: (a) no correct attractor; (b) correct and spurious attractors; (c) a narrow correct basin; (d) a broad, mostly unique correct attractor. a trajectory after it reaches a useful basin, while breadth in… view at source ↗

**Figure 7.** Figure 7: Test-time depth-scaling efficiency on Sudoku-Lite. At a matched exact-accuracy target of 92.99%, EqR requires 3.76× fewer NFEs than the baseline, and EqR+ACT improves this to 11.34× fewer NFEs. The gain is therefore not explained by increased test-time compute alone: the proposed training interventions make the same accuracy reachable with substantially less inference compute. 0 32 # Restarts (k) 0.75 0.… view at source ↗

**Figure 8.** Figure 8: Breadth-scaling efficiency of aggregation rules on Sudoku. Expected accuracy versus the number of restarts for majority vote and Top-1 Converged across TRM, TRM+RI, and EqR (TRM+RI+NI). After applying RI, especially RI+NI, Top-1 Converged becomes more compute-efficient than majority vote and reaches a higher accuracy ceiling. Convergence becomes a reliable selection signal after landscape shaping. Majorit… view at source ↗

**Figure 9.** Figure 9: Iterative depth improves generalization under a matched layer-evaluation budget. On Sudoku, increasing the iteration steps changes both optimization and generalization: the shallow iterative model reaches higher training accuracy than the feedforward depth baseline at its best evaluation checkpoint, while deeper iterative models trade slightly lower training accuracy for substantially higher evaluation acc… view at source ↗

**Figure 10.** Figure 10: Gradient flow in segmented online training. Left: a segment loss supervises the current solver state while backpropagating only through the immediately preceding local computation graph, marked by the red dashed region. Right: over a long trajectory, SOT changes the optimizer time scale: each segment performs a local backward pass and an immediate parameter update, then the next segment continues from the… view at source ↗

**Figure 11.** Figure 11: Residual diagnostic for late anchors. At the terminal-loss detached-carry checkpoint, the rollout residual ∥zt − zt−1∥ drops sharply in the early iterations and then changes only slightly over the late trajectory. The shifted-log scale keeps this late plateau visible: the change from iterations 8 to 12 is small compared with the early 2 to 8 drop, supporting the use of late anchors that supervise states a… view at source ↗

**Figure 12.** Figure 12: Ablation of ACT halting mechanisms. We compare several ACT variants (including oracle halting based on ground-truth correctness) and report both task performance and the resulting average number of iteration steps at 50k training steps; oracle halting tends to overfit and collapses toward near-single-step behavior. Backward and parameter-update costs cℓ One local loss/head backward. cB Segment parameter-b… view at source ↗

**Figure 13.** Figure 13: An example reasoning trajectory (length 32) produced by TRM on a Sudoku-Extreme puzzle. Correct predictions are highlighted in green. The figure suggests that the model’s reasoning is not strictly sequential in the way an algorithmic solver would be. Instead, it exhibits “erase then retry” behavior, with partial revisions that overwrite earlier choices. We mark one representative location with a red circl… view at source ↗

**Figure 14.** Figure 14: Maze benchmark examples. Maze-1k contains inputs with multiple valid shortest paths, while Maze-Unique restricts the benchmark to uniquely solvable instances. 0 33 66 99 132 165 198 231 Rating 0 20000 40000 60000 80000 100000 Count variance=997.12 Train Ratings mean=29.87 0 19 38 57 76 95 114 133 152 Rating 0 50 100 150 200 250 300 Count variance=464.04 Lite-Eval Ratings mean=20.92 0 49 98 147 196 245 294… view at source ↗

read the original abstract

Scaling test-time compute by iteratively updating a latent state has emerged as a powerful paradigm for reasoning. Yet the internal mechanisms that enable these iterative models to generalize beyond memorized patterns remain unclear. We hypothesize that generalizable reasoning arises from learning task-conditioned attractors: latent dynamical systems whose stable fixed points correspond to valid solutions. We formalize this process through Equilibrium Reasoners (EqR), which enable test-time scaling without external verifiers or task-specific priors. EqR scales internal dynamics along two axes: depth, by running more iterations, and breadth, by aggregating stochastic trajectories from multiple initializations. Empirically, gains from test-time scaling are tightly coupled with stronger convergence toward solution-aligned attractors. This attractor perspective allows neural networks to adaptively allocate test-time compute based on task difficulty. While simple cases converge within 1 to 5 iteration steps, harder cases benefit from massive test-time scaling. By unrolling up to the equivalent of 40,000 layers, scalable latent reasoning boosts accuracy from 2.6% for feedforward models to over 99% on Sudoku-Extreme. These results suggest that learned attractor landscapes provide a useful mechanistic lens for understanding scalable reasoning in iterative latent models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EqR frames iterative latent reasoning as attractor dynamics and shows strong Sudoku scaling, but the convergence-accuracy link risks being circular without independent checks on rule satisfaction.

read the letter

The main takeaway is that this paper gives a clean mechanistic story for why unrolling latent iterations helps on hard reasoning tasks: the model learns task-specific attractors whose fixed points land on valid solutions. They formalize Equilibrium Reasoners that scale along depth (more iterations) and breadth (multiple random starts), and they report that accuracy tracks how tightly the dynamics converge to those points. On Sudoku-Extreme the jump from 2.6 % feedforward to over 99 % with the equivalent of 40 000 layers is the clearest empirical result here. That scaling behavior and the two-axis view are the genuinely new pieces; prior iterative latent work has not framed the process this explicitly as attractor learning without external verifiers. The experiments look solid enough on the surface to explain why harder instances need far more iterations while easy ones converge in a handful of steps. The soft spot is exactly the one in the stress-test note. If solution alignment is measured by proximity to the labeled answer, then the reported tight coupling between convergence and accuracy is unsurprising and does not yet demonstrate that the attractor dynamics themselves enforce the underlying constraints. A reader will want to see whether the fixed points satisfy Sudoku rules on their own, perhaps through an ablation that scores rule violations separately from label match. Without that separation the mechanistic claim stays partly interpretive. This is the kind of paper that belongs in a reading group focused on test-time compute and latent dynamics; anyone working on iterative architectures or scaling laws for reasoning will find the empirical pattern useful even if they end up disagreeing with the attractor lens. It is coherent on its own terms and has enough concrete results to merit sending out for peer review rather than desk rejection, though the authors should expect questions on the circularity issue and will probably need to add controls in revision.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Equilibrium Reasoners (EqR), a framework for scalable latent reasoning in which neural networks learn task-conditioned attractors whose stable fixed points are hypothesized to correspond to valid solutions. The approach scales test-time compute along depth (by increasing the number of iterations) and breadth (by aggregating multiple stochastic trajectories), with the claim that performance gains are tightly coupled to stronger convergence toward solution-aligned attractors. On Sudoku-Extreme, unrolling dynamics equivalent to 40,000 layers raises accuracy from 2.6% (feedforward baseline) to over 99%.

Significance. If the attractor mechanism can be shown to operate independently of label-based supervision, the work would supply a dynamical-systems account of why iterative latent models generalize and scale, together with a practical method for adaptive test-time compute allocation. The reported scaling results on a hard constraint-satisfaction task would be a notable empirical contribution to the literature on test-time reasoning.

major comments (2)

[Abstract / Hypothesis] Abstract and opening hypothesis paragraph: the claim that fixed points 'correspond to valid solutions' and that convergence is 'tightly coupled' to accuracy risks circularity if solution-alignment is measured by proximity to the same ground-truth labels used to compute the 99% accuracy figure. The manuscript must demonstrate that the learned fixed points satisfy the underlying task constraints (Sudoku rules, etc.) on held-out instances even when label information is withheld from the convergence metric.
[Empirical Results] Empirical section reporting the Sudoku-Extreme results: the abstract states large gains and a correlation between convergence and accuracy, yet supplies no dataset splits, error bars, ablation controls, or explicit separation between the attractor fixed-point property and the supervised accuracy signal. Without these, the support for the mechanistic attractor hypothesis cannot be evaluated.

minor comments (1)

[Methods / Scaling Description] Clarify the precise mapping from iteration count to 'equivalent of 40,000 layers' and whether this equivalence is architectural or merely computational.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important points for clarifying the independence of the attractor mechanism from label supervision and for improving the empirical reporting. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Hypothesis] Abstract and opening hypothesis paragraph: the claim that fixed points 'correspond to valid solutions' and that convergence is 'tightly coupled' to accuracy risks circularity if solution-alignment is measured by proximity to the same ground-truth labels used to compute the 99% accuracy figure. The manuscript must demonstrate that the learned fixed points satisfy the underlying task constraints (Sudoku rules, etc.) on held-out instances even when label information is withheld from the convergence metric.

Authors: We agree that explicit separation is necessary to substantiate the mechanistic claim. Convergence in EqR is defined via the fixed-point residual of the latent dynamics (||Δx|| < ε), which does not reference ground-truth labels during iteration. In the revision we will add a dedicated analysis on held-out Sudoku-Extreme instances showing that trajectories reaching this label-free convergence criterion decode to states satisfying all Sudoku constraints (unique rows, columns, and blocks) at rates that track the reported accuracy gains. This will demonstrate that the attractor property operates independently of the supervised accuracy metric. revision: yes
Referee: [Empirical Results] Empirical section reporting the Sudoku-Extreme results: the abstract states large gains and a correlation between convergence and accuracy, yet supplies no dataset splits, error bars, ablation controls, or explicit separation between the attractor fixed-point property and the supervised accuracy signal. Without these, the support for the mechanistic attractor hypothesis cannot be evaluated.

Authors: We acknowledge that these details were insufficiently reported. The revised manuscript will specify the exact train/validation/test splits for Sudoku-Extreme, report standard error bars across multiple random seeds, and include targeted ablations that compare convergent EqR trajectories against non-convergent iterative baselines while holding supervision fixed. These additions will make the empirical support for the attractor hypothesis directly evaluable. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper hypothesizes that generalizable reasoning arises from task-conditioned attractors whose fixed points correspond to valid solutions, then reports that test-time scaling gains are empirically coupled to convergence toward solution-aligned attractors, with accuracy improvements demonstrated on Sudoku-Extreme via unrolling iterations. No equations, fitting procedures, or derivations are shown that reduce any claimed prediction or result to its inputs by construction. There are no self-citations, uniqueness theorems, or ansatzes invoked in the provided text that would create load-bearing circularity. The central claims remain independent of the accuracy metric and are presented as an empirical observation rather than a definitional equivalence, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review is limited to the abstract; the ledger therefore records only the core hypothesis and postulated mechanism explicitly stated there.

axioms (1)

domain assumption Stable fixed points of the learned latent dynamics correspond to valid task solutions
Invoked in the first paragraph as the basis for the entire approach.

invented entities (1)

task-conditioned attractors no independent evidence
purpose: Stable fixed points in latent space that encode correct solutions for a given task
Newly hypothesized mechanism introduced to explain generalization and scalable reasoning

pith-pipeline@v0.9.0 · 5743 in / 1326 out tokens · 39381 ms · 2026-05-21T04:53:29.026844+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We hypothesize that generalizable reasoning arises from learning task-conditioned attractors: latent dynamical systems whose stable fixed points correspond to valid solutions.
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By unrolling up to the equivalent of 40,000 layers, scalable latent reasoning boosts accuracy from 2.6% for feedforward models to over 99% on Sudoku-Extreme

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages

[1]

Forty-first International Conference on Machine Learning , year=

Scaling Exponents Across Parameterizations and Optimizers , author=. Forty-first International Conference on Machine Learning , year=

work page
[2]

International Conference on Learning Representations , year=

Adam: A Method for Stochastic Optimization , author=. International Conference on Learning Representations , year=

work page
[3]

Advances in Neural Information Processing Systems , volume=

Path Independent Equilibrium Models Can Better Exploit Test-Time Computation , author=. Advances in Neural Information Processing Systems , volume=

work page
[4]

International Conference on Learning Representations , year=

Unbiasing Truncated Backpropagation Through Time , author=. International Conference on Learning Representations , year=

work page
[5]

Proceedings of The 35th Uncertainty in Artificial Intelligence Conference , series=

Adaptively Truncating Backpropagation Through Time to Control Gradient Bias , author=. Proceedings of The 35th Uncertainty in Artificial Intelligence Conference , series=. 2020 , publisher=

work page 2020
[6]

Proceedings of the 35th International Conference on Machine Learning , series=

Reviving and Improving Recurrent Back-Propagation , author=. Proceedings of the 35th International Conference on Machine Learning , series=. 2018 , publisher=

work page 2018
[7]

Neural Computation , year=

Hochreiter, Sepp and Schmidhuber, J. Neural Computation , year=

work page
[8]

NeurIPS 2022 Workshop on Neuro Causal and Symbolic AI (nCSI) , year=

Playgrounds for abstraction and reasoning , author=. NeurIPS 2022 Workshop on Neuro Causal and Symbolic AI (nCSI) , year=

work page 2022
[9]

Why Do Reasoning Models Loop? , author=

Wait, Wait, Wait... Why Do Reasoning Models Loop? , author=. 2025 , eprint=

work page 2025
[10]

Benhao Huang , year=

work page
[11]

2025 , eprint=

Hierarchical Reasoning Model , author=. 2025 , eprint=

work page 2025
[12]

2025 , eprint=

Less is More: Recursive Reasoning with Tiny Networks , author=. 2025 , eprint=

work page 2025
[13]

Conference on Language Modeling , year=

Training Large Language Models to Reason in a Continuous Latent Space , author=. Conference on Language Modeling , year=

work page
[14]

International Conference on Learning Representations , year=

Think before you speak: Training Language Models With Pause Tokens , author=. International Conference on Learning Representations , year=

work page
[15]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , pages=

SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , pages=. 2025 , doi=

work page 2025
[16]

2025 , eprint=

SoftCoT++: Test-Time Scaling with Soft Chain-of-Thought Reasoning , author=. 2025 , eprint=

work page 2025
[17]

International Conference on Learning Representations , year=

PonderLM: Pretraining Language Models to Ponder in Continuous Space , author=. International Conference on Learning Representations , year=

work page
[18]

2025 , eprint=

PonderLM-2: Pretraining LLM with Latent Thoughts in Continuous Space , author=. 2025 , eprint=

work page 2025
[19]

International Conference on Learning Representations , year=

SIM-CoT: Supervised Implicit Chain-of-Thought , author=. International Conference on Learning Representations , year=

work page
[20]

2025 , eprint=

Parallel Test-Time Scaling for Latent Reasoning Models , author=. 2025 , eprint=

work page 2025
[21]

2026 , eprint=

SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models , author=. 2026 , eprint=

work page 2026
[22]

Proceedings of the 38th International Conference on Machine Learning , series=

Stabilizing Equilibrium Models by Jacobian Regularization , author=. Proceedings of the 38th International Conference on Machine Learning , series=

work page
[23]

2023 , eprint=

TorchDEQ: A Library for Deep Equilibrium Models , author=. 2023 , eprint=

work page 2023
[24]

Advances in Neural Information Processing Systems , volume=

Deep Equilibrium Models , author=. Advances in Neural Information Processing Systems , volume=

work page
[25]

Science , volume=

A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go Through Self-Play , author=. Science , volume=. 2018 , doi=

work page 2018
[26]

Advances in Neural Information Processing Systems , volume=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , volume=

work page
[27]

International Conference on Learning Representations , year=

Universal transformers , author=. International Conference on Learning Representations , year=

work page
[28]

2025 , eprint=

Scaling latent reasoning via looped language models , author=. 2025 , eprint=

work page 2025
[29]

Advances in Neural Information Processing Systems , year=

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach , author=. Advances in Neural Information Processing Systems , year=

work page
[30]

2021 , eprint=

Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with Recurrent Networks , author=. 2021 , eprint=

work page 2021
[31]

Proceedings of the 40th International Conference on Machine Learning , series=

Looped Transformers as Programmable Computers , author=. Proceedings of the 40th International Conference on Machine Learning , series=

work page
[32]

International Conference on Learning Representations , year=

Looped Transformers are Better at Learning Learning Algorithms , author=. International Conference on Learning Representations , year=

work page
[33]

Proceedings of the 42nd International Conference on Machine Learning , year=

On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding , author=. Proceedings of the 42nd International Conference on Machine Learning , year=

work page
[34]

International Conference on Learning Representations , year=

Reasoning with Latent Thoughts: On the Power of Looped Transformers , author=. International Conference on Learning Representations , year=

work page
[35]

International Conference on Learning Representations , year=

The Expressive Power of Transformers with Chain of Thought , author=. International Conference on Learning Representations , year=

work page
[36]

2025 , eprint=

Latent Chain-of-Thought? Decoding the Depth-Recurrent Transformer , author=. 2025 , eprint=

work page 2025
[37]

2025 , eprint=

Two-Scale Latent Dynamics for Recurrent-Depth Transformers , author=. 2025 , eprint=

work page 2025
[38]

2026 , eprint=

Inverse Depth Scaling From Most Layers Being Similar , author=. 2026 , eprint=

work page 2026
[39]

2026 , eprint=

Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers , author=. 2026 , eprint=

work page 2026
[40]

2026 , eprint=

Relational Preference Encoding in Looped Transformer Internal States , author=. 2026 , eprint=

work page 2026
[41]

2026 , eprint=

A Mechanistic Analysis of Looped Reasoning Language Models , author=. 2026 , eprint=

work page 2026
[42]

2026 , eprint=

Stability and Generalization in Looped Transformers , author=. 2026 , eprint=

work page 2026
[43]

2026 , eprint=

How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models , author=. 2026 , eprint=

work page 2026
[44]

2026 , eprint=

Parcae: Scaling Laws For Stable Looped Language Models , author=. 2026 , eprint=

work page 2026
[45]

2026 , eprint=

AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth , author=. 2026 , eprint=

work page 2026
[46]

International Conference on Learning Representations , year=

LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation , author=. International Conference on Learning Representations , year=

work page
[47]

Advances in Neural Information Processing Systems , year=

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation , author=. Advances in Neural Information Processing Systems , year=

work page
[48]

2026 , eprint=

Understanding Dynamic Compute Allocation in Recurrent Transformers , author=. 2026 , eprint=

work page 2026
[49]

2025 , eprint=

Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models , author=. 2025 , eprint=

work page 2025
[50]

2026 , eprint=

LoopViT: Scaling Visual ARC with Looped Transformers , author=. 2026 , eprint=

work page 2026
[51]

Conference on Language Modeling , year=

Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping , author=. Conference on Language Modeling , year=

work page
[52]

2026 , eprint=

Are Your Reasoning Models Reasoning or Guessing? A Mechanistic Analysis of Hierarchical Reasoning Models , author=. 2026 , eprint=

work page 2026
[53]

International Conference on Learning Representations , year=

Sharpness-Aware Minimization for Efficiently Improving Generalization , author=. International Conference on Learning Representations , year=

work page
[54]

International Conference on Learning Representations , year=

Is Attention Better Than Matrix Decomposition? , author=. International Conference on Learning Representations , year=

work page
[55]

Advances in Neural Information Processing Systems , volume=

On Training Implicit Models , author=. Advances in Neural Information Processing Systems , volume=

work page
[56]

Advances in Neural Information Processing Systems , volume=

Implicit Graph Neural Networks , author=. Advances in Neural Information Processing Systems , volume=

work page
[57]

SIAM Journal on Mathematics of Data Science , volume=

Implicit Deep Learning , author=. SIAM Journal on Mathematics of Data Science , volume=. 2021 , doi=

work page 2021
[58]

2025 , eprint=

Universal Reasoning Model , author=. 2025 , eprint=

work page 2025
[59]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Deep Equilibrium Optical Flow Estimation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[60]

2025 , eprint=

Golden Noise for Diffusion Models: A Learning Framework , author=. 2025 , eprint=

work page 2025
[61]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

work page
[62]

Computer Vision -- ECCV 2024 , pages=

FreeInit: Bridging Initialization Gap in Video Diffusion Models , author=. Computer Vision -- ECCV 2024 , pages=. 2024 , doi=

work page 2024
[63]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

JFB: Jacobian-Free Backpropagation for Implicit Networks , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2022 , doi=

work page 2022
[64]

2017 , eprint=

Adaptive Computation Time for Recurrent Neural Networks , author=. 2017 , eprint=

work page 2017
[65]

Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=

RingFormer: Rethinking Recurrent Transformer with Adaptive Level Signals , author=. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=. 2025 , doi=

work page 2025
[66]

2025 , eprint=

Investigating Recurrent Transformers with Dynamic Halt , author=. 2025 , eprint=

work page 2025
[67]

Advances in Neural Information Processing Systems , year=

Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning , author=. Advances in Neural Information Processing Systems , year=

work page
[68]

International Conference on Learning Representations , year=

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters , author=. International Conference on Learning Representations , year=

work page
[69]

Advances in Neural Information Processing Systems , year=

Does Thinking More always Help? Mirage of Test-Time Scaling in Reasoning Models , author=. Advances in Neural Information Processing Systems , year=

work page
[70]

International Conference on Learning Representations , year=

Deep Think with Confidence , author=. International Conference on Learning Representations , year=

work page
[71]

2025 , eprint=

MatryoshkaThinking: Recursive Test-Time Scaling Enables Efficient Reasoning , author=. 2025 , eprint=

work page 2025

[1] [1]

Forty-first International Conference on Machine Learning , year=

Scaling Exponents Across Parameterizations and Optimizers , author=. Forty-first International Conference on Machine Learning , year=

work page

[2] [2]

International Conference on Learning Representations , year=

Adam: A Method for Stochastic Optimization , author=. International Conference on Learning Representations , year=

work page

[3] [3]

Advances in Neural Information Processing Systems , volume=

Path Independent Equilibrium Models Can Better Exploit Test-Time Computation , author=. Advances in Neural Information Processing Systems , volume=

work page

[4] [4]

International Conference on Learning Representations , year=

Unbiasing Truncated Backpropagation Through Time , author=. International Conference on Learning Representations , year=

work page

[5] [5]

Proceedings of The 35th Uncertainty in Artificial Intelligence Conference , series=

Adaptively Truncating Backpropagation Through Time to Control Gradient Bias , author=. Proceedings of The 35th Uncertainty in Artificial Intelligence Conference , series=. 2020 , publisher=

work page 2020

[6] [6]

Proceedings of the 35th International Conference on Machine Learning , series=

Reviving and Improving Recurrent Back-Propagation , author=. Proceedings of the 35th International Conference on Machine Learning , series=. 2018 , publisher=

work page 2018

[7] [7]

Neural Computation , year=

Hochreiter, Sepp and Schmidhuber, J. Neural Computation , year=

work page

[8] [8]

NeurIPS 2022 Workshop on Neuro Causal and Symbolic AI (nCSI) , year=

Playgrounds for abstraction and reasoning , author=. NeurIPS 2022 Workshop on Neuro Causal and Symbolic AI (nCSI) , year=

work page 2022

[9] [9]

Why Do Reasoning Models Loop? , author=

Wait, Wait, Wait... Why Do Reasoning Models Loop? , author=. 2025 , eprint=

work page 2025

[10] [10]

Benhao Huang , year=

work page

[11] [11]

2025 , eprint=

Hierarchical Reasoning Model , author=. 2025 , eprint=

work page 2025

[12] [12]

2025 , eprint=

Less is More: Recursive Reasoning with Tiny Networks , author=. 2025 , eprint=

work page 2025

[13] [13]

Conference on Language Modeling , year=

Training Large Language Models to Reason in a Continuous Latent Space , author=. Conference on Language Modeling , year=

work page

[14] [14]

International Conference on Learning Representations , year=

Think before you speak: Training Language Models With Pause Tokens , author=. International Conference on Learning Representations , year=

work page

[15] [15]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , pages=

SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , pages=. 2025 , doi=

work page 2025

[16] [16]

2025 , eprint=

SoftCoT++: Test-Time Scaling with Soft Chain-of-Thought Reasoning , author=. 2025 , eprint=

work page 2025

[17] [17]

International Conference on Learning Representations , year=

PonderLM: Pretraining Language Models to Ponder in Continuous Space , author=. International Conference on Learning Representations , year=

work page

[18] [18]

2025 , eprint=

PonderLM-2: Pretraining LLM with Latent Thoughts in Continuous Space , author=. 2025 , eprint=

work page 2025

[19] [19]

International Conference on Learning Representations , year=

SIM-CoT: Supervised Implicit Chain-of-Thought , author=. International Conference on Learning Representations , year=

work page

[20] [20]

2025 , eprint=

Parallel Test-Time Scaling for Latent Reasoning Models , author=. 2025 , eprint=

work page 2025

[21] [21]

2026 , eprint=

SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models , author=. 2026 , eprint=

work page 2026

[22] [22]

Proceedings of the 38th International Conference on Machine Learning , series=

Stabilizing Equilibrium Models by Jacobian Regularization , author=. Proceedings of the 38th International Conference on Machine Learning , series=

work page

[23] [23]

2023 , eprint=

TorchDEQ: A Library for Deep Equilibrium Models , author=. 2023 , eprint=

work page 2023

[24] [24]

Advances in Neural Information Processing Systems , volume=

Deep Equilibrium Models , author=. Advances in Neural Information Processing Systems , volume=

work page

[25] [25]

Science , volume=

A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go Through Self-Play , author=. Science , volume=. 2018 , doi=

work page 2018

[26] [26]

Advances in Neural Information Processing Systems , volume=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , volume=

work page

[27] [27]

International Conference on Learning Representations , year=

Universal transformers , author=. International Conference on Learning Representations , year=

work page

[28] [28]

2025 , eprint=

Scaling latent reasoning via looped language models , author=. 2025 , eprint=

work page 2025

[29] [29]

Advances in Neural Information Processing Systems , year=

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach , author=. Advances in Neural Information Processing Systems , year=

work page

[30] [30]

2021 , eprint=

Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with Recurrent Networks , author=. 2021 , eprint=

work page 2021

[31] [31]

Proceedings of the 40th International Conference on Machine Learning , series=

Looped Transformers as Programmable Computers , author=. Proceedings of the 40th International Conference on Machine Learning , series=

work page

[32] [32]

International Conference on Learning Representations , year=

Looped Transformers are Better at Learning Learning Algorithms , author=. International Conference on Learning Representations , year=

work page

[33] [33]

Proceedings of the 42nd International Conference on Machine Learning , year=

On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding , author=. Proceedings of the 42nd International Conference on Machine Learning , year=

work page

[34] [34]

International Conference on Learning Representations , year=

Reasoning with Latent Thoughts: On the Power of Looped Transformers , author=. International Conference on Learning Representations , year=

work page

[35] [35]

International Conference on Learning Representations , year=

The Expressive Power of Transformers with Chain of Thought , author=. International Conference on Learning Representations , year=

work page

[36] [36]

2025 , eprint=

Latent Chain-of-Thought? Decoding the Depth-Recurrent Transformer , author=. 2025 , eprint=

work page 2025

[37] [37]

2025 , eprint=

Two-Scale Latent Dynamics for Recurrent-Depth Transformers , author=. 2025 , eprint=

work page 2025

[38] [38]

2026 , eprint=

Inverse Depth Scaling From Most Layers Being Similar , author=. 2026 , eprint=

work page 2026

[39] [39]

2026 , eprint=

Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers , author=. 2026 , eprint=

work page 2026

[40] [40]

2026 , eprint=

Relational Preference Encoding in Looped Transformer Internal States , author=. 2026 , eprint=

work page 2026

[41] [41]

2026 , eprint=

A Mechanistic Analysis of Looped Reasoning Language Models , author=. 2026 , eprint=

work page 2026

[42] [42]

2026 , eprint=

Stability and Generalization in Looped Transformers , author=. 2026 , eprint=

work page 2026

[43] [43]

2026 , eprint=

How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models , author=. 2026 , eprint=

work page 2026

[44] [44]

2026 , eprint=

Parcae: Scaling Laws For Stable Looped Language Models , author=. 2026 , eprint=

work page 2026

[45] [45]

2026 , eprint=

AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth , author=. 2026 , eprint=

work page 2026

[46] [46]

International Conference on Learning Representations , year=

LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation , author=. International Conference on Learning Representations , year=

work page

[47] [47]

Advances in Neural Information Processing Systems , year=

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation , author=. Advances in Neural Information Processing Systems , year=

work page

[48] [48]

2026 , eprint=

Understanding Dynamic Compute Allocation in Recurrent Transformers , author=. 2026 , eprint=

work page 2026

[49] [49]

2025 , eprint=

Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models , author=. 2025 , eprint=

work page 2025

[50] [50]

2026 , eprint=

LoopViT: Scaling Visual ARC with Looped Transformers , author=. 2026 , eprint=

work page 2026

[51] [51]

Conference on Language Modeling , year=

Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping , author=. Conference on Language Modeling , year=

work page

[52] [52]

2026 , eprint=

Are Your Reasoning Models Reasoning or Guessing? A Mechanistic Analysis of Hierarchical Reasoning Models , author=. 2026 , eprint=

work page 2026

[53] [53]

International Conference on Learning Representations , year=

Sharpness-Aware Minimization for Efficiently Improving Generalization , author=. International Conference on Learning Representations , year=

work page

[54] [54]

International Conference on Learning Representations , year=

Is Attention Better Than Matrix Decomposition? , author=. International Conference on Learning Representations , year=

work page

[55] [55]

Advances in Neural Information Processing Systems , volume=

On Training Implicit Models , author=. Advances in Neural Information Processing Systems , volume=

work page

[56] [56]

Advances in Neural Information Processing Systems , volume=

Implicit Graph Neural Networks , author=. Advances in Neural Information Processing Systems , volume=

work page

[57] [57]

SIAM Journal on Mathematics of Data Science , volume=

Implicit Deep Learning , author=. SIAM Journal on Mathematics of Data Science , volume=. 2021 , doi=

work page 2021

[58] [58]

2025 , eprint=

Universal Reasoning Model , author=. 2025 , eprint=

work page 2025

[59] [59]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Deep Equilibrium Optical Flow Estimation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[60] [60]

2025 , eprint=

Golden Noise for Diffusion Models: A Learning Framework , author=. 2025 , eprint=

work page 2025

[61] [61]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

work page

[62] [62]

Computer Vision -- ECCV 2024 , pages=

FreeInit: Bridging Initialization Gap in Video Diffusion Models , author=. Computer Vision -- ECCV 2024 , pages=. 2024 , doi=

work page 2024

[63] [63]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

JFB: Jacobian-Free Backpropagation for Implicit Networks , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2022 , doi=

work page 2022

[64] [64]

2017 , eprint=

Adaptive Computation Time for Recurrent Neural Networks , author=. 2017 , eprint=

work page 2017

[65] [65]

Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=

RingFormer: Rethinking Recurrent Transformer with Adaptive Level Signals , author=. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=. 2025 , doi=

work page 2025

[66] [66]

2025 , eprint=

Investigating Recurrent Transformers with Dynamic Halt , author=. 2025 , eprint=

work page 2025

[67] [67]

Advances in Neural Information Processing Systems , year=

Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning , author=. Advances in Neural Information Processing Systems , year=

work page

[68] [68]

International Conference on Learning Representations , year=

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters , author=. International Conference on Learning Representations , year=

work page

[69] [69]

Advances in Neural Information Processing Systems , year=

Does Thinking More always Help? Mirage of Test-Time Scaling in Reasoning Models , author=. Advances in Neural Information Processing Systems , year=

work page

[70] [70]

International Conference on Learning Representations , year=

Deep Think with Confidence , author=. International Conference on Learning Representations , year=

work page

[71] [71]

2025 , eprint=

MatryoshkaThinking: Recursive Test-Time Scaling Enables Efficient Reasoning , author=. 2025 , eprint=

work page 2025