Simply Stabilizing the Loop via Fully Looped Transformer

Hechang Chen; Jiankun Zhang; Jing Ma; Rao Fu; Yi Chang; Yu Li; Zixuan Yang

arxiv: 2605.18797 · v1 · pith:W5DTY6P2new · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Simply Stabilizing the Loop via Fully Looped Transformer

Rao Fu , Zixuan Yang , Jiankun Zhang , Jing Ma , Hechang Chen , Yu Li , Yi Chang This is my paper

Pith reviewed 2026-05-20 23:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords looped transformertraining stabilitygradient oscillationresidual explosionattention injectioniterative computationmodel scalingtest-time compute

0 comments

The pith

Two parameter-free changes to the looped transformer fix gradient oscillation and residual explosion to allow stable training at up to 12 iterations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that looped transformers, which reuse the same blocks repeatedly to gain performance without adding parameters, become unstable as the number of iterations grows. The authors trace the instability to oscillating gradients and exploding residuals, then introduce a fully looped architecture that spreads inter-loop signals across every layer and an attention injection step that reuses the existing attention block to dampen oscillations. These changes keep training stable at 12 loop iterations where prior looped models collapse. In regimes where baselines remain stable, the modified model still raises average downstream accuracy by as much as 13.2 percent. The result supplies a practical way to trade extra test-time computation for better performance while keeping parameter count fixed.

Core claim

The Fully Looped Transformer distributes inter-loop signals across all layers to prevent residual explosion and reuses the attention block to suppress gradient oscillation. These two parameter-free modifications stabilize training dynamics up to 12 loop iterations, whereas baseline looped models collapse in the same regime, and they raise average downstream-task performance by up to 13.2 percent in milder settings.

What carries the argument

Fully Looped Architecture together with Attention Injection, which together spread residual connections and reuse attention to remove the two identified sources of instability.

If this is right

Model capacity can be increased by raising loop count at inference instead of widening or deepening the network.
Test-time compute can be varied after training by choosing different numbers of iterations without retraining.
Training succeeds in regimes where earlier looped designs fail, expanding the usable range of iteration counts.
Downstream accuracy improves even when both models train successfully, showing the fixes also aid optimization.
Parameter count stays constant while effective depth grows, offering a different scaling axis from standard transformers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same signal-distribution and attention-reuse ideas could be tested in other iterative architectures such as recurrent networks or state-space models.
If the fixes generalize, training budgets could shift from adding parameters toward adding loop iterations at inference.
The approach may reduce the need for very deep unrolled networks by letting a shallow block be reused more reliably.
Similar lightweight modifications might address instability in other training regimes that suffer from repeated residual paths.

Load-bearing premise

Instability in looped transformers comes only from gradient oscillation and residual explosion, and the two proposed fixes remove those sources without introducing new instabilities.

What would settle it

Train the Fully Looped Transformer and a standard Looped Transformer to 12 iterations on the same data and compare whether loss curves remain stable or diverge.

Figures

Figures reproduced from arXiv: 2605.18797 by Hechang Chen, Jiankun Zhang, Jing Ma, Rao Fu, Yi Chang, Yu Li, Zixuan Yang.

**Figure 2.** Figure 2: Training dynamics of LT and FLT during the first 2000 optimizer steps. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Training dynamics comparison of the FLT variants and LT variants. All models compared [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: The loss of different base size models at 12-loop setting. All models except FLT collapsed. Smoothed with factor 0.9 for readability. Metrics GQA MLA SWA FA Wiki2ppl ↓ 39.68 38.76 38.91 41.12 Valbpb ↓ 0.897 0.904 0.895 0.895 Core↑ 16.24 15.64 15.58 15.66 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Test-time adaptation evaluation results of FLT. Models trained with 3, 6, 9, or 12 loop [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Left: the residual norm at the 6th loop iteration. Middle: the residual norm at the 9th loop iteration. Right: the residual norm at the 12th loop iteration [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Left: the gradient norm of LM head block. Middle: the gradient norm of FFN at 5th layer. Right: the gradient norm of attention block at 5th layer. Figures 6 and 7 provide supplementary evidence for the diagnostic experiment in Section 3 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Trend chart of Core Metric changes for FLT with different attention variants throughout the [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: The training loss of FLT with different attention variants. Smoothed with factor 0.9. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

read the original abstract

Scaling model performance typically requires increasing model size. Looped Transformer offers a compelling alternative by iteratively reusing the same Transformer blocks, trading additional computation for improved performance without increasing parameter count or context length. Because the number of loop iterations can be adjusted at inference, it also provides a natural mechanism for balancing performance and test-time compute. However, Looped Transformer still suffers from training instability when the number of loop iterations increases. Our analysis reveals that this instability stems from two sources: gradient oscillation and residual explosion. To address these two problems, we propose the Fully Looped Transformer, which introduces two parameter-free modifications: (1) Fully Looped Architecture, which distributes inter-loop signals across all layers to mitigate residual explosion; (2) Attention Injection, which reuses the existing attention block to suppress gradient oscillation. These modifications stabilize training dynamics, enabling the Fully Looped Transformer to be trained stably up to 12 loop iterations, whereas other baseline looped models collapse in this regime. In milder settings where Looped Transformer does not collapse, Fully Looped Transformer still improves average downstream-task performance by up to 13.2\%. Overall, our experiments demonstrate that Fully Looped Transformer improves training stability, enhances downstream performance, and provides preliminary adaptability under different test-time compute budgets by varying loop iterations at inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper stabilizes looped transformers up to 12 iterations with two parameter-free changes and reports some performance gains, but the instability diagnosis looks incomplete and experimental details are thin.

read the letter

The main takeaway is that this work takes the looped transformer setup and adds a fully distributed signal path across layers plus attention reuse to keep gradients in check. That lets training hold together at higher loop counts where prior versions fall apart, and it gives a modest lift on downstream tasks when the baseline stays stable. The changes are parameter-free, which keeps the scaling story clean: more test-time iterations instead of more weights or context.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Fully Looped Transformer to address training instability in Looped Transformers when increasing the number of loop iterations. It identifies gradient oscillation and residual explosion as the two primary sources of instability through analysis, then proposes two parameter-free modifications: (1) Fully Looped Architecture, which distributes inter-loop signals across all layers to mitigate residual explosion, and (2) Attention Injection, which reuses the existing attention block to suppress gradient oscillation. Experiments demonstrate stable training up to 12 loop iterations (where baselines collapse) and up to 13.2% average downstream-task performance gains in milder regimes, with preliminary evidence of adaptability by varying loop count at inference.

Significance. If the empirical stability and gains hold under rigorous controls, the work provides a practical, parameter-free route to deeper effective computation via looping without increasing model size or context length. This directly supports better performance-compute tradeoffs at test time. The explicit identification of instability sources and the parameter-free design are strengths that could influence efficient scaling research, though the completeness of the causal analysis remains a key open question for the central claims.

major comments (2)

§3 (Instability Analysis): The assertion that gradient oscillation and residual explosion constitute the complete and primary causes of collapse at high loop counts is load-bearing for the design of the two fixes and the stability claim up to 12 iterations. The manuscript does not provide exhaustive ablations or theoretical arguments ruling out other mechanisms (e.g., attention pattern drift across iterations or depth-dependent regularization effects), leaving the attribution of success to these specific modifications incomplete.
§4 and §5 (Experimental Setup and Results): The reported 13.2% average performance improvement and stability up to 12 iterations lack details on experimental controls, number of random seeds, statistical significance testing, or correction for multiple comparisons across tasks and loop counts. Without these, it is unclear whether the gains are robust or whether the collapse of baselines is consistently reproduced.

minor comments (2)

Abstract and §2: The description of Attention Injection as 'reusing the existing attention block' would benefit from a precise equation or pseudocode showing how the injection is implemented without altering parameter count or introducing new learnable weights.
Figure 2 or equivalent (Training Dynamics): The plots of gradient norms and residual magnitudes across loops would be clearer if they included error bars from multiple runs and direct comparison to the proposed fixes at each iteration count.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [—] §3 (Instability Analysis): The assertion that gradient oscillation and residual explosion constitute the complete and primary causes of collapse at high loop counts is load-bearing for the design of the two fixes and the stability claim up to 12 iterations. The manuscript does not provide exhaustive ablations or theoretical arguments ruling out other mechanisms (e.g., attention pattern drift across iterations or depth-dependent regularization effects), leaving the attribution of success to these specific modifications incomplete.

Authors: We appreciate the referee's emphasis on the need for stronger causal attribution. Section 3 presents empirical gradient analysis and ablation results showing that gradient oscillation and residual explosion are the dominant instability sources in the high-iteration regime, with the proposed fixes directly mitigating them to enable stable training up to 12 iterations. We do not claim these are the sole possible mechanisms. In the revision we will expand the discussion to explicitly acknowledge alternative factors such as attention pattern drift and include additional targeted ablations (e.g., monitoring attention entropy across loops) to further support the primary role of the identified issues. revision: partial
Referee: [—] §4 and §5 (Experimental Setup and Results): The reported 13.2% average performance improvement and stability up to 12 iterations lack details on experimental controls, number of random seeds, statistical significance testing, or correction for multiple comparisons across tasks and loop counts. Without these, it is unclear whether the gains are robust or whether the collapse of baselines is consistently reproduced.

Authors: We agree that greater experimental rigor is required. The revised manuscript will report the number of random seeds (3–5 per setting), include standard deviations or error bars for all metrics, add statistical significance tests (e.g., paired t-tests against baselines), and clarify the primary comparisons of interest to address multiple-testing concerns. These additions will substantiate the robustness of the 13.2% gains and the consistent reproduction of baseline collapse at high loop counts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical validation of parameter-free modifications

full rationale

The paper presents an empirical analysis identifying gradient oscillation and residual explosion as instability sources, followed by two parameter-free architectural modifications (Fully Looped Architecture and Attention Injection) whose effects are demonstrated through training runs up to 12 iterations and downstream task improvements of up to 13.2%. No equations, predictions, or first-principles derivations are shown that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The central stability and performance claims are supported by external experimental benchmarks rather than internal redefinitions or renamings, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard transformer training assumptions plus the untested premise that the observed instability sources are exhaustive and that the fixes are neutral with respect to other dynamics.

axioms (1)

domain assumption Standard assumptions of transformer training dynamics and gradient flow hold for looped variants.
Invoked when attributing instability to gradient oscillation and residual explosion.

pith-pipeline@v0.9.0 · 5771 in / 1154 out tokens · 26824 ms · 2026-05-20T23:15:33.420617+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 8 internal anchors

[1]

Forty-first International Conference on Machine Learning , year=

Position: Will we run out of data? Limits of LLM scaling based on human-generated data , author=. Forty-first International Conference on Machine Learning , year=

work page
[2]

arXiv preprint arXiv:2509.14786 , year=

Pre-training under infinite compute , author=. arXiv preprint arXiv:2509.14786 , year=

work page arXiv
[3]

Epoch AI , author=

Training compute of frontier AI models grows by 4-5x per year , url=. Epoch AI , author=

work page
[4]

arXiv preprint arXiv:2512.19941 , year=

Block-Recurrent Dynamics in Vision Transformers , author=. arXiv preprint arXiv:2512.19941 , year=

work page arXiv
[5]

arXiv preprint arXiv:2410.01405 , year=

On expressive power of looped transformers: Theoretical analysis and enhancement via timestep encoding , author=. arXiv preprint arXiv:2410.01405 , year=

work page arXiv
[6]

Advances in neural information processing systems , volume=

Deep equilibrium models , author=. Advances in neural information processing systems , volume=

work page
[7]

arXiv preprint arXiv:2502.17416 (2025)

Reasoning with latent thoughts: On the power of looped transformers , author=. arXiv preprint arXiv:2502.17416 , year=

work page arXiv
[8]

arXiv preprint arXiv:2409.15647 , year=

Looped transformers for length generalization , author=. arXiv preprint arXiv:2409.15647 , year=

work page arXiv
[9]

arXiv preprint arXiv:2410.21698 , year=

On the role of depth and looping for in-context learning with task diversity , author=. arXiv preprint arXiv:2410.21698 , year=

work page arXiv
[10]

arXiv preprint arXiv:2410.11268 , year=

Bypassing the exponential dependency: Looped transformers efficiently learn in-context by multi-step gradient descent , author=. arXiv preprint arXiv:2410.11268 , year=

work page arXiv
[11]

arXiv preprint arXiv:2410.08292 , year=

Can looped transformers learn to implement multi-step gradient descent for in-context learning? , author=. arXiv preprint arXiv:2410.08292 , year=

work page arXiv
[12]

Looped transformers are better at learning learning algorithms.arXiv preprint arXiv:2311.12424,

Looped transformers are better at learning learning algorithms , author=. arXiv preprint arXiv:2311.12424 , year=

work page arXiv
[13]

International Conference on Learning Representations , year=

Universal Transformers , author=. International Conference on Learning Representations , year=

work page
[14]

arXiv preprint arXiv:2402.13572 , year=

Algoformer: An efficient transformer framework with algorithmic structures , author=. arXiv preprint arXiv:2402.13572 , year=

work page arXiv
[15]

Advances in Neural Information Processing Systems , volume=

Path independent equilibrium models can better exploit test-time computation , author=. Advances in Neural Information Processing Systems , volume=

work page
[16]

Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Enhancing auto-regressive chain-of-thought through loop-aligned reasoning , author=. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[17]

International Conference on Machine Learning , pages=

Looped transformers as programmable computers , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[18]

Scaling Latent Reasoning via Looped Language Models

Scaling latent reasoning via looped language models , author=. arXiv preprint arXiv:2510.25741 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Scaling up test-time compute with latent reasoning: A recurrent depth approach , author=. arXiv preprint arXiv:2502.05171 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Training Compute-Optimal Large Language Models

Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , volume=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

The Synthetic Data Playbook: Generating Trillions of the Finest Tokens , author=

work page
[22]

2024 , eprint=

OpenAI o1 System Card , author=. 2024 , eprint=

work page 2024
[23]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

International conference on machine learning , pages=

Perceiver: General perception with iterative attention , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[25]

Highway Networks

Highway networks , author=. arXiv preprint arXiv:1505.00387 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[27]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Densely connected convolutional networks , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[28]

Advances in neural information processing systems , volume=

Residual networks behave like ensembles of relatively shallow networks , author=. Advances in neural information processing systems , volume=

work page
[29]

A Mechanistic Analysis of Looped Reasoning Language Models

A Mechanistic Analysis of Looped Reasoning Language Models , author=. arXiv preprint arXiv:2604.11791 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Neural computation , volume=

Long short-term memory , author=. Neural computation , volume=. 1997 , publisher=

work page 1997
[31]

arXiv preprint arXiv:2510.07358 , year=

Encode, Think, Decode: Scaling test-time reasoning with recursive latent thoughts , author=. arXiv preprint arXiv:2510.07358 , year=

work page arXiv
[32]

International conference on machine learning , pages=

On the difficulty of training recurrent neural networks , author=. International conference on machine learning , pages=. 2013 , organization=

work page 2013
[33]

IEEE transactions on neural networks , volume=

Learning long-term dependencies with gradient descent is difficult , author=. IEEE transactions on neural networks , volume=. 1994 , publisher=

work page 1994
[34]

Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages=

Learning phrase representations using RNN encoder--decoder for statistical machine translation , author=. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages=

work page 2014
[35]

Advances in Neural Information Processing Systems , volume=

Datacomp-lm: In search of the next generation of training sets for language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[36]

doi:10.57967/hf/2497 , publisher =

Lozhkov, Anton and Ben Allal, Loubna and von Werra, Leandro and Wolf, Thomas , title =. doi:10.57967/hf/2497 , publisher =

work page doi:10.57967/hf/2497
[37]

Parcae: Scaling Laws For Stable Looped Language Models

Parcae: Scaling Laws For Stable Looped Language Models , author=. arXiv preprint arXiv:2604.12946 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Advances in neural information processing systems , volume=

Sequence to sequence learning with neural networks , author=. Advances in neural information processing systems , volume=

work page
[39]

2016 , eprint=

Pointer Sentinel Mixture Models , author=. 2016 , eprint=

work page 2016
[40]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[41]

2020 , eprint=

Longformer: The Long-Document Transformer , author=. 2020 , eprint=

work page 2020
[42]

2024 , eprint=

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , author=. 2024 , eprint=

work page 2024
[43]

2023 , eprint=

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. 2023 , eprint=

work page 2023
[44]

2023 , eprint=

Attention Is All You Need , author=. 2023 , eprint=

work page 2023
[45]

2019 , eprint=

Root Mean Square Layer Normalization , author=. 2019 , eprint=

work page 2019
[46]

2016 , eprint=

Residual Networks Behave Like Ensembles of Relatively Shallow Networks , author=. 2016 , eprint=

work page 2016
[47]

2022 , eprint=

Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer , author=. 2022 , eprint=

work page 2022
[48]

2019 , eprint=

Decoupled Weight Decay Regularization , author=. 2019 , eprint=

work page 2019
[49]

2024 , url =

Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

work page 2024
[50]

2025 , publisher =

Andrej Karpathy , title =. 2025 , publisher =

work page 2025
[51]

2023 , eprint=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=

work page 2023
[52]

Proceedings of the IEEE , volume=

Backpropagation through time: what it does and how to do it , author=. Proceedings of the IEEE , volume=. 2002 , publisher=

work page 2002
[53]

2023 , eprint=

RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2023 , eprint=

work page 2023
[54]

2024 , eprint=

ReLU ^2 Wins: Discovering Efficient Activation Functions for Sparse LLMs , author=. 2024 , eprint=

work page 2024
[55]

2024 , eprint=

Gemma 2: Improving Open Language Models at a Practical Size , author=. 2024 , eprint=

work page 2024
[56]

2024 , url =

Keller Jordan and Jeremy Bernstein and Brendan Rappazzo and @fernbear.bsky.social and Boza Vlado and You Jiacheng and Franz Cesista and Braden Koszarsky and @Grad62304977 , title =. 2024 , url =

work page 2024
[57]

Language Models are Unsupervised Multitask Learners , author=

work page
[58]

EMNLP , year=

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. EMNLP , year=

work page
[59]

Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

work page
[60]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=

HellaSwag: Can a Machine Really Finish Your Sentence? , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=

work page
[61]

Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics , pages =

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author =. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics , pages =

work page 2019
[62]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord , title =. arXiv:1803.05457v1 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[63]

2016 , eprint=

The LAMBADA dataset: Word prediction requiring a broad discourse context , author=. 2016 , eprint=

work page 2016
[64]

2020 , eprint=

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models , author=. 2020 , eprint=

work page 2020
[65]

2020 , eprint=

Query-Key Normalization for Transformers , author=. 2020 , eprint=

work page 2020

[1] [1]

Forty-first International Conference on Machine Learning , year=

Position: Will we run out of data? Limits of LLM scaling based on human-generated data , author=. Forty-first International Conference on Machine Learning , year=

work page

[2] [2]

arXiv preprint arXiv:2509.14786 , year=

Pre-training under infinite compute , author=. arXiv preprint arXiv:2509.14786 , year=

work page arXiv

[3] [3]

Epoch AI , author=

Training compute of frontier AI models grows by 4-5x per year , url=. Epoch AI , author=

work page

[4] [4]

arXiv preprint arXiv:2512.19941 , year=

Block-Recurrent Dynamics in Vision Transformers , author=. arXiv preprint arXiv:2512.19941 , year=

work page arXiv

[5] [5]

arXiv preprint arXiv:2410.01405 , year=

On expressive power of looped transformers: Theoretical analysis and enhancement via timestep encoding , author=. arXiv preprint arXiv:2410.01405 , year=

work page arXiv

[6] [6]

Advances in neural information processing systems , volume=

Deep equilibrium models , author=. Advances in neural information processing systems , volume=

work page

[7] [7]

arXiv preprint arXiv:2502.17416 (2025)

Reasoning with latent thoughts: On the power of looped transformers , author=. arXiv preprint arXiv:2502.17416 , year=

work page arXiv

[8] [8]

arXiv preprint arXiv:2409.15647 , year=

Looped transformers for length generalization , author=. arXiv preprint arXiv:2409.15647 , year=

work page arXiv

[9] [9]

arXiv preprint arXiv:2410.21698 , year=

On the role of depth and looping for in-context learning with task diversity , author=. arXiv preprint arXiv:2410.21698 , year=

work page arXiv

[10] [10]

arXiv preprint arXiv:2410.11268 , year=

Bypassing the exponential dependency: Looped transformers efficiently learn in-context by multi-step gradient descent , author=. arXiv preprint arXiv:2410.11268 , year=

work page arXiv

[11] [11]

arXiv preprint arXiv:2410.08292 , year=

Can looped transformers learn to implement multi-step gradient descent for in-context learning? , author=. arXiv preprint arXiv:2410.08292 , year=

work page arXiv

[12] [12]

Looped transformers are better at learning learning algorithms.arXiv preprint arXiv:2311.12424,

Looped transformers are better at learning learning algorithms , author=. arXiv preprint arXiv:2311.12424 , year=

work page arXiv

[13] [13]

International Conference on Learning Representations , year=

Universal Transformers , author=. International Conference on Learning Representations , year=

work page

[14] [14]

arXiv preprint arXiv:2402.13572 , year=

Algoformer: An efficient transformer framework with algorithmic structures , author=. arXiv preprint arXiv:2402.13572 , year=

work page arXiv

[15] [15]

Advances in Neural Information Processing Systems , volume=

Path independent equilibrium models can better exploit test-time computation , author=. Advances in Neural Information Processing Systems , volume=

work page

[16] [16]

Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Enhancing auto-regressive chain-of-thought through loop-aligned reasoning , author=. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[17] [17]

International Conference on Machine Learning , pages=

Looped transformers as programmable computers , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023

[18] [18]

Scaling Latent Reasoning via Looped Language Models

Scaling latent reasoning via looped language models , author=. arXiv preprint arXiv:2510.25741 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Scaling up test-time compute with latent reasoning: A recurrent depth approach , author=. arXiv preprint arXiv:2502.05171 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Training Compute-Optimal Large Language Models

Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , volume=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

The Synthetic Data Playbook: Generating Trillions of the Finest Tokens , author=

work page

[22] [22]

2024 , eprint=

OpenAI o1 System Card , author=. 2024 , eprint=

work page 2024

[23] [23]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

International conference on machine learning , pages=

Perceiver: General perception with iterative attention , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021

[25] [25]

Highway Networks

Highway networks , author=. arXiv preprint arXiv:1505.00387 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page

[27] [27]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Densely connected convolutional networks , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page

[28] [28]

Advances in neural information processing systems , volume=

Residual networks behave like ensembles of relatively shallow networks , author=. Advances in neural information processing systems , volume=

work page

[29] [29]

A Mechanistic Analysis of Looped Reasoning Language Models

A Mechanistic Analysis of Looped Reasoning Language Models , author=. arXiv preprint arXiv:2604.11791 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Neural computation , volume=

Long short-term memory , author=. Neural computation , volume=. 1997 , publisher=

work page 1997

[31] [31]

arXiv preprint arXiv:2510.07358 , year=

Encode, Think, Decode: Scaling test-time reasoning with recursive latent thoughts , author=. arXiv preprint arXiv:2510.07358 , year=

work page arXiv

[32] [32]

International conference on machine learning , pages=

On the difficulty of training recurrent neural networks , author=. International conference on machine learning , pages=. 2013 , organization=

work page 2013

[33] [33]

IEEE transactions on neural networks , volume=

Learning long-term dependencies with gradient descent is difficult , author=. IEEE transactions on neural networks , volume=. 1994 , publisher=

work page 1994

[34] [34]

Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages=

Learning phrase representations using RNN encoder--decoder for statistical machine translation , author=. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages=

work page 2014

[35] [35]

Advances in Neural Information Processing Systems , volume=

Datacomp-lm: In search of the next generation of training sets for language models , author=. Advances in Neural Information Processing Systems , volume=

work page

[36] [36]

doi:10.57967/hf/2497 , publisher =

Lozhkov, Anton and Ben Allal, Loubna and von Werra, Leandro and Wolf, Thomas , title =. doi:10.57967/hf/2497 , publisher =

work page doi:10.57967/hf/2497

[37] [37]

Parcae: Scaling Laws For Stable Looped Language Models

Parcae: Scaling Laws For Stable Looped Language Models , author=. arXiv preprint arXiv:2604.12946 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

Advances in neural information processing systems , volume=

Sequence to sequence learning with neural networks , author=. Advances in neural information processing systems , volume=

work page

[39] [39]

2016 , eprint=

Pointer Sentinel Mixture Models , author=. 2016 , eprint=

work page 2016

[40] [40]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025

[41] [41]

2020 , eprint=

Longformer: The Long-Document Transformer , author=. 2020 , eprint=

work page 2020

[42] [42]

2024 , eprint=

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , author=. 2024 , eprint=

work page 2024

[43] [43]

2023 , eprint=

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. 2023 , eprint=

work page 2023

[44] [44]

2023 , eprint=

Attention Is All You Need , author=. 2023 , eprint=

work page 2023

[45] [45]

2019 , eprint=

Root Mean Square Layer Normalization , author=. 2019 , eprint=

work page 2019

[46] [46]

2016 , eprint=

Residual Networks Behave Like Ensembles of Relatively Shallow Networks , author=. 2016 , eprint=

work page 2016

[47] [47]

2022 , eprint=

Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer , author=. 2022 , eprint=

work page 2022

[48] [48]

2019 , eprint=

Decoupled Weight Decay Regularization , author=. 2019 , eprint=

work page 2019

[49] [49]

2024 , url =

Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

work page 2024

[50] [50]

2025 , publisher =

Andrej Karpathy , title =. 2025 , publisher =

work page 2025

[51] [51]

2023 , eprint=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=

work page 2023

[52] [52]

Proceedings of the IEEE , volume=

Backpropagation through time: what it does and how to do it , author=. Proceedings of the IEEE , volume=. 2002 , publisher=

work page 2002

[53] [53]

2023 , eprint=

RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2023 , eprint=

work page 2023

[54] [54]

2024 , eprint=

ReLU ^2 Wins: Discovering Efficient Activation Functions for Sparse LLMs , author=. 2024 , eprint=

work page 2024

[55] [55]

2024 , eprint=

Gemma 2: Improving Open Language Models at a Practical Size , author=. 2024 , eprint=

work page 2024

[56] [56]

2024 , url =

Keller Jordan and Jeremy Bernstein and Brendan Rappazzo and @fernbear.bsky.social and Boza Vlado and You Jiacheng and Franz Cesista and Braden Koszarsky and @Grad62304977 , title =. 2024 , url =

work page 2024

[57] [57]

Language Models are Unsupervised Multitask Learners , author=

work page

[58] [58]

EMNLP , year=

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. EMNLP , year=

work page

[59] [59]

Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

work page

[60] [60]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=

HellaSwag: Can a Machine Really Finish Your Sentence? , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=

work page

[61] [61]

Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics , pages =

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author =. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics , pages =

work page 2019

[62] [62]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord , title =. arXiv:1803.05457v1 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[63] [63]

2016 , eprint=

The LAMBADA dataset: Word prediction requiring a broad discourse context , author=. 2016 , eprint=

work page 2016

[64] [64]

2020 , eprint=

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models , author=. 2020 , eprint=

work page 2020

[65] [65]

2020 , eprint=

Query-Key Normalization for Transformers , author=. 2020 , eprint=

work page 2020