Simply Stabilizing the Loop via Fully Looped Transformer
Pith reviewed 2026-05-20 23:15 UTC · model grok-4.3
The pith
Two parameter-free changes to the looped transformer fix gradient oscillation and residual explosion to allow stable training at up to 12 iterations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Fully Looped Transformer distributes inter-loop signals across all layers to prevent residual explosion and reuses the attention block to suppress gradient oscillation. These two parameter-free modifications stabilize training dynamics up to 12 loop iterations, whereas baseline looped models collapse in the same regime, and they raise average downstream-task performance by up to 13.2 percent in milder settings.
What carries the argument
Fully Looped Architecture together with Attention Injection, which together spread residual connections and reuse attention to remove the two identified sources of instability.
If this is right
- Model capacity can be increased by raising loop count at inference instead of widening or deepening the network.
- Test-time compute can be varied after training by choosing different numbers of iterations without retraining.
- Training succeeds in regimes where earlier looped designs fail, expanding the usable range of iteration counts.
- Downstream accuracy improves even when both models train successfully, showing the fixes also aid optimization.
- Parameter count stays constant while effective depth grows, offering a different scaling axis from standard transformers.
Where Pith is reading between the lines
- The same signal-distribution and attention-reuse ideas could be tested in other iterative architectures such as recurrent networks or state-space models.
- If the fixes generalize, training budgets could shift from adding parameters toward adding loop iterations at inference.
- The approach may reduce the need for very deep unrolled networks by letting a shallow block be reused more reliably.
- Similar lightweight modifications might address instability in other training regimes that suffer from repeated residual paths.
Load-bearing premise
Instability in looped transformers comes only from gradient oscillation and residual explosion, and the two proposed fixes remove those sources without introducing new instabilities.
What would settle it
Train the Fully Looped Transformer and a standard Looped Transformer to 12 iterations on the same data and compare whether loss curves remain stable or diverge.
Figures
read the original abstract
Scaling model performance typically requires increasing model size. Looped Transformer offers a compelling alternative by iteratively reusing the same Transformer blocks, trading additional computation for improved performance without increasing parameter count or context length. Because the number of loop iterations can be adjusted at inference, it also provides a natural mechanism for balancing performance and test-time compute. However, Looped Transformer still suffers from training instability when the number of loop iterations increases. Our analysis reveals that this instability stems from two sources: gradient oscillation and residual explosion. To address these two problems, we propose the Fully Looped Transformer, which introduces two parameter-free modifications: (1) Fully Looped Architecture, which distributes inter-loop signals across all layers to mitigate residual explosion; (2) Attention Injection, which reuses the existing attention block to suppress gradient oscillation. These modifications stabilize training dynamics, enabling the Fully Looped Transformer to be trained stably up to 12 loop iterations, whereas other baseline looped models collapse in this regime. In milder settings where Looped Transformer does not collapse, Fully Looped Transformer still improves average downstream-task performance by up to 13.2\%. Overall, our experiments demonstrate that Fully Looped Transformer improves training stability, enhances downstream performance, and provides preliminary adaptability under different test-time compute budgets by varying loop iterations at inference.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Fully Looped Transformer to address training instability in Looped Transformers when increasing the number of loop iterations. It identifies gradient oscillation and residual explosion as the two primary sources of instability through analysis, then proposes two parameter-free modifications: (1) Fully Looped Architecture, which distributes inter-loop signals across all layers to mitigate residual explosion, and (2) Attention Injection, which reuses the existing attention block to suppress gradient oscillation. Experiments demonstrate stable training up to 12 loop iterations (where baselines collapse) and up to 13.2% average downstream-task performance gains in milder regimes, with preliminary evidence of adaptability by varying loop count at inference.
Significance. If the empirical stability and gains hold under rigorous controls, the work provides a practical, parameter-free route to deeper effective computation via looping without increasing model size or context length. This directly supports better performance-compute tradeoffs at test time. The explicit identification of instability sources and the parameter-free design are strengths that could influence efficient scaling research, though the completeness of the causal analysis remains a key open question for the central claims.
major comments (2)
- §3 (Instability Analysis): The assertion that gradient oscillation and residual explosion constitute the complete and primary causes of collapse at high loop counts is load-bearing for the design of the two fixes and the stability claim up to 12 iterations. The manuscript does not provide exhaustive ablations or theoretical arguments ruling out other mechanisms (e.g., attention pattern drift across iterations or depth-dependent regularization effects), leaving the attribution of success to these specific modifications incomplete.
- §4 and §5 (Experimental Setup and Results): The reported 13.2% average performance improvement and stability up to 12 iterations lack details on experimental controls, number of random seeds, statistical significance testing, or correction for multiple comparisons across tasks and loop counts. Without these, it is unclear whether the gains are robust or whether the collapse of baselines is consistently reproduced.
minor comments (2)
- Abstract and §2: The description of Attention Injection as 'reusing the existing attention block' would benefit from a precise equation or pseudocode showing how the injection is implemented without altering parameter count or introducing new learnable weights.
- Figure 2 or equivalent (Training Dynamics): The plots of gradient norms and residual magnitudes across loops would be clearer if they included error bars from multiple runs and direct comparison to the proposed fixes at each iteration count.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [—] §3 (Instability Analysis): The assertion that gradient oscillation and residual explosion constitute the complete and primary causes of collapse at high loop counts is load-bearing for the design of the two fixes and the stability claim up to 12 iterations. The manuscript does not provide exhaustive ablations or theoretical arguments ruling out other mechanisms (e.g., attention pattern drift across iterations or depth-dependent regularization effects), leaving the attribution of success to these specific modifications incomplete.
Authors: We appreciate the referee's emphasis on the need for stronger causal attribution. Section 3 presents empirical gradient analysis and ablation results showing that gradient oscillation and residual explosion are the dominant instability sources in the high-iteration regime, with the proposed fixes directly mitigating them to enable stable training up to 12 iterations. We do not claim these are the sole possible mechanisms. In the revision we will expand the discussion to explicitly acknowledge alternative factors such as attention pattern drift and include additional targeted ablations (e.g., monitoring attention entropy across loops) to further support the primary role of the identified issues. revision: partial
-
Referee: [—] §4 and §5 (Experimental Setup and Results): The reported 13.2% average performance improvement and stability up to 12 iterations lack details on experimental controls, number of random seeds, statistical significance testing, or correction for multiple comparisons across tasks and loop counts. Without these, it is unclear whether the gains are robust or whether the collapse of baselines is consistently reproduced.
Authors: We agree that greater experimental rigor is required. The revised manuscript will report the number of random seeds (3–5 per setting), include standard deviations or error bars for all metrics, add statistical significance tests (e.g., paired t-tests against baselines), and clarify the primary comparisons of interest to address multiple-testing concerns. These additions will substantiate the robustness of the 13.2% gains and the consistent reproduction of baseline collapse at high loop counts. revision: yes
Circularity Check
No significant circularity; claims rest on empirical validation of parameter-free modifications
full rationale
The paper presents an empirical analysis identifying gradient oscillation and residual explosion as instability sources, followed by two parameter-free architectural modifications (Fully Looped Architecture and Attention Injection) whose effects are demonstrated through training runs up to 12 iterations and downstream task improvements of up to 13.2%. No equations, predictions, or first-principles derivations are shown that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The central stability and performance claims are supported by external experimental benchmarks rather than internal redefinitions or renamings, rendering the derivation self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions of transformer training dynamics and gradient flow hold for looped variants.
Reference graph
Works this paper leans on
-
[1]
Forty-first International Conference on Machine Learning , year=
Position: Will we run out of data? Limits of LLM scaling based on human-generated data , author=. Forty-first International Conference on Machine Learning , year=
-
[2]
arXiv preprint arXiv:2509.14786 , year=
Pre-training under infinite compute , author=. arXiv preprint arXiv:2509.14786 , year=
-
[3]
Training compute of frontier AI models grows by 4-5x per year , url=. Epoch AI , author=
-
[4]
arXiv preprint arXiv:2512.19941 , year=
Block-Recurrent Dynamics in Vision Transformers , author=. arXiv preprint arXiv:2512.19941 , year=
-
[5]
arXiv preprint arXiv:2410.01405 , year=
On expressive power of looped transformers: Theoretical analysis and enhancement via timestep encoding , author=. arXiv preprint arXiv:2410.01405 , year=
-
[6]
Advances in neural information processing systems , volume=
Deep equilibrium models , author=. Advances in neural information processing systems , volume=
-
[7]
arXiv preprint arXiv:2502.17416 (2025)
Reasoning with latent thoughts: On the power of looped transformers , author=. arXiv preprint arXiv:2502.17416 , year=
-
[8]
arXiv preprint arXiv:2409.15647 , year=
Looped transformers for length generalization , author=. arXiv preprint arXiv:2409.15647 , year=
-
[9]
arXiv preprint arXiv:2410.21698 , year=
On the role of depth and looping for in-context learning with task diversity , author=. arXiv preprint arXiv:2410.21698 , year=
-
[10]
arXiv preprint arXiv:2410.11268 , year=
Bypassing the exponential dependency: Looped transformers efficiently learn in-context by multi-step gradient descent , author=. arXiv preprint arXiv:2410.11268 , year=
-
[11]
arXiv preprint arXiv:2410.08292 , year=
Can looped transformers learn to implement multi-step gradient descent for in-context learning? , author=. arXiv preprint arXiv:2410.08292 , year=
-
[12]
Looped transformers are better at learning learning algorithms.arXiv preprint arXiv:2311.12424,
Looped transformers are better at learning learning algorithms , author=. arXiv preprint arXiv:2311.12424 , year=
-
[13]
International Conference on Learning Representations , year=
Universal Transformers , author=. International Conference on Learning Representations , year=
-
[14]
arXiv preprint arXiv:2402.13572 , year=
Algoformer: An efficient transformer framework with algorithmic structures , author=. arXiv preprint arXiv:2402.13572 , year=
-
[15]
Advances in Neural Information Processing Systems , volume=
Path independent equilibrium models can better exploit test-time computation , author=. Advances in Neural Information Processing Systems , volume=
-
[16]
Enhancing auto-regressive chain-of-thought through loop-aligned reasoning , author=. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[17]
International Conference on Machine Learning , pages=
Looped transformers as programmable computers , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[18]
Scaling Latent Reasoning via Looped Language Models
Scaling latent reasoning via looped language models , author=. arXiv preprint arXiv:2510.25741 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Scaling up test-time compute with latent reasoning: A recurrent depth approach , author=. arXiv preprint arXiv:2502.05171 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Training Compute-Optimal Large Language Models
Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , volume=
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
The Synthetic Data Playbook: Generating Trillions of the Finest Tokens , author=
- [22]
-
[23]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
International conference on machine learning , pages=
Perceiver: General perception with iterative attention , author=. International conference on machine learning , pages=. 2021 , organization=
work page 2021
-
[25]
Highway networks , author=. arXiv preprint arXiv:1505.00387 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[27]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Densely connected convolutional networks , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[28]
Advances in neural information processing systems , volume=
Residual networks behave like ensembles of relatively shallow networks , author=. Advances in neural information processing systems , volume=
-
[29]
A Mechanistic Analysis of Looped Reasoning Language Models
A Mechanistic Analysis of Looped Reasoning Language Models , author=. arXiv preprint arXiv:2604.11791 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Long short-term memory , author=. Neural computation , volume=. 1997 , publisher=
work page 1997
-
[31]
arXiv preprint arXiv:2510.07358 , year=
Encode, Think, Decode: Scaling test-time reasoning with recursive latent thoughts , author=. arXiv preprint arXiv:2510.07358 , year=
-
[32]
International conference on machine learning , pages=
On the difficulty of training recurrent neural networks , author=. International conference on machine learning , pages=. 2013 , organization=
work page 2013
-
[33]
IEEE transactions on neural networks , volume=
Learning long-term dependencies with gradient descent is difficult , author=. IEEE transactions on neural networks , volume=. 1994 , publisher=
work page 1994
-
[34]
Learning phrase representations using RNN encoder--decoder for statistical machine translation , author=. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages=
work page 2014
-
[35]
Advances in Neural Information Processing Systems , volume=
Datacomp-lm: In search of the next generation of training sets for language models , author=. Advances in Neural Information Processing Systems , volume=
-
[36]
doi:10.57967/hf/2497 , publisher =
Lozhkov, Anton and Ben Allal, Loubna and von Werra, Leandro and Wolf, Thomas , title =. doi:10.57967/hf/2497 , publisher =
-
[37]
Parcae: Scaling Laws For Stable Looped Language Models
Parcae: Scaling Laws For Stable Looped Language Models , author=. arXiv preprint arXiv:2604.12946 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Advances in neural information processing systems , volume=
Sequence to sequence learning with neural networks , author=. Advances in neural information processing systems , volume=
- [39]
- [40]
- [41]
-
[42]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , author=. 2024 , eprint=
work page 2024
-
[43]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. 2023 , eprint=
work page 2023
- [44]
- [45]
-
[46]
Residual Networks Behave Like Ensembles of Relatively Shallow Networks , author=. 2016 , eprint=
work page 2016
-
[47]
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer , author=. 2022 , eprint=
work page 2022
- [48]
-
[49]
Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =
work page 2024
- [50]
-
[51]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=
work page 2023
-
[52]
Proceedings of the IEEE , volume=
Backpropagation through time: what it does and how to do it , author=. Proceedings of the IEEE , volume=. 2002 , publisher=
work page 2002
-
[53]
RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2023 , eprint=
work page 2023
-
[54]
ReLU ^2 Wins: Discovering Efficient Activation Functions for Sparse LLMs , author=. 2024 , eprint=
work page 2024
-
[55]
Gemma 2: Improving Open Language Models at a Practical Size , author=. 2024 , eprint=
work page 2024
-
[56]
Keller Jordan and Jeremy Bernstein and Brendan Rappazzo and @fernbear.bsky.social and Boza Vlado and You Jiacheng and Franz Cesista and Braden Koszarsky and @Grad62304977 , title =. 2024 , url =
work page 2024
-
[57]
Language Models are Unsupervised Multitask Learners , author=
-
[58]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. EMNLP , year=
-
[59]
Thirty-Fourth AAAI Conference on Artificial Intelligence , year =
Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. Thirty-Fourth AAAI Conference on Artificial Intelligence , year =
-
[60]
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=
HellaSwag: Can a Machine Really Finish Your Sentence? , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=
-
[61]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author =. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics , pages =
work page 2019
-
[62]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord , title =. arXiv:1803.05457v1 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[63]
The LAMBADA dataset: Word prediction requiring a broad discourse context , author=. 2016 , eprint=
work page 2016
-
[64]
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models , author=. 2020 , eprint=
work page 2020
- [65]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.