pith. sign in

arxiv: 2605.18797 · v1 · pith:W5DTY6P2new · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Simply Stabilizing the Loop via Fully Looped Transformer

Pith reviewed 2026-05-20 23:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords looped transformertraining stabilitygradient oscillationresidual explosionattention injectioniterative computationmodel scalingtest-time compute
0
0 comments X

The pith

Two parameter-free changes to the looped transformer fix gradient oscillation and residual explosion to allow stable training at up to 12 iterations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that looped transformers, which reuse the same blocks repeatedly to gain performance without adding parameters, become unstable as the number of iterations grows. The authors trace the instability to oscillating gradients and exploding residuals, then introduce a fully looped architecture that spreads inter-loop signals across every layer and an attention injection step that reuses the existing attention block to dampen oscillations. These changes keep training stable at 12 loop iterations where prior looped models collapse. In regimes where baselines remain stable, the modified model still raises average downstream accuracy by as much as 13.2 percent. The result supplies a practical way to trade extra test-time computation for better performance while keeping parameter count fixed.

Core claim

The Fully Looped Transformer distributes inter-loop signals across all layers to prevent residual explosion and reuses the attention block to suppress gradient oscillation. These two parameter-free modifications stabilize training dynamics up to 12 loop iterations, whereas baseline looped models collapse in the same regime, and they raise average downstream-task performance by up to 13.2 percent in milder settings.

What carries the argument

Fully Looped Architecture together with Attention Injection, which together spread residual connections and reuse attention to remove the two identified sources of instability.

If this is right

  • Model capacity can be increased by raising loop count at inference instead of widening or deepening the network.
  • Test-time compute can be varied after training by choosing different numbers of iterations without retraining.
  • Training succeeds in regimes where earlier looped designs fail, expanding the usable range of iteration counts.
  • Downstream accuracy improves even when both models train successfully, showing the fixes also aid optimization.
  • Parameter count stays constant while effective depth grows, offering a different scaling axis from standard transformers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same signal-distribution and attention-reuse ideas could be tested in other iterative architectures such as recurrent networks or state-space models.
  • If the fixes generalize, training budgets could shift from adding parameters toward adding loop iterations at inference.
  • The approach may reduce the need for very deep unrolled networks by letting a shallow block be reused more reliably.
  • Similar lightweight modifications might address instability in other training regimes that suffer from repeated residual paths.

Load-bearing premise

Instability in looped transformers comes only from gradient oscillation and residual explosion, and the two proposed fixes remove those sources without introducing new instabilities.

What would settle it

Train the Fully Looped Transformer and a standard Looped Transformer to 12 iterations on the same data and compare whether loss curves remain stable or diverge.

Figures

Figures reproduced from arXiv: 2605.18797 by Hechang Chen, Jiankun Zhang, Jing Ma, Rao Fu, Yi Chang, Yu Li, Zixuan Yang.

Figure 1
Figure 1. Figure 1: The comparison of Looped Transformer (LT) and Fully Looped Transformer (FLT). FLT [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training dynamics of LT and FLT during the first 2000 optimizer steps. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training dynamics comparison of the FLT variants and LT variants. All models compared [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The loss of different base size models at 12-loop setting. All models except FLT col￾lapsed. Smoothed with factor 0.9 for readability. Metrics GQA MLA SWA FA Wiki2ppl ↓ 39.68 38.76 38.91 41.12 Valbpb ↓ 0.897 0.904 0.895 0.895 Core↑ 16.24 15.64 15.58 15.66 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Test-time adaptation evaluation results of FLT. Models trained with 3, 6, 9, or 12 loop [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Left: the residual norm at the 6th loop iteration. Middle: the residual norm at the 9th loop iteration. Right: the residual norm at the 12th loop iteration [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Left: the gradient norm of LM head block. Middle: the gradient norm of FFN at 5th layer. Right: the gradient norm of attention block at 5th layer. Figures 6 and 7 provide supplementary evidence for the diagnostic experiment in Section 3 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Trend chart of Core Metric changes for FLT with different attention variants throughout the [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The training loss of FLT with different attention variants. Smoothed with factor 0.9. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
read the original abstract

Scaling model performance typically requires increasing model size. Looped Transformer offers a compelling alternative by iteratively reusing the same Transformer blocks, trading additional computation for improved performance without increasing parameter count or context length. Because the number of loop iterations can be adjusted at inference, it also provides a natural mechanism for balancing performance and test-time compute. However, Looped Transformer still suffers from training instability when the number of loop iterations increases. Our analysis reveals that this instability stems from two sources: gradient oscillation and residual explosion. To address these two problems, we propose the Fully Looped Transformer, which introduces two parameter-free modifications: (1) Fully Looped Architecture, which distributes inter-loop signals across all layers to mitigate residual explosion; (2) Attention Injection, which reuses the existing attention block to suppress gradient oscillation. These modifications stabilize training dynamics, enabling the Fully Looped Transformer to be trained stably up to 12 loop iterations, whereas other baseline looped models collapse in this regime. In milder settings where Looped Transformer does not collapse, Fully Looped Transformer still improves average downstream-task performance by up to 13.2\%. Overall, our experiments demonstrate that Fully Looped Transformer improves training stability, enhances downstream performance, and provides preliminary adaptability under different test-time compute budgets by varying loop iterations at inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Fully Looped Transformer to address training instability in Looped Transformers when increasing the number of loop iterations. It identifies gradient oscillation and residual explosion as the two primary sources of instability through analysis, then proposes two parameter-free modifications: (1) Fully Looped Architecture, which distributes inter-loop signals across all layers to mitigate residual explosion, and (2) Attention Injection, which reuses the existing attention block to suppress gradient oscillation. Experiments demonstrate stable training up to 12 loop iterations (where baselines collapse) and up to 13.2% average downstream-task performance gains in milder regimes, with preliminary evidence of adaptability by varying loop count at inference.

Significance. If the empirical stability and gains hold under rigorous controls, the work provides a practical, parameter-free route to deeper effective computation via looping without increasing model size or context length. This directly supports better performance-compute tradeoffs at test time. The explicit identification of instability sources and the parameter-free design are strengths that could influence efficient scaling research, though the completeness of the causal analysis remains a key open question for the central claims.

major comments (2)
  1. §3 (Instability Analysis): The assertion that gradient oscillation and residual explosion constitute the complete and primary causes of collapse at high loop counts is load-bearing for the design of the two fixes and the stability claim up to 12 iterations. The manuscript does not provide exhaustive ablations or theoretical arguments ruling out other mechanisms (e.g., attention pattern drift across iterations or depth-dependent regularization effects), leaving the attribution of success to these specific modifications incomplete.
  2. §4 and §5 (Experimental Setup and Results): The reported 13.2% average performance improvement and stability up to 12 iterations lack details on experimental controls, number of random seeds, statistical significance testing, or correction for multiple comparisons across tasks and loop counts. Without these, it is unclear whether the gains are robust or whether the collapse of baselines is consistently reproduced.
minor comments (2)
  1. Abstract and §2: The description of Attention Injection as 'reusing the existing attention block' would benefit from a precise equation or pseudocode showing how the injection is implemented without altering parameter count or introducing new learnable weights.
  2. Figure 2 or equivalent (Training Dynamics): The plots of gradient norms and residual magnitudes across loops would be clearer if they included error bars from multiple runs and direct comparison to the proposed fixes at each iteration count.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [—] §3 (Instability Analysis): The assertion that gradient oscillation and residual explosion constitute the complete and primary causes of collapse at high loop counts is load-bearing for the design of the two fixes and the stability claim up to 12 iterations. The manuscript does not provide exhaustive ablations or theoretical arguments ruling out other mechanisms (e.g., attention pattern drift across iterations or depth-dependent regularization effects), leaving the attribution of success to these specific modifications incomplete.

    Authors: We appreciate the referee's emphasis on the need for stronger causal attribution. Section 3 presents empirical gradient analysis and ablation results showing that gradient oscillation and residual explosion are the dominant instability sources in the high-iteration regime, with the proposed fixes directly mitigating them to enable stable training up to 12 iterations. We do not claim these are the sole possible mechanisms. In the revision we will expand the discussion to explicitly acknowledge alternative factors such as attention pattern drift and include additional targeted ablations (e.g., monitoring attention entropy across loops) to further support the primary role of the identified issues. revision: partial

  2. Referee: [—] §4 and §5 (Experimental Setup and Results): The reported 13.2% average performance improvement and stability up to 12 iterations lack details on experimental controls, number of random seeds, statistical significance testing, or correction for multiple comparisons across tasks and loop counts. Without these, it is unclear whether the gains are robust or whether the collapse of baselines is consistently reproduced.

    Authors: We agree that greater experimental rigor is required. The revised manuscript will report the number of random seeds (3–5 per setting), include standard deviations or error bars for all metrics, add statistical significance tests (e.g., paired t-tests against baselines), and clarify the primary comparisons of interest to address multiple-testing concerns. These additions will substantiate the robustness of the 13.2% gains and the consistent reproduction of baseline collapse at high loop counts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical validation of parameter-free modifications

full rationale

The paper presents an empirical analysis identifying gradient oscillation and residual explosion as instability sources, followed by two parameter-free architectural modifications (Fully Looped Architecture and Attention Injection) whose effects are demonstrated through training runs up to 12 iterations and downstream task improvements of up to 13.2%. No equations, predictions, or first-principles derivations are shown that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The central stability and performance claims are supported by external experimental benchmarks rather than internal redefinitions or renamings, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard transformer training assumptions plus the untested premise that the observed instability sources are exhaustive and that the fixes are neutral with respect to other dynamics.

axioms (1)
  • domain assumption Standard assumptions of transformer training dynamics and gradient flow hold for looped variants.
    Invoked when attributing instability to gradient oscillation and residual explosion.

pith-pipeline@v0.9.0 · 5771 in / 1154 out tokens · 26824 ms · 2026-05-20T23:15:33.420617+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 8 internal anchors

  1. [1]

    Forty-first International Conference on Machine Learning , year=

    Position: Will we run out of data? Limits of LLM scaling based on human-generated data , author=. Forty-first International Conference on Machine Learning , year=

  2. [2]

    arXiv preprint arXiv:2509.14786 , year=

    Pre-training under infinite compute , author=. arXiv preprint arXiv:2509.14786 , year=

  3. [3]

    Epoch AI , author=

    Training compute of frontier AI models grows by 4-5x per year , url=. Epoch AI , author=

  4. [4]

    arXiv preprint arXiv:2512.19941 , year=

    Block-Recurrent Dynamics in Vision Transformers , author=. arXiv preprint arXiv:2512.19941 , year=

  5. [5]

    arXiv preprint arXiv:2410.01405 , year=

    On expressive power of looped transformers: Theoretical analysis and enhancement via timestep encoding , author=. arXiv preprint arXiv:2410.01405 , year=

  6. [6]

    Advances in neural information processing systems , volume=

    Deep equilibrium models , author=. Advances in neural information processing systems , volume=

  7. [7]

    arXiv preprint arXiv:2502.17416 (2025)

    Reasoning with latent thoughts: On the power of looped transformers , author=. arXiv preprint arXiv:2502.17416 , year=

  8. [8]

    arXiv preprint arXiv:2409.15647 , year=

    Looped transformers for length generalization , author=. arXiv preprint arXiv:2409.15647 , year=

  9. [9]

    arXiv preprint arXiv:2410.21698 , year=

    On the role of depth and looping for in-context learning with task diversity , author=. arXiv preprint arXiv:2410.21698 , year=

  10. [10]

    arXiv preprint arXiv:2410.11268 , year=

    Bypassing the exponential dependency: Looped transformers efficiently learn in-context by multi-step gradient descent , author=. arXiv preprint arXiv:2410.11268 , year=

  11. [11]

    arXiv preprint arXiv:2410.08292 , year=

    Can looped transformers learn to implement multi-step gradient descent for in-context learning? , author=. arXiv preprint arXiv:2410.08292 , year=

  12. [12]

    Looped transformers are better at learning learning algorithms.arXiv preprint arXiv:2311.12424,

    Looped transformers are better at learning learning algorithms , author=. arXiv preprint arXiv:2311.12424 , year=

  13. [13]

    International Conference on Learning Representations , year=

    Universal Transformers , author=. International Conference on Learning Representations , year=

  14. [14]

    arXiv preprint arXiv:2402.13572 , year=

    Algoformer: An efficient transformer framework with algorithmic structures , author=. arXiv preprint arXiv:2402.13572 , year=

  15. [15]

    Advances in Neural Information Processing Systems , volume=

    Path independent equilibrium models can better exploit test-time computation , author=. Advances in Neural Information Processing Systems , volume=

  16. [16]

    Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Enhancing auto-regressive chain-of-thought through loop-aligned reasoning , author=. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  17. [17]

    International Conference on Machine Learning , pages=

    Looped transformers as programmable computers , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  18. [18]

    Scaling Latent Reasoning via Looped Language Models

    Scaling latent reasoning via looped language models , author=. arXiv preprint arXiv:2510.25741 , year=

  19. [19]

    Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    Scaling up test-time compute with latent reasoning: A recurrent depth approach , author=. arXiv preprint arXiv:2502.05171 , year=

  20. [20]

    Training Compute-Optimal Large Language Models

    Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , volume=

  21. [21]

    The Synthetic Data Playbook: Generating Trillions of the Finest Tokens , author=

  22. [22]

    2024 , eprint=

    OpenAI o1 System Card , author=. 2024 , eprint=

  23. [23]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  24. [24]

    International conference on machine learning , pages=

    Perceiver: General perception with iterative attention , author=. International conference on machine learning , pages=. 2021 , organization=

  25. [25]

    Highway Networks

    Highway networks , author=. arXiv preprint arXiv:1505.00387 , year=

  26. [26]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  27. [27]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Densely connected convolutional networks , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  28. [28]

    Advances in neural information processing systems , volume=

    Residual networks behave like ensembles of relatively shallow networks , author=. Advances in neural information processing systems , volume=

  29. [29]

    A Mechanistic Analysis of Looped Reasoning Language Models

    A Mechanistic Analysis of Looped Reasoning Language Models , author=. arXiv preprint arXiv:2604.11791 , year=

  30. [30]

    Neural computation , volume=

    Long short-term memory , author=. Neural computation , volume=. 1997 , publisher=

  31. [31]

    arXiv preprint arXiv:2510.07358 , year=

    Encode, Think, Decode: Scaling test-time reasoning with recursive latent thoughts , author=. arXiv preprint arXiv:2510.07358 , year=

  32. [32]

    International conference on machine learning , pages=

    On the difficulty of training recurrent neural networks , author=. International conference on machine learning , pages=. 2013 , organization=

  33. [33]

    IEEE transactions on neural networks , volume=

    Learning long-term dependencies with gradient descent is difficult , author=. IEEE transactions on neural networks , volume=. 1994 , publisher=

  34. [34]

    Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages=

    Learning phrase representations using RNN encoder--decoder for statistical machine translation , author=. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages=

  35. [35]

    Advances in Neural Information Processing Systems , volume=

    Datacomp-lm: In search of the next generation of training sets for language models , author=. Advances in Neural Information Processing Systems , volume=

  36. [36]

    doi:10.57967/hf/2497 , publisher =

    Lozhkov, Anton and Ben Allal, Loubna and von Werra, Leandro and Wolf, Thomas , title =. doi:10.57967/hf/2497 , publisher =

  37. [37]

    Parcae: Scaling Laws For Stable Looped Language Models

    Parcae: Scaling Laws For Stable Looped Language Models , author=. arXiv preprint arXiv:2604.12946 , year=

  38. [38]

    Advances in neural information processing systems , volume=

    Sequence to sequence learning with neural networks , author=. Advances in neural information processing systems , volume=

  39. [39]

    2016 , eprint=

    Pointer Sentinel Mixture Models , author=. 2016 , eprint=

  40. [40]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  41. [41]

    2020 , eprint=

    Longformer: The Long-Document Transformer , author=. 2020 , eprint=

  42. [42]

    2024 , eprint=

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , author=. 2024 , eprint=

  43. [43]

    2023 , eprint=

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. 2023 , eprint=

  44. [44]

    2023 , eprint=

    Attention Is All You Need , author=. 2023 , eprint=

  45. [45]

    2019 , eprint=

    Root Mean Square Layer Normalization , author=. 2019 , eprint=

  46. [46]

    2016 , eprint=

    Residual Networks Behave Like Ensembles of Relatively Shallow Networks , author=. 2016 , eprint=

  47. [47]

    2022 , eprint=

    Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer , author=. 2022 , eprint=

  48. [48]

    2019 , eprint=

    Decoupled Weight Decay Regularization , author=. 2019 , eprint=

  49. [49]

    2024 , url =

    Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

  50. [50]

    2025 , publisher =

    Andrej Karpathy , title =. 2025 , publisher =

  51. [51]

    2023 , eprint=

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=

  52. [52]

    Proceedings of the IEEE , volume=

    Backpropagation through time: what it does and how to do it , author=. Proceedings of the IEEE , volume=. 2002 , publisher=

  53. [53]

    2023 , eprint=

    RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2023 , eprint=

  54. [54]

    2024 , eprint=

    ReLU ^2 Wins: Discovering Efficient Activation Functions for Sparse LLMs , author=. 2024 , eprint=

  55. [55]

    2024 , eprint=

    Gemma 2: Improving Open Language Models at a Practical Size , author=. 2024 , eprint=

  56. [56]

    2024 , url =

    Keller Jordan and Jeremy Bernstein and Brendan Rappazzo and @fernbear.bsky.social and Boza Vlado and You Jiacheng and Franz Cesista and Braden Koszarsky and @Grad62304977 , title =. 2024 , url =

  57. [57]

    Language Models are Unsupervised Multitask Learners , author=

  58. [58]

    EMNLP , year=

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. EMNLP , year=

  59. [59]

    Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

    Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

  60. [60]

    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=

    HellaSwag: Can a Machine Really Finish Your Sentence? , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=

  61. [61]

    Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics , pages =

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author =. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics , pages =

  62. [62]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord , title =. arXiv:1803.05457v1 , year =

  63. [63]

    2016 , eprint=

    The LAMBADA dataset: Word prediction requiring a broad discourse context , author=. 2016 , eprint=

  64. [64]

    2020 , eprint=

    ZeRO: Memory Optimizations Toward Training Trillion Parameter Models , author=. 2020 , eprint=

  65. [65]

    2020 , eprint=

    Query-Key Normalization for Transformers , author=. 2020 , eprint=