pith. sign in

arxiv: 2606.18524 · v1 · pith:UXDRSHU7new · submitted 2026-06-16 · 💻 cs.LG

On the Residual Scaling of Looped Transformers: Stability and Transferability

Pith reviewed 2026-06-27 00:50 UTC · model grok-4.3

classification 💻 cs.LG
keywords looped transformersresidual scalingweight sharingtraining stabilityhyperparameter transfertransformer depthrecurrent transformers
0
0 comments X

The pith

Looped Transformers require residual scaling of 1/N to handle correlations from weight sharing, unlike the standard 1/sqrt(L) for unshared layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Looped Transformers reuse one residual block N times to build effective depth without adding parameters. Prior scaling rules that set the residual multiplier to 1 over square root of L fail here because weight sharing correlates the updates across the N iterations. The analysis shows that a stronger 1/N factor is needed to keep the updates stable. When the shared block itself contains L distinct layers, the combined rule becomes lambda over N times square root of L. This separation means the best learning rate depends only on L and transfers directly when N increases.

Core claim

In looped residual networks the shared block f is applied N times via h leftarrow h + epsilon f(h). Weight sharing induces correlations across iterations that the usual 1/sqrt(L) scaling does not cancel, so epsilon must be set to 1/N. For a block of L unique layers looped N times the parameterization factors as epsilon = lambda / (N sqrt(L)), with the 1/N term controlling loop-internal correlation and the 1/sqrt(L) term controlling cross-layer variance. The optimal learning rate therefore depends on L alone and can be transferred unchanged to any larger N.

What carries the argument

The factored residual multiplier epsilon = lambda / (N sqrt(L)) that separates within-loop correlation from across-layer variance.

If this is right

  • 1/N scaling improves trainability and final loss compared with 1/sqrt(N) scaling for any loop count N.
  • The optimal learning rate can be chosen using only the number of unique layers L and reused for any larger N without retuning.
  • The two sources of growth (loop correlation and layer variance) can be controlled independently by the two factors in the scaling rule.
  • Hyperparameter search performed on small-N models transfers directly to large-N models of the same looped architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same correlation argument could be tested on other weight-tied recurrent structures such as looped RNNs or state-space models.
  • If additional sources of correlation appear at extreme N, the scaling may need an extra term beyond the current factorization.
  • The result suggests that depth simulation by looping can be made as stable as explicit stacking once the scaling accounts for reuse.
  • Practitioners could first tune L on a small model and then scale N upward while keeping the learning rate fixed.

Load-bearing premise

The variance analysis treats weight-sharing correlations as the main source of instability and assumes no dominant confounding effects from the optimizer or other architecture details.

What would settle it

Train the same looped transformer at large N with epsilon set to 1/sqrt(N) versus 1/N and check whether the 1/sqrt(N) run shows exploding activations or higher final loss.

read the original abstract

Looped (weight-tied) Transformers apply a shared residual block $N$ times ($h \leftarrow h + \varepsilon\,f(h)$, same $f$ at each step), increasing effective depth without adding parameters. Prior depth-scaling analyses prescribe $\varepsilon = 1/\!\sqrt{L}$ for depth-$L$ residual networks. We show that this is insufficient for looped architectures: weight sharing makes residual updates correlated across iterations, requiring the stronger scaling $\varepsilon = 1/N$. For multi-layer blocks ($L$ unique layers looped $N$ times), we derive a factored parameterization $\varepsilon = \lambda/(N\!\sqrt{L})$ that separates the two sources of growth: $1/N$ controls the within-layer loop correlation, and $1/\!\sqrt{L}$ controls the across-layer variance. A key consequence is that the optimal learning rate depends only on the number of unique layers $L$, not on the loop count $N$, enabling direct hyperparameter transfer from small to large $N$ without retuning. Experiments on looped Transformers confirm that $1/N$ scaling improves trainability and yields better loss than $1/\!\sqrt{N}$ scaling across loop counts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that standard residual scaling ε=1/√L is insufficient for looped (weight-tied) Transformers because weight sharing induces correlations across iterations in the residual updates, requiring the stronger ε=1/N. For multi-layer blocks (L unique layers looped N times) it derives the factored form ε=λ/(N√L) that separates loop-induced correlation (1/N) from layer variance (1/√L). A key consequence is that optimal learning rates depend only on L, not N, enabling direct hyperparameter transfer across loop counts. Experiments on looped Transformers are reported to confirm that 1/N scaling improves trainability and final loss relative to 1/√N scaling.

Significance. If the derivation and supporting experiments hold, the work supplies a theoretically motivated scaling rule that stabilizes looped architectures and permits hyperparameter transfer from small to large N at fixed L. This factorization of the two sources of growth could facilitate efficient scaling of effective depth without parameter growth, with relevance to parameter-efficient deep models in language modeling and related domains.

major comments (2)
  1. [Derivation of ε=1/N and ε=λ/(N√L)] The variance analysis deriving the 1/N factor (abstract and derivation) rests on the assumption of stationary, full pairwise correlation between successive applications of the shared f. Because each step updates the hidden state (h ← h + ε f(h)), the inputs to f become non-stationary; any non-linearity can cause correlation between f(h_t) and f(h_{t+k}) to decay with k. The manuscript must either bound this decay or demonstrate that the quadratic variance accumulation still holds; otherwise the stronger 1/N (and the claim that optimal LR depends only on L) may not be required.
  2. [Experiments confirming 1/N scaling and transferability] The central claim that optimal learning rate depends only on L (not N) is load-bearing for the transferability result. The experiments must report explicit controls that vary L and N independently, together with quantitative metrics (loss values, training curves, or tables) showing that a learning rate tuned at one N transfers without retuning to other N at fixed L.
minor comments (2)
  1. [Abstract] The abstract states that prior 1/√L scaling is 'insufficient' and that experiments 'confirm' the new rule, but supplies no dataset names, model dimensions, or loop counts; a one-sentence addition would improve readability.
  2. [Derivation] Notation for the free parameter λ should be introduced explicitly when the factored form is first stated, and its range or selection procedure clarified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Derivation of ε=1/N and ε=λ/(N√L)] The variance analysis deriving the 1/N factor (abstract and derivation) rests on the assumption of stationary, full pairwise correlation between successive applications of the shared f. Because each step updates the hidden state (h ← h + ε f(h)), the inputs to f become non-stationary; any non-linearity can cause correlation between f(h_t) and f(h_{t+k}) to decay with k. The manuscript must either bound this decay or demonstrate that the quadratic variance accumulation still holds; otherwise the stronger 1/N (and the claim that optimal LR depends only on L) may not be required.

    Authors: The derivation of the 1/N factor is obtained by modeling the variance of the accumulated residual updates under the correlation structure induced by weight sharing. While non-stationarity can in principle cause correlation decay, the shared parameters ensure that the updates remain correlated across iterations in a manner that leads to linear (rather than sqrt) growth in variance with N. To address the concern, we will revise the manuscript to include an empirical verification that the quadratic variance accumulation holds in the looped Transformer setting across the range of N considered, thereby supporting that the stronger scaling remains necessary for stability. revision: yes

  2. Referee: [Experiments confirming 1/N scaling and transferability] The central claim that optimal learning rate depends only on L (not N) is load-bearing for the transferability result. The experiments must report explicit controls that vary L and N independently, together with quantitative metrics (loss values, training curves, or tables) showing that a learning rate tuned at one N transfers without retuning to other N at fixed L.

    Authors: The current experiments vary N while holding L fixed and compare scaling rules, but do not include a full set of independent L/N controls with explicit transfer metrics. We will revise the experimental section to add tables and curves that vary L and N independently, reporting final losses and training dynamics to demonstrate that a learning rate selected at one N transfers directly to other values of N at the same L without retuning. revision: yes

Circularity Check

0 steps flagged

Variance derivation of 1/N scaling is independent of fitted inputs or self-citations

full rationale

The paper presents ε = 1/N and the factored form ε = λ/(N√L) as results of a variance analysis on correlated residual updates induced by weight sharing. No quoted equations reduce the claimed scaling to a fitted parameter renamed as prediction, a self-citation chain, or a self-definitional loop. The derivation is framed as first-principles analysis of update correlations and is therefore self-contained; external benchmarks or code reproduction would be needed to assess correctness but are outside the circularity criteria.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Ledger constructed from abstract only; full derivation unavailable.

free parameters (1)
  • λ
    Scaling prefactor appearing in the multi-layer looped formula; its status (derived constant or fitted) is not specified in the abstract.
axioms (1)
  • domain assumption Weight sharing in looped transformers induces correlated residual updates across iterations that dominate stability requirements.
    This premise is invoked to conclude that prior 1/sqrt(L) scaling is insufficient and 1/N is required.

pith-pipeline@v0.9.1-grok · 5751 in / 1385 out tokens · 48488 ms · 2026-06-27T00:50:57.532194+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 9 canonical work pages · 3 internal anchors

  1. [1]

    Thomas Bachlechner, Bodhisattwa Prasad Ma- jumder, Huanru Henry Mao, Gary Cottrell, and Julian J. McAuley. ReZero is all you need: fast convergence at large depth. In Cassio P. de Cam- pos, Marloes H. Maathuis, and Erik Quaeghe- beur, editors, Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, UAI 2021, Virtual Event, ...

  2. [2]

    Depthwise hyperpa- rameter transfer in residual networks: Dynamics and scaling limit

    Blake Bordelon, Lorenzo Noci, Mufan Bill Li, Boris Hanin, and Cengiz Pehlevan. Depthwise hyperpa- rameter transfer in residual networks: Dynamics and scaling limit. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe- view.net, 2024. URL https://openreview.net/ forum?id=KZJehvRKGD

  3. [3]

    Concentration Inequalities - A Nonasymptotic Theory of Independence

    Stéphane Boucheron, Gábor Lugosi, and Pas- cal Massart. Concentration Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013. ISBN 978-0-19-953525-

  4. [4]

    2013.How to Build a Brain: A Neural Architecture for Biological Cognition

    doi: 10.1093/ACPROF:OSO/9780199535255. 001.0001. URL https://doi.org/10.1093/acprof: oso/9780199535255.001.0001

  5. [5]

    Neural ordi- nary differential equations

    Tian Qi Chen, Yulia Rubanova, Jesse Bet- tencourt, and David Duvenaud. Neural ordi- nary differential equations. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS ...

  6. [6]

    Univer- sal transformers

    Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Univer- sal transformers. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net,

  7. [7]

    URL https://openreview.net/forum?id= HyzdRiR9Y7

  8. [8]

    Don’t be lazy: CompleteP enables compute-efficient deep transform- ers

    Nolan Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehle- van, Boris Hanin, and Joel Hestness. Don’t be lazy: CompleteP enables compute-efficient deep transform- ers. 2025. doi: 10.48550/ARXIV.2505.01618. URL https://arxiv.org/abs/2505.01618

  9. [9]

    Looped transformers for length generalization

    Ying Fan, Yilun Du, Kannan Ramchandran, and Kangwook Lee. Looped transformers for length generalization. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April24-28, 2025. OpenReview.net,

  10. [10]

    URL https://openreview.net/forum?id= 2edigk8yoU

  11. [11]

    Reddi, Stefanie Jegelka, and Sanjiv Kumar

    Khashayar Gatmiry, Nikunj Saunshi, Sashank J. Reddi, Stefanie Jegelka, and Sanjiv Kumar. Can looped transformers learn to implement multi-step gradient descent for in-context learning? In Rus- lan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-firstInternational Conferen...

  12. [12]

    On Milman’s inequality and ran- dom subspaces which escape through a mesh inRn

    Yehoram Gordon. On Milman’s inequality and ran- dom subspaces which escape through a mesh inRn. In Geometric Aspects of Functional Analysis, vol- ume 1317 ofLecture Notes in Mathematics, pages 84–106. Springer, 1988. doi: 10.1007/BFb0081737

  13. [13]

    Width and depth limits commute in residual networks

    Soufiane Hayou and Greg Yang. Width and depth limits commute in residual networks. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, Proceedings of Machine Learning Research, pages 12700–12723. P...

  14. [14]

    ALBERT: A lite BERT for self-supervised learning of language representations

    Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview. net/forum?id=H1eA7AEtvS

  15. [15]

    The Llama 3 Herd of Models

    Llama Team. The llama 3 herd of models.CoRR, abs/2407.21783, 2024. doi: 10.48550/ARXIV.2407. 21783. URL https://doi.org/10.48550/arXiv. 2407.21783

  16. [16]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open- Review.net, 2019. URLhttps://openreview.net/ forum?id=Bkg6RiCqY7

  17. [17]

    Generalization bounds for neu- ral ordinary differential equations and deep residual networks

    Pierre Marion. Generalization bounds for neu- ral ordinary differential equations and deep residual networks. In Advances in Neural 10 Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http:// papers.nips.cc/paper_files/paper/2023/hash/ ...

  18. [18]

    Scaling resnets in the large-depth regime

    Pierre Marion, Adeline Fermanian, Gérard Biau, and Jean-Philippe Vert. Scaling resnets in the large-depth regime. J. Mach. Learn. Res., 26:56:1– 56:48, 2025. URL https://jmlr.org/papers/v26/ 22-0664.html

  19. [19]

    Completed hyperparameter trans- fer across modules, width, depth, batch and duration

    Bruno Mlodozeniec, Pierre Ablin, Louis Béthune, Dan Busbridge, Michal Klein, Jason Ramapuram, and Marco Cuturi. Completed hyperparameter trans- fer across modules, width, depth, batch and duration

  20. [20]

    URLhttps://arxiv.org/abs/2512.22382

  21. [21]

    Loop neural net- works for parameter sharing

    Kei-Sing Ng and Qingchen Wang. Loop neural net- works for parameter sharing. 2024. URL https: //arxiv.org/abs/2409.14199

  22. [22]

    Raffel, Leandro von Werra, and Thomas Wolf

    Guilherme Penedo, Hynek Kydlícek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin A. Raffel, Leandro von Werra, and Thomas Wolf. The FineWeb datasets: De- canting the web for the finest text data at scale. In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vanco...

  23. [23]

    Using the output embed- ding to improve language models

    Ofir Press and Lior Wolf. Using the output embed- ding to improve language models. In Mirella Lap- ata, Phil Blunsom, and Alexander Koller, editors, Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers, pages 157–163. As- sociation f...

  24. [24]

    Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, San- jiv Kumar, and Sashank J. Reddi. Reasoning with latent thoughts: On the power of looped transform- ers. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview.net/forum?id=din0lGfZFd

  25. [25]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Ham- bro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023. doi: 10. 48550/ARXIV.2302.13971. URL http...

  26. [26]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Atten- tion is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wal- lach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Confer...

  27. [27]

    DeepNet: Scaling transformers to 1,000 layers.IEEE Trans

    Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. DeepNet: Scaling transformers to 1,000 layers.IEEE Trans. Pattern Anal. Mach. Intell., 46(10):6761–6774, 2024. doi: 10.1109/TPAMI.2024.3386927. URL https: //doi.org/10.1109/TPAMI.2024.3386927

  28. [28]

    Wang, David Hall, Percy Liang, and Tengyu Ma

    Kaiyue Wen, Zhiyuan Li, Jason S. Wang, David Hall, Percy Liang, and Tengyu Ma. Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective.CoRR, abs/2410.05192,

  29. [29]

    URL https://doi.org/10.48550/arXiv.2410.05192

    doi: 10.48550/ARXIV.2410.05192. URL https://doi.org/10.48550/arXiv.2410.05192

  30. [30]

    On layer normalization in the transformer architecture

    Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. On layer normalization in the transformer architecture. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Re- search, pages 10524–...

  31. [31]

    Hu, Igor Babuschkin, Szymon Sidor, XiaodongLiu, DavidFarhi, NickRyder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao

    Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, XiaodongLiu, DavidFarhi, NickRyder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero- shot hyperparameter transfer. 2022. doi: 10.48550/ ARXIV.2203.03466. URL https://arxiv.org/abs/ 2203.03466

  32. [32]

    Nowak, and Dimitris Papailiopoulos

    Liu Yang, Kangwook Lee, Robert D. Nowak, and Dimitris Papailiopoulos. Looped transform- ers are better at learning learning algorithms. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https: //openreview.net/forum?id=HHbRxoDTxE. 11

  33. [33]

    Root mean square layer normalization

    Biao Zhang and Rico Sennrich. Root mean square layer normalization. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Gar- nett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, B...

  34. [34]

    Dauphin, and Tengyu Ma

    Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. Fixup initialization: Residual learning with- out normalization. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net,

  35. [35]

    URL https://openreview.net/forum?id= H1gsz30cKX

  36. [36]

    Scaling Latent Reasoning via Looped Language Models

    Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, Lu Li, Jiajun Shi, Kaijing Ma, Shanda Li, Taylor Kergan, Andrew Smith, Xingwei Qu, Mude Hui, Bohong Wu, Qiyang Min, Hongzhi Huang, Xun Zhou, Wei Ye, Jiaheng Liu, Jian Yang, Yun- feng Shi, Chenghua Lin, Enduo Zhao, Tianle Cai, Ge Zhang, Wenhao Huan...