On the Residual Scaling of Looped Transformers: Stability and Transferability
Pith reviewed 2026-06-27 00:50 UTC · model grok-4.3
The pith
Looped Transformers require residual scaling of 1/N to handle correlations from weight sharing, unlike the standard 1/sqrt(L) for unshared layers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In looped residual networks the shared block f is applied N times via h leftarrow h + epsilon f(h). Weight sharing induces correlations across iterations that the usual 1/sqrt(L) scaling does not cancel, so epsilon must be set to 1/N. For a block of L unique layers looped N times the parameterization factors as epsilon = lambda / (N sqrt(L)), with the 1/N term controlling loop-internal correlation and the 1/sqrt(L) term controlling cross-layer variance. The optimal learning rate therefore depends on L alone and can be transferred unchanged to any larger N.
What carries the argument
The factored residual multiplier epsilon = lambda / (N sqrt(L)) that separates within-loop correlation from across-layer variance.
If this is right
- 1/N scaling improves trainability and final loss compared with 1/sqrt(N) scaling for any loop count N.
- The optimal learning rate can be chosen using only the number of unique layers L and reused for any larger N without retuning.
- The two sources of growth (loop correlation and layer variance) can be controlled independently by the two factors in the scaling rule.
- Hyperparameter search performed on small-N models transfers directly to large-N models of the same looped architecture.
Where Pith is reading between the lines
- The same correlation argument could be tested on other weight-tied recurrent structures such as looped RNNs or state-space models.
- If additional sources of correlation appear at extreme N, the scaling may need an extra term beyond the current factorization.
- The result suggests that depth simulation by looping can be made as stable as explicit stacking once the scaling accounts for reuse.
- Practitioners could first tune L on a small model and then scale N upward while keeping the learning rate fixed.
Load-bearing premise
The variance analysis treats weight-sharing correlations as the main source of instability and assumes no dominant confounding effects from the optimizer or other architecture details.
What would settle it
Train the same looped transformer at large N with epsilon set to 1/sqrt(N) versus 1/N and check whether the 1/sqrt(N) run shows exploding activations or higher final loss.
read the original abstract
Looped (weight-tied) Transformers apply a shared residual block $N$ times ($h \leftarrow h + \varepsilon\,f(h)$, same $f$ at each step), increasing effective depth without adding parameters. Prior depth-scaling analyses prescribe $\varepsilon = 1/\!\sqrt{L}$ for depth-$L$ residual networks. We show that this is insufficient for looped architectures: weight sharing makes residual updates correlated across iterations, requiring the stronger scaling $\varepsilon = 1/N$. For multi-layer blocks ($L$ unique layers looped $N$ times), we derive a factored parameterization $\varepsilon = \lambda/(N\!\sqrt{L})$ that separates the two sources of growth: $1/N$ controls the within-layer loop correlation, and $1/\!\sqrt{L}$ controls the across-layer variance. A key consequence is that the optimal learning rate depends only on the number of unique layers $L$, not on the loop count $N$, enabling direct hyperparameter transfer from small to large $N$ without retuning. Experiments on looped Transformers confirm that $1/N$ scaling improves trainability and yields better loss than $1/\!\sqrt{N}$ scaling across loop counts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that standard residual scaling ε=1/√L is insufficient for looped (weight-tied) Transformers because weight sharing induces correlations across iterations in the residual updates, requiring the stronger ε=1/N. For multi-layer blocks (L unique layers looped N times) it derives the factored form ε=λ/(N√L) that separates loop-induced correlation (1/N) from layer variance (1/√L). A key consequence is that optimal learning rates depend only on L, not N, enabling direct hyperparameter transfer across loop counts. Experiments on looped Transformers are reported to confirm that 1/N scaling improves trainability and final loss relative to 1/√N scaling.
Significance. If the derivation and supporting experiments hold, the work supplies a theoretically motivated scaling rule that stabilizes looped architectures and permits hyperparameter transfer from small to large N at fixed L. This factorization of the two sources of growth could facilitate efficient scaling of effective depth without parameter growth, with relevance to parameter-efficient deep models in language modeling and related domains.
major comments (2)
- [Derivation of ε=1/N and ε=λ/(N√L)] The variance analysis deriving the 1/N factor (abstract and derivation) rests on the assumption of stationary, full pairwise correlation between successive applications of the shared f. Because each step updates the hidden state (h ← h + ε f(h)), the inputs to f become non-stationary; any non-linearity can cause correlation between f(h_t) and f(h_{t+k}) to decay with k. The manuscript must either bound this decay or demonstrate that the quadratic variance accumulation still holds; otherwise the stronger 1/N (and the claim that optimal LR depends only on L) may not be required.
- [Experiments confirming 1/N scaling and transferability] The central claim that optimal learning rate depends only on L (not N) is load-bearing for the transferability result. The experiments must report explicit controls that vary L and N independently, together with quantitative metrics (loss values, training curves, or tables) showing that a learning rate tuned at one N transfers without retuning to other N at fixed L.
minor comments (2)
- [Abstract] The abstract states that prior 1/√L scaling is 'insufficient' and that experiments 'confirm' the new rule, but supplies no dataset names, model dimensions, or loop counts; a one-sentence addition would improve readability.
- [Derivation] Notation for the free parameter λ should be introduced explicitly when the factored form is first stated, and its range or selection procedure clarified.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and indicate where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Derivation of ε=1/N and ε=λ/(N√L)] The variance analysis deriving the 1/N factor (abstract and derivation) rests on the assumption of stationary, full pairwise correlation between successive applications of the shared f. Because each step updates the hidden state (h ← h + ε f(h)), the inputs to f become non-stationary; any non-linearity can cause correlation between f(h_t) and f(h_{t+k}) to decay with k. The manuscript must either bound this decay or demonstrate that the quadratic variance accumulation still holds; otherwise the stronger 1/N (and the claim that optimal LR depends only on L) may not be required.
Authors: The derivation of the 1/N factor is obtained by modeling the variance of the accumulated residual updates under the correlation structure induced by weight sharing. While non-stationarity can in principle cause correlation decay, the shared parameters ensure that the updates remain correlated across iterations in a manner that leads to linear (rather than sqrt) growth in variance with N. To address the concern, we will revise the manuscript to include an empirical verification that the quadratic variance accumulation holds in the looped Transformer setting across the range of N considered, thereby supporting that the stronger scaling remains necessary for stability. revision: yes
-
Referee: [Experiments confirming 1/N scaling and transferability] The central claim that optimal learning rate depends only on L (not N) is load-bearing for the transferability result. The experiments must report explicit controls that vary L and N independently, together with quantitative metrics (loss values, training curves, or tables) showing that a learning rate tuned at one N transfers without retuning to other N at fixed L.
Authors: The current experiments vary N while holding L fixed and compare scaling rules, but do not include a full set of independent L/N controls with explicit transfer metrics. We will revise the experimental section to add tables and curves that vary L and N independently, reporting final losses and training dynamics to demonstrate that a learning rate selected at one N transfers directly to other values of N at the same L without retuning. revision: yes
Circularity Check
Variance derivation of 1/N scaling is independent of fitted inputs or self-citations
full rationale
The paper presents ε = 1/N and the factored form ε = λ/(N√L) as results of a variance analysis on correlated residual updates induced by weight sharing. No quoted equations reduce the claimed scaling to a fitted parameter renamed as prediction, a self-citation chain, or a self-definitional loop. The derivation is framed as first-principles analysis of update correlations and is therefore self-contained; external benchmarks or code reproduction would be needed to assess correctness but are outside the circularity criteria.
Axiom & Free-Parameter Ledger
free parameters (1)
- λ
axioms (1)
- domain assumption Weight sharing in looped transformers induces correlated residual updates across iterations that dominate stability requirements.
Reference graph
Works this paper leans on
-
[1]
Thomas Bachlechner, Bodhisattwa Prasad Ma- jumder, Huanru Henry Mao, Gary Cottrell, and Julian J. McAuley. ReZero is all you need: fast convergence at large depth. In Cassio P. de Cam- pos, Marloes H. Maathuis, and Erik Quaeghe- beur, editors, Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, UAI 2021, Virtual Event, ...
2021
-
[2]
Depthwise hyperpa- rameter transfer in residual networks: Dynamics and scaling limit
Blake Bordelon, Lorenzo Noci, Mufan Bill Li, Boris Hanin, and Cengiz Pehlevan. Depthwise hyperpa- rameter transfer in residual networks: Dynamics and scaling limit. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe- view.net, 2024. URL https://openreview.net/ forum?id=KZJehvRKGD
2024
-
[3]
Concentration Inequalities - A Nonasymptotic Theory of Independence
Stéphane Boucheron, Gábor Lugosi, and Pas- cal Massart. Concentration Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013. ISBN 978-0-19-953525-
2013
-
[4]
2013.How to Build a Brain: A Neural Architecture for Biological Cognition
doi: 10.1093/ACPROF:OSO/9780199535255. 001.0001. URL https://doi.org/10.1093/acprof: oso/9780199535255.001.0001
-
[5]
Neural ordi- nary differential equations
Tian Qi Chen, Yulia Rubanova, Jesse Bet- tencourt, and David Duvenaud. Neural ordi- nary differential equations. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS ...
2018
-
[6]
Univer- sal transformers
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Univer- sal transformers. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net,
2019
-
[7]
URL https://openreview.net/forum?id= HyzdRiR9Y7
-
[8]
Don’t be lazy: CompleteP enables compute-efficient deep transform- ers
Nolan Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehle- van, Boris Hanin, and Joel Hestness. Don’t be lazy: CompleteP enables compute-efficient deep transform- ers. 2025. doi: 10.48550/ARXIV.2505.01618. URL https://arxiv.org/abs/2505.01618
-
[9]
Looped transformers for length generalization
Ying Fan, Yilun Du, Kannan Ramchandran, and Kangwook Lee. Looped transformers for length generalization. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April24-28, 2025. OpenReview.net,
2025
-
[10]
URL https://openreview.net/forum?id= 2edigk8yoU
-
[11]
Reddi, Stefanie Jegelka, and Sanjiv Kumar
Khashayar Gatmiry, Nikunj Saunshi, Sashank J. Reddi, Stefanie Jegelka, and Sanjiv Kumar. Can looped transformers learn to implement multi-step gradient descent for in-context learning? In Rus- lan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-firstInternational Conferen...
2024
-
[12]
On Milman’s inequality and ran- dom subspaces which escape through a mesh inRn
Yehoram Gordon. On Milman’s inequality and ran- dom subspaces which escape through a mesh inRn. In Geometric Aspects of Functional Analysis, vol- ume 1317 ofLecture Notes in Mathematics, pages 84–106. Springer, 1988. doi: 10.1007/BFb0081737
-
[13]
Width and depth limits commute in residual networks
Soufiane Hayou and Greg Yang. Width and depth limits commute in residual networks. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, Proceedings of Machine Learning Research, pages 12700–12723. P...
2023
-
[14]
ALBERT: A lite BERT for self-supervised learning of language representations
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview. net/forum?id=H1eA7AEtvS
2020
-
[15]
Llama Team. The llama 3 herd of models.CoRR, abs/2407.21783, 2024. doi: 10.48550/ARXIV.2407. 21783. URL https://doi.org/10.48550/arXiv. 2407.21783
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407 2024
-
[16]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open- Review.net, 2019. URLhttps://openreview.net/ forum?id=Bkg6RiCqY7
2019
-
[17]
Generalization bounds for neu- ral ordinary differential equations and deep residual networks
Pierre Marion. Generalization bounds for neu- ral ordinary differential equations and deep residual networks. In Advances in Neural 10 Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http:// papers.nips.cc/paper_files/paper/2023/hash/ ...
2023
-
[18]
Scaling resnets in the large-depth regime
Pierre Marion, Adeline Fermanian, Gérard Biau, and Jean-Philippe Vert. Scaling resnets in the large-depth regime. J. Mach. Learn. Res., 26:56:1– 56:48, 2025. URL https://jmlr.org/papers/v26/ 22-0664.html
2025
-
[19]
Completed hyperparameter trans- fer across modules, width, depth, batch and duration
Bruno Mlodozeniec, Pierre Ablin, Louis Béthune, Dan Busbridge, Michal Klein, Jason Ramapuram, and Marco Cuturi. Completed hyperparameter trans- fer across modules, width, depth, batch and duration
-
[20]
URLhttps://arxiv.org/abs/2512.22382
-
[21]
Loop neural net- works for parameter sharing
Kei-Sing Ng and Qingchen Wang. Loop neural net- works for parameter sharing. 2024. URL https: //arxiv.org/abs/2409.14199
arXiv 2024
-
[22]
Raffel, Leandro von Werra, and Thomas Wolf
Guilherme Penedo, Hynek Kydlícek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin A. Raffel, Leandro von Werra, and Thomas Wolf. The FineWeb datasets: De- canting the web for the finest text data at scale. In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vanco...
2024
-
[23]
Using the output embed- ding to improve language models
Ofir Press and Lior Wolf. Using the output embed- ding to improve language models. In Mirella Lap- ata, Phil Blunsom, and Alexander Koller, editors, Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers, pages 157–163. As- sociation f...
-
[24]
Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, San- jiv Kumar, and Sashank J. Reddi. Reasoning with latent thoughts: On the power of looped transform- ers. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview.net/forum?id=din0lGfZFd
2025
-
[25]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Ham- bro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023. doi: 10. 48550/ARXIV.2302.13971. URL http...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.13971 2023
-
[26]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Atten- tion is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wal- lach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Confer...
2017
-
[27]
DeepNet: Scaling transformers to 1,000 layers.IEEE Trans
Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. DeepNet: Scaling transformers to 1,000 layers.IEEE Trans. Pattern Anal. Mach. Intell., 46(10):6761–6774, 2024. doi: 10.1109/TPAMI.2024.3386927. URL https: //doi.org/10.1109/TPAMI.2024.3386927
-
[28]
Wang, David Hall, Percy Liang, and Tengyu Ma
Kaiyue Wen, Zhiyuan Li, Jason S. Wang, David Hall, Percy Liang, and Tengyu Ma. Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective.CoRR, abs/2410.05192,
-
[29]
URL https://doi.org/10.48550/arXiv.2410.05192
doi: 10.48550/ARXIV.2410.05192. URL https://doi.org/10.48550/arXiv.2410.05192
-
[30]
On layer normalization in the transformer architecture
Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. On layer normalization in the transformer architecture. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Re- search, pages 10524–...
2020
-
[31]
Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, XiaodongLiu, DavidFarhi, NickRyder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero- shot hyperparameter transfer. 2022. doi: 10.48550/ ARXIV.2203.03466. URL https://arxiv.org/abs/ 2203.03466
arXiv 2022
-
[32]
Nowak, and Dimitris Papailiopoulos
Liu Yang, Kangwook Lee, Robert D. Nowak, and Dimitris Papailiopoulos. Looped transform- ers are better at learning learning algorithms. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https: //openreview.net/forum?id=HHbRxoDTxE. 11
2024
-
[33]
Root mean square layer normalization
Biao Zhang and Rico Sennrich. Root mean square layer normalization. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Gar- nett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, B...
2019
-
[34]
Dauphin, and Tengyu Ma
Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. Fixup initialization: Residual learning with- out normalization. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net,
2019
-
[35]
URL https://openreview.net/forum?id= H1gsz30cKX
-
[36]
Scaling Latent Reasoning via Looped Language Models
Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, Lu Li, Jiajun Shi, Kaijing Ma, Shanda Li, Taylor Kergan, Andrew Smith, Xingwei Qu, Mude Hui, Bohong Wu, Qiyang Min, Hongzhi Huang, Xun Zhou, Wei Ye, Jiaheng Liu, Jian Yang, Yun- feng Shi, Chenghua Lin, Enduo Zhao, Tianle Cai, Ge Zhang, Wenhao Huan...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.