On the Residual Scaling of Looped Transformers: Stability and Transferability

Bingrui Li; Ge Zhang; Jian Li; Shaowen Wang; Shen Yan; Wenhao Huang

arxiv: 2606.18524 · v1 · pith:UXDRSHU7new · submitted 2026-06-16 · 💻 cs.LG

On the Residual Scaling of Looped Transformers: Stability and Transferability

Shaowen Wang , Bingrui Li , Ge Zhang , Wenhao Huang , Shen Yan , Jian Li This is my paper

Pith reviewed 2026-06-27 00:50 UTC · model grok-4.3

classification 💻 cs.LG

keywords looped transformersresidual scalingweight sharingtraining stabilityhyperparameter transfertransformer depthrecurrent transformers

0 comments

The pith

Looped Transformers require residual scaling of 1/N to handle correlations from weight sharing, unlike the standard 1/sqrt(L) for unshared layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Looped Transformers reuse one residual block N times to build effective depth without adding parameters. Prior scaling rules that set the residual multiplier to 1 over square root of L fail here because weight sharing correlates the updates across the N iterations. The analysis shows that a stronger 1/N factor is needed to keep the updates stable. When the shared block itself contains L distinct layers, the combined rule becomes lambda over N times square root of L. This separation means the best learning rate depends only on L and transfers directly when N increases.

Core claim

In looped residual networks the shared block f is applied N times via h leftarrow h + epsilon f(h). Weight sharing induces correlations across iterations that the usual 1/sqrt(L) scaling does not cancel, so epsilon must be set to 1/N. For a block of L unique layers looped N times the parameterization factors as epsilon = lambda / (N sqrt(L)), with the 1/N term controlling loop-internal correlation and the 1/sqrt(L) term controlling cross-layer variance. The optimal learning rate therefore depends on L alone and can be transferred unchanged to any larger N.

What carries the argument

The factored residual multiplier epsilon = lambda / (N sqrt(L)) that separates within-loop correlation from across-layer variance.

If this is right

1/N scaling improves trainability and final loss compared with 1/sqrt(N) scaling for any loop count N.
The optimal learning rate can be chosen using only the number of unique layers L and reused for any larger N without retuning.
The two sources of growth (loop correlation and layer variance) can be controlled independently by the two factors in the scaling rule.
Hyperparameter search performed on small-N models transfers directly to large-N models of the same looped architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same correlation argument could be tested on other weight-tied recurrent structures such as looped RNNs or state-space models.
If additional sources of correlation appear at extreme N, the scaling may need an extra term beyond the current factorization.
The result suggests that depth simulation by looping can be made as stable as explicit stacking once the scaling accounts for reuse.
Practitioners could first tune L on a small model and then scale N upward while keeping the learning rate fixed.

Load-bearing premise

The variance analysis treats weight-sharing correlations as the main source of instability and assumes no dominant confounding effects from the optimizer or other architecture details.

What would settle it

Train the same looped transformer at large N with epsilon set to 1/sqrt(N) versus 1/N and check whether the 1/sqrt(N) run shows exploding activations or higher final loss.

read the original abstract

Looped (weight-tied) Transformers apply a shared residual block $N$ times ($h \leftarrow h + \varepsilon\,f(h)$, same $f$ at each step), increasing effective depth without adding parameters. Prior depth-scaling analyses prescribe $\varepsilon = 1/\!\sqrt{L}$ for depth-$L$ residual networks. We show that this is insufficient for looped architectures: weight sharing makes residual updates correlated across iterations, requiring the stronger scaling $\varepsilon = 1/N$. For multi-layer blocks ($L$ unique layers looped $N$ times), we derive a factored parameterization $\varepsilon = \lambda/(N\!\sqrt{L})$ that separates the two sources of growth: $1/N$ controls the within-layer loop correlation, and $1/\!\sqrt{L}$ controls the across-layer variance. A key consequence is that the optimal learning rate depends only on the number of unique layers $L$, not on the loop count $N$, enabling direct hyperparameter transfer from small to large $N$ without retuning. Experiments on looped Transformers confirm that $1/N$ scaling improves trainability and yields better loss than $1/\!\sqrt{N}$ scaling across loop counts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Looped transformers need an extra 1/N residual factor on top of the usual 1/sqrt(L) because weight sharing correlates the updates, and this makes optimal LR depend only on the number of unique layers.

read the letter

The paper's central result is a derivation showing that standard residual scaling breaks for looped (weight-tied) blocks. Because the same f is applied repeatedly, the residual updates are correlated, so variance grows with N squared rather than N; that forces the stronger ε = 1/N term. They factor it as ε = λ/(N sqrt(L)) so the two sources of growth stay separate, and the practical payoff is that you can pick the learning rate from a small-L run and keep it when you increase N.

The derivation comes from a correlation analysis of the updates rather than post-hoc fitting, and the experiments test the claim directly by comparing 1/N against 1/sqrt(N) scaling across loop counts on transformer models. That combination of explicit math and targeted runs is the useful part.

The soft spot is the assumption that pairwise correlations between f(h_t) and f(h_{t+k}) stay high and stationary. Once the hidden state is updated each step, non-linearities can make later applications less correlated, so the total variance may grow slower than N squared. If the paper only assumes full correlation without measuring the decay on the actual runs, the 1/N factor could be stronger than necessary. The abstract does not show those checks, so that is the main place a referee would press.

The work is aimed at people already working on looped or recurrent-style transformers and on scaling rules for residual networks. Anyone tuning large-N versions would find the hyperparameter-transfer claim directly usable. The derivation plus the controlled experiments are enough to justify sending it to review; the correlation-decay question is addressable in revision rather than a load-bearing flaw.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that standard residual scaling ε=1/√L is insufficient for looped (weight-tied) Transformers because weight sharing induces correlations across iterations in the residual updates, requiring the stronger ε=1/N. For multi-layer blocks (L unique layers looped N times) it derives the factored form ε=λ/(N√L) that separates loop-induced correlation (1/N) from layer variance (1/√L). A key consequence is that optimal learning rates depend only on L, not N, enabling direct hyperparameter transfer across loop counts. Experiments on looped Transformers are reported to confirm that 1/N scaling improves trainability and final loss relative to 1/√N scaling.

Significance. If the derivation and supporting experiments hold, the work supplies a theoretically motivated scaling rule that stabilizes looped architectures and permits hyperparameter transfer from small to large N at fixed L. This factorization of the two sources of growth could facilitate efficient scaling of effective depth without parameter growth, with relevance to parameter-efficient deep models in language modeling and related domains.

major comments (2)

[Derivation of ε=1/N and ε=λ/(N√L)] The variance analysis deriving the 1/N factor (abstract and derivation) rests on the assumption of stationary, full pairwise correlation between successive applications of the shared f. Because each step updates the hidden state (h ← h + ε f(h)), the inputs to f become non-stationary; any non-linearity can cause correlation between f(h_t) and f(h_{t+k}) to decay with k. The manuscript must either bound this decay or demonstrate that the quadratic variance accumulation still holds; otherwise the stronger 1/N (and the claim that optimal LR depends only on L) may not be required.
[Experiments confirming 1/N scaling and transferability] The central claim that optimal learning rate depends only on L (not N) is load-bearing for the transferability result. The experiments must report explicit controls that vary L and N independently, together with quantitative metrics (loss values, training curves, or tables) showing that a learning rate tuned at one N transfers without retuning to other N at fixed L.

minor comments (2)

[Abstract] The abstract states that prior 1/√L scaling is 'insufficient' and that experiments 'confirm' the new rule, but supplies no dataset names, model dimensions, or loop counts; a one-sentence addition would improve readability.
[Derivation] Notation for the free parameter λ should be introduced explicitly when the factored form is first stated, and its range or selection procedure clarified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Derivation of ε=1/N and ε=λ/(N√L)] The variance analysis deriving the 1/N factor (abstract and derivation) rests on the assumption of stationary, full pairwise correlation between successive applications of the shared f. Because each step updates the hidden state (h ← h + ε f(h)), the inputs to f become non-stationary; any non-linearity can cause correlation between f(h_t) and f(h_{t+k}) to decay with k. The manuscript must either bound this decay or demonstrate that the quadratic variance accumulation still holds; otherwise the stronger 1/N (and the claim that optimal LR depends only on L) may not be required.

Authors: The derivation of the 1/N factor is obtained by modeling the variance of the accumulated residual updates under the correlation structure induced by weight sharing. While non-stationarity can in principle cause correlation decay, the shared parameters ensure that the updates remain correlated across iterations in a manner that leads to linear (rather than sqrt) growth in variance with N. To address the concern, we will revise the manuscript to include an empirical verification that the quadratic variance accumulation holds in the looped Transformer setting across the range of N considered, thereby supporting that the stronger scaling remains necessary for stability. revision: yes
Referee: [Experiments confirming 1/N scaling and transferability] The central claim that optimal learning rate depends only on L (not N) is load-bearing for the transferability result. The experiments must report explicit controls that vary L and N independently, together with quantitative metrics (loss values, training curves, or tables) showing that a learning rate tuned at one N transfers without retuning to other N at fixed L.

Authors: The current experiments vary N while holding L fixed and compare scaling rules, but do not include a full set of independent L/N controls with explicit transfer metrics. We will revise the experimental section to add tables and curves that vary L and N independently, reporting final losses and training dynamics to demonstrate that a learning rate selected at one N transfers directly to other values of N at the same L without retuning. revision: yes

Circularity Check

0 steps flagged

Variance derivation of 1/N scaling is independent of fitted inputs or self-citations

full rationale

The paper presents ε = 1/N and the factored form ε = λ/(N√L) as results of a variance analysis on correlated residual updates induced by weight sharing. No quoted equations reduce the claimed scaling to a fitted parameter renamed as prediction, a self-citation chain, or a self-definitional loop. The derivation is framed as first-principles analysis of update correlations and is therefore self-contained; external benchmarks or code reproduction would be needed to assess correctness but are outside the circularity criteria.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Ledger constructed from abstract only; full derivation unavailable.

free parameters (1)

λ
Scaling prefactor appearing in the multi-layer looped formula; its status (derived constant or fitted) is not specified in the abstract.

axioms (1)

domain assumption Weight sharing in looped transformers induces correlated residual updates across iterations that dominate stability requirements.
This premise is invoked to conclude that prior 1/sqrt(L) scaling is insufficient and 1/N is required.

pith-pipeline@v0.9.1-grok · 5751 in / 1385 out tokens · 48488 ms · 2026-06-27T00:50:57.532194+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 9 canonical work pages · 3 internal anchors

[1]

Thomas Bachlechner, Bodhisattwa Prasad Ma- jumder, Huanru Henry Mao, Gary Cottrell, and Julian J. McAuley. ReZero is all you need: fast convergence at large depth. In Cassio P. de Cam- pos, Marloes H. Maathuis, and Erik Quaeghe- beur, editors, Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, UAI 2021, Virtual Event, ...

2021
[2]

Depthwise hyperpa- rameter transfer in residual networks: Dynamics and scaling limit

Blake Bordelon, Lorenzo Noci, Mufan Bill Li, Boris Hanin, and Cengiz Pehlevan. Depthwise hyperpa- rameter transfer in residual networks: Dynamics and scaling limit. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe- view.net, 2024. URL https://openreview.net/ forum?id=KZJehvRKGD

2024
[3]

Concentration Inequalities - A Nonasymptotic Theory of Independence

Stéphane Boucheron, Gábor Lugosi, and Pas- cal Massart. Concentration Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013. ISBN 978-0-19-953525-

2013
[4]

2013.How to Build a Brain: A Neural Architecture for Biological Cognition

doi: 10.1093/ACPROF:OSO/9780199535255. 001.0001. URL https://doi.org/10.1093/acprof: oso/9780199535255.001.0001

work page doi:10.1093/acprof:oso/9780199535255
[5]

Neural ordi- nary differential equations

Tian Qi Chen, Yulia Rubanova, Jesse Bet- tencourt, and David Duvenaud. Neural ordi- nary differential equations. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS ...

2018
[6]

Univer- sal transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Univer- sal transformers. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net,

2019
[7]

URL https://openreview.net/forum?id= HyzdRiR9Y7
[8]

Don’t be lazy: CompleteP enables compute-efficient deep transform- ers

Nolan Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehle- van, Boris Hanin, and Joel Hestness. Don’t be lazy: CompleteP enables compute-efficient deep transform- ers. 2025. doi: 10.48550/ARXIV.2505.01618. URL https://arxiv.org/abs/2505.01618

work page doi:10.48550/arxiv.2505.01618 2025
[9]

Looped transformers for length generalization

Ying Fan, Yilun Du, Kannan Ramchandran, and Kangwook Lee. Looped transformers for length generalization. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April24-28, 2025. OpenReview.net,

2025
[10]

URL https://openreview.net/forum?id= 2edigk8yoU
[11]

Reddi, Stefanie Jegelka, and Sanjiv Kumar

Khashayar Gatmiry, Nikunj Saunshi, Sashank J. Reddi, Stefanie Jegelka, and Sanjiv Kumar. Can looped transformers learn to implement multi-step gradient descent for in-context learning? In Rus- lan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-firstInternational Conferen...

2024
[12]

On Milman’s inequality and ran- dom subspaces which escape through a mesh inRn

Yehoram Gordon. On Milman’s inequality and ran- dom subspaces which escape through a mesh inRn. In Geometric Aspects of Functional Analysis, vol- ume 1317 ofLecture Notes in Mathematics, pages 84–106. Springer, 1988. doi: 10.1007/BFb0081737

work page doi:10.1007/bfb0081737 1988
[13]

Width and depth limits commute in residual networks

Soufiane Hayou and Greg Yang. Width and depth limits commute in residual networks. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, Proceedings of Machine Learning Research, pages 12700–12723. P...

2023
[14]

ALBERT: A lite BERT for self-supervised learning of language representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview. net/forum?id=H1eA7AEtvS

2020
[15]

The Llama 3 Herd of Models

Llama Team. The llama 3 herd of models.CoRR, abs/2407.21783, 2024. doi: 10.48550/ARXIV.2407. 21783. URL https://doi.org/10.48550/arXiv. 2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407 2024
[16]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open- Review.net, 2019. URLhttps://openreview.net/ forum?id=Bkg6RiCqY7

2019
[17]

Generalization bounds for neu- ral ordinary differential equations and deep residual networks

Pierre Marion. Generalization bounds for neu- ral ordinary differential equations and deep residual networks. In Advances in Neural 10 Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http:// papers.nips.cc/paper_files/paper/2023/hash/ ...

2023
[18]

Scaling resnets in the large-depth regime

Pierre Marion, Adeline Fermanian, Gérard Biau, and Jean-Philippe Vert. Scaling resnets in the large-depth regime. J. Mach. Learn. Res., 26:56:1– 56:48, 2025. URL https://jmlr.org/papers/v26/ 22-0664.html

2025
[19]

Completed hyperparameter trans- fer across modules, width, depth, batch and duration

Bruno Mlodozeniec, Pierre Ablin, Louis Béthune, Dan Busbridge, Michal Klein, Jason Ramapuram, and Marco Cuturi. Completed hyperparameter trans- fer across modules, width, depth, batch and duration
[20]

URLhttps://arxiv.org/abs/2512.22382

arXiv
[21]

Loop neural net- works for parameter sharing

Kei-Sing Ng and Qingchen Wang. Loop neural net- works for parameter sharing. 2024. URL https: //arxiv.org/abs/2409.14199

arXiv 2024
[22]

Raffel, Leandro von Werra, and Thomas Wolf

Guilherme Penedo, Hynek Kydlícek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin A. Raffel, Leandro von Werra, and Thomas Wolf. The FineWeb datasets: De- canting the web for the finest text data at scale. In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vanco...

2024
[23]

Using the output embed- ding to improve language models

Ofir Press and Lior Wolf. Using the output embed- ding to improve language models. In Mirella Lap- ata, Phil Blunsom, and Alexander Koller, editors, Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers, pages 157–163. As- sociation f...

work page doi:10.18653/v1/e17-2025 2017
[24]

Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, San- jiv Kumar, and Sashank J. Reddi. Reasoning with latent thoughts: On the power of looped transform- ers. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview.net/forum?id=din0lGfZFd

2025
[25]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Ham- bro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023. doi: 10. 48550/ARXIV.2302.13971. URL http...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.13971 2023
[26]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Atten- tion is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wal- lach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Confer...

2017
[27]

DeepNet: Scaling transformers to 1,000 layers.IEEE Trans

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. DeepNet: Scaling transformers to 1,000 layers.IEEE Trans. Pattern Anal. Mach. Intell., 46(10):6761–6774, 2024. doi: 10.1109/TPAMI.2024.3386927. URL https: //doi.org/10.1109/TPAMI.2024.3386927

work page doi:10.1109/tpami.2024.3386927 2024
[28]

Wang, David Hall, Percy Liang, and Tengyu Ma

Kaiyue Wen, Zhiyuan Li, Jason S. Wang, David Hall, Percy Liang, and Tengyu Ma. Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective.CoRR, abs/2410.05192,

arXiv
[29]

URL https://doi.org/10.48550/arXiv.2410.05192

doi: 10.48550/ARXIV.2410.05192. URL https://doi.org/10.48550/arXiv.2410.05192

work page doi:10.48550/arxiv.2410.05192
[30]

On layer normalization in the transformer architecture

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. On layer normalization in the transformer architecture. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Re- search, pages 10524–...

2020
[31]

Hu, Igor Babuschkin, Szymon Sidor, XiaodongLiu, DavidFarhi, NickRyder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao

Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, XiaodongLiu, DavidFarhi, NickRyder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero- shot hyperparameter transfer. 2022. doi: 10.48550/ ARXIV.2203.03466. URL https://arxiv.org/abs/ 2203.03466

arXiv 2022
[32]

Nowak, and Dimitris Papailiopoulos

Liu Yang, Kangwook Lee, Robert D. Nowak, and Dimitris Papailiopoulos. Looped transform- ers are better at learning learning algorithms. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https: //openreview.net/forum?id=HHbRxoDTxE. 11

2024
[33]

Root mean square layer normalization

Biao Zhang and Rico Sennrich. Root mean square layer normalization. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Gar- nett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, B...

2019
[34]

Dauphin, and Tengyu Ma

Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. Fixup initialization: Residual learning with- out normalization. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net,

2019
[35]

URL https://openreview.net/forum?id= H1gsz30cKX
[36]

Scaling Latent Reasoning via Looped Language Models

Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, Lu Li, Jiajun Shi, Kaijing Ma, Shanda Li, Taylor Kergan, Andrew Smith, Xingwei Qu, Mude Hui, Bohong Wu, Qiyang Min, Hongzhi Huang, Xun Zhou, Wei Ye, Jiaheng Liu, Jian Yang, Yun- feng Shi, Chenghua Lin, Enduo Zhao, Tianle Cai, Ge Zhang, Wenhao Huan...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510 2025

[1] [1]

Thomas Bachlechner, Bodhisattwa Prasad Ma- jumder, Huanru Henry Mao, Gary Cottrell, and Julian J. McAuley. ReZero is all you need: fast convergence at large depth. In Cassio P. de Cam- pos, Marloes H. Maathuis, and Erik Quaeghe- beur, editors, Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, UAI 2021, Virtual Event, ...

2021

[2] [2]

Depthwise hyperpa- rameter transfer in residual networks: Dynamics and scaling limit

Blake Bordelon, Lorenzo Noci, Mufan Bill Li, Boris Hanin, and Cengiz Pehlevan. Depthwise hyperpa- rameter transfer in residual networks: Dynamics and scaling limit. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe- view.net, 2024. URL https://openreview.net/ forum?id=KZJehvRKGD

2024

[3] [3]

Concentration Inequalities - A Nonasymptotic Theory of Independence

Stéphane Boucheron, Gábor Lugosi, and Pas- cal Massart. Concentration Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013. ISBN 978-0-19-953525-

2013

[4] [4]

2013.How to Build a Brain: A Neural Architecture for Biological Cognition

doi: 10.1093/ACPROF:OSO/9780199535255. 001.0001. URL https://doi.org/10.1093/acprof: oso/9780199535255.001.0001

work page doi:10.1093/acprof:oso/9780199535255

[5] [5]

Neural ordi- nary differential equations

Tian Qi Chen, Yulia Rubanova, Jesse Bet- tencourt, and David Duvenaud. Neural ordi- nary differential equations. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS ...

2018

[6] [6]

Univer- sal transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Univer- sal transformers. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net,

2019

[7] [7]

URL https://openreview.net/forum?id= HyzdRiR9Y7

[8] [8]

Don’t be lazy: CompleteP enables compute-efficient deep transform- ers

Nolan Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehle- van, Boris Hanin, and Joel Hestness. Don’t be lazy: CompleteP enables compute-efficient deep transform- ers. 2025. doi: 10.48550/ARXIV.2505.01618. URL https://arxiv.org/abs/2505.01618

work page doi:10.48550/arxiv.2505.01618 2025

[9] [9]

Looped transformers for length generalization

Ying Fan, Yilun Du, Kannan Ramchandran, and Kangwook Lee. Looped transformers for length generalization. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April24-28, 2025. OpenReview.net,

2025

[10] [10]

URL https://openreview.net/forum?id= 2edigk8yoU

[11] [11]

Reddi, Stefanie Jegelka, and Sanjiv Kumar

Khashayar Gatmiry, Nikunj Saunshi, Sashank J. Reddi, Stefanie Jegelka, and Sanjiv Kumar. Can looped transformers learn to implement multi-step gradient descent for in-context learning? In Rus- lan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-firstInternational Conferen...

2024

[12] [12]

On Milman’s inequality and ran- dom subspaces which escape through a mesh inRn

Yehoram Gordon. On Milman’s inequality and ran- dom subspaces which escape through a mesh inRn. In Geometric Aspects of Functional Analysis, vol- ume 1317 ofLecture Notes in Mathematics, pages 84–106. Springer, 1988. doi: 10.1007/BFb0081737

work page doi:10.1007/bfb0081737 1988

[13] [13]

Width and depth limits commute in residual networks

Soufiane Hayou and Greg Yang. Width and depth limits commute in residual networks. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, Proceedings of Machine Learning Research, pages 12700–12723. P...

2023

[14] [14]

ALBERT: A lite BERT for self-supervised learning of language representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview. net/forum?id=H1eA7AEtvS

2020

[15] [15]

The Llama 3 Herd of Models

Llama Team. The llama 3 herd of models.CoRR, abs/2407.21783, 2024. doi: 10.48550/ARXIV.2407. 21783. URL https://doi.org/10.48550/arXiv. 2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407 2024

[16] [16]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open- Review.net, 2019. URLhttps://openreview.net/ forum?id=Bkg6RiCqY7

2019

[17] [17]

Generalization bounds for neu- ral ordinary differential equations and deep residual networks

Pierre Marion. Generalization bounds for neu- ral ordinary differential equations and deep residual networks. In Advances in Neural 10 Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http:// papers.nips.cc/paper_files/paper/2023/hash/ ...

2023

[18] [18]

Scaling resnets in the large-depth regime

Pierre Marion, Adeline Fermanian, Gérard Biau, and Jean-Philippe Vert. Scaling resnets in the large-depth regime. J. Mach. Learn. Res., 26:56:1– 56:48, 2025. URL https://jmlr.org/papers/v26/ 22-0664.html

2025

[19] [19]

Completed hyperparameter trans- fer across modules, width, depth, batch and duration

Bruno Mlodozeniec, Pierre Ablin, Louis Béthune, Dan Busbridge, Michal Klein, Jason Ramapuram, and Marco Cuturi. Completed hyperparameter trans- fer across modules, width, depth, batch and duration

[20] [20]

URLhttps://arxiv.org/abs/2512.22382

arXiv

[21] [21]

Loop neural net- works for parameter sharing

Kei-Sing Ng and Qingchen Wang. Loop neural net- works for parameter sharing. 2024. URL https: //arxiv.org/abs/2409.14199

arXiv 2024

[22] [22]

Raffel, Leandro von Werra, and Thomas Wolf

Guilherme Penedo, Hynek Kydlícek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin A. Raffel, Leandro von Werra, and Thomas Wolf. The FineWeb datasets: De- canting the web for the finest text data at scale. In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vanco...

2024

[23] [23]

Using the output embed- ding to improve language models

Ofir Press and Lior Wolf. Using the output embed- ding to improve language models. In Mirella Lap- ata, Phil Blunsom, and Alexander Koller, editors, Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers, pages 157–163. As- sociation f...

work page doi:10.18653/v1/e17-2025 2017

[24] [24]

Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, San- jiv Kumar, and Sashank J. Reddi. Reasoning with latent thoughts: On the power of looped transform- ers. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview.net/forum?id=din0lGfZFd

2025

[25] [25]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Ham- bro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023. doi: 10. 48550/ARXIV.2302.13971. URL http...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.13971 2023

[26] [26]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Atten- tion is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wal- lach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Confer...

2017

[27] [27]

DeepNet: Scaling transformers to 1,000 layers.IEEE Trans

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. DeepNet: Scaling transformers to 1,000 layers.IEEE Trans. Pattern Anal. Mach. Intell., 46(10):6761–6774, 2024. doi: 10.1109/TPAMI.2024.3386927. URL https: //doi.org/10.1109/TPAMI.2024.3386927

work page doi:10.1109/tpami.2024.3386927 2024

[28] [28]

Wang, David Hall, Percy Liang, and Tengyu Ma

Kaiyue Wen, Zhiyuan Li, Jason S. Wang, David Hall, Percy Liang, and Tengyu Ma. Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective.CoRR, abs/2410.05192,

arXiv

[29] [29]

URL https://doi.org/10.48550/arXiv.2410.05192

doi: 10.48550/ARXIV.2410.05192. URL https://doi.org/10.48550/arXiv.2410.05192

work page doi:10.48550/arxiv.2410.05192

[30] [30]

On layer normalization in the transformer architecture

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. On layer normalization in the transformer architecture. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Re- search, pages 10524–...

2020

[31] [31]

Hu, Igor Babuschkin, Szymon Sidor, XiaodongLiu, DavidFarhi, NickRyder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao

Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, XiaodongLiu, DavidFarhi, NickRyder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero- shot hyperparameter transfer. 2022. doi: 10.48550/ ARXIV.2203.03466. URL https://arxiv.org/abs/ 2203.03466

arXiv 2022

[32] [32]

Nowak, and Dimitris Papailiopoulos

Liu Yang, Kangwook Lee, Robert D. Nowak, and Dimitris Papailiopoulos. Looped transform- ers are better at learning learning algorithms. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https: //openreview.net/forum?id=HHbRxoDTxE. 11

2024

[33] [33]

Root mean square layer normalization

Biao Zhang and Rico Sennrich. Root mean square layer normalization. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Gar- nett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, B...

2019

[34] [34]

Dauphin, and Tengyu Ma

Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. Fixup initialization: Residual learning with- out normalization. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net,

2019

[35] [35]

URL https://openreview.net/forum?id= H1gsz30cKX

[36] [36]

Scaling Latent Reasoning via Looped Language Models

Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, Lu Li, Jiajun Shi, Kaijing Ma, Shanda Li, Taylor Kergan, Andrew Smith, Xingwei Qu, Mude Hui, Bohong Wu, Qiyang Min, Hongzhi Huang, Xun Zhou, Wei Ye, Jiaheng Liu, Jian Yang, Yun- feng Shi, Chenghua Lin, Enduo Zhao, Tianle Cai, Ge Zhang, Wenhao Huan...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510 2025