Looped Transformers require residual scaling ε = 1/N due to correlated updates from weight sharing, unlike standard 1/sqrt(L), enabling learning rate transfer independent of loop count N.
On Milman’s inequality and ran- dom subspaces which escape through a mesh inRn
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
On the Residual Scaling of Looped Transformers: Stability and Transferability
Looped Transformers require residual scaling ε = 1/N due to correlated updates from weight sharing, unlike standard 1/sqrt(L), enabling learning rate transfer independent of loop count N.