AdamW-trained transformer hidden states and backpropagated variables converge uniformly in L2 to a forward-backward ODE system (McKean-Vlasov when non-causal) at rate O(L^{-1}+L^{-1/3}H^{-1/2}) as depth L and heads H increase, with bounds independent of token number.
Wiley Series in Probability and Statistics
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
citation-role summary
other 1
citation-polarity summary
fields
stat.ML 1years
2026 1verdicts
UNVERDICTED 1roles
other 1polarities
unclear 1representative citing papers
citing papers explorer
-
Uniform Scaling Limits in AdamW-Trained Transformers
AdamW-trained transformer hidden states and backpropagated variables converge uniformly in L2 to a forward-backward ODE system (McKean-Vlasov when non-causal) at rate O(L^{-1}+L^{-1/3}H^{-1/2}) as depth L and heads H increase, with bounds independent of token number.