The same Transformer architecture follows different spectral scaling laws under different optimizers, with Muon achieving linear hard-rank scaling on tail representations while AdamW shows weak scaling, even when perplexity is matched.
The spectrum of covariance matrices of randomly connected recurrent neuronal networks with linear dynamics.PLoS computational biology
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws
The same Transformer architecture follows different spectral scaling laws under different optimizers, with Muon achieving linear hard-rank scaling on tail representations while AdamW shows weak scaling, even when perplexity is matched.