The same Transformer architecture follows different spectral scaling laws under different optimizers, with Muon achieving linear hard-rank scaling on tail representations while AdamW shows weak scaling, even when perplexity is matched.
modded-nanogpt: Speedrunning the nanogpt baseline
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2representative citing papers
citing papers explorer
-
Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws
The same Transformer architecture follows different spectral scaling laws under different optimizers, with Muon achieving linear hard-rank scaling on tail representations while AdamW shows weak scaling, even when perplexity is matched.
- DynMuon: A Dynamic Spectral Shaping View of Muon