DynMuon dynamically schedules the spectral exponent p in Muon-style updates according to curvature, noise, and training stage, yielding lower validation loss with 10-26% fewer steps than fixed Muon.
modded-nanogpt: Speedrunning the nanogpt baseline
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
The same Transformer architecture follows different spectral scaling laws under different optimizers, with Muon achieving linear hard-rank scaling on tail representations while AdamW shows weak scaling, even when perplexity is matched.
citing papers explorer
-
DynMuon: A Dynamic Spectral Shaping View of Muon
DynMuon dynamically schedules the spectral exponent p in Muon-style updates according to curvature, noise, and training stage, yielding lower validation loss with 10-26% fewer steps than fixed Muon.
-
Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws
The same Transformer architecture follows different spectral scaling laws under different optimizers, with Muon achieving linear hard-rank scaling on tail representations while AdamW shows weak scaling, even when perplexity is matched.