Muon: An optimizer for hidden layers in neural networks.URL https://kellerjordan

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cecista, Laker Newhouse, Jeremy Bernstein · 2024

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

The same Transformer architecture follows different spectral scaling laws under different optimizers, with Muon achieving linear hard-rank scaling on tail representations while AdamW shows weak scaling, even when perplexity is matched.

citing papers explorer

Showing 1 of 1 citing paper.

Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws cs.LG · 2026-05-20 · unverdicted · none · ref 10
The same Transformer architecture follows different spectral scaling laws under different optimizers, with Muon achieving linear hard-rank scaling on tail representations while AdamW shows weak scaling, even when perplexity is matched.

Muon: An optimizer for hidden layers in neural networks.URL https://kellerjordan

fields

years

verdicts

representative citing papers

citing papers explorer