Muon optimizer accelerates grokking.arXiv preprint arXiv:2504.16041

Amund Tveit, Bjørn Remseth, Arve Skogvold · arXiv 2504.16041

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.

Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

Muon achieves faster convergence and larger stable learning rates by flattening the singular value spectrum of the momentum buffer through orthogonalization, scaling step size with average rather than maximum singular values.

citing papers explorer

Showing 2 of 2 citing papers.

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds cs.LG · 2026-05-07 · unverdicted · none · ref 39
SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence cs.LG · 2026-05-13 · unverdicted · none · ref 14
Muon achieves faster convergence and larger stable learning rates by flattening the singular value spectrum of the momentum buffer through orthogonalization, scaling step size with average rather than maximum singular values.

Muon optimizer accelerates grokking.arXiv preprint arXiv:2504.16041

fields

years

verdicts

representative citing papers

citing papers explorer