To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters
read the original abstract
While Adam has long been the ubiquitous default optimizer for deep neural networks, Muon has recently seen rapid adoption due to its superior training speed. Although much of the literature focuses on validating the benefits of Muon, our work investigates the potential downsides of the mechanism driving this speedup. On the theoretical front, we analyze the learning dynamics of simplified Muon on deep linear networks and linear attention. Our analysis reveals that Muon gains speed by avoiding saddle points, but does so at the expense of the simplicity bias characteristic of Gradient Descent (GD), where the complexity of the functional solution learned grows sequentially. Experiments demonstrate the consequences of losing the simplicity bias, showing that Muon struggles to uncover common underlying structure across tasks and may be prone to fitting spurious features. More broadly, this paper serves as a reminder that faster optimization is rarely a free lunch; improvements in optimization can come at the cost of changes in the inductive biases that shape generalization.
This paper has not been read by Pith yet.
Forward citations
Cited by 3 Pith papers
-
Muon learns balanced solutions in matrix factorization without slow saddle-to-saddle dynamics
Muon in matrix factorization avoids saddle-to-saddle dynamics, learns top modes simultaneously, conserves sqrt(P^TP) - sqrt(Q^TQ), and reaches balanced solutions from small initialization with a two-step alignment schedule.
-
Muon in Vision Transformers: Optimizer-Recipe Interactions and Gradient Spectra
Muon optimizer outperforms AdamW in ViT training on two image datasets, with gains that depend on data augmentation strength and are linked to wider singular-value spread in QKV gradients and prevention of late-traini...
-
MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models
MiMuon is a hybrid optimizer that achieves a generalization error bound of O(1/N) independent of the small singular-value gap that limits the original Muon bound, while retaining the same O(1/T^{1/4}) convergence rate.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.