pith. sign in

arxiv: 2603.00742 · v2 · pith:EG6A6LUHnew · submitted 2026-02-28 · 💻 cs.LG

To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters

classification 💻 cs.LG
keywords muonbiassimplicitydeeplinearnetworksoptimizationspeed
0
0 comments X
read the original abstract

While Adam has long been the ubiquitous default optimizer for deep neural networks, Muon has recently seen rapid adoption due to its superior training speed. Although much of the literature focuses on validating the benefits of Muon, our work investigates the potential downsides of the mechanism driving this speedup. On the theoretical front, we analyze the learning dynamics of simplified Muon on deep linear networks and linear attention. Our analysis reveals that Muon gains speed by avoiding saddle points, but does so at the expense of the simplicity bias characteristic of Gradient Descent (GD), where the complexity of the functional solution learned grows sequentially. Experiments demonstrate the consequences of losing the simplicity bias, showing that Muon struggles to uncover common underlying structure across tasks and may be prone to fitting spurious features. More broadly, this paper serves as a reminder that faster optimization is rarely a free lunch; improvements in optimization can come at the cost of changes in the inductive biases that shape generalization.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Muon learns balanced solutions in matrix factorization without slow saddle-to-saddle dynamics

    cs.LG 2026-06 unverdicted novelty 6.0

    Muon in matrix factorization avoids saddle-to-saddle dynamics, learns top modes simultaneously, conserves sqrt(P^TP) - sqrt(Q^TQ), and reaches balanced solutions from small initialization with a two-step alignment schedule.

  2. Muon in Vision Transformers: Optimizer-Recipe Interactions and Gradient Spectra

    cs.LG 2026-05 conditional novelty 5.0

    Muon optimizer outperforms AdamW in ViT training on two image datasets, with gains that depend on data augmentation strength and are linked to wider singular-value spread in QKV gradients and prevention of late-traini...

  3. MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models

    cs.LG 2026-05 unverdicted novelty 5.0

    MiMuon is a hybrid optimizer that achieves a generalization error bound of O(1/N) independent of the small singular-value gap that limits the original Muon bound, while retaining the same O(1/T^{1/4}) convergence rate.