pith. sign in

arxiv: 2602.16340 · v3 · pith:W4JVQ3RAnew · submitted 2026-02-18 · 💻 cs.LG · stat.ML

The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks

classification 💻 cs.LG stat.ML
keywords normbiasdescenthomogeneousmarginmodelssteepestadam
0
0 comments X
read the original abstract

We study the implicit bias of momentum-based optimizers on smooth homogeneous models. We show that \textit{momentum steepest descent} algorithms like Muon (spectral norm), MomentumGD ($\ell_2$ norm), and Signum ($\ell_\infty$ norm) are \textit{approximate} steepest descent trajectories under a decaying learning rate schedule, proving that these algorithms have a bias towards KKT points of the corresponding margin maximization problem. We extend the analysis to Adam (without the stability constant), which maximizes the $\ell_\infty$ margin, and to Muon-Signum and Muon-Adam, which maximize a hybrid norm. Our experiments corroborate the theory and show that the identity of the margin maximized depends on the choice of optimizer. Overall, our results extend earlier lines of work on steepest descent in homogeneous models and momentum-based optimizers in linear models.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise

    math.OC 2026-05 unverdicted novelty 6.0

    Establishes matching lower and upper oracle complexity bounds for scale-invariant methods with spectral norm under heavy-tailed noise, plus improved rates with higher-order smoothness, and practical tests on neural networks.

  2. Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise

    math.OC 2026-05 unverdicted novelty 6.0

    Establishes matching Ω and O(min{m,n} ε^-(3p-2)/(p-1)) bounds for scale-invariant spectral-norm methods under heavy-tailed noise, plus an improved O(min{m,n} ε^-(5p-3)/(2p-2)) rate via transported Scion under Hessian ...

  3. Convergence of Spectral Descent for Non-smooth Optimization

    cs.LG 2026-05 unverdicted novelty 5.0

    Proves linear convergence of Spectral Descent (SD) and Truncated SD for non-smooth convex problems under stated conditions, sublinear rates for regularized versions via Frank-Wolfe, and recovery guarantees for robust ...