arXiv preprint arXiv:2602.16340 , year=

The Implicit Bias of Adam, Muon on Smooth Homogeneous Neural Networks , author= · 2026 · cs.LG · arXiv 2602.16340

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

We study the implicit bias of momentum-based optimizers on smooth homogeneous models. We show that \textit{momentum steepest descent} algorithms like Muon (spectral norm), MomentumGD ($\ell_2$ norm), and Signum ($\ell_\infty$ norm) are \textit{approximate} steepest descent trajectories under a decaying learning rate schedule, proving that these algorithms have a bias towards KKT points of the corresponding margin maximization problem. We extend the analysis to Adam (without the stability constant), which maximizes the $\ell_\infty$ margin, and to Muon-Signum and Muon-Adam, which maximize a hybrid norm. Our experiments corroborate the theory and show that the identity of the margin maximized depends on the choice of optimizer. Overall, our results extend earlier lines of work on steepest descent in homogeneous models and momentum-based optimizers in linear models.

representative citing papers

Convergence of Spectral Descent for Non-smooth Optimization

cs.LG · 2026-05-26 · unverdicted · novelty 5.0

Proves linear convergence of Spectral Descent (SD) and Truncated SD for non-smooth convex problems under stated conditions, sublinear rates for regularized versions via Frank-Wolfe, and recovery guarantees for robust low-rank matrix recovery.

Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise

math.OC · 2026-05-18

citing papers explorer

Showing 1 of 1 citing paper after filters.

Convergence of Spectral Descent for Non-smooth Optimization cs.LG · 2026-05-26 · unverdicted · none · ref 10 · internal anchor
Proves linear convergence of Spectral Descent (SD) and Truncated SD for non-smooth convex problems under stated conditions, sublinear rates for regularized versions via Frank-Wolfe, and recovery guarantees for robust low-rank matrix recovery.

arXiv preprint arXiv:2602.16340 , year=

fields

years

verdicts

representative citing papers

citing papers explorer