Convergence Bound and Critical Batch Size of Muon Optimizer
read the original abstract
Muon, a recently proposed optimizer that leverages the inherent matrix structure of neural network parameters, has demonstrated strong empirical performance, indicating its potential as a successor to standard optimizers such as AdamW. This paper presents theoretical analysis to support its practical success. We provide convergence proofs for Muon across four practical settings, systematically examining its behavior with and without the inclusion of Nesterov momentum and weight decay. We then demonstrate that the addition of weight decay ensures almost-sure boundedness of the parameter and gradient norms -- without relying on the commonly imposed bounded-gradient assumption -- and clarify the interplay between the weight decay coefficient and the learning rate. Finally, we derive a lower bound on the critical batch size for Muon -- the batch size that minimizes the stochastic first-order oracle (SFO) complexity of training. Because the resulting formula involves problem-dependent quantities that are not directly observable (gradient variance, target precision, effective rank), it does not predict the critical batch size in absolute terms; rather, it reveals how the hyperparameters $\beta$ (momentum) and $\lambda$ (weight decay) govern the qualitative scaling of this value. Our experiments validate these hyperparameter-dependent predictions across workloads including image classification and language modeling.
This paper has not been read by Pith yet.
Forward citations
Cited by 10 Pith papers
-
When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds
SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
-
DP-Muon: Differentially Private Optimization via Matrix-Orthogonalized Momentum
DP-Muon adapts matrix-orthogonalized momentum optimization to differential privacy via per-matrix clipping and noise addition, with proofs of inherited privacy and optimization guarantees plus a bias-corrected version...
-
Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds
Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.
-
Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition
Muon with Nesterov momentum and inexact polar decomposition achieves optimal convergence rates of O(ε^(-(3α-2)/(α-1))) under heavy-tailed noise for ε-stationary points in non-convex settings.
-
Accelerating LMO-Based Optimization via Implicit Gradient Transport
LMO-IGT achieves O(ε^{-3.5}) iteration complexity for stochastic LMO optimization via implicit gradient transport with a single gradient per step and introduces the regularized support function as a unified stationari...
-
Convergence Rate Analysis of SOAP with Arbitrary Orthogonal Projection Matrices
SOAP and its generalizations with arbitrary orthogonal projections converge at a provable rate when the projections are conditionally independent of the current gradient.
-
Muon Does Not Converge on Convex Lipschitz Functions
Muon does not converge on convex Lipschitz functions regardless of learning rate, while error feedback restores theoretical convergence but degrades performance on CIFAR-10 and nanoGPT tasks.
-
Orth-Dion: Eliminating Geometric Mismatch in Distributed Low-Rank Spectral Optimization
Orth-Dion uses QR factorization on the right factor instead of column normalization to eliminate the geometric mismatch in low-rank approximations of spectral optimizers like Muon, achieving O(sqrt(L_r/T)) rate under ...
-
Anytime Training with Schedule-Free Spectral Optimization
SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.
-
Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training
Proposes low-rank orthogonalization and derives low-rank Muon and MSGD variants that outperform standard Muon on GPT-2 and LLaMA pretraining while providing iteration complexity bounds.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.