SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
LiMuon: Light and Fast Muon Optimizer for Large Models
11 Pith papers cite this work. Polarity classification is still indexing.
abstract
Large models recently are widely applied in machine learning, so efficient training of large models has received widespread attention. More recently, the useful Muon optimizer is specifically designed for matrix-structured parameters of large models. Although some works have begun to study the Muon optimizer, the existing Muon and its variants still suffer from high sample complexity or high memory for large models. To fill this gap, we propose a light and fast Muon (LiMuon) optimizer for training large models, which builds on the momentum-based variance reduced technique and randomized Singular Value Decomposition (SVD). In particular, our LiMuon simultaneously has a lower memory and lower sample complexity than the Muon and its variants. Moreover, we prove that our LiMuon with lower memory has a lower sample complexity of $O(\epsilon^{-3})$ for finding an $\epsilon$-stationary solution of non-convex stochastic optimization under the generalized smoothness condition. To further narrow practice and theory gap, we also prove that our LiMuon with Newton-Schulz steps has a lower sample complexity than the Muon with Newton-Schulz steps. Numerical experimental results on training Mamba-130M, Qwen2.5-0.5B and ViT models demonstrate effectiveness of our LiMuon.
citation-role summary
citation-polarity summary
years
2026 11verdicts
UNVERDICTED 11representative citing papers
Convergence rates are derived for Muon-type methods with inexact LMO in the degenerate case under novel assumptions and layer-wise (L^0, L^1)-smoothness for non-convex and star-convex objectives with weight decay.
OptMuon combines orthogonalized momentum with trajectory-dependent AdaGrad-Norm adaptation to obtain expected-stationarity rates of order T^{-1/2} + sigma^{1/2}T^{-1/4} or T^{-1/2} + sigma^{1/3}T^{-1/3} that reduce to near-optimal deterministic first-order rates in the zero-noise regime.
LionMuon alternates Lion and Muon steps with shared dual-EMA buffer to Pareto-dominate existing optimizers in loss and compute on models up to 720M parameters.
Establishes matching Ω and O(min{m,n} ε^-(3p-2)/(p-1)) bounds for scale-invariant spectral-norm methods under heavy-tailed noise, plus an improved O(min{m,n} ε^-(5p-3)/(2p-2)) rate via transported Scion under Hessian Lipschitz continuity.
Proposes equivariant optimizer updates matched to layer symmetries for embeddings, SwiGLU MLPs, and MoE routers, with reported gains in validation loss and training stability on several language model architectures.
MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.
AdaNAGED combines zeroth-order gradient-free training, automatic parameter adaptation, and LMO-based non-Euclidean geometry with claimed convergence guarantees, demonstrated on OPT-1.3B fine-tuning.
SoftSignum replaces hard sign with soft-sign in optimizers via temperature control and quantile scheduling, extends to SoftMuon, provides a convergence proof for stochastic non-convex settings, and reports better performance than sign-based methods and AdamW on deep learning tasks.
MiMuon is a hybrid optimizer that achieves a generalization error bound of O(1/N) independent of the small singular-value gap that limits the original Muon bound, while retaining the same O(1/T^{1/4}) convergence rate.
Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.
citing papers explorer
-
When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds
SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
-
Convergence Analysis of Muon-type Methods with Inexact LMO in the Degenerate Case
Convergence rates are derived for Muon-type methods with inexact LMO in the degenerate case under novel assumptions and layer-wise (L^0, L^1)-smoothness for non-convex and star-convex objectives with weight decay.
-
OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality
OptMuon combines orthogonalized momentum with trajectory-dependent AdaGrad-Norm adaptation to obtain expected-stationarity rates of order T^{-1/2} + sigma^{1/2}T^{-1/4} or T^{-1/2} + sigma^{1/3}T^{-1/3} that reduce to near-optimal deterministic first-order rates in the zero-noise regime.
-
LionMuon: Alternating Spectral and Sign Descent for Efficient Training
LionMuon alternates Lion and Muon steps with shared dual-EMA buffer to Pareto-dominate existing optimizers in loss and compute on models up to 720M parameters.
-
Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise
Establishes matching Ω and O(min{m,n} ε^-(3p-2)/(p-1)) bounds for scale-invariant spectral-norm methods under heavy-tailed noise, plus an improved O(min{m,n} ε^-(5p-3)/(2p-2)) rate via transported Scion under Hessian Lipschitz continuity.
-
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers
Proposes equivariant optimizer updates matched to layer symmetries for embeddings, SwiGLU MLPs, and MoE routers, with reported gains in validation loss and training stability on several language model architectures.
-
MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration
MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.
-
Zero-order Parameter-free Optimization for LMO-based Methods: Novel Approach for Efficient Fine-tuning
AdaNAGED combines zeroth-order gradient-free training, automatic parameter adaptation, and LMO-based non-Euclidean geometry with claimed convergence guarantees, demonstrated on OPT-1.3B fine-tuning.
-
Softsign: Smooth Sign in Your Optimizer For Better Parameter Heterogeneity Handling
SoftSignum replaces hard sign with soft-sign in optimizers via temperature control and quantile scheduling, extends to SoftMuon, provides a convergence proof for stochastic non-convex settings, and reports better performance than sign-based methods and AdamW on deep learning tasks.
-
MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models
MiMuon is a hybrid optimizer that achieves a generalization error bound of O(1/N) independent of the small singular-value gap that limits the original Muon bound, while retaining the same O(1/T^{1/4}) convergence rate.
-
Communication-Efficient Gluon in Federated Learning
Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.