LiMuon: Light and Fast Muon Optimizer for Large Models

· 2025 · cs.LG · arXiv 2509.14562

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

open full Pith review browse 11 citing papers arXiv PDF

abstract

Large models recently are widely applied in machine learning, so efficient training of large models has received widespread attention. More recently, the useful Muon optimizer is specifically designed for matrix-structured parameters of large models. Although some works have begun to study the Muon optimizer, the existing Muon and its variants still suffer from high sample complexity or high memory for large models. To fill this gap, we propose a light and fast Muon (LiMuon) optimizer for training large models, which builds on the momentum-based variance reduced technique and randomized Singular Value Decomposition (SVD). In particular, our LiMuon simultaneously has a lower memory and lower sample complexity than the Muon and its variants. Moreover, we prove that our LiMuon with lower memory has a lower sample complexity of $O(\epsilon^{-3})$ for finding an $\epsilon$-stationary solution of non-convex stochastic optimization under the generalized smoothness condition. To further narrow practice and theory gap, we also prove that our LiMuon with Newton-Schulz steps has a lower sample complexity than the Muon with Newton-Schulz steps. Numerical experimental results on training Mamba-130M, Qwen2.5-0.5B and ViT models demonstrate effectiveness of our LiMuon.

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.

Convergence Analysis of Muon-type Methods with Inexact LMO in the Degenerate Case

math.OC · 2026-06-19 · unverdicted · novelty 6.0

Convergence rates are derived for Muon-type methods with inexact LMO in the degenerate case under novel assumptions and layer-wise (L^0, L^1)-smoothness for non-convex and star-convex objectives with weight decay.

OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality

math.OC · 2026-06-07 · unverdicted · novelty 6.0 · 2 refs

OptMuon combines orthogonalized momentum with trajectory-dependent AdaGrad-Norm adaptation to obtain expected-stationarity rates of order T^{-1/2} + sigma^{1/2}T^{-1/4} or T^{-1/2} + sigma^{1/3}T^{-1/3} that reduce to near-optimal deterministic first-order rates in the zero-noise regime.

LionMuon: Alternating Spectral and Sign Descent for Efficient Training

cs.LG · 2026-05-19 · unverdicted · novelty 6.0 · 2 refs

LionMuon alternates Lion and Muon steps with shared dual-EMA buffer to Pareto-dominate existing optimizers in loss and compute on models up to 720M parameters.

Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise

math.OC · 2026-05-18 · unverdicted · novelty 6.0

Establishes matching Ω and O(min{m,n} ε^-(3p-2)/(p-1)) bounds for scale-invariant spectral-norm methods under heavy-tailed noise, plus an improved O(min{m,n} ε^-(5p-3)/(2p-2)) rate via transported Scion under Hessian Lipschitz continuity.

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

math.OC · 2026-05-18 · unverdicted · novelty 6.0 · 2 refs

Proposes equivariant optimizer updates matched to layer symmetries for embeddings, SwiGLU MLPs, and MoE routers, with reported gains in validation loss and training stability on several language model architectures.

MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

cs.LG · 2026-03-30 · unverdicted · novelty 6.0

MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.

Zero-order Parameter-free Optimization for LMO-based Methods: Novel Approach for Efficient Fine-tuning

cs.LG · 2026-06-12 · unverdicted · novelty 5.0

AdaNAGED combines zeroth-order gradient-free training, automatic parameter adaptation, and LMO-based non-Euclidean geometry with claimed convergence guarantees, demonstrated on OPT-1.3B fine-tuning.

Softsign: Smooth Sign in Your Optimizer For Better Parameter Heterogeneity Handling

cs.LG · 2026-05-29 · unverdicted · novelty 5.0

SoftSignum replaces hard sign with soft-sign in optimizers via temperature control and quantile scheduling, extends to SoftMuon, provides a convergence proof for stochastic non-convex settings, and reports better performance than sign-based methods and AdamW on deep learning tasks.

MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models

cs.LG · 2026-05-19 · unverdicted · novelty 5.0

MiMuon is a hybrid optimizer that achieves a generalization error bound of O(1/N) independent of the small singular-value gap that limits the original Muon bound, while retaining the same O(1/T^{1/4}) convergence rate.

Communication-Efficient Gluon in Federated Learning

cs.LG · 2026-04-12 · unverdicted · novelty 5.0

Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.

citing papers explorer

Showing 11 of 11 citing papers after filters.

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds cs.LG · 2026-05-07 · unverdicted · none · ref 14 · internal anchor
SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
Convergence Analysis of Muon-type Methods with Inexact LMO in the Degenerate Case math.OC · 2026-06-19 · unverdicted · none · ref 48 · internal anchor
Convergence rates are derived for Muon-type methods with inexact LMO in the degenerate case under novel assumptions and layer-wise (L^0, L^1)-smoothness for non-convex and star-convex objectives with weight decay.
OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality math.OC · 2026-06-07 · unverdicted · none · ref 8 · 2 links · internal anchor
OptMuon combines orthogonalized momentum with trajectory-dependent AdaGrad-Norm adaptation to obtain expected-stationarity rates of order T^{-1/2} + sigma^{1/2}T^{-1/4} or T^{-1/2} + sigma^{1/3}T^{-1/3} that reduce to near-optimal deterministic first-order rates in the zero-noise regime.
LionMuon: Alternating Spectral and Sign Descent for Efficient Training cs.LG · 2026-05-19 · unverdicted · none · ref 12 · 2 links · internal anchor
LionMuon alternates Lion and Muon steps with shared dual-EMA buffer to Pareto-dominate existing optimizers in loss and compute on models up to 720M parameters.
Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise math.OC · 2026-05-18 · unverdicted · none · ref 54 · internal anchor
Establishes matching Ω and O(min{m,n} ε^-(3p-2)/(p-1)) bounds for scale-invariant spectral-norm methods under heavy-tailed noise, plus an improved O(min{m,n} ε^-(5p-3)/(2p-2)) rate via transported Scion under Hessian Lipschitz continuity.
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers math.OC · 2026-05-18 · unverdicted · none · ref 71 · 2 links · internal anchor
Proposes equivariant optimizer updates matched to layer symmetries for embeddings, SwiGLU MLPs, and MoE routers, with reported gains in validation loss and training stability on several language model architectures.
MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration cs.LG · 2026-03-30 · unverdicted · none · ref 33 · internal anchor
MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.
Zero-order Parameter-free Optimization for LMO-based Methods: Novel Approach for Efficient Fine-tuning cs.LG · 2026-06-12 · unverdicted · none · ref 71 · internal anchor
AdaNAGED combines zeroth-order gradient-free training, automatic parameter adaptation, and LMO-based non-Euclidean geometry with claimed convergence guarantees, demonstrated on OPT-1.3B fine-tuning.
Softsign: Smooth Sign in Your Optimizer For Better Parameter Heterogeneity Handling cs.LG · 2026-05-29 · unverdicted · none · ref 17 · internal anchor
SoftSignum replaces hard sign with soft-sign in optimizers via temperature control and quantile scheduling, extends to SoftMuon, provides a convergence proof for stochastic non-convex settings, and reports better performance than sign-based methods and AdamW on deep learning tasks.
MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models cs.LG · 2026-05-19 · unverdicted · none · ref 12 · internal anchor
MiMuon is a hybrid optimizer that achieves a generalization error bound of O(1/N) independent of the small singular-value gap that limits the original Muon bound, while retaining the same O(1/T^{1/4}) convergence rate.
Communication-Efficient Gluon in Federated Learning cs.LG · 2026-04-12 · unverdicted · none · ref 12 · internal anchor
Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.

LiMuon: Light and Fast Muon Optimizer for Large Models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer