hub Canonical reference

On the Convergence Analysis of Muon

· 2025 · stat.ML · arXiv 2505.23737

Canonical reference. 83% of citing Pith papers cite this work as background.

35 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 35 citing papers arXiv PDF

abstract

The majority of parameters in neural networks are naturally represented as matrices. However, most commonly used optimizers treat these matrix parameters as flattened vectors during optimization, potentially overlooking their inherent structural properties. Recently, an optimizer called Muon has been proposed, specifically designed to optimize matrix-structured parameters. Extensive empirical evidence shows that Muon can significantly outperform traditional optimizers when training neural networks. Nonetheless, the theoretical understanding of Muon's convergence behavior and the reasons behind its superior performance remain limited. In this work, we present a comprehensive convergence rate analysis of Muon and its comparison with Gradient Descent (GD). We characterize the conditions under which Muon can outperform GD. Our theoretical results reveal that Muon can benefit from the low-rank structure of Hessian matrices, a phenomenon widely observed in practical neural network training. Our experimental results support and corroborate the theoretical findings.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6

citation-polarity summary

background 5 support 1

representative citing papers

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.

Why Muon Outperforms Adam: A Curvature Perspective

cs.LG · 2026-06-03 · conditional · novelty 7.0

Muon outperforms Adam by reducing curvature penalty via lower Normalized Directional Sharpness, as shown via Taylor approximation on LLM training and proven on stylized quadratic problems with heterogeneous curvature.

AMUSE: Anytime Muon with Stable Gradient Evaluation

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

AMUSE is a new optimizer integrating Muon orthogonalization with Schedule-Free averaging via adaptive interpolation for schedule-free anytime training that improves Pareto frontiers on vision and LLM tasks.

DP-Muon: Differentially Private Optimization via Matrix-Orthogonalized Momentum

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

DP-Muon adapts matrix-orthogonalized momentum optimization to differential privacy via per-matrix clipping and noise addition, with proofs of inherited privacy and optimization guarantees plus a bias-corrected version that improves private fine-tuning utility.

Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).

Phases of Muon: When Muon Eclipses SignSGD

math.OC · 2026-05-10 · unverdicted · novelty 7.0

On power-law covariance least squares problems, SignSVD (Muon) and SignSGD (Adam proxy) show three phases of relative performance depending on data exponent α and target exponent β.

Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.

Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition

math.OC · 2026-05-07 · unverdicted · novelty 7.0

Muon with Nesterov momentum and inexact polar decomposition achieves optimal convergence rates of O(ε^(-(3α-2)/(α-1))) under heavy-tailed noise for ε-stationary points in non-convex settings.

Convergence Rate Analysis of SOAP with Arbitrary Orthogonal Projection Matrices

math.OC · 2026-04-23 · unverdicted · novelty 7.0

SOAP and its generalizations with arbitrary orthogonal projections converge at a provable rate when the projections are conditionally independent of the current gradient.

Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory

cs.LG · 2026-03-27 · unverdicted · novelty 7.0

Muon achieves higher storage capacity than SGD and matches Newton's method in one-step recovery rates for associative memory under power-law distributions, while saturating at larger critical batch sizes and showing faster initial multi-step dynamics.

On the Convergence of Muon and Beyond

cs.LG · 2025-09-19 · unverdicted · novelty 7.0

Muon-MVR2 attains the optimal anytime convergence rate of ~O(T^{-1/3}) in stochastic non-convex settings under horizon-free schedules.

FOGO: Forgetting-aware Orthogonalization Optimizer

cs.LG · 2026-06-09 · unverdicted · novelty 6.0

FOGO introduces spectral orthogonalization of momentum updates plus a random-projection codebook memory to detect and correct gradient interference, improving convergence and retention over Adam and Muon on imbalanced, continual, and large-model tasks.

OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality

math.OC · 2026-06-07 · unverdicted · novelty 6.0 · 2 refs

OptMuon combines orthogonalized momentum with trajectory-dependent AdaGrad-Norm adaptation to obtain expected-stationarity rates of order T^{-1/2} + sigma^{1/2}T^{-1/4} or T^{-1/2} + sigma^{1/3}T^{-1/3} that reduce to near-optimal deterministic first-order rates in the zero-noise regime.

LionMuon: Alternating Spectral and Sign Descent for Efficient Training

cs.LG · 2026-05-19 · unverdicted · novelty 6.0 · 2 refs

LionMuon alternates Lion and Muon steps with shared dual-EMA buffer to Pareto-dominate existing optimizers in loss and compute on models up to 720M parameters.

Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise

math.OC · 2026-05-18 · unverdicted · novelty 6.0

Establishes matching Ω and O(min{m,n} ε^-(3p-2)/(p-1)) bounds for scale-invariant spectral-norm methods under heavy-tailed noise, plus an improved O(min{m,n} ε^-(5p-3)/(2p-2)) rate via transported Scion under Hessian Lipschitz continuity.

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

math.OC · 2026-05-18 · unverdicted · novelty 6.0 · 2 refs

Proposes equivariant optimizer updates matched to layer symmetries for embeddings, SwiGLU MLPs, and MoE routers, with reported gains in validation loss and training stability on several language model architectures.

Muon Does Not Converge on Convex Lipschitz Functions

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

Muon does not converge on convex Lipschitz functions regardless of learning rate, while error feedback restores theoretical convergence but degrades performance on CIFAR-10 and nanoGPT tasks.

Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning

cs.LG · 2026-05-09 · unverdicted · novelty 6.0 · 2 refs

Muon-OGD introduces a spectral-norm constrained orthogonal projection method solved via dual iterations and Newton-Schulz approximations to improve stability-plasticity trade-off in sequential LLM adaptation.

ZAYA1-8B Technical Report

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

SignMuon: Communication-Efficient Distributed Muon Optimization

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

SignMuon merges majority-vote sign aggregation from signSGD with Muon's polar-factor steps to create a communication-efficient distributed optimizer that matches signSGD rates under symmetric noise and shows strong empirical results on CIFAR and nanoGPT.

SUDA-Muon: Structural Design Principles and Boundaries for Fully Decentralized Muon

math.OC · 2026-04-27 · unverdicted · novelty 6.0

SUDA-Muon modularizes decentralized Muon via the SUDA template, proving a topology-separated convergence rate of O((1+σ/√N)K^{-1/4}) in nuclear-norm geometry while establishing that tracking-before-polarization is required to avoid non-stationary fixed points and that local-polarize-then-average is

MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

cs.LG · 2026-03-30 · unverdicted · novelty 6.0

MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.

Muon Learns More Robust and Transferable Features than Adam

cs.LG · 2026-06-08 · unverdicted · novelty 5.0

Muon learns more robust and transferable features than Adam and SGD, shown via corruption robustness tests, transfer experiments, layer-wise probes, effective rank measurements, and a theoretical proof on margins in a multi-component classification problem.

Can Entry-Wise Clipping Give Spectral Control of Stochastic Gradients?

cs.LG · 2026-05-26 · unverdicted · novelty 5.0

Entry-wise clipping achieves spectral control of gradients via localization under heavy-tailed contamination, with O(ε^{-4}) convergence and empirical savings on NanoGPT pretraining.

citing papers explorer

Showing 33 of 33 citing papers after filters.

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds cs.LG · 2026-05-07 · unverdicted · none · ref 32 · internal anchor
SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
AMUSE: Anytime Muon with Stable Gradient Evaluation cs.LG · 2026-05-21 · unverdicted · none · ref 39 · internal anchor
AMUSE is a new optimizer integrating Muon orthogonalization with Schedule-Free averaging via adaptive interpolation for schedule-free anytime training that improves Pareto frontiers on vision and LLM tasks.
DP-Muon: Differentially Private Optimization via Matrix-Orthogonalized Momentum cs.LG · 2026-05-13 · unverdicted · none · ref 7 · internal anchor
DP-Muon adapts matrix-orthogonalized momentum optimization to differential privacy via per-matrix clipping and noise addition, with proofs of inherited privacy and optimization guarantees plus a bias-corrected version that improves private fine-tuning utility.
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters cs.LG · 2026-05-12 · unverdicted · none · ref 56 · internal anchor
Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
Phases of Muon: When Muon Eclipses SignSGD math.OC · 2026-05-10 · unverdicted · none · ref 65 · internal anchor
On power-law covariance least squares problems, SignSVD (Muon) and SignSGD (Adam proxy) show three phases of relative performance depending on data exponent α and target exponent β.
Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds cs.LG · 2026-05-10 · unverdicted · none · ref 54 · internal anchor
Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.
Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition math.OC · 2026-05-07 · unverdicted · none · ref 45 · internal anchor
Muon with Nesterov momentum and inexact polar decomposition achieves optimal convergence rates of O(ε^(-(3α-2)/(α-1))) under heavy-tailed noise for ε-stationary points in non-convex settings.
Convergence Rate Analysis of SOAP with Arbitrary Orthogonal Projection Matrices math.OC · 2026-04-23 · unverdicted · none · ref 18 · internal anchor
SOAP and its generalizations with arbitrary orthogonal projections converge at a provable rate when the projections are conditionally independent of the current gradient.
Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory cs.LG · 2026-03-27 · unverdicted · none · ref 49 · internal anchor
Muon achieves higher storage capacity than SGD and matches Newton's method in one-step recovery rates for associative memory under power-law distributions, while saturating at larger critical batch sizes and showing faster initial multi-step dynamics.
On the Convergence of Muon and Beyond cs.LG · 2025-09-19 · unverdicted · none · ref 43 · internal anchor
Muon-MVR2 attains the optimal anytime convergence rate of ~O(T^{-1/3}) in stochastic non-convex settings under horizon-free schedules.
FOGO: Forgetting-aware Orthogonalization Optimizer cs.LG · 2026-06-09 · unverdicted · none · ref 38 · internal anchor
FOGO introduces spectral orthogonalization of momentum updates plus a random-projection codebook memory to detect and correct gradient interference, improving convergence and retention over Adam and Muon on imbalanced, continual, and large-model tasks.
OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality math.OC · 2026-06-07 · unverdicted · none · ref 75 · 2 links · internal anchor
OptMuon combines orthogonalized momentum with trajectory-dependent AdaGrad-Norm adaptation to obtain expected-stationarity rates of order T^{-1/2} + sigma^{1/2}T^{-1/4} or T^{-1/2} + sigma^{1/3}T^{-1/3} that reduce to near-optimal deterministic first-order rates in the zero-noise regime.
LionMuon: Alternating Spectral and Sign Descent for Efficient Training cs.LG · 2026-05-19 · unverdicted · none · ref 19 · 2 links · internal anchor
LionMuon alternates Lion and Muon steps with shared dual-EMA buffer to Pareto-dominate existing optimizers in loss and compute on models up to 720M parameters.
Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise math.OC · 2026-05-18 · unverdicted · none · ref 109 · internal anchor
Establishes matching Ω and O(min{m,n} ε^-(3p-2)/(p-1)) bounds for scale-invariant spectral-norm methods under heavy-tailed noise, plus an improved O(min{m,n} ε^-(5p-3)/(2p-2)) rate via transported Scion under Hessian Lipschitz continuity.
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers math.OC · 2026-05-18 · unverdicted · none · ref 138 · 2 links · internal anchor
Proposes equivariant optimizer updates matched to layer symmetries for embeddings, SwiGLU MLPs, and MoE routers, with reported gains in validation loss and training stability on several language model architectures.
Muon Does Not Converge on Convex Lipschitz Functions cs.LG · 2026-05-09 · unverdicted · none · ref 82 · internal anchor
Muon does not converge on convex Lipschitz functions regardless of learning rate, while error feedback restores theoretical convergence but degrades performance on CIFAR-10 and nanoGPT tasks.
Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning cs.LG · 2026-05-09 · unverdicted · none · ref 14 · 2 links · internal anchor
Muon-OGD introduces a spectral-norm constrained orthogonal projection method solved via dual iterations and Newton-Schulz approximations to improve stability-plasticity trade-off in sequential LLM adaptation.
ZAYA1-8B Technical Report cs.AI · 2026-05-06 · unverdicted · none · ref 201 · internal anchor
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
SignMuon: Communication-Efficient Distributed Muon Optimization cs.LG · 2026-05-04 · unverdicted · none · ref 32 · internal anchor
SignMuon merges majority-vote sign aggregation from signSGD with Muon's polar-factor steps to create a communication-efficient distributed optimizer that matches signSGD rates under symmetric noise and shows strong empirical results on CIFAR and nanoGPT.
SUDA-Muon: Structural Design Principles and Boundaries for Fully Decentralized Muon math.OC · 2026-04-27 · unverdicted · none · ref 22 · internal anchor
SUDA-Muon modularizes decentralized Muon via the SUDA template, proving a topology-separated convergence rate of O((1+σ/√N)K^{-1/4}) in nuclear-norm geometry while establishing that tracking-before-polarization is required to avoid non-stationary fixed points and that local-polarize-then-average is
MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration cs.LG · 2026-03-30 · unverdicted · none · ref 12 · internal anchor
MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.
Muon Learns More Robust and Transferable Features than Adam cs.LG · 2026-06-08 · unverdicted · none · ref 55 · internal anchor
Muon learns more robust and transferable features than Adam and SGD, shown via corruption robustness tests, transfer experiments, layer-wise probes, effective rank measurements, and a theoretical proof on margins in a multi-component classification problem.
Can Entry-Wise Clipping Give Spectral Control of Stochastic Gradients? cs.LG · 2026-05-26 · unverdicted · none · ref 40 · internal anchor
Entry-wise clipping achieves spectral control of gradients via localization under heavy-tailed contamination, with O(ε^{-4}) convergence and empirical savings on NanoGPT pretraining.
Convergence of Spectral Descent for Non-smooth Optimization cs.LG · 2026-05-26 · unverdicted · none · ref 22 · internal anchor
Proves linear convergence of Spectral Descent (SD) and Truncated SD for non-smooth convex problems under stated conditions, sublinear rates for regularized versions via Frank-Wolfe, and recovery guarantees for robust low-rank matrix recovery.
When Muon Optimizer Meets Adversarial Training: A Theoretical and Empirical Study cs.LG · 2026-05-26 · unverdicted · none · ref 42 · internal anchor
Muon optimizer in adversarial training imposes spectral-norm stability on matrix updates and matches or exceeds SGD/AdamW robustness on CNNs and ViTs under lp attacks.
Anytime Training with Schedule-Free Spectral Optimization cs.LG · 2026-05-21 · unverdicted · none · ref 61 · internal anchor
SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.
MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models cs.LG · 2026-05-19 · unverdicted · none · ref 36 · internal anchor
MiMuon is a hybrid optimizer that achieves a generalization error bound of O(1/N) independent of the small singular-value gap that limits the original Muon bound, while retaining the same O(1/T^{1/4}) convergence rate.
Position: Zeroth-Order Optimization in Deep Learning Is Underexplored, Not Underpowered cs.LG · 2026-05-15 · unverdicted · none · ref 31 · internal anchor
Zeroth-order optimization is underexplored rather than underpowered in deep learning, with limitations stemming from full-space designs that can be addressed via subspace, spectral, and systems-aware approaches.
Communication-Efficient Gluon in Federated Learning cs.LG · 2026-04-12 · unverdicted · none · ref 36 · internal anchor
Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.
HTMuon: Improving Muon via Heavy-Tailed Spectral Correction cs.LG · 2026-03-10 · unverdicted · none · ref 24 · internal anchor
HTMuon modifies Muon to produce heavier-tailed updates and weight spectra via HT-SR theory, yielding up to 0.98 lower perplexity on LLaMA pretraining and serving as a plug-in for other Muon variants.
Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training cs.LG · 2025-09-15 · unverdicted · none · ref 48 · internal anchor
Proposes low-rank orthogonalization and derives low-rank Muon and MSGD variants that outperform standard Muon on GPT-2 and LLaMA pretraining while providing iteration complexity bounds.
ZONOS2 Technical Report cs.SD · 2026-06-23 · unverdicted · none · ref 242 · internal anchor
ZONOS2 8B is a scaled MoE TTS model with 900M active parameters trained on 6M hours of data that reports competitive SOTA results on naturalness, speaker similarity, WER, and a new ZTTS1-Eval benchmark while releasing weights and code.
Statistical Properties of Training & Generalization stat.ML · 2026-06-18 · unverdicted · none · ref 120 · internal anchor
Review of neural scaling laws and their relation to constraints and inductive biases when applying machine learning to physics problems.

On the Convergence Analysis of Muon

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer