On the Convergence Analysis of Muon

Cong Shen; Jiawei Zhang; Minhui Huang; Ruichuan Huang; Wei Shen

arxiv: 2505.23737 · v2 · submitted 2025-05-29 · 📊 stat.ML · cs.IT· cs.LG· math.IT· math.OC

On the Convergence Analysis of Muon

Wei Shen , Ruichuan Huang , Minhui Huang , Cong Shen , Jiawei Zhang This is my paper

Pith reviewed 2026-05-19 13:15 UTC · model grok-4.3

classification 📊 stat.ML cs.ITcs.LGmath.ITmath.OC

keywords Muon optimizerconvergence analysislow-rank Hessianmatrix parametersneural network traininggradient descent comparisonoptimization theory

0 comments

The pith

Muon can outperform gradient descent by benefiting from the low-rank structure of Hessian matrices during neural network training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a convergence analysis for the Muon optimizer, which is designed specifically for the matrix-shaped parameters common in neural networks rather than treating them as flat vectors. It compares Muon directly to standard gradient descent and identifies the conditions under which Muon achieves better rates. The central result is that Muon exploits low-rank structure in the Hessian, a pattern frequently seen in practice. This supplies a theoretical reason for Muon's observed empirical gains. The analysis therefore ties optimizer choice to the intrinsic geometry of the loss surface rather than treating all parameters uniformly.

Core claim

Muon achieves improved convergence rates over gradient descent precisely when the Hessian matrices of the loss exhibit low-rank structure, a property that holds under the modeling conditions examined and that matches what is widely observed when training neural networks with matrix parameters.

What carries the argument

Convergence-rate comparison between Muon and gradient descent that isolates the benefit from low-rank Hessian structure.

If this is right

Muon delivers strictly faster convergence than gradient descent whenever the Hessian is low-rank.
The advantage scales with the degree of rank deficiency in the Hessian.
Standard vector-based optimizers ignore this structural property and therefore cannot realize the same improvement.
The result supplies a concrete criterion for choosing Muon over gradient descent in matrix-parameterized models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar low-rank exploitation might be engineered into other matrix-aware optimizers.
The analysis suggests testing Muon on tasks where Hessian rank can be explicitly controlled, such as low-rank matrix completion or factored models.
If low-rank Hessians are the main source of Muon's edge, then hybrid methods that detect rank deficiency on the fly could further improve performance.

Load-bearing premise

Hessian matrices arising in neural network training possess low-rank structure under the conditions studied.

What would settle it

A direct measurement showing that Muon and gradient descent converge at the same rate (or Muon is slower) on a problem where the Hessian has full rank and satisfies the other modeling assumptions.

Figures

Figures reproduced from arXiv: 2505.23737 by Cong Shen, Jiawei Zhang, Minhui Huang, Ruichuan Huang, Wei Shen.

**Figure 1.** Figure 1: Experiments on a quadratic function f(W) = tr (W − W∗ ) ⊤Q(W − W∗ ) . Detailed settings can be found in Section E. with dataset {(xi , yi)} B i=1, where xi ∈ R d is the feature data, yi ∈ R c is the corresponding c-dimensional one-hot vector of the label, c is the number of classes and B is the number of samples. We first consider a linear model g(W; x) = W x ∈ R c , where W ∈ R c×d is the parameter, and… view at source ↗

**Figure 2.** Figure 2: (a, b): Spectra of XMNIST and XGaussian. (c, d, e, f): Optimizing (3) with different X and Y . Actually, we find that the feature matrix X in practical machine learning problem is typically very “low rank”, or more precisely, their singular values are highly concentrated (See [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of Jt, Lt, ∥∇f(Wt)∥ 2 ∗ and ∥∇f(Wt)∥ 2 F over the training process of GD and Muon (Algorithm 2) on a MLP model with three matrix parameters W 1 ∈ R 128×784, W2 ∈ R 64×128, W3 ∈ R 10×64 . We show the gradients and Hessians with respect to W2 in this Figure. Detailed settings can be found in Appendix E. (a) Loss (b) AdamW: Jt and Lt (c) AdamW: ∥∇f(Wt)∥ 2 ∗ and ∥∇f(Wt)∥ 2 F (d) AdamW: ∥∇f(Wt)∥ 2 ∗/… view at source ↗

**Figure 4.** Figure 4: Comparison of Jt, Lt, ∥∇f(Wt)∥ 2 ∗, ∥∇f(Wt)∥ 2 F and the average ratio over the training process of a six-layer GPT-2 style model with AdamW and Muon. (e): The average ratio (∥∇f(Wt)∥ 2 ∗Lt/(∥∇f(Wt)∥ 2 FJt)) of all matrix parameters optimized by Muon. In fact, not only does the average ratio of all matrix parameters satisfy Equation (7), but every matrix parameter optimized by Muon also satisfies Equation … view at source ↗

**Figure 5.** Figure 5: Spectra of XCIFAR10, XText, XGaussian, 1 and XGaussian, 2. Then, we apply GD and Muon to optimize this function, both start from the W0 = 0 with 4000 iterations. We choose the optimal constant stepsize 1 L for GD and choose the stepsize for Muon such that Muon can converge in 4000 iterations with the best function value. For each iteration of both algorithms, we record the difference in function value from… view at source ↗

read the original abstract

The majority of parameters in neural networks are naturally represented as matrices. However, most commonly used optimizers treat these matrix parameters as flattened vectors during optimization, potentially overlooking their inherent structural properties. Recently, an optimizer called Muon has been proposed, specifically designed to optimize matrix-structured parameters. Extensive empirical evidence shows that Muon can significantly outperform traditional optimizers when training neural networks. Nonetheless, the theoretical understanding of Muon's convergence behavior and the reasons behind its superior performance remain limited. In this work, we present a comprehensive convergence rate analysis of Muon and its comparison with Gradient Descent (GD). We characterize the conditions under which Muon can outperform GD. Our theoretical results reveal that Muon can benefit from the low-rank structure of Hessian matrices, a phenomenon widely observed in practical neural network training. Our experimental results support and corroborate the theoretical findings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives Muon a convergence analysis that credits its edge over GD to low-rank Hessians, but the advantage looks inserted as an assumption more than extracted from the dynamics.

read the letter

The main point is that this work supplies a convergence-rate comparison between Muon and gradient descent, with the claimed improvement for Muon resting on the Hessian having low rank. That matches the empirical observation that neural-network Hessians often behave this way, so the framing is reasonable on its face. The paper also runs experiments that line up with the stated conditions, which is useful for grounding the theory even if the setups are limited. Credit is due for trying to move beyond the usual vector-flattening view and for spelling out when the matrix-aware update should pull ahead. The math appears to follow standard local-quadratic or quadratic analysis, which keeps it accessible. The soft spot is exactly the one flagged in the stress-test note. The low-rank structure enters as a modeling premise rather than something the proof shows is preserved or actively used by Muon’s update along the trajectory. If the contraction factor simply gets a rank-dependent term plugged in without bounding how the iterates maintain that rank or how small r must be before the gap appears, the link between the analysis and practical superiority stays incomplete. The paper does not seem to address non-quadratic regimes or provide checks on whether the optimizer itself induces the low-rank property. This is a moderate rather than fatal issue, but it limits how far the results can be taken without further work. The paper is for people who design or analyze matrix-parameter optimizers and want a first theoretical handle on why they sometimes beat plain GD. A reader already following Muon or low-rank Hessian literature will get the most out of it. It is coherent enough on its own terms to deserve referee time, even though the current version would likely need revisions on the assumption handling and tightness of the bounds. I would send it to review.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes the convergence of the Muon optimizer, which operates directly on matrix-structured parameters rather than vectorized versions, and compares its rates to those of gradient descent (GD). The central theoretical claim is that Muon can exploit low-rank structure in the Hessian (a phenomenon described as widely observed in neural network training) to achieve faster convergence under certain conditions, while GD does not; this is supported by derived rates and corroborated by experiments.

Significance. If the derivations are correct, the work supplies a concrete theoretical explanation for Muon's observed empirical gains by linking them to low-rank Hessians rather than generic matrix-aware updates. This is a useful contribution to the growing literature on structure-exploiting optimizers, especially since the paper ships explicit rate comparisons and identifies the rank-dependent regime. The result would be more impactful if the low-rank benefit were shown to arise from the dynamics rather than inserted as a modeling assumption.

major comments (2)

[§4] §4 (Convergence Analysis): The claimed improvement in Muon's contraction factor when the Hessian has rank r ≪ d is obtained by restricting the update to the dominant subspace; however, the analysis does not derive that the trajectory preserves or exploits this low-rank structure from the optimization dynamics. It is introduced as an external modeling choice (see Assumption 3.2), so the link between the low-rank premise and the outperformance condition remains an assumption rather than a derived property.
[Theorem 4.3] Theorem 4.3 and Corollary 4.4: The comparison to GD shows a rank-dependent gap only after substituting the low-rank Hessian model into Muon's step-size choice; without a separate bound showing that GD cannot similarly benefit from the same low-rank information (or that Muon's matrix update is the only mechanism that captures it), the claim that Muon 'benefits from the low-rank structure' while GD does not is not fully secured.

minor comments (2)

[§3] Notation for the matrix norm and the projection onto the rank-r subspace is introduced in §3 but used without re-statement in the proofs of §4; adding a short reminder would improve readability.
[§5] The experimental section reports wall-clock speedups but does not include the condition number or effective rank of the Hessians on the tested models; adding these diagnostics would directly connect the plots to the low-rank regime analyzed in the theory.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the careful reading of our manuscript and the insightful comments. We have prepared point-by-point responses to the major comments and will incorporate revisions as detailed below to improve the clarity and strength of our theoretical claims.

read point-by-point responses

Referee: [§4] §4 (Convergence Analysis): The claimed improvement in Muon's contraction factor when the Hessian has rank r ≪ d is obtained by restricting the update to the dominant subspace; however, the analysis does not derive that the trajectory preserves or exploits this low-rank structure from the optimization dynamics. It is introduced as an external modeling choice (see Assumption 3.2), so the link between the low-rank premise and the outperformance condition remains an assumption rather than a derived property.

Authors: We agree that Assumption 3.2 introduces the low-rank Hessian structure as a modeling assumption rather than deriving its preservation from the optimization dynamics. This assumption is motivated by the extensive empirical literature on low-rank Hessians in neural network training, which we reference in the introduction. The contribution of Section 4 is to show the resulting convergence benefit for Muon under this condition. In the revision we will add explicit language in Section 3 and the concluding discussion to clarify the role of the assumption and to identify the derivation of low-rank structure from the dynamics as an open direction for future work. revision: yes
Referee: [Theorem 4.3] Theorem 4.3 and Corollary 4.4: The comparison to GD shows a rank-dependent gap only after substituting the low-rank Hessian model into Muon's step-size choice; without a separate bound showing that GD cannot similarly benefit from the same low-rank information (or that Muon's matrix update is the only mechanism that captures it), the claim that Muon 'benefits from the low-rank structure' while GD does not is not fully secured.

Authors: The rate comparison in Theorem 4.3 and Corollary 4.4 is obtained by inserting the low-rank model into the respective convergence bounds, with Muon's matrix-parameter update permitting a step-size choice that operates directly on the dominant subspace. Standard GD is formulated on the vectorized parameters and therefore does not exploit the matrix structure in the same manner. We acknowledge that the manuscript does not supply an auxiliary bound ruling out all possible adaptations of GD. In the revision we will insert a clarifying paragraph after Theorem 4.3 that emphasizes the distinction arising from Muon's matrix-aware mechanism and notes that any comparable benefit for GD would require additional structural assumptions not present in the standard algorithm. revision: yes

Circularity Check

0 steps flagged

No circularity: analysis derives rates from explicit low-rank assumption without self-reduction

full rationale

The paper states its central result as a convergence comparison between Muon and GD that holds under the modeling premise of low-rank Hessian structure (an external empirical observation, not derived inside the paper). No quoted step equates a claimed prediction or rate to a fitted quantity, self-citation chain, or definitional tautology; the low-rank condition is inserted as an assumption into the contraction analysis rather than being smuggled in or forced by the optimizer definition itself. The derivation chain therefore remains self-contained against the stated assumptions and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the modeling assumption that Hessians possess low-rank structure; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Hessian matrices arising in neural network training possess low-rank structure
Invoked to explain Muon's advantage over GD; stated as widely observed in practice

pith-pipeline@v0.9.0 · 5683 in / 1076 out tokens · 51790 ms · 2026-05-19T13:15:42.856252+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our theoretical results reveal that Muon can benefit from the low-rank structure of Hessian matrices, a phenomenon widely observed in practical neural network training.
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

when H_t can be represented by P_t ⊗ Q_t and Q_t, P_t are relatively low-rank such that sum σ_p,i σ_q,i ≪ r σ_p,1 σ_q,1, then J ≪ r L

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds
cs.LG 2026-05 unverdicted novelty 8.0

SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
AMUSE: Anytime Muon with Stable Gradient Evaluation
cs.LG 2026-05 unverdicted novelty 7.0

AMUSE is a new optimizer integrating Muon orthogonalization with Schedule-Free averaging via adaptive interpolation for schedule-free anytime training that improves Pareto frontiers on vision and LLM tasks.
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers
math.OC 2026-05 conditional novelty 7.0

Proposes equivariant optimizers matched to the symmetry groups of embeddings, SwiGLU projections and MoE routers, with experiments showing consistent gains over AdamW on language model pre-training.
DP-Muon: Differentially Private Optimization via Matrix-Orthogonalized Momentum
cs.LG 2026-05 unverdicted novelty 7.0

DP-Muon adapts matrix-orthogonalized momentum optimization to differential privacy via per-matrix clipping and noise addition, with proofs of inherited privacy and optimization guarantees plus a bias-corrected version...
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
cs.LG 2026-05 unverdicted novelty 7.0

Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
Phases of Muon: When Muon Eclipses SignSGD
math.OC 2026-05 unverdicted novelty 7.0

On power-law covariance least squares problems, SignSVD (Muon) and SignSGD (Adam proxy) show three phases of relative performance depending on data exponent α and target exponent β.
Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds
cs.LG 2026-05 unverdicted novelty 7.0

Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.
Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition
math.OC 2026-05 unverdicted novelty 7.0

Muon with Nesterov momentum and inexact polar decomposition achieves optimal convergence rates of O(ε^(-(3α-2)/(α-1))) under heavy-tailed noise for ε-stationary points in non-convex settings.
Convergence Rate Analysis of SOAP with Arbitrary Orthogonal Projection Matrices
math.OC 2026-04 unverdicted novelty 7.0

SOAP and its generalizations with arbitrary orthogonal projections converge at a provable rate when the projections are conditionally independent of the current gradient.
Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory
cs.LG 2026-03 unverdicted novelty 7.0

Muon achieves higher storage capacity than SGD and matches Newton's method in one-step recovery rates for associative memory under power-law distributions, while saturating at larger critical batch sizes and showing f...
On the Convergence of Muon and Beyond
cs.LG 2025-09 unverdicted novelty 7.0

Muon-MVR2 attains the optimal anytime convergence rate of ~O(T^{-1/3}) in stochastic non-convex settings under horizon-free schedules.
LionMuon: Alternating Spectral and Sign Descent for Efficient Training
cs.LG 2026-05 unverdicted novelty 6.0

LionMuon alternates Lion sign steps and Muon spectral steps with shared dual-EMA momentum to match Lion memory while outperforming both at P=2 on 124M-720M models, backed by heavy-tailed complexity bounds that predict...
Muon Does Not Converge on Convex Lipschitz Functions
cs.LG 2026-05 unverdicted novelty 6.0

Muon does not converge on convex Lipschitz functions regardless of learning rate, while error feedback restores theoretical convergence but degrades performance on CIFAR-10 and nanoGPT tasks.
Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning
cs.LG 2026-05 unverdicted novelty 6.0

Muon-OGD introduces a spectral-norm constrained orthogonal projection method solved via dual iterations and Newton-Schulz approximations to improve stability-plasticity trade-off in sequential LLM adaptation.
ZAYA1-8B Technical Report
cs.AI 2026-05 unverdicted novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
SignMuon: Communication-Efficient Distributed Muon Optimization
cs.LG 2026-05 unverdicted novelty 6.0

SignMuon merges majority-vote sign aggregation from signSGD with Muon's polar-factor steps to create a communication-efficient distributed optimizer that matches signSGD rates under symmetric noise and shows strong em...
SUDA-Muon: Structural Design Principles and Boundaries for Fully Decentralized Muon
math.OC 2026-04 unverdicted novelty 6.0

SUDA-Muon modularizes decentralized Muon via the SUDA template, proving a topology-separated convergence rate of O((1+σ/√N)K^{-1/4}) in nuclear-norm geometry while establishing that tracking-before-polarization is req...
MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration
cs.LG 2026-03 unverdicted novelty 6.0

MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.
Anytime Training with Schedule-Free Spectral Optimization
cs.LG 2026-05 unverdicted novelty 5.0

SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.
MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models
cs.LG 2026-05 unverdicted novelty 5.0

MiMuon is a hybrid optimizer that achieves a generalization error bound of O(1/N) independent of the small singular-value gap that limits the original Muon bound, while retaining the same O(1/T^{1/4}) convergence rate.
Position: Zeroth-Order Optimization in Deep Learning Is Underexplored, Not Underpowered
cs.LG 2026-05 unverdicted novelty 5.0

Zeroth-order optimization is underexplored rather than underpowered in deep learning, with limitations stemming from full-space designs that can be addressed via subspace, spectral, and systems-aware approaches.
Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning
cs.LG 2026-05 unverdicted novelty 5.0

Muon-OGD integrates Muon-style spectral-norm geometry with orthogonal gradient constraints to improve the stability-plasticity trade-off during sequential LLM adaptation.
Communication-Efficient Gluon in Federated Learning
cs.LG 2026-04 unverdicted novelty 5.0

Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.
RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization
cs.LG 2026-03 conditional novelty 5.0

RMNP preconditions matrix updates via row-wise L2 normalization instead of Newton-Schulz iteration, reducing complexity to O(mn) while matching Muon's non-convex convergence rate and empirical performance.
HTMuon: Improving Muon via Heavy-Tailed Spectral Correction
cs.LG 2026-03 unverdicted novelty 5.0

HTMuon modifies Muon to produce heavier-tailed updates and weight spectra via HT-SR theory, yielding up to 0.98 lower perplexity on LLaMA pretraining and serving as a plug-in for other Muon variants.
Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training
cs.LG 2025-09 unverdicted novelty 5.0

Proposes low-rank orthogonalization and derives low-rank Muon and MSGD variants that outperform standard Muon on GPT-2 and LLaMA pretraining while providing iteration complexity bounds.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 25 Pith papers · 11 internal anchors

[1]

GPT-4 Technical Report

URL https://arxiv. org/abs/2303.08774, 2:6,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Dion: Distributed Orthonormalized Updates

Kwangjun Ahn and Byron Xu. Dion: A communication-efficient optimizer for large models.arXiv preprint arXiv:2504.05295,

work page arXiv
[3]

Asgo: Adaptive structured gradient optimization.arXiv preprint arXiv:2503.20762,

Kang An, Yuxing Liu, Rui Pan, Shiqian Ma, Donald Goldfarb, and Tong Zhang. Asgo: Adaptive structured gradient optimization.arXiv preprint arXiv:2503.20762,

work page arXiv
[4]

Old Optimizer, New Norm: An Anthology

12 Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

An exploration of non-euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827,

Michael Crawshaw, Chirag Modi, Mingrui Liu, and Robert M Gower. An exploration of non-euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827,

work page arXiv
[6]

When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299,

Damek Davis and Dmitriy Drusvyatskiy. When do spectral gradient updates help in deep learning? arXiv preprint arXiv:2512.04299,

work page arXiv
[7]

Towards quantifying the hessian structure of neural networks.arXiv preprint arXiv:2505.02809,

Zhaorui Dong, Yushun Zhang, Zhi-Quan Luo, Jianfeng Yao, and Ruoyu Sun. Towards quantifying the hessian structure of neural networks.arXiv preprint arXiv:2505.02809,

work page arXiv
[8]

Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235,

Guillaume Garrigos and Robert M Gower. Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235,

work page arXiv
[9]

Gradient Descent Happens in a Tiny Subspace

Guy Gur-Ari, Daniel A Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace.arXiv preprint arXiv:1812.04754,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Mora: High-rank updating for parameter- efficient fine-tuning.arXiv preprint arXiv:2405.12130, 2024

Ting Jiang, Shaohan Huang, Shengyue Luo, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, et al. Mora: High-rank updating for parameter-efficient fine- tuning.arXiv preprint arXiv:2405.12130,

work page arXiv
[11]

arXiv preprint arXiv:2503.12645 , year=

Dmitry Kovalev. Understanding gradient orthogonalization for deep learning via non-euclidean trust- region optimization.arXiv preprint arXiv:2503.12645,

work page arXiv
[12]

Learning multiple layers of features from tiny images.(2009),

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.(2009),

work page 2009
[13]

Li and M

Jiaxiang Li and Mingyi Hong. A note on the convergence of muon and further.arXiv preprint arXiv:2502.02900,

work page arXiv
[14]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025a. Liming Liu, Zhenghao Xu, Zixuan Zhang, Hao Kang, Zichong Li, Chen Liang, Weizhu Chen, and Tuo Zhao. Cosmos: A hybrid adaptive optimizer for memory-efficien...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[15]

Charles H Martin and Christopher Hinrichs

Jianhao Ma, Yu Huang, Yuejie Chi, and Yuxin Chen. Preconditioning benefits of spectral orthogonal- ization in muon.arXiv preprint arXiv:2601.13474,

work page arXiv
[16]

A new perspective on shampoo’s preconditioner.arXiv preprint arXiv:2406.17748,

Depen Morwani, Itai Shapira, Nikhil Vyas, Eran Malach, Sham Kakade, and Lucas Janson. A new perspective on shampoo’s preconditioner.arXiv preprint arXiv:2406.17748,

work page arXiv
[17]

Training Deep Learning Models with Norm-Constrained LMOs

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529,

work page internal anchor Pith review arXiv
[18]

Muon is provably faster with momentum variance reduction.arXiv preprint arXiv:2512.16598, 2025

Xun Qian, Hussein Rammal, Dmitry Kovalev, and Peter Richtarik. Muon is provably faster with momentum variance reduction.arXiv preprint arXiv:2512.16598,

work page arXiv
[19]

Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity and beyond.arXiv preprint arXiv:1611.07476,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of over-parametrized neural networks.arXiv preprint arXiv:1706.04454,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

ArXiv Preprint: 2511.00674 , Year =

Weijie Su. Isotropic curvature model for understanding deep learning optimization: Is gradient or- thogonalization optimal?arXiv preprint arXiv:2511.00674,

work page arXiv
[22]

SOAP: Improving and Stabilizing Shampoo using Adam

Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Muon outperforms adam in tail-end associative memory learning

Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, and Vincent YF Tan. Muon outperforms adam in tail-end associative memory learning. arXiv preprint arXiv:2509.26030,

work page arXiv
[24]

Dissecting hessian: Understanding common structure of hessian in neural networks.arXiv preprint arXiv:2010.04261,

14 Yikai Wu, Xingyu Zhu, Chenwei Wu, Annie Wang, and Rong Ge. Dissecting hessian: Understanding common structure of hessian in neural networks.arXiv preprint arXiv:2010.04261,

work page arXiv 2010
[25]

Structured preconditioners in adaptive optimization: A unified analysis.arXiv preprint arXiv:2503.10537,

Shuo Xie, Tianhao Wang, Sashank Reddi, Sanjiv Kumar, and Zhiyuan Li. Structured preconditioners in adaptive optimization: A unified analysis.arXiv preprint arXiv:2503.10537,

work page arXiv
[26]

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes.arXiv preprint arXiv:1904.00962,

work page internal anchor Pith review arXiv 1904
[27]

Adam-mini: Use fewer learning rates to gain more.arXiv preprint arXiv:2406.16793,

Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, and Zhiquan Luo. Why transformers need adam: A hessian perspective.Advances in Neural Information Processing Systems, 37:131786– 131823, 2024a. Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun. Adam-mini: Use fewer learning ...

work page arXiv
[28]

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Galore: Memory-efficient llm training by gradient low-rank projection.arXiv preprint arXiv:2403.03507,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

SGD Converges to Global Minimum in Deep Learning via Star-convex Path

Yi Zhou, Junjie Yang, Huishuai Zhang, Yingbin Liang, and Vahid Tarokh. Sgd converges to global minimum in deep learning via star-convex path.arXiv preprint arXiv:1901.00451,

work page internal anchor Pith review Pith/arXiv arXiv 1901
[30]

In Section A, we introduce some lemmas that will be utilized in the subsequent proofs, and we also give the proofs of Theorem 4.12 and Theorem A.3

15 Appendix The Appendix is organized as follows. In Section A, we introduce some lemmas that will be utilized in the subsequent proofs, and we also give the proofs of Theorem 4.12 and Theorem A.3. In Section B, we present the proofs of theorems in the nonconvex setting. In Section C, we present the proofs of theorems in the star convex setting. In Sectio...

work page 2025
[31]

log T∆ D2op ˜J !#2 + sr3/2D3 op 6T 2

If ˜J >0 andη= min Dop T log T∆ D2op ˜J , Dop . We have f(W T )−f ∗ ≤ D2 op ˜J T + D2 op ˜J 2T " log T∆ D2op ˜J !#2 + sr3/2D3 op 6T 2 " log T∆ D2op ˜J !#2 ≤ ˜O D2 op ˜J T ! . When ˜J≤0, we can setη= min n Dop T log T 2∆ D3opsr3/2 , Dop o . We have f(W T )−f ∗ ≤ sr3/2D3 op T 2 + sr3/2D3 op 6T 2 " log T 2∆ D3opsr3/2 !#2 ≤ ˜O sr3/2D3 op T 2 ! . D Low-rank St...

work page 2009
[32]

We compute its singular values and compare them with those of a Gaussian random matrix of the same size,X Gaussian, 2 ∈R 768×836

For the Shakespeare dataset, we take the first 3000 characters and use the RoBERTa [Liu et al., 2019] tokenizer and embedding model to convert the text into an embedding matrixX Text ∈R 768×836, where 768 is the embedding dimension and 836 is the token length. We compute its singular values and compare them with those of a Gaussian random matrix of the sa...

work page 2019

[1] [1]

GPT-4 Technical Report

URL https://arxiv. org/abs/2303.08774, 2:6,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Dion: Distributed Orthonormalized Updates

Kwangjun Ahn and Byron Xu. Dion: A communication-efficient optimizer for large models.arXiv preprint arXiv:2504.05295,

work page arXiv

[3] [3]

Asgo: Adaptive structured gradient optimization.arXiv preprint arXiv:2503.20762,

Kang An, Yuxing Liu, Rui Pan, Shiqian Ma, Donald Goldfarb, and Tong Zhang. Asgo: Adaptive structured gradient optimization.arXiv preprint arXiv:2503.20762,

work page arXiv

[4] [4]

Old Optimizer, New Norm: An Anthology

12 Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

An exploration of non-euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827,

Michael Crawshaw, Chirag Modi, Mingrui Liu, and Robert M Gower. An exploration of non-euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827,

work page arXiv

[6] [6]

When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299,

Damek Davis and Dmitriy Drusvyatskiy. When do spectral gradient updates help in deep learning? arXiv preprint arXiv:2512.04299,

work page arXiv

[7] [7]

Towards quantifying the hessian structure of neural networks.arXiv preprint arXiv:2505.02809,

Zhaorui Dong, Yushun Zhang, Zhi-Quan Luo, Jianfeng Yao, and Ruoyu Sun. Towards quantifying the hessian structure of neural networks.arXiv preprint arXiv:2505.02809,

work page arXiv

[8] [8]

Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235,

Guillaume Garrigos and Robert M Gower. Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235,

work page arXiv

[9] [9]

Gradient Descent Happens in a Tiny Subspace

Guy Gur-Ari, Daniel A Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace.arXiv preprint arXiv:1812.04754,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Mora: High-rank updating for parameter- efficient fine-tuning.arXiv preprint arXiv:2405.12130, 2024

Ting Jiang, Shaohan Huang, Shengyue Luo, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, et al. Mora: High-rank updating for parameter-efficient fine- tuning.arXiv preprint arXiv:2405.12130,

work page arXiv

[11] [11]

arXiv preprint arXiv:2503.12645 , year=

Dmitry Kovalev. Understanding gradient orthogonalization for deep learning via non-euclidean trust- region optimization.arXiv preprint arXiv:2503.12645,

work page arXiv

[12] [12]

Learning multiple layers of features from tiny images.(2009),

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.(2009),

work page 2009

[13] [13]

Li and M

Jiaxiang Li and Mingyi Hong. A note on the convergence of muon and further.arXiv preprint arXiv:2502.02900,

work page arXiv

[14] [14]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025a. Liming Liu, Zhenghao Xu, Zixuan Zhang, Hao Kang, Zichong Li, Chen Liang, Weizhu Chen, and Tuo Zhao. Cosmos: A hybrid adaptive optimizer for memory-efficien...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[15] [15]

Charles H Martin and Christopher Hinrichs

Jianhao Ma, Yu Huang, Yuejie Chi, and Yuxin Chen. Preconditioning benefits of spectral orthogonal- ization in muon.arXiv preprint arXiv:2601.13474,

work page arXiv

[16] [16]

A new perspective on shampoo’s preconditioner.arXiv preprint arXiv:2406.17748,

Depen Morwani, Itai Shapira, Nikhil Vyas, Eran Malach, Sham Kakade, and Lucas Janson. A new perspective on shampoo’s preconditioner.arXiv preprint arXiv:2406.17748,

work page arXiv

[17] [17]

Training Deep Learning Models with Norm-Constrained LMOs

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529,

work page internal anchor Pith review arXiv

[18] [18]

Muon is provably faster with momentum variance reduction.arXiv preprint arXiv:2512.16598, 2025

Xun Qian, Hussein Rammal, Dmitry Kovalev, and Peter Richtarik. Muon is provably faster with momentum variance reduction.arXiv preprint arXiv:2512.16598,

work page arXiv

[19] [19]

Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity and beyond.arXiv preprint arXiv:1611.07476,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of over-parametrized neural networks.arXiv preprint arXiv:1706.04454,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

ArXiv Preprint: 2511.00674 , Year =

Weijie Su. Isotropic curvature model for understanding deep learning optimization: Is gradient or- thogonalization optimal?arXiv preprint arXiv:2511.00674,

work page arXiv

[22] [22]

SOAP: Improving and Stabilizing Shampoo using Adam

Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Muon outperforms adam in tail-end associative memory learning

Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, and Vincent YF Tan. Muon outperforms adam in tail-end associative memory learning. arXiv preprint arXiv:2509.26030,

work page arXiv

[24] [24]

Dissecting hessian: Understanding common structure of hessian in neural networks.arXiv preprint arXiv:2010.04261,

14 Yikai Wu, Xingyu Zhu, Chenwei Wu, Annie Wang, and Rong Ge. Dissecting hessian: Understanding common structure of hessian in neural networks.arXiv preprint arXiv:2010.04261,

work page arXiv 2010

[25] [25]

Structured preconditioners in adaptive optimization: A unified analysis.arXiv preprint arXiv:2503.10537,

Shuo Xie, Tianhao Wang, Sashank Reddi, Sanjiv Kumar, and Zhiyuan Li. Structured preconditioners in adaptive optimization: A unified analysis.arXiv preprint arXiv:2503.10537,

work page arXiv

[26] [26]

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes.arXiv preprint arXiv:1904.00962,

work page internal anchor Pith review arXiv 1904

[27] [27]

Adam-mini: Use fewer learning rates to gain more.arXiv preprint arXiv:2406.16793,

Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, and Zhiquan Luo. Why transformers need adam: A hessian perspective.Advances in Neural Information Processing Systems, 37:131786– 131823, 2024a. Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun. Adam-mini: Use fewer learning ...

work page arXiv

[28] [28]

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Galore: Memory-efficient llm training by gradient low-rank projection.arXiv preprint arXiv:2403.03507,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

SGD Converges to Global Minimum in Deep Learning via Star-convex Path

Yi Zhou, Junjie Yang, Huishuai Zhang, Yingbin Liang, and Vahid Tarokh. Sgd converges to global minimum in deep learning via star-convex path.arXiv preprint arXiv:1901.00451,

work page internal anchor Pith review Pith/arXiv arXiv 1901

[30] [30]

In Section A, we introduce some lemmas that will be utilized in the subsequent proofs, and we also give the proofs of Theorem 4.12 and Theorem A.3

15 Appendix The Appendix is organized as follows. In Section A, we introduce some lemmas that will be utilized in the subsequent proofs, and we also give the proofs of Theorem 4.12 and Theorem A.3. In Section B, we present the proofs of theorems in the nonconvex setting. In Section C, we present the proofs of theorems in the star convex setting. In Sectio...

work page 2025

[31] [31]

log T∆ D2op ˜J !#2 + sr3/2D3 op 6T 2

If ˜J >0 andη= min Dop T log T∆ D2op ˜J , Dop . We have f(W T )−f ∗ ≤ D2 op ˜J T + D2 op ˜J 2T " log T∆ D2op ˜J !#2 + sr3/2D3 op 6T 2 " log T∆ D2op ˜J !#2 ≤ ˜O D2 op ˜J T ! . When ˜J≤0, we can setη= min n Dop T log T 2∆ D3opsr3/2 , Dop o . We have f(W T )−f ∗ ≤ sr3/2D3 op T 2 + sr3/2D3 op 6T 2 " log T 2∆ D3opsr3/2 !#2 ≤ ˜O sr3/2D3 op T 2 ! . D Low-rank St...

work page 2009

[32] [32]

We compute its singular values and compare them with those of a Gaussian random matrix of the same size,X Gaussian, 2 ∈R 768×836

For the Shakespeare dataset, we take the first 3000 characters and use the RoBERTa [Liu et al., 2019] tokenizer and embedding model to convert the text into an embedding matrixX Text ∈R 768×836, where 768 is the embedding dimension and 836 is the token length. We compute its singular values and compare them with those of a Gaussian random matrix of the sa...

work page 2019