pith. sign in

arxiv: 2505.23737 · v2 · submitted 2025-05-29 · 📊 stat.ML · cs.IT· cs.LG· math.IT· math.OC

On the Convergence Analysis of Muon

Pith reviewed 2026-05-19 13:15 UTC · model grok-4.3

classification 📊 stat.ML cs.ITcs.LGmath.ITmath.OC
keywords Muon optimizerconvergence analysislow-rank Hessianmatrix parametersneural network traininggradient descent comparisonoptimization theory
0
0 comments X

The pith

Muon can outperform gradient descent by benefiting from the low-rank structure of Hessian matrices during neural network training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a convergence analysis for the Muon optimizer, which is designed specifically for the matrix-shaped parameters common in neural networks rather than treating them as flat vectors. It compares Muon directly to standard gradient descent and identifies the conditions under which Muon achieves better rates. The central result is that Muon exploits low-rank structure in the Hessian, a pattern frequently seen in practice. This supplies a theoretical reason for Muon's observed empirical gains. The analysis therefore ties optimizer choice to the intrinsic geometry of the loss surface rather than treating all parameters uniformly.

Core claim

Muon achieves improved convergence rates over gradient descent precisely when the Hessian matrices of the loss exhibit low-rank structure, a property that holds under the modeling conditions examined and that matches what is widely observed when training neural networks with matrix parameters.

What carries the argument

Convergence-rate comparison between Muon and gradient descent that isolates the benefit from low-rank Hessian structure.

If this is right

  • Muon delivers strictly faster convergence than gradient descent whenever the Hessian is low-rank.
  • The advantage scales with the degree of rank deficiency in the Hessian.
  • Standard vector-based optimizers ignore this structural property and therefore cannot realize the same improvement.
  • The result supplies a concrete criterion for choosing Muon over gradient descent in matrix-parameterized models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar low-rank exploitation might be engineered into other matrix-aware optimizers.
  • The analysis suggests testing Muon on tasks where Hessian rank can be explicitly controlled, such as low-rank matrix completion or factored models.
  • If low-rank Hessians are the main source of Muon's edge, then hybrid methods that detect rank deficiency on the fly could further improve performance.

Load-bearing premise

Hessian matrices arising in neural network training possess low-rank structure under the conditions studied.

What would settle it

A direct measurement showing that Muon and gradient descent converge at the same rate (or Muon is slower) on a problem where the Hessian has full rank and satisfies the other modeling assumptions.

Figures

Figures reproduced from arXiv: 2505.23737 by Cong Shen, Jiawei Zhang, Minhui Huang, Ruichuan Huang, Wei Shen.

Figure 1
Figure 1. Figure 1: Experiments on a quadratic function f(W) = tr (W − W∗ ) ⊤Q(W − W∗ )  . Detailed settings can be found in Section E. with dataset {(xi , yi)} B i=1, where xi ∈ R d is the feature data, yi ∈ R c is the corresponding c-dimensional one-hot vector of the label, c is the number of classes and B is the number of samples. We first consider a linear model g(W; x) = W x ∈ R c , where W ∈ R c×d is the parameter, and… view at source ↗
Figure 2
Figure 2. Figure 2: (a, b): Spectra of XMNIST and XGaussian. (c, d, e, f): Optimizing (3) with different X and Y . Actually, we find that the feature matrix X in practical machine learning problem is typically very “low rank”, or more precisely, their singular values are highly concentrated (See [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of Jt, Lt, ∥∇f(Wt)∥ 2 ∗ and ∥∇f(Wt)∥ 2 F over the training process of GD and Muon (Algorithm 2) on a MLP model with three matrix parameters W 1 ∈ R 128×784, W2 ∈ R 64×128, W3 ∈ R 10×64 . We show the gradients and Hessians with respect to W2 in this Figure. Detailed settings can be found in Appendix E. (a) Loss (b) AdamW: Jt and Lt (c) AdamW: ∥∇f(Wt)∥ 2 ∗ and ∥∇f(Wt)∥ 2 F (d) AdamW: ∥∇f(Wt)∥ 2 ∗/… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of Jt, Lt, ∥∇f(Wt)∥ 2 ∗, ∥∇f(Wt)∥ 2 F and the average ratio over the training process of a six-layer GPT-2 style model with AdamW and Muon. (e): The average ratio (∥∇f(Wt)∥ 2 ∗Lt/(∥∇f(Wt)∥ 2 FJt)) of all matrix parameters optimized by Muon. In fact, not only does the average ratio of all matrix parameters satisfy Equation (7), but every matrix parameter optimized by Muon also satisfies Equation … view at source ↗
Figure 5
Figure 5. Figure 5: Spectra of XCIFAR10, XText, XGaussian, 1 and XGaussian, 2. Then, we apply GD and Muon to optimize this function, both start from the W0 = 0 with 4000 iterations. We choose the optimal constant stepsize 1 L for GD and choose the stepsize for Muon such that Muon can converge in 4000 iterations with the best function value. For each iteration of both algorithms, we record the difference in function value from… view at source ↗
read the original abstract

The majority of parameters in neural networks are naturally represented as matrices. However, most commonly used optimizers treat these matrix parameters as flattened vectors during optimization, potentially overlooking their inherent structural properties. Recently, an optimizer called Muon has been proposed, specifically designed to optimize matrix-structured parameters. Extensive empirical evidence shows that Muon can significantly outperform traditional optimizers when training neural networks. Nonetheless, the theoretical understanding of Muon's convergence behavior and the reasons behind its superior performance remain limited. In this work, we present a comprehensive convergence rate analysis of Muon and its comparison with Gradient Descent (GD). We characterize the conditions under which Muon can outperform GD. Our theoretical results reveal that Muon can benefit from the low-rank structure of Hessian matrices, a phenomenon widely observed in practical neural network training. Our experimental results support and corroborate the theoretical findings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes the convergence of the Muon optimizer, which operates directly on matrix-structured parameters rather than vectorized versions, and compares its rates to those of gradient descent (GD). The central theoretical claim is that Muon can exploit low-rank structure in the Hessian (a phenomenon described as widely observed in neural network training) to achieve faster convergence under certain conditions, while GD does not; this is supported by derived rates and corroborated by experiments.

Significance. If the derivations are correct, the work supplies a concrete theoretical explanation for Muon's observed empirical gains by linking them to low-rank Hessians rather than generic matrix-aware updates. This is a useful contribution to the growing literature on structure-exploiting optimizers, especially since the paper ships explicit rate comparisons and identifies the rank-dependent regime. The result would be more impactful if the low-rank benefit were shown to arise from the dynamics rather than inserted as a modeling assumption.

major comments (2)
  1. [§4] §4 (Convergence Analysis): The claimed improvement in Muon's contraction factor when the Hessian has rank r ≪ d is obtained by restricting the update to the dominant subspace; however, the analysis does not derive that the trajectory preserves or exploits this low-rank structure from the optimization dynamics. It is introduced as an external modeling choice (see Assumption 3.2), so the link between the low-rank premise and the outperformance condition remains an assumption rather than a derived property.
  2. [Theorem 4.3] Theorem 4.3 and Corollary 4.4: The comparison to GD shows a rank-dependent gap only after substituting the low-rank Hessian model into Muon's step-size choice; without a separate bound showing that GD cannot similarly benefit from the same low-rank information (or that Muon's matrix update is the only mechanism that captures it), the claim that Muon 'benefits from the low-rank structure' while GD does not is not fully secured.
minor comments (2)
  1. [§3] Notation for the matrix norm and the projection onto the rank-r subspace is introduced in §3 but used without re-statement in the proofs of §4; adding a short reminder would improve readability.
  2. [§5] The experimental section reports wall-clock speedups but does not include the condition number or effective rank of the Hessians on the tested models; adding these diagnostics would directly connect the plots to the low-rank regime analyzed in the theory.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the careful reading of our manuscript and the insightful comments. We have prepared point-by-point responses to the major comments and will incorporate revisions as detailed below to improve the clarity and strength of our theoretical claims.

read point-by-point responses
  1. Referee: [§4] §4 (Convergence Analysis): The claimed improvement in Muon's contraction factor when the Hessian has rank r ≪ d is obtained by restricting the update to the dominant subspace; however, the analysis does not derive that the trajectory preserves or exploits this low-rank structure from the optimization dynamics. It is introduced as an external modeling choice (see Assumption 3.2), so the link between the low-rank premise and the outperformance condition remains an assumption rather than a derived property.

    Authors: We agree that Assumption 3.2 introduces the low-rank Hessian structure as a modeling assumption rather than deriving its preservation from the optimization dynamics. This assumption is motivated by the extensive empirical literature on low-rank Hessians in neural network training, which we reference in the introduction. The contribution of Section 4 is to show the resulting convergence benefit for Muon under this condition. In the revision we will add explicit language in Section 3 and the concluding discussion to clarify the role of the assumption and to identify the derivation of low-rank structure from the dynamics as an open direction for future work. revision: yes

  2. Referee: [Theorem 4.3] Theorem 4.3 and Corollary 4.4: The comparison to GD shows a rank-dependent gap only after substituting the low-rank Hessian model into Muon's step-size choice; without a separate bound showing that GD cannot similarly benefit from the same low-rank information (or that Muon's matrix update is the only mechanism that captures it), the claim that Muon 'benefits from the low-rank structure' while GD does not is not fully secured.

    Authors: The rate comparison in Theorem 4.3 and Corollary 4.4 is obtained by inserting the low-rank model into the respective convergence bounds, with Muon's matrix-parameter update permitting a step-size choice that operates directly on the dominant subspace. Standard GD is formulated on the vectorized parameters and therefore does not exploit the matrix structure in the same manner. We acknowledge that the manuscript does not supply an auxiliary bound ruling out all possible adaptations of GD. In the revision we will insert a clarifying paragraph after Theorem 4.3 that emphasizes the distinction arising from Muon's matrix-aware mechanism and notes that any comparable benefit for GD would require additional structural assumptions not present in the standard algorithm. revision: yes

Circularity Check

0 steps flagged

No circularity: analysis derives rates from explicit low-rank assumption without self-reduction

full rationale

The paper states its central result as a convergence comparison between Muon and GD that holds under the modeling premise of low-rank Hessian structure (an external empirical observation, not derived inside the paper). No quoted step equates a claimed prediction or rate to a fitted quantity, self-citation chain, or definitional tautology; the low-rank condition is inserted as an assumption into the contraction analysis rather than being smuggled in or forced by the optimizer definition itself. The derivation chain therefore remains self-contained against the stated assumptions and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the modeling assumption that Hessians possess low-rank structure; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Hessian matrices arising in neural network training possess low-rank structure
    Invoked to explain Muon's advantage over GD; stated as widely observed in practice

pith-pipeline@v0.9.0 · 5683 in / 1076 out tokens · 51790 ms · 2026-05-19T13:15:42.856252+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

    cs.LG 2026-05 unverdicted novelty 8.0

    SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.

  2. AMUSE: Anytime Muon with Stable Gradient Evaluation

    cs.LG 2026-05 unverdicted novelty 7.0

    AMUSE is a new optimizer integrating Muon orthogonalization with Schedule-Free averaging via adaptive interpolation for schedule-free anytime training that improves Pareto frontiers on vision and LLM tasks.

  3. Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

    math.OC 2026-05 conditional novelty 7.0

    Proposes equivariant optimizers matched to the symmetry groups of embeddings, SwiGLU projections and MoE routers, with experiments showing consistent gains over AdamW on language model pre-training.

  4. DP-Muon: Differentially Private Optimization via Matrix-Orthogonalized Momentum

    cs.LG 2026-05 unverdicted novelty 7.0

    DP-Muon adapts matrix-orthogonalized momentum optimization to differential privacy via per-matrix clipping and noise addition, with proofs of inherited privacy and optimization guarantees plus a bias-corrected version...

  5. Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters

    cs.LG 2026-05 unverdicted novelty 7.0

    Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).

  6. Phases of Muon: When Muon Eclipses SignSGD

    math.OC 2026-05 unverdicted novelty 7.0

    On power-law covariance least squares problems, SignSVD (Muon) and SignSGD (Adam proxy) show three phases of relative performance depending on data exponent α and target exponent β.

  7. Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds

    cs.LG 2026-05 unverdicted novelty 7.0

    Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.

  8. Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition

    math.OC 2026-05 unverdicted novelty 7.0

    Muon with Nesterov momentum and inexact polar decomposition achieves optimal convergence rates of O(ε^(-(3α-2)/(α-1))) under heavy-tailed noise for ε-stationary points in non-convex settings.

  9. Convergence Rate Analysis of SOAP with Arbitrary Orthogonal Projection Matrices

    math.OC 2026-04 unverdicted novelty 7.0

    SOAP and its generalizations with arbitrary orthogonal projections converge at a provable rate when the projections are conditionally independent of the current gradient.

  10. Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory

    cs.LG 2026-03 unverdicted novelty 7.0

    Muon achieves higher storage capacity than SGD and matches Newton's method in one-step recovery rates for associative memory under power-law distributions, while saturating at larger critical batch sizes and showing f...

  11. On the Convergence of Muon and Beyond

    cs.LG 2025-09 unverdicted novelty 7.0

    Muon-MVR2 attains the optimal anytime convergence rate of ~O(T^{-1/3}) in stochastic non-convex settings under horizon-free schedules.

  12. LionMuon: Alternating Spectral and Sign Descent for Efficient Training

    cs.LG 2026-05 unverdicted novelty 6.0

    LionMuon alternates Lion sign steps and Muon spectral steps with shared dual-EMA momentum to match Lion memory while outperforming both at P=2 on 124M-720M models, backed by heavy-tailed complexity bounds that predict...

  13. Muon Does Not Converge on Convex Lipschitz Functions

    cs.LG 2026-05 unverdicted novelty 6.0

    Muon does not converge on convex Lipschitz functions regardless of learning rate, while error feedback restores theoretical convergence but degrades performance on CIFAR-10 and nanoGPT tasks.

  14. Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Muon-OGD introduces a spectral-norm constrained orthogonal projection method solved via dual iterations and Newton-Schulz approximations to improve stability-plasticity trade-off in sequential LLM adaptation.

  15. ZAYA1-8B Technical Report

    cs.AI 2026-05 unverdicted novelty 6.0

    ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

  16. SignMuon: Communication-Efficient Distributed Muon Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    SignMuon merges majority-vote sign aggregation from signSGD with Muon's polar-factor steps to create a communication-efficient distributed optimizer that matches signSGD rates under symmetric noise and shows strong em...

  17. SUDA-Muon: Structural Design Principles and Boundaries for Fully Decentralized Muon

    math.OC 2026-04 unverdicted novelty 6.0

    SUDA-Muon modularizes decentralized Muon via the SUDA template, proving a topology-separated convergence rate of O((1+σ/√N)K^{-1/4}) in nuclear-norm geometry while establishing that tracking-before-polarization is req...

  18. MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

    cs.LG 2026-03 unverdicted novelty 6.0

    MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.

  19. Anytime Training with Schedule-Free Spectral Optimization

    cs.LG 2026-05 unverdicted novelty 5.0

    SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.

  20. MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models

    cs.LG 2026-05 unverdicted novelty 5.0

    MiMuon is a hybrid optimizer that achieves a generalization error bound of O(1/N) independent of the small singular-value gap that limits the original Muon bound, while retaining the same O(1/T^{1/4}) convergence rate.

  21. Position: Zeroth-Order Optimization in Deep Learning Is Underexplored, Not Underpowered

    cs.LG 2026-05 unverdicted novelty 5.0

    Zeroth-order optimization is underexplored rather than underpowered in deep learning, with limitations stemming from full-space designs that can be addressed via subspace, spectral, and systems-aware approaches.

  22. Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning

    cs.LG 2026-05 unverdicted novelty 5.0

    Muon-OGD integrates Muon-style spectral-norm geometry with orthogonal gradient constraints to improve the stability-plasticity trade-off during sequential LLM adaptation.

  23. Communication-Efficient Gluon in Federated Learning

    cs.LG 2026-04 unverdicted novelty 5.0

    Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.

  24. RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization

    cs.LG 2026-03 conditional novelty 5.0

    RMNP preconditions matrix updates via row-wise L2 normalization instead of Newton-Schulz iteration, reducing complexity to O(mn) while matching Muon's non-convex convergence rate and empirical performance.

  25. HTMuon: Improving Muon via Heavy-Tailed Spectral Correction

    cs.LG 2026-03 unverdicted novelty 5.0

    HTMuon modifies Muon to produce heavier-tailed updates and weight spectra via HT-SR theory, yielding up to 0.98 lower perplexity on LLaMA pretraining and serving as a plug-in for other Muon variants.

  26. Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

    cs.LG 2025-09 unverdicted novelty 5.0

    Proposes low-rank orthogonalization and derives low-rank Muon and MSGD variants that outperform standard Muon on GPT-2 and LLaMA pretraining while providing iteration complexity bounds.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 25 Pith papers · 11 internal anchors

  1. [1]

    GPT-4 Technical Report

    URL https://arxiv. org/abs/2303.08774, 2:6,

  2. [2]

    Dion: Distributed Orthonormalized Updates

    Kwangjun Ahn and Byron Xu. Dion: A communication-efficient optimizer for large models.arXiv preprint arXiv:2504.05295,

  3. [3]

    Asgo: Adaptive structured gradient optimization.arXiv preprint arXiv:2503.20762,

    Kang An, Yuxing Liu, Rui Pan, Shiqian Ma, Donald Goldfarb, and Tong Zhang. Asgo: Adaptive structured gradient optimization.arXiv preprint arXiv:2503.20762,

  4. [4]

    Old Optimizer, New Norm: An Anthology

    12 Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325,

  5. [5]

    An exploration of non-euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827,

    Michael Crawshaw, Chirag Modi, Mingrui Liu, and Robert M Gower. An exploration of non-euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827,

  6. [6]

    When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299,

    Damek Davis and Dmitriy Drusvyatskiy. When do spectral gradient updates help in deep learning? arXiv preprint arXiv:2512.04299,

  7. [7]

    Towards quantifying the hessian structure of neural networks.arXiv preprint arXiv:2505.02809,

    Zhaorui Dong, Yushun Zhang, Zhi-Quan Luo, Jianfeng Yao, and Ruoyu Sun. Towards quantifying the hessian structure of neural networks.arXiv preprint arXiv:2505.02809,

  8. [8]

    Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235,

    Guillaume Garrigos and Robert M Gower. Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235,

  9. [9]

    Gradient Descent Happens in a Tiny Subspace

    Guy Gur-Ari, Daniel A Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace.arXiv preprint arXiv:1812.04754,

  10. [10]

    Mora: High-rank updating for parameter- efficient fine-tuning.arXiv preprint arXiv:2405.12130, 2024

    Ting Jiang, Shaohan Huang, Shengyue Luo, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, et al. Mora: High-rank updating for parameter-efficient fine- tuning.arXiv preprint arXiv:2405.12130,

  11. [11]

    arXiv preprint arXiv:2503.12645 , year=

    Dmitry Kovalev. Understanding gradient orthogonalization for deep learning via non-euclidean trust- region optimization.arXiv preprint arXiv:2503.12645,

  12. [12]

    Learning multiple layers of features from tiny images.(2009),

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.(2009),

  13. [13]

    Li and M

    Jiaxiang Li and Mingyi Hong. A note on the convergence of muon and further.arXiv preprint arXiv:2502.02900,

  14. [14]

    Muon is Scalable for LLM Training

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025a. Liming Liu, Zhenghao Xu, Zixuan Zhang, Hao Kang, Zichong Li, Chen Liang, Weizhu Chen, and Tuo Zhao. Cosmos: A hybrid adaptive optimizer for memory-efficien...

  15. [15]

    Charles H Martin and Christopher Hinrichs

    Jianhao Ma, Yu Huang, Yuejie Chi, and Yuxin Chen. Preconditioning benefits of spectral orthogonal- ization in muon.arXiv preprint arXiv:2601.13474,

  16. [16]

    A new perspective on shampoo’s preconditioner.arXiv preprint arXiv:2406.17748,

    Depen Morwani, Itai Shapira, Nikhil Vyas, Eran Malach, Sham Kakade, and Lucas Janson. A new perspective on shampoo’s preconditioner.arXiv preprint arXiv:2406.17748,

  17. [17]

    Training Deep Learning Models with Norm-Constrained LMOs

    Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529,

  18. [18]

    Muon is provably faster with momentum variance reduction.arXiv preprint arXiv:2512.16598, 2025

    Xun Qian, Hussein Rammal, Dmitry Kovalev, and Peter Richtarik. Muon is provably faster with momentum variance reduction.arXiv preprint arXiv:2512.16598,

  19. [19]

    Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

    Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity and beyond.arXiv preprint arXiv:1611.07476,

  20. [20]

    Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

    Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of over-parametrized neural networks.arXiv preprint arXiv:1706.04454,

  21. [21]

    ArXiv Preprint: 2511.00674 , Year =

    Weijie Su. Isotropic curvature model for understanding deep learning optimization: Is gradient or- thogonalization optimal?arXiv preprint arXiv:2511.00674,

  22. [22]

    SOAP: Improving and Stabilizing Shampoo using Adam

    Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321,

  23. [23]

    Muon outperforms adam in tail-end associative memory learning

    Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, and Vincent YF Tan. Muon outperforms adam in tail-end associative memory learning. arXiv preprint arXiv:2509.26030,

  24. [24]

    Dissecting hessian: Understanding common structure of hessian in neural networks.arXiv preprint arXiv:2010.04261,

    14 Yikai Wu, Xingyu Zhu, Chenwei Wu, Annie Wang, and Rong Ge. Dissecting hessian: Understanding common structure of hessian in neural networks.arXiv preprint arXiv:2010.04261,

  25. [25]

    Structured preconditioners in adaptive optimization: A unified analysis.arXiv preprint arXiv:2503.10537,

    Shuo Xie, Tianhao Wang, Sashank Reddi, Sanjiv Kumar, and Zhiyuan Li. Structured preconditioners in adaptive optimization: A unified analysis.arXiv preprint arXiv:2503.10537,

  26. [26]

    Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

    Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes.arXiv preprint arXiv:1904.00962,

  27. [27]

    Adam-mini: Use fewer learning rates to gain more.arXiv preprint arXiv:2406.16793,

    Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, and Zhiquan Luo. Why transformers need adam: A hessian perspective.Advances in Neural Information Processing Systems, 37:131786– 131823, 2024a. Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun. Adam-mini: Use fewer learning ...

  28. [28]

    GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

    Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Galore: Memory-efficient llm training by gradient low-rank projection.arXiv preprint arXiv:2403.03507,

  29. [29]

    SGD Converges to Global Minimum in Deep Learning via Star-convex Path

    Yi Zhou, Junjie Yang, Huishuai Zhang, Yingbin Liang, and Vahid Tarokh. Sgd converges to global minimum in deep learning via star-convex path.arXiv preprint arXiv:1901.00451,

  30. [30]

    In Section A, we introduce some lemmas that will be utilized in the subsequent proofs, and we also give the proofs of Theorem 4.12 and Theorem A.3

    15 Appendix The Appendix is organized as follows. In Section A, we introduce some lemmas that will be utilized in the subsequent proofs, and we also give the proofs of Theorem 4.12 and Theorem A.3. In Section B, we present the proofs of theorems in the nonconvex setting. In Section C, we present the proofs of theorems in the star convex setting. In Sectio...

  31. [31]

    log T∆ D2op ˜J !#2 + sr3/2D3 op 6T 2

    If ˜J >0 andη= min Dop T log T∆ D2op ˜J , Dop . We have f(W T )−f ∗ ≤ D2 op ˜J T + D2 op ˜J 2T " log T∆ D2op ˜J !#2 + sr3/2D3 op 6T 2 " log T∆ D2op ˜J !#2 ≤ ˜O D2 op ˜J T ! . When ˜J≤0, we can setη= min n Dop T log T 2∆ D3opsr3/2 , Dop o . We have f(W T )−f ∗ ≤ sr3/2D3 op T 2 + sr3/2D3 op 6T 2 " log T 2∆ D3opsr3/2 !#2 ≤ ˜O sr3/2D3 op T 2 ! . D Low-rank St...

  32. [32]

    We compute its singular values and compare them with those of a Gaussian random matrix of the same size,X Gaussian, 2 ∈R 768×836

    For the Shakespeare dataset, we take the first 3000 characters and use the RoBERTa [Liu et al., 2019] tokenizer and embedding model to convert the text into an embedding matrixX Text ∈R 768×836, where 768 is the embedding dimension and 836 is the token length. We compute its singular values and compare them with those of a Gaussian random matrix of the sa...