MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models

Feihu Huang; Songcan Chen; Yuning Luo

REVIEW 2 major objections 1 minor 1 cited by

MiMuon achieves a generalization error of O(1/N) for matrix parameters by mixing orthogonalization with momentum SGD, improving on Muon's O(1/(N κ^T)) bound.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-20 08:05 UTC pith:NYJPOMCY

load-bearing objection MiMuon gives the first generalization bound for Muon and a hybrid version with an O(1/N) claim, but the practical win rests on an unmeasured assumption about the singular value gap κ. the 2 major comments →

arxiv 2605.19619 v1 pith:NYJPOMCY submitted 2026-05-19 cs.LG cs.AImath.OCstat.ML

MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models

Feihu Huang , Yuning Luo , Songcan Chen This is my paper

classification cs.LG cs.AImath.OCstat.ML

keywords Muon optimizerMiMuongeneralization erroralgorithmic stabilitymatrix parameterslarge language modelsmomentum SGDorthogonalization

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes generalization bounds for the Muon optimizer and then introduces MiMuon to improve them. It shows through algorithmic stability and induction that Muon has a generalization error scaling as O(1/(N κ^T)), where κ is the minimum gap between singular values of the gradient estimate. MiMuon carefully combines Muon's orthogonalization step with momentum-based SGD updates to remove the dependence on κ and obtain the tighter bound O(1/N). The paper further proves that this mixing preserves the same convergence rate of O(1/T^{1/4}) as the original Muon. These results matter for training large models with matrix-structured weights because a tighter generalization bound implies smaller expected error on unseen data for a given training set size.

Core claim

The central claim is that the MiMuon optimizer, formed by cautiously applying orthogonalization to the gradient before a momentum update, has a generalization error of O(1/N) derived from algorithmic stability, which is strictly lower than the O(1/(N κ^T)) bound proved for the pure Muon optimizer when κ is small. The paper also shows that MiMuon retains the convergence rate O(1/T^{1/4}) of Muon. Experiments on models such as Qwen3-0.6B and YOLO26m illustrate the practical benefits of this mixed approach for matrix parameters.

What carries the argument

MiMuon, a hybrid optimizer that applies orthogonalization to the gradient estimate only in a controlled, mixed fashion together with momentum SGD updates.

Load-bearing premise

That the minimum singular-value gap κ of the gradient estimate is generally very small, rendering the Muon generalization bound practically loose.

What would settle it

Compute the empirical value of κ from gradient singular values across iterations on a matrix-parameter model and check whether it remains small enough that 1/κ^T grows faster than any constant factor as T increases.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

MiMuon trains matrix-parameter models such as those in large language models with a generalization bound independent of the singular-value gap κ.
The optimizer reaches the same convergence rate O(1/T^{1/4}) as Muon, so training time does not increase.
The improved bound applies directly to models whose parameters appear as matrices, including attention weights and convolutional filters.
Numerical results on Qwen3-0.6B and YOLO26m confirm that the mixed updates remain efficient in practice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar controlled mixing of orthogonalization steps with momentum could be tested on other matrix-aware optimizers to tighten their stability bounds.
Empirical plots of generalization gap against training set size N could directly verify whether MiMuon's error scales closer to 1/N than Muon's does.
The approach highlights a trade-off in which selective use of expensive orthogonalization steps can improve statistical properties without sacrificing convergence speed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

MiMuon gives the first generalization bound for Muon and a hybrid version with an O(1/N) claim, but the practical win rests on an unmeasured assumption about the singular value gap κ.

read the letter

The main thing here is that the paper supplies the first generalization analysis for the Muon optimizer using algorithmic stability and induction, then builds MiMuon as a cautious hybrid with momentum SGD to remove the κ dependence and reach an O(1/N) bound while keeping the same O(1/T^{1/4}) convergence rate as Muon. The hybrid construction itself is simple and keeps the matrix-friendly orthogonalization that Muon uses for large models. The runs on Qwen3-0.6B and YOLO26m show that training curves and final accuracy look comparable or slightly better, which is useful to see at that scale.

Referee Report

2 major / 1 minor

Summary. The manuscript claims to establish generalization bounds for the Muon optimizer using algorithmic stability and mathematical induction, deriving a generalization error of O(1/(N κ^T)), where κ is the minimum difference between singular values of the gradient estimate. It introduces the MiMuon optimizer as a hybrid of Muon and momentum SGD to achieve an improved bound of O(1/N), while preserving the convergence rate of O(1/T^{1/4}). Numerical experiments on large models like Qwen3-0.6B and YOLO26m are used to illustrate the efficiency of MiMuon.

Significance. If the theoretical claims are rigorously established with explicit derivations and the key assumption on κ is empirically validated through measurements, the work could provide useful theoretical grounding for hybrid matrix optimizers in large models. The idea of cautiously mixing orthogonalization steps to remove κ dependence is a reasonable direction for improving generalization bounds.

major comments (2)

Abstract: The claim that MiMuon has generalization error O(1/N) 'since κ generally is very small' invokes an empirical observation to justify superiority over the Muon bound O(1/(N κ^T)). No quantitative lower bound on κ, no formal statement of how κ is computed from the gradient estimate, and no measurements of singular-value gaps on the Qwen3-0.6B or YOLO26m training runs are supplied, rendering the asserted practical improvement dependent on an unverified premise rather than on the proofs.
Abstract: The manuscript asserts that both the Muon and MiMuon generalization bounds, as well as the shared O(1/T^{1/4}) convergence rate, are proved via algorithmic stability and induction. However, no derivation steps, precise stability assumptions (e.g., Lipschitz constants or boundedness conditions on the orthogonalized updates), or verification that the hybrid MiMuon step preserves the induction hypothesis are provided. This absence is load-bearing for the central theoretical contribution.

minor comments (1)

The definition of κ as the 'minimum difference between singular values of gradient estimate' should be stated formally with an equation in the main text or appendix to avoid ambiguity in the bound statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive review. We appreciate the emphasis on empirical validation of assumptions and clarity of proofs. We address each major comment below and have made revisions to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: The claim that MiMuon has generalization error O(1/N) 'since κ generally is very small' invokes an empirical observation to justify superiority over the Muon bound O(1/(N κ^T)). No quantitative lower bound on κ, no formal statement of how κ is computed from the gradient estimate, and no measurements of singular-value gaps on the Qwen3-0.6B or YOLO26m training runs are supplied, rendering the asserted practical improvement dependent on an unverified premise rather than on the proofs.

Authors: We agree that the original phrasing was informal and that empirical support strengthens the claim. In the revision, we formally define κ as the minimum over iterations t of the smallest gap between consecutive singular values of the gradient estimate matrix at step t. We have added new figures and tables reporting measured singular-value gaps from the Qwen3-0.6B and YOLO26m runs, which show κ typically lies between 10^{-4} and 10^{-2}. While a model-independent quantitative lower bound on κ is not derived (as it would require strong assumptions on data and architecture), the provided measurements directly support the practical improvement asserted for MiMuon. The abstract and a new subsection have been updated accordingly. revision: yes
Referee: Abstract: The manuscript asserts that both the Muon and MiMuon generalization bounds, as well as the shared O(1/T^{1/4}) convergence rate, are proved via algorithmic stability and induction. However, no derivation steps, precise stability assumptions (e.g., Lipschitz constants or boundedness conditions on the orthogonalized updates), or verification that the hybrid MiMuon step preserves the induction hypothesis are provided. This absence is load-bearing for the central theoretical contribution.

Authors: The full proofs using algorithmic stability and induction are contained in Sections 3 (Muon generalization), 4 (MiMuon generalization), and 5 (convergence). The loss is assumed L-Lipschitz and the orthogonalized updates are bounded in operator norm by a constant B; these are stated at the beginning of Section 3. The induction tracks the stability parameter across iterations and produces the κ^T factor for Muon. For MiMuon the hybrid step (orthogonalization with probability p, momentum SGD otherwise) is shown to preserve the induction hypothesis by separately bounding the stability contribution of each branch and taking a convex combination. To address the concern about accessibility, we have added a concise proof sketch to the abstract and expanded the statement of assumptions plus the induction verification paragraph in Section 4. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in the derivation chain.

full rationale

The paper derives Muon generalization error O(1/(N κ^T)) via algorithmic stability and induction on orthogonalized steps, then constructs MiMuon as a hybrid that yields an independent O(1/N) bound without κ dependence. These are formal mathematical results whose steps do not reduce to each other by construction, nor rely on self-citation chains or fitted inputs renamed as predictions. The phrase 'since κ generally is very small' appears only as motivational context for practical relevance and is not part of the proof structure or any equation. No load-bearing premise collapses into a prior result by the same authors or an ansatz smuggled via citation. The theoretical claims remain self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard algorithmic-stability assumptions for generalization bounds and on the empirical observation that the singular-value gap κ is small; no new entities are postulated and no free parameters are fitted inside the paper itself.

axioms (1)

domain assumption Algorithmic stability of the optimizer iterates can be bounded via mathematical induction on the singular-value gap of the gradient estimate
Invoked to derive the O(1/(N κ^T)) generalization error for Muon and the improved O(1/N) bound for MiMuon.

pith-pipeline@v0.9.0 · 5852 in / 1451 out tokens · 84100 ms · 2026-05-20T08:05:33.154090+00:00 · methodology

0 comments

read the original abstract

Matrix-structured parameters frequently appear in many artificial intelligence models such as large language models. More recently, an efficient Muon optimizer is designed for matrix parameters of large-scale models, and shows markedly faster convergence than the vector-wise algorithms. Although some works have begun to study convergence properties (i.e., optimization error) of the Muon optimizer, its generalization properties (i.e., generalization error) is still not established. Thus, in this paper, we study generalization error of the Muon optimizer based on algorithmic stability and mathematical induction, and prove that the Muon has a generalization error of $O\big(\frac{1}{N\kappa^{T}}\big)$, where $N$ is training sample size, and $T$ denotes iteration number, and $\kappa>0$ denotes minimum difference between singular values of gradient estimate. To enhance generalization of the Muon, we propose an effective mixed Muon (MiMuon) optimizer by cautiously using orthogonalization of gradient, which is a hybrid of Muon and momentum-based SGD optimizers. Then we prove that our MiMuon optimizer has a lower generalization error of $O\big(\frac{1}{N}\big)$ than $O\big(\frac{1}{N\kappa^{T}}\big)$ of Muon optimizer, since $\kappa$ generally is very small. Meanwhile, we also studied the convergence properties of our MiMuon algorithm, and prove that our MiMuon algorithm has the same convergence rate of $O(\frac{1}{T^{1/4}})$ as the Muon algorithm. Some numerical experimental results on training large models including Qwen3-0.6B and YOLO26m demonstrate efficiency of the MiMuon optimizer.

Figures

Figures reproduced from arXiv: 2605.19619 by Feihu Huang, Songcan Chen, Yuning Luo.

**Figure 2.** Figure 2: Optimization behavior on Qwen3-0.6B. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Training loss components on YOLO26m. (a) mAP50. (b) mAP50:95 [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Validation accuracy (mAP50 and mAP50-95) on [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

prove that the Muon has a generalization error of O(1/(N κ^T)) ... since κ generally is very small
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MiMuon ... hybrid of Muon and momentum-based SGD ... O(1/N)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Does Muon Help Agentic Reinforcement Learning?
cs.LG 2026-07 conditional novelty 6.0

Under a shared KL/clipping recipe in agentic RL, fan-in Muon at 3e-5 delivers a larger stable update and improves late success over a fixed AdamW 1e-6 baseline, with the effect tied to update magnitude rather than spe...

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 1 Pith paper · 12 internal anchors

[1]

SIAM, 2017

Amir Beck.First-order methods in optimization. SIAM, 2017

work page 2017
[2]

Old Optimizer, New Norm: An Anthology

JeremyBernsteinandLakerNewhouse. Oldoptimizer,newnorm: Ananthology.arXivpreprintarXiv:2409.20325, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Optimization methods for large-scale machine learning.SIAM review, 60(2):223–311, 2018

Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.SIAM review, 60(2):223–311, 2018

work page 2018
[4]

On the Convergence of Muon and Beyond

Da Chang, Yongxiang Liu, and Ganzhao Yuan. On the convergence of muon and beyond.arXiv preprint arXiv:2509.15816, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Symbolic discovery of optimization algorithms

Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, et al. Symbolic discovery of optimization algorithms. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 49205–49233, 2023

work page 2023
[6]

To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters

Sara Dragutinović and Rajesh Ranganath. To use or not to use muon: How simplicity bias in optimizers matters. arXiv preprint arXiv:2603.00742, 2026

work page internal anchor Pith review arXiv 2026
[7]

Combining axes precondi- tioners through kronecker approximation for deep learning

Sai Surya Duvvuri, Fnu Devvrit, Rohan Anil, Cho-Jui Hsieh, and Inderjit S Dhillon. Combining axes precondi- tioners through kronecker approximation for deep learning. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[8]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, pages 1842–1850. PMLR, 2018

work page 2018
[9]

Trainfaster,generalizebetter: Stabilityofstochasticgradientdescent

MoritzHardt,BenRecht,andYoramSinger. Trainfaster,generalizebetter: Stabilityofstochasticgradientdescent. InInternational conference on machine learning, pages 1225–1234. PMLR, 2016

work page 2016
[10]

Variance-reduced and projection-free stochastic optimization

Elad Hazan and Haipeng Luo. Variance-reduced and projection-free stochastic optimization. InInternational Conference on Machine Learning, pages 1263–1271. PMLR, 2016

work page 2016
[11]

Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

Chuan He, Zhanwang Deng, and Zhaosong Lu. Low-rank orthogonalization for large-scale matrix optimization with applications to foundation model training.arXiv preprint arXiv:2509.11983, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

LiMuon: Light and Fast Muon Optimizer for Large Models

Feihu Huang, Yuning Luo, and Songcan Chen. Limuon: Light and fast muon optimizer for large models.arXiv preprint arXiv:2509.14562, 2025

work page internal anchor Pith review arXiv 2025
[13]

Muon: An optimizer for hidden layers in neural networks.URL https://kellerjordan

K Jordan, Y Jin, V Boza, Y Jiacheng, F Cesista, L Newhouse, and J Bernstein. Muon: An optimizer for hidden layers in neural networks.URL https://kellerjordan. github. io/posts/muon, 2024

work page 2024
[14]

Convergence of muon with newton-schulz.arXiv preprint arXiv:2601.19156, 2026

Gyu Yeol Kim and Min-hwan Oh. Convergence of muon with newton-schulz.arXiv preprint arXiv:2601.19156, 2026. 11

work page arXiv 2026
[15]

Adam: A Method for Stochastic Optimization

DiederikPKingmaandJimmyBa. Adam: Amethodforstochasticoptimization.arXivpreprintarXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[16]

Understanding gradient orthogonalization for deep learning via non-euclidean trust-region optimization

Dmitry Kovalev. Understanding gradient orthogonalization for deep learning via non-euclidean trust-region opti- mization.arXiv preprint arXiv:2503.12645, 2025

work page arXiv 2025
[17]

Imagenet classification with deep convolutional neural networks.Communications of the ACM, 60(6):84–90, 2017

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Communications of the ACM, 60(6):84–90, 2017

work page 2017
[18]

Wei Lin, Yining Jiang, Qingyu Song, Qiao Xiang, and Hong Xu

YichengLang,ChangshengWang,YihuaZhang,MingyiHong,ZhengZhang,WotaoYin,andSijiaLiu. Powering up zeroth-order training via subspace gradient orthogonalization.arXiv preprint arXiv:2602.17155, 2026

work page arXiv 2026
[19]

Stability and generalization of stochastic optimization with nonconvex and nonsmooth problems

Yunwen Lei. Stability and generalization of stochastic optimization with nonconvex and nonsmooth problems. In The Thirty Sixth Annual Conference on Learning Theory, pages 191–227. PMLR, 2023

work page 2023
[20]

Fine-grainedanalysisofstabilityandgeneralizationforstochasticgradientdescent

YunwenLeiandYimingYing. Fine-grainedanalysisofstabilityandgeneralizationforstochasticgradientdescent. InInternational Conference on Machine Learning, pages 5809–5819. PMLR, 2020

work page 2020
[21]

Li and M

Jiaxiang Li and Mingyi Hong. A note on the convergence of muon and further.arXiv preprint arXiv:2502.02900, 2025

work page arXiv 2025
[22]

Muon is Scalable for LLM Training

JingyuanLiu,JianlinSu,XingchengYao,ZhejunJiang,GuokunLai,YulunDu,YidaoQin,WeixinXu,EnzheLu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Mars-m: When variance reduction meets matrices.arXiv preprint arXiv:2510.21800, 2025

Yifeng Liu, Angela Yuan, and Quanquan Gu. Mars-m: When variance reduction meets matrices.arXiv preprint arXiv:2510.21800, 2025

work page arXiv 2025
[24]

Decoupled Weight Decay Regularization

IlyaLoshchilovandFrankHutter. Decoupledweightdecayregularization.arXivpreprintarXiv:1711.05101,2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

Preconditioning benefits of spectral orthogonalization in muon.arXiv preprint arXiv:2601.13474, 2026

Jianhao Ma, Yu Huang, Yuejie Chi, and Yuxin Chen. Preconditioning benefits of spectral orthogonalization in muon.arXiv preprint arXiv:2601.13474, 2026

work page arXiv 2026
[26]

Training Deep Learning Models with Norm-Constrained LMOs

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025

work page internal anchor Pith review arXiv 2025
[27]

arXiv preprint arXiv:2506.01913 (2025)

Thomas Pethick, Wanyun Xie, Mete Erdogan, Kimon Antonakopoulos, Tony Silveti-Falls, and Volkan Cevher. Generalized gradient norm clipping & non-euclidean(𝑙_0, 𝑙_1)-smoothness.arXiv preprint arXiv:2506.01913, 2025

work page arXiv 2025
[28]

Muon is provably faster with momentum variance reduction

Xun Qian, Hussein Rammal, Dmitry Kovalev, and Peter Richtarik. Muon is provably faster with momentum variance reduction.arXiv preprint arXiv:2512.16598, 2025

work page arXiv 2025
[29]

On the gener- alization of stochastic gradient descent with momentum.Journal of Machine Learning Research, 25(22):1–56, 2024

Ali Ramezani-Kebrya, Kimon Antonakopoulos, Volkan Cevher, Ashish Khisti, and Ben Liang. On the gener- alization of stochastic gradient descent with momentum.Journal of Machine Learning Research, 25(22):1–56, 2024

work page 2024
[30]

Sumo: Subspace-aware moment- orthogonalization for accelerating memory-efficient llm training.arXiv preprint arXiv:2505.24749, 2025

Yehonathan Refael, Guy Smorodinsky, Tom Tirer, and Ofir Lindenbaum. Sumo: Subspace-aware moment- orthogonalization for accelerating memory-efficient llm training.arXiv preprint arXiv:2505.24749, 2025

work page arXiv 2025
[31]

Gluon: Making muon & scion great again!(bridging theory and practice of lmo-based optimizers for llms)

Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, and Peter Richtárik. Gluon: Making muon & scion great again!(bridging theory and practice of lmo-based optimizers for llms).arXiv preprint arXiv:2505.13416, 2025

work page arXiv 2025
[32]

A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951. 12

work page 1951
[33]

RanjanSapkota,RahulHarshaCheppally,AjaySharda,andManojKarkee.Yolo26: keyarchitecturalenhancements and performance benchmarking for real-time object detection.arXiv preprint arXiv:2509.25164, 2025

work page arXiv 2025
[34]

Lions and muons: Optimization via stochastic frank-wolfe

Maria-EleniSfyrakiandJun-KunWang. Lionsandmuons: Optimizationviastochasticfrank-wolfe.arXivpreprint arXiv:2506.04192, 2025

work page internal anchor Pith review arXiv 2025
[35]

Learnability, stability and uniform convergence.The Journal of Machine Learning Research, 11:2635–2670, 2010

Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence.The Journal of Machine Learning Research, 11:2635–2670, 2010

work page 2010
[36]

On the Convergence Analysis of Muon

Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon. arXiv preprint arXiv:2505.23737, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Isotropic curvature model for understanding deep learning optimization: Is gradient orthogonalization optimal?arXiv preprint arXiv:2511.00674, 2025

Weijie Su. Isotropic curvature model for understanding deep learning optimization: Is gradient orthogonalization optimal?arXiv preprint arXiv:2511.00674, 2025

work page arXiv 2025
[38]

On the importance of initialization and momentum in deep learning

Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. InInternational conference on machine learning, pages 1139–1147. pmlr, 2013

work page 2013
[39]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025

work page 2025
[40]

On generalization of spectral gradient descent: A case study on imbalanced data

Bhavya Vasudeva, Puneesh Deora, and Christos Thrampoulidis. On generalization of spectral gradient descent: A case study on imbalanced data. InHigh-dimensional Learning Dynamics 2025

work page 2025
[41]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[42]

Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise

Dingzhi Yu, Hongyi Tao, Yuanyu Wan, Luo Luo, and Lijun Zhang. Sign-based optimizers are effective under heavy-tailed noise.arXiv preprint arXiv:2602.07425, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

A useful variant of the davis–kahan theorem for statisticians

Yi Yu, Tengyao Wang, and Richard J Samworth. A useful variant of the davis–kahan theorem for statisticians. Biometrika, 102(2):315–323, 2015

work page 2015
[44]

Cambridge University Press, 2023

Tong Zhang.Mathematical analysis of machine learning algorithms. Cambridge University Press, 2023. 13 A Generalization Analysis In this subsection, we provide a detailed generalization analysis for both the Muon and our MiMuon algorithms, respectively. Theorem 4.(Restatement of Theorem 1) Assume the sequence{𝑊 𝑡 , 𝑀𝑡 }𝑇 𝑡=0 is generated from Algorithm 1 o...

work page 2023

[1] [1]

SIAM, 2017

Amir Beck.First-order methods in optimization. SIAM, 2017

work page 2017

[2] [2]

Old Optimizer, New Norm: An Anthology

JeremyBernsteinandLakerNewhouse. Oldoptimizer,newnorm: Ananthology.arXivpreprintarXiv:2409.20325, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Optimization methods for large-scale machine learning.SIAM review, 60(2):223–311, 2018

Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.SIAM review, 60(2):223–311, 2018

work page 2018

[4] [4]

On the Convergence of Muon and Beyond

Da Chang, Yongxiang Liu, and Ganzhao Yuan. On the convergence of muon and beyond.arXiv preprint arXiv:2509.15816, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Symbolic discovery of optimization algorithms

Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, et al. Symbolic discovery of optimization algorithms. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 49205–49233, 2023

work page 2023

[6] [6]

To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters

Sara Dragutinović and Rajesh Ranganath. To use or not to use muon: How simplicity bias in optimizers matters. arXiv preprint arXiv:2603.00742, 2026

work page internal anchor Pith review arXiv 2026

[7] [7]

Combining axes precondi- tioners through kronecker approximation for deep learning

Sai Surya Duvvuri, Fnu Devvrit, Rohan Anil, Cho-Jui Hsieh, and Inderjit S Dhillon. Combining axes precondi- tioners through kronecker approximation for deep learning. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[8] [8]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, pages 1842–1850. PMLR, 2018

work page 2018

[9] [9]

Trainfaster,generalizebetter: Stabilityofstochasticgradientdescent

MoritzHardt,BenRecht,andYoramSinger. Trainfaster,generalizebetter: Stabilityofstochasticgradientdescent. InInternational conference on machine learning, pages 1225–1234. PMLR, 2016

work page 2016

[10] [10]

Variance-reduced and projection-free stochastic optimization

Elad Hazan and Haipeng Luo. Variance-reduced and projection-free stochastic optimization. InInternational Conference on Machine Learning, pages 1263–1271. PMLR, 2016

work page 2016

[11] [11]

Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

Chuan He, Zhanwang Deng, and Zhaosong Lu. Low-rank orthogonalization for large-scale matrix optimization with applications to foundation model training.arXiv preprint arXiv:2509.11983, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

LiMuon: Light and Fast Muon Optimizer for Large Models

Feihu Huang, Yuning Luo, and Songcan Chen. Limuon: Light and fast muon optimizer for large models.arXiv preprint arXiv:2509.14562, 2025

work page internal anchor Pith review arXiv 2025

[13] [13]

Muon: An optimizer for hidden layers in neural networks.URL https://kellerjordan

K Jordan, Y Jin, V Boza, Y Jiacheng, F Cesista, L Newhouse, and J Bernstein. Muon: An optimizer for hidden layers in neural networks.URL https://kellerjordan. github. io/posts/muon, 2024

work page 2024

[14] [14]

Convergence of muon with newton-schulz.arXiv preprint arXiv:2601.19156, 2026

Gyu Yeol Kim and Min-hwan Oh. Convergence of muon with newton-schulz.arXiv preprint arXiv:2601.19156, 2026. 11

work page arXiv 2026

[15] [15]

Adam: A Method for Stochastic Optimization

DiederikPKingmaandJimmyBa. Adam: Amethodforstochasticoptimization.arXivpreprintarXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[16] [16]

Understanding gradient orthogonalization for deep learning via non-euclidean trust-region optimization

Dmitry Kovalev. Understanding gradient orthogonalization for deep learning via non-euclidean trust-region opti- mization.arXiv preprint arXiv:2503.12645, 2025

work page arXiv 2025

[17] [17]

Imagenet classification with deep convolutional neural networks.Communications of the ACM, 60(6):84–90, 2017

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Communications of the ACM, 60(6):84–90, 2017

work page 2017

[18] [18]

Wei Lin, Yining Jiang, Qingyu Song, Qiao Xiang, and Hong Xu

YichengLang,ChangshengWang,YihuaZhang,MingyiHong,ZhengZhang,WotaoYin,andSijiaLiu. Powering up zeroth-order training via subspace gradient orthogonalization.arXiv preprint arXiv:2602.17155, 2026

work page arXiv 2026

[19] [19]

Stability and generalization of stochastic optimization with nonconvex and nonsmooth problems

Yunwen Lei. Stability and generalization of stochastic optimization with nonconvex and nonsmooth problems. In The Thirty Sixth Annual Conference on Learning Theory, pages 191–227. PMLR, 2023

work page 2023

[20] [20]

Fine-grainedanalysisofstabilityandgeneralizationforstochasticgradientdescent

YunwenLeiandYimingYing. Fine-grainedanalysisofstabilityandgeneralizationforstochasticgradientdescent. InInternational Conference on Machine Learning, pages 5809–5819. PMLR, 2020

work page 2020

[21] [21]

Li and M

Jiaxiang Li and Mingyi Hong. A note on the convergence of muon and further.arXiv preprint arXiv:2502.02900, 2025

work page arXiv 2025

[22] [22]

Muon is Scalable for LLM Training

JingyuanLiu,JianlinSu,XingchengYao,ZhejunJiang,GuokunLai,YulunDu,YidaoQin,WeixinXu,EnzheLu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Mars-m: When variance reduction meets matrices.arXiv preprint arXiv:2510.21800, 2025

Yifeng Liu, Angela Yuan, and Quanquan Gu. Mars-m: When variance reduction meets matrices.arXiv preprint arXiv:2510.21800, 2025

work page arXiv 2025

[24] [24]

Decoupled Weight Decay Regularization

IlyaLoshchilovandFrankHutter. Decoupledweightdecayregularization.arXivpreprintarXiv:1711.05101,2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[25] [25]

Preconditioning benefits of spectral orthogonalization in muon.arXiv preprint arXiv:2601.13474, 2026

Jianhao Ma, Yu Huang, Yuejie Chi, and Yuxin Chen. Preconditioning benefits of spectral orthogonalization in muon.arXiv preprint arXiv:2601.13474, 2026

work page arXiv 2026

[26] [26]

Training Deep Learning Models with Norm-Constrained LMOs

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025

work page internal anchor Pith review arXiv 2025

[27] [27]

arXiv preprint arXiv:2506.01913 (2025)

Thomas Pethick, Wanyun Xie, Mete Erdogan, Kimon Antonakopoulos, Tony Silveti-Falls, and Volkan Cevher. Generalized gradient norm clipping & non-euclidean(𝑙_0, 𝑙_1)-smoothness.arXiv preprint arXiv:2506.01913, 2025

work page arXiv 2025

[28] [28]

Muon is provably faster with momentum variance reduction

Xun Qian, Hussein Rammal, Dmitry Kovalev, and Peter Richtarik. Muon is provably faster with momentum variance reduction.arXiv preprint arXiv:2512.16598, 2025

work page arXiv 2025

[29] [29]

On the gener- alization of stochastic gradient descent with momentum.Journal of Machine Learning Research, 25(22):1–56, 2024

Ali Ramezani-Kebrya, Kimon Antonakopoulos, Volkan Cevher, Ashish Khisti, and Ben Liang. On the gener- alization of stochastic gradient descent with momentum.Journal of Machine Learning Research, 25(22):1–56, 2024

work page 2024

[30] [30]

Sumo: Subspace-aware moment- orthogonalization for accelerating memory-efficient llm training.arXiv preprint arXiv:2505.24749, 2025

Yehonathan Refael, Guy Smorodinsky, Tom Tirer, and Ofir Lindenbaum. Sumo: Subspace-aware moment- orthogonalization for accelerating memory-efficient llm training.arXiv preprint arXiv:2505.24749, 2025

work page arXiv 2025

[31] [31]

Gluon: Making muon & scion great again!(bridging theory and practice of lmo-based optimizers for llms)

Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, and Peter Richtárik. Gluon: Making muon & scion great again!(bridging theory and practice of lmo-based optimizers for llms).arXiv preprint arXiv:2505.13416, 2025

work page arXiv 2025

[32] [32]

A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951. 12

work page 1951

[33] [33]

RanjanSapkota,RahulHarshaCheppally,AjaySharda,andManojKarkee.Yolo26: keyarchitecturalenhancements and performance benchmarking for real-time object detection.arXiv preprint arXiv:2509.25164, 2025

work page arXiv 2025

[34] [34]

Lions and muons: Optimization via stochastic frank-wolfe

Maria-EleniSfyrakiandJun-KunWang. Lionsandmuons: Optimizationviastochasticfrank-wolfe.arXivpreprint arXiv:2506.04192, 2025

work page internal anchor Pith review arXiv 2025

[35] [35]

Learnability, stability and uniform convergence.The Journal of Machine Learning Research, 11:2635–2670, 2010

Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence.The Journal of Machine Learning Research, 11:2635–2670, 2010

work page 2010

[36] [36]

On the Convergence Analysis of Muon

Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon. arXiv preprint arXiv:2505.23737, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Isotropic curvature model for understanding deep learning optimization: Is gradient orthogonalization optimal?arXiv preprint arXiv:2511.00674, 2025

Weijie Su. Isotropic curvature model for understanding deep learning optimization: Is gradient orthogonalization optimal?arXiv preprint arXiv:2511.00674, 2025

work page arXiv 2025

[38] [38]

On the importance of initialization and momentum in deep learning

Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. InInternational conference on machine learning, pages 1139–1147. pmlr, 2013

work page 2013

[39] [39]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025

work page 2025

[40] [40]

On generalization of spectral gradient descent: A case study on imbalanced data

Bhavya Vasudeva, Puneesh Deora, and Christos Thrampoulidis. On generalization of spectral gradient descent: A case study on imbalanced data. InHigh-dimensional Learning Dynamics 2025

work page 2025

[41] [41]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017

[42] [42]

Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise

Dingzhi Yu, Hongyi Tao, Yuanyu Wan, Luo Luo, and Lijun Zhang. Sign-based optimizers are effective under heavy-tailed noise.arXiv preprint arXiv:2602.07425, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[43] [43]

A useful variant of the davis–kahan theorem for statisticians

Yi Yu, Tengyao Wang, and Richard J Samworth. A useful variant of the davis–kahan theorem for statisticians. Biometrika, 102(2):315–323, 2015

work page 2015

[44] [44]

Cambridge University Press, 2023

Tong Zhang.Mathematical analysis of machine learning algorithms. Cambridge University Press, 2023. 13 A Generalization Analysis In this subsection, we provide a detailed generalization analysis for both the Muon and our MiMuon algorithms, respectively. Theorem 4.(Restatement of Theorem 1) Assume the sequence{𝑊 𝑡 , 𝑀𝑡 }𝑇 𝑡=0 is generated from Algorithm 1 o...

work page 2023