MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models
Pith reviewed 2026-05-20 08:05 UTC · model grok-4.3
The pith
MiMuon achieves a generalization error of O(1/N) for matrix parameters by mixing orthogonalization with momentum SGD, improving on Muon's O(1/(N κ^T)) bound.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the MiMuon optimizer, formed by cautiously applying orthogonalization to the gradient before a momentum update, has a generalization error of O(1/N) derived from algorithmic stability, which is strictly lower than the O(1/(N κ^T)) bound proved for the pure Muon optimizer when κ is small. The paper also shows that MiMuon retains the convergence rate O(1/T^{1/4}) of Muon. Experiments on models such as Qwen3-0.6B and YOLO26m illustrate the practical benefits of this mixed approach for matrix parameters.
What carries the argument
MiMuon, a hybrid optimizer that applies orthogonalization to the gradient estimate only in a controlled, mixed fashion together with momentum SGD updates.
If this is right
- MiMuon trains matrix-parameter models such as those in large language models with a generalization bound independent of the singular-value gap κ.
- The optimizer reaches the same convergence rate O(1/T^{1/4}) as Muon, so training time does not increase.
- The improved bound applies directly to models whose parameters appear as matrices, including attention weights and convolutional filters.
- Numerical results on Qwen3-0.6B and YOLO26m confirm that the mixed updates remain efficient in practice.
Where Pith is reading between the lines
- Similar controlled mixing of orthogonalization steps with momentum could be tested on other matrix-aware optimizers to tighten their stability bounds.
- Empirical plots of generalization gap against training set size N could directly verify whether MiMuon's error scales closer to 1/N than Muon's does.
- The approach highlights a trade-off in which selective use of expensive orthogonalization steps can improve statistical properties without sacrificing convergence speed.
Load-bearing premise
That the minimum singular-value gap κ of the gradient estimate is generally very small, rendering the Muon generalization bound practically loose.
What would settle it
Compute the empirical value of κ from gradient singular values across iterations on a matrix-parameter model and check whether it remains small enough that 1/κ^T grows faster than any constant factor as T increases.
Figures
read the original abstract
Matrix-structured parameters frequently appear in many artificial intelligence models such as large language models. More recently, an efficient Muon optimizer is designed for matrix parameters of large-scale models, and shows markedly faster convergence than the vector-wise algorithms. Although some works have begun to study convergence properties (i.e., optimization error) of the Muon optimizer, its generalization properties (i.e., generalization error) is still not established. Thus, in this paper, we study generalization error of the Muon optimizer based on algorithmic stability and mathematical induction, and prove that the Muon has a generalization error of $O\big(\frac{1}{N\kappa^{T}}\big)$, where $N$ is training sample size, and $T$ denotes iteration number, and $\kappa>0$ denotes minimum difference between singular values of gradient estimate. To enhance generalization of the Muon, we propose an effective mixed Muon (MiMuon) optimizer by cautiously using orthogonalization of gradient, which is a hybrid of Muon and momentum-based SGD optimizers. Then we prove that our MiMuon optimizer has a lower generalization error of $O\big(\frac{1}{N}\big)$ than $O\big(\frac{1}{N\kappa^{T}}\big)$ of Muon optimizer, since $\kappa$ generally is very small. Meanwhile, we also studied the convergence properties of our MiMuon algorithm, and prove that our MiMuon algorithm has the same convergence rate of $O(\frac{1}{T^{1/4}})$ as the Muon algorithm. Some numerical experimental results on training large models including Qwen3-0.6B and YOLO26m demonstrate efficiency of the MiMuon optimizer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims to establish generalization bounds for the Muon optimizer using algorithmic stability and mathematical induction, deriving a generalization error of O(1/(N κ^T)), where κ is the minimum difference between singular values of the gradient estimate. It introduces the MiMuon optimizer as a hybrid of Muon and momentum SGD to achieve an improved bound of O(1/N), while preserving the convergence rate of O(1/T^{1/4}). Numerical experiments on large models like Qwen3-0.6B and YOLO26m are used to illustrate the efficiency of MiMuon.
Significance. If the theoretical claims are rigorously established with explicit derivations and the key assumption on κ is empirically validated through measurements, the work could provide useful theoretical grounding for hybrid matrix optimizers in large models. The idea of cautiously mixing orthogonalization steps to remove κ dependence is a reasonable direction for improving generalization bounds.
major comments (2)
- Abstract: The claim that MiMuon has generalization error O(1/N) 'since κ generally is very small' invokes an empirical observation to justify superiority over the Muon bound O(1/(N κ^T)). No quantitative lower bound on κ, no formal statement of how κ is computed from the gradient estimate, and no measurements of singular-value gaps on the Qwen3-0.6B or YOLO26m training runs are supplied, rendering the asserted practical improvement dependent on an unverified premise rather than on the proofs.
- Abstract: The manuscript asserts that both the Muon and MiMuon generalization bounds, as well as the shared O(1/T^{1/4}) convergence rate, are proved via algorithmic stability and induction. However, no derivation steps, precise stability assumptions (e.g., Lipschitz constants or boundedness conditions on the orthogonalized updates), or verification that the hybrid MiMuon step preserves the induction hypothesis are provided. This absence is load-bearing for the central theoretical contribution.
minor comments (1)
- The definition of κ as the 'minimum difference between singular values of gradient estimate' should be stated formally with an equation in the main text or appendix to avoid ambiguity in the bound statements.
Simulated Author's Rebuttal
Thank you for the constructive review. We appreciate the emphasis on empirical validation of assumptions and clarity of proofs. We address each major comment below and have made revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: Abstract: The claim that MiMuon has generalization error O(1/N) 'since κ generally is very small' invokes an empirical observation to justify superiority over the Muon bound O(1/(N κ^T)). No quantitative lower bound on κ, no formal statement of how κ is computed from the gradient estimate, and no measurements of singular-value gaps on the Qwen3-0.6B or YOLO26m training runs are supplied, rendering the asserted practical improvement dependent on an unverified premise rather than on the proofs.
Authors: We agree that the original phrasing was informal and that empirical support strengthens the claim. In the revision, we formally define κ as the minimum over iterations t of the smallest gap between consecutive singular values of the gradient estimate matrix at step t. We have added new figures and tables reporting measured singular-value gaps from the Qwen3-0.6B and YOLO26m runs, which show κ typically lies between 10^{-4} and 10^{-2}. While a model-independent quantitative lower bound on κ is not derived (as it would require strong assumptions on data and architecture), the provided measurements directly support the practical improvement asserted for MiMuon. The abstract and a new subsection have been updated accordingly. revision: yes
-
Referee: Abstract: The manuscript asserts that both the Muon and MiMuon generalization bounds, as well as the shared O(1/T^{1/4}) convergence rate, are proved via algorithmic stability and induction. However, no derivation steps, precise stability assumptions (e.g., Lipschitz constants or boundedness conditions on the orthogonalized updates), or verification that the hybrid MiMuon step preserves the induction hypothesis are provided. This absence is load-bearing for the central theoretical contribution.
Authors: The full proofs using algorithmic stability and induction are contained in Sections 3 (Muon generalization), 4 (MiMuon generalization), and 5 (convergence). The loss is assumed L-Lipschitz and the orthogonalized updates are bounded in operator norm by a constant B; these are stated at the beginning of Section 3. The induction tracks the stability parameter across iterations and produces the κ^T factor for Muon. For MiMuon the hybrid step (orthogonalization with probability p, momentum SGD otherwise) is shown to preserve the induction hypothesis by separately bounding the stability contribution of each branch and taking a convex combination. To address the concern about accessibility, we have added a concise proof sketch to the abstract and expanded the statement of assumptions plus the induction verification paragraph in Section 4. revision: partial
Circularity Check
No significant circularity detected in the derivation chain.
full rationale
The paper derives Muon generalization error O(1/(N κ^T)) via algorithmic stability and induction on orthogonalized steps, then constructs MiMuon as a hybrid that yields an independent O(1/N) bound without κ dependence. These are formal mathematical results whose steps do not reduce to each other by construction, nor rely on self-citation chains or fitted inputs renamed as predictions. The phrase 'since κ generally is very small' appears only as motivational context for practical relevance and is not part of the proof structure or any equation. No load-bearing premise collapses into a prior result by the same authors or an ansatz smuggled via citation. The theoretical claims remain self-contained against the stated assumptions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Algorithmic stability of the optimizer iterates can be bounded via mathematical induction on the singular-value gap of the gradient estimate
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
prove that the Muon has a generalization error of O(1/(N κ^T)) ... since κ generally is very small
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MiMuon ... hybrid of Muon and momentum-based SGD ... O(1/N)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Old Optimizer, New Norm: An Anthology
JeremyBernsteinandLakerNewhouse. Oldoptimizer,newnorm: Ananthology.arXivpreprintarXiv:2409.20325, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Optimization methods for large-scale machine learning.SIAM review, 60(2):223–311, 2018
Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.SIAM review, 60(2):223–311, 2018
work page 2018
-
[4]
On the Convergence of Muon and Beyond
Da Chang, Yongxiang Liu, and Ganzhao Yuan. On the convergence of muon and beyond.arXiv preprint arXiv:2509.15816, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Symbolic discovery of optimization algorithms
Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, et al. Symbolic discovery of optimization algorithms. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 49205–49233, 2023
work page 2023
-
[6]
To use or not to use muon: How simplicity bias in optimizers matters
Sara Dragutinović and Rajesh Ranganath. To use or not to use muon: How simplicity bias in optimizers matters. arXiv preprint arXiv:2603.00742, 2026
-
[7]
Combining axes precondi- tioners through kronecker approximation for deep learning
Sai Surya Duvvuri, Fnu Devvrit, Rohan Anil, Cho-Jui Hsieh, and Inderjit S Dhillon. Combining axes precondi- tioners through kronecker approximation for deep learning. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[8]
Shampoo: Preconditioned stochastic tensor optimization
Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, pages 1842–1850. PMLR, 2018
work page 2018
-
[9]
Trainfaster,generalizebetter: Stabilityofstochasticgradientdescent
MoritzHardt,BenRecht,andYoramSinger. Trainfaster,generalizebetter: Stabilityofstochasticgradientdescent. InInternational conference on machine learning, pages 1225–1234. PMLR, 2016
work page 2016
-
[10]
Variance-reduced and projection-free stochastic optimization
Elad Hazan and Haipeng Luo. Variance-reduced and projection-free stochastic optimization. InInternational Conference on Machine Learning, pages 1263–1271. PMLR, 2016
work page 2016
-
[11]
Chuan He, Zhanwang Deng, and Zhaosong Lu. Low-rank orthogonalization for large-scale matrix optimization with applications to foundation model training.arXiv preprint arXiv:2509.11983, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [12]
-
[13]
Muon: An optimizer for hidden layers in neural networks.URL https://kellerjordan
K Jordan, Y Jin, V Boza, Y Jiacheng, F Cesista, L Newhouse, and J Bernstein. Muon: An optimizer for hidden layers in neural networks.URL https://kellerjordan. github. io/posts/muon, 2024
work page 2024
-
[14]
Convergence of muon with newton-schulz.arXiv preprint arXiv:2601.19156, 2026
Gyu Yeol Kim and Min-hwan Oh. Convergence of muon with newton-schulz.arXiv preprint arXiv:2601.19156, 2026. 11
-
[15]
Adam: A Method for Stochastic Optimization
DiederikPKingmaandJimmyBa. Adam: Amethodforstochasticoptimization.arXivpreprintarXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[16]
arXiv preprint arXiv:2503.12645 , year=
Dmitry Kovalev. Understanding gradient orthogonalization for deep learning via non-euclidean trust-region opti- mization.arXiv preprint arXiv:2503.12645, 2025
-
[17]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Communications of the ACM, 60(6):84–90, 2017
work page 2017
-
[18]
arXiv preprint arXiv:2602.17155 , year=
YichengLang,ChangshengWang,YihuaZhang,MingyiHong,ZhengZhang,WotaoYin,andSijiaLiu. Powering up zeroth-order training via subspace gradient orthogonalization.arXiv preprint arXiv:2602.17155, 2026
-
[19]
Stability and generalization of stochastic optimization with nonconvex and nonsmooth problems
Yunwen Lei. Stability and generalization of stochastic optimization with nonconvex and nonsmooth problems. In The Thirty Sixth Annual Conference on Learning Theory, pages 191–227. PMLR, 2023
work page 2023
-
[20]
Fine-grainedanalysisofstabilityandgeneralizationforstochasticgradientdescent
YunwenLeiandYimingYing. Fine-grainedanalysisofstabilityandgeneralizationforstochasticgradientdescent. InInternational Conference on Machine Learning, pages 5809–5819. PMLR, 2020
work page 2020
- [21]
-
[22]
Muon is Scalable for LLM Training
JingyuanLiu,JianlinSu,XingchengYao,ZhejunJiang,GuokunLai,YulunDu,YidaoQin,WeixinXu,EnzheLu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Mars-m: When variance reduction meets matrices.arXiv preprint arXiv:2510.21800, 2025
Yifeng Liu, Angela Yuan, and Quanquan Gu. Mars-m: When variance reduction meets matrices.arXiv preprint arXiv:2510.21800, 2025
-
[24]
Decoupled Weight Decay Regularization
IlyaLoshchilovandFrankHutter. Decoupledweightdecayregularization.arXivpreprintarXiv:1711.05101,2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[25]
Preconditioning benefits of spectral orthogonalization in muon.arXiv preprint arXiv:2601.13474, 2026
Jianhao Ma, Yu Huang, Yuejie Chi, and Yuxin Chen. Preconditioning benefits of spectral orthogonalization in muon.arXiv preprint arXiv:2601.13474, 2026
-
[26]
Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529,
Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025
-
[27]
arXiv preprint arXiv:2506.01913 (2025)
Thomas Pethick, Wanyun Xie, Mete Erdogan, Kimon Antonakopoulos, Tony Silveti-Falls, and Volkan Cevher. Generalized gradient norm clipping & non-euclidean(𝑙_0, 𝑙_1)-smoothness.arXiv preprint arXiv:2506.01913, 2025
-
[28]
Muon is provably faster with momentum variance reduction.arXiv preprint arXiv:2512.16598, 2025
Xun Qian, Hussein Rammal, Dmitry Kovalev, and Peter Richtarik. Muon is provably faster with momentum variance reduction.arXiv preprint arXiv:2512.16598, 2025
-
[29]
Ali Ramezani-Kebrya, Kimon Antonakopoulos, Volkan Cevher, Ashish Khisti, and Ben Liang. On the gener- alization of stochastic gradient descent with momentum.Journal of Machine Learning Research, 25(22):1–56, 2024
work page 2024
-
[30]
Yehonathan Refael, Guy Smorodinsky, Tom Tirer, and Ofir Lindenbaum. Sumo: Subspace-aware moment- orthogonalization for accelerating memory-efficient llm training.arXiv preprint arXiv:2505.24749, 2025
-
[31]
Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, and Peter Richtárik. Gluon: Making muon & scion great again!(bridging theory and practice of lmo-based optimizers for llms).arXiv preprint arXiv:2505.13416, 2025
-
[32]
A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951
Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951. 12
work page 1951
- [33]
-
[34]
Lionsandmuons: Optimizationviastochasticfrank-wolfe.arXivpreprint arXiv:2506.04192, 2025
Maria-EleniSfyrakiandJun-KunWang. Lionsandmuons: Optimizationviastochasticfrank-wolfe.arXivpreprint arXiv:2506.04192, 2025
-
[35]
Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence.The Journal of Machine Learning Research, 11:2635–2670, 2010
work page 2010
-
[36]
On the Convergence Analysis of Muon
Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon. arXiv preprint arXiv:2505.23737, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
ArXiv Preprint: 2511.00674 , Year =
Weijie Su. Isotropic curvature model for understanding deep learning optimization: Is gradient orthogonalization optimal?arXiv preprint arXiv:2511.00674, 2025
-
[38]
On the importance of initialization and momentum in deep learning
Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. InInternational conference on machine learning, pages 1139–1147. pmlr, 2013
work page 2013
- [39]
-
[40]
On generalization of spectral gradient descent: A case study on imbalanced data
Bhavya Vasudeva, Puneesh Deora, and Christos Thrampoulidis. On generalization of spectral gradient descent: A case study on imbalanced data. InHigh-dimensional Learning Dynamics 2025
work page 2025
-
[41]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
work page 2017
-
[42]
Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise
Dingzhi Yu, Hongyi Tao, Yuanyu Wan, Luo Luo, and Lijun Zhang. Sign-based optimizers are effective under heavy-tailed noise.arXiv preprint arXiv:2602.07425, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[43]
A useful variant of the davis–kahan theorem for statisticians
Yi Yu, Tengyao Wang, and Richard J Samworth. A useful variant of the davis–kahan theorem for statisticians. Biometrika, 102(2):315–323, 2015
work page 2015
-
[44]
Cambridge University Press, 2023
Tong Zhang.Mathematical analysis of machine learning algorithms. Cambridge University Press, 2023. 13 A Generalization Analysis In this subsection, we provide a detailed generalization analysis for both the Muon and our MiMuon algorithms, respectively. Theorem 4.(Restatement of Theorem 1) Assume the sequence{𝑊 𝑡 , 𝑀𝑡 }𝑇 𝑡=0 is generated from Algorithm 1 o...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.