pith. sign in

arxiv: 2605.19619 · v1 · pith:NYJPOMCYnew · submitted 2026-05-19 · 💻 cs.LG · cs.AI· math.OC· stat.ML

MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models

Pith reviewed 2026-05-20 08:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OCstat.ML
keywords Muon optimizerMiMuongeneralization erroralgorithmic stabilitymatrix parameterslarge language modelsmomentum SGDorthogonalization
0
0 comments X

The pith

MiMuon achieves a generalization error of O(1/N) for matrix parameters by mixing orthogonalization with momentum SGD, improving on Muon's O(1/(N κ^T)) bound.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes generalization bounds for the Muon optimizer and then introduces MiMuon to improve them. It shows through algorithmic stability and induction that Muon has a generalization error scaling as O(1/(N κ^T)), where κ is the minimum gap between singular values of the gradient estimate. MiMuon carefully combines Muon's orthogonalization step with momentum-based SGD updates to remove the dependence on κ and obtain the tighter bound O(1/N). The paper further proves that this mixing preserves the same convergence rate of O(1/T^{1/4}) as the original Muon. These results matter for training large models with matrix-structured weights because a tighter generalization bound implies smaller expected error on unseen data for a given training set size.

Core claim

The central claim is that the MiMuon optimizer, formed by cautiously applying orthogonalization to the gradient before a momentum update, has a generalization error of O(1/N) derived from algorithmic stability, which is strictly lower than the O(1/(N κ^T)) bound proved for the pure Muon optimizer when κ is small. The paper also shows that MiMuon retains the convergence rate O(1/T^{1/4}) of Muon. Experiments on models such as Qwen3-0.6B and YOLO26m illustrate the practical benefits of this mixed approach for matrix parameters.

What carries the argument

MiMuon, a hybrid optimizer that applies orthogonalization to the gradient estimate only in a controlled, mixed fashion together with momentum SGD updates.

If this is right

  • MiMuon trains matrix-parameter models such as those in large language models with a generalization bound independent of the singular-value gap κ.
  • The optimizer reaches the same convergence rate O(1/T^{1/4}) as Muon, so training time does not increase.
  • The improved bound applies directly to models whose parameters appear as matrices, including attention weights and convolutional filters.
  • Numerical results on Qwen3-0.6B and YOLO26m confirm that the mixed updates remain efficient in practice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar controlled mixing of orthogonalization steps with momentum could be tested on other matrix-aware optimizers to tighten their stability bounds.
  • Empirical plots of generalization gap against training set size N could directly verify whether MiMuon's error scales closer to 1/N than Muon's does.
  • The approach highlights a trade-off in which selective use of expensive orthogonalization steps can improve statistical properties without sacrificing convergence speed.

Load-bearing premise

That the minimum singular-value gap κ of the gradient estimate is generally very small, rendering the Muon generalization bound practically loose.

What would settle it

Compute the empirical value of κ from gradient singular values across iterations on a matrix-parameter model and check whether it remains small enough that 1/κ^T grows faster than any constant factor as T increases.

Figures

Figures reproduced from arXiv: 2605.19619 by Feihu Huang, Songcan Chen, Yuning Luo.

Figure 1
Figure 1. Figure 1: Illustration of different gradient mapping [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Optimization behavior on Qwen3-0.6B. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training loss components on YOLO26m. (a) mAP50. (b) mAP50:95 [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Validation accuracy (mAP50 and mAP50-95) on [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Matrix-structured parameters frequently appear in many artificial intelligence models such as large language models. More recently, an efficient Muon optimizer is designed for matrix parameters of large-scale models, and shows markedly faster convergence than the vector-wise algorithms. Although some works have begun to study convergence properties (i.e., optimization error) of the Muon optimizer, its generalization properties (i.e., generalization error) is still not established. Thus, in this paper, we study generalization error of the Muon optimizer based on algorithmic stability and mathematical induction, and prove that the Muon has a generalization error of $O\big(\frac{1}{N\kappa^{T}}\big)$, where $N$ is training sample size, and $T$ denotes iteration number, and $\kappa>0$ denotes minimum difference between singular values of gradient estimate. To enhance generalization of the Muon, we propose an effective mixed Muon (MiMuon) optimizer by cautiously using orthogonalization of gradient, which is a hybrid of Muon and momentum-based SGD optimizers. Then we prove that our MiMuon optimizer has a lower generalization error of $O\big(\frac{1}{N}\big)$ than $O\big(\frac{1}{N\kappa^{T}}\big)$ of Muon optimizer, since $\kappa$ generally is very small. Meanwhile, we also studied the convergence properties of our MiMuon algorithm, and prove that our MiMuon algorithm has the same convergence rate of $O(\frac{1}{T^{1/4}})$ as the Muon algorithm. Some numerical experimental results on training large models including Qwen3-0.6B and YOLO26m demonstrate efficiency of the MiMuon optimizer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript claims to establish generalization bounds for the Muon optimizer using algorithmic stability and mathematical induction, deriving a generalization error of O(1/(N κ^T)), where κ is the minimum difference between singular values of the gradient estimate. It introduces the MiMuon optimizer as a hybrid of Muon and momentum SGD to achieve an improved bound of O(1/N), while preserving the convergence rate of O(1/T^{1/4}). Numerical experiments on large models like Qwen3-0.6B and YOLO26m are used to illustrate the efficiency of MiMuon.

Significance. If the theoretical claims are rigorously established with explicit derivations and the key assumption on κ is empirically validated through measurements, the work could provide useful theoretical grounding for hybrid matrix optimizers in large models. The idea of cautiously mixing orthogonalization steps to remove κ dependence is a reasonable direction for improving generalization bounds.

major comments (2)
  1. Abstract: The claim that MiMuon has generalization error O(1/N) 'since κ generally is very small' invokes an empirical observation to justify superiority over the Muon bound O(1/(N κ^T)). No quantitative lower bound on κ, no formal statement of how κ is computed from the gradient estimate, and no measurements of singular-value gaps on the Qwen3-0.6B or YOLO26m training runs are supplied, rendering the asserted practical improvement dependent on an unverified premise rather than on the proofs.
  2. Abstract: The manuscript asserts that both the Muon and MiMuon generalization bounds, as well as the shared O(1/T^{1/4}) convergence rate, are proved via algorithmic stability and induction. However, no derivation steps, precise stability assumptions (e.g., Lipschitz constants or boundedness conditions on the orthogonalized updates), or verification that the hybrid MiMuon step preserves the induction hypothesis are provided. This absence is load-bearing for the central theoretical contribution.
minor comments (1)
  1. The definition of κ as the 'minimum difference between singular values of gradient estimate' should be stated formally with an equation in the main text or appendix to avoid ambiguity in the bound statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive review. We appreciate the emphasis on empirical validation of assumptions and clarity of proofs. We address each major comment below and have made revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: Abstract: The claim that MiMuon has generalization error O(1/N) 'since κ generally is very small' invokes an empirical observation to justify superiority over the Muon bound O(1/(N κ^T)). No quantitative lower bound on κ, no formal statement of how κ is computed from the gradient estimate, and no measurements of singular-value gaps on the Qwen3-0.6B or YOLO26m training runs are supplied, rendering the asserted practical improvement dependent on an unverified premise rather than on the proofs.

    Authors: We agree that the original phrasing was informal and that empirical support strengthens the claim. In the revision, we formally define κ as the minimum over iterations t of the smallest gap between consecutive singular values of the gradient estimate matrix at step t. We have added new figures and tables reporting measured singular-value gaps from the Qwen3-0.6B and YOLO26m runs, which show κ typically lies between 10^{-4} and 10^{-2}. While a model-independent quantitative lower bound on κ is not derived (as it would require strong assumptions on data and architecture), the provided measurements directly support the practical improvement asserted for MiMuon. The abstract and a new subsection have been updated accordingly. revision: yes

  2. Referee: Abstract: The manuscript asserts that both the Muon and MiMuon generalization bounds, as well as the shared O(1/T^{1/4}) convergence rate, are proved via algorithmic stability and induction. However, no derivation steps, precise stability assumptions (e.g., Lipschitz constants or boundedness conditions on the orthogonalized updates), or verification that the hybrid MiMuon step preserves the induction hypothesis are provided. This absence is load-bearing for the central theoretical contribution.

    Authors: The full proofs using algorithmic stability and induction are contained in Sections 3 (Muon generalization), 4 (MiMuon generalization), and 5 (convergence). The loss is assumed L-Lipschitz and the orthogonalized updates are bounded in operator norm by a constant B; these are stated at the beginning of Section 3. The induction tracks the stability parameter across iterations and produces the κ^T factor for Muon. For MiMuon the hybrid step (orthogonalization with probability p, momentum SGD otherwise) is shown to preserve the induction hypothesis by separately bounding the stability contribution of each branch and taking a convex combination. To address the concern about accessibility, we have added a concise proof sketch to the abstract and expanded the statement of assumptions plus the induction verification paragraph in Section 4. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in the derivation chain.

full rationale

The paper derives Muon generalization error O(1/(N κ^T)) via algorithmic stability and induction on orthogonalized steps, then constructs MiMuon as a hybrid that yields an independent O(1/N) bound without κ dependence. These are formal mathematical results whose steps do not reduce to each other by construction, nor rely on self-citation chains or fitted inputs renamed as predictions. The phrase 'since κ generally is very small' appears only as motivational context for practical relevance and is not part of the proof structure or any equation. No load-bearing premise collapses into a prior result by the same authors or an ansatz smuggled via citation. The theoretical claims remain self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard algorithmic-stability assumptions for generalization bounds and on the empirical observation that the singular-value gap κ is small; no new entities are postulated and no free parameters are fitted inside the paper itself.

axioms (1)
  • domain assumption Algorithmic stability of the optimizer iterates can be bounded via mathematical induction on the singular-value gap of the gradient estimate
    Invoked to derive the O(1/(N κ^T)) generalization error for Muon and the improved O(1/N) bound for MiMuon.

pith-pipeline@v0.9.0 · 5852 in / 1451 out tokens · 84100 ms · 2026-05-20T08:05:33.154090+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 8 internal anchors

  1. [1]

    SIAM, 2017

    Amir Beck.First-order methods in optimization. SIAM, 2017

  2. [2]

    Old Optimizer, New Norm: An Anthology

    JeremyBernsteinandLakerNewhouse. Oldoptimizer,newnorm: Ananthology.arXivpreprintarXiv:2409.20325, 2024

  3. [3]

    Optimization methods for large-scale machine learning.SIAM review, 60(2):223–311, 2018

    Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.SIAM review, 60(2):223–311, 2018

  4. [4]

    On the Convergence of Muon and Beyond

    Da Chang, Yongxiang Liu, and Ganzhao Yuan. On the convergence of muon and beyond.arXiv preprint arXiv:2509.15816, 2025

  5. [5]

    Symbolic discovery of optimization algorithms

    Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, et al. Symbolic discovery of optimization algorithms. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 49205–49233, 2023

  6. [6]

    To use or not to use muon: How simplicity bias in optimizers matters

    Sara Dragutinović and Rajesh Ranganath. To use or not to use muon: How simplicity bias in optimizers matters. arXiv preprint arXiv:2603.00742, 2026

  7. [7]

    Combining axes precondi- tioners through kronecker approximation for deep learning

    Sai Surya Duvvuri, Fnu Devvrit, Rohan Anil, Cho-Jui Hsieh, and Inderjit S Dhillon. Combining axes precondi- tioners through kronecker approximation for deep learning. InThe Twelfth International Conference on Learning Representations, 2024

  8. [8]

    Shampoo: Preconditioned stochastic tensor optimization

    Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, pages 1842–1850. PMLR, 2018

  9. [9]

    Trainfaster,generalizebetter: Stabilityofstochasticgradientdescent

    MoritzHardt,BenRecht,andYoramSinger. Trainfaster,generalizebetter: Stabilityofstochasticgradientdescent. InInternational conference on machine learning, pages 1225–1234. PMLR, 2016

  10. [10]

    Variance-reduced and projection-free stochastic optimization

    Elad Hazan and Haipeng Luo. Variance-reduced and projection-free stochastic optimization. InInternational Conference on Machine Learning, pages 1263–1271. PMLR, 2016

  11. [11]

    Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

    Chuan He, Zhanwang Deng, and Zhaosong Lu. Low-rank orthogonalization for large-scale matrix optimization with applications to foundation model training.arXiv preprint arXiv:2509.11983, 2025

  12. [12]

    Huang, Y

    Feihu Huang, Yuning Luo, and Songcan Chen. Limuon: Light and fast muon optimizer for large models.arXiv preprint arXiv:2509.14562, 2025

  13. [13]

    Muon: An optimizer for hidden layers in neural networks.URL https://kellerjordan

    K Jordan, Y Jin, V Boza, Y Jiacheng, F Cesista, L Newhouse, and J Bernstein. Muon: An optimizer for hidden layers in neural networks.URL https://kellerjordan. github. io/posts/muon, 2024

  14. [14]

    Convergence of muon with newton-schulz.arXiv preprint arXiv:2601.19156, 2026

    Gyu Yeol Kim and Min-hwan Oh. Convergence of muon with newton-schulz.arXiv preprint arXiv:2601.19156, 2026. 11

  15. [15]

    Adam: A Method for Stochastic Optimization

    DiederikPKingmaandJimmyBa. Adam: Amethodforstochasticoptimization.arXivpreprintarXiv:1412.6980, 2014

  16. [16]

    arXiv preprint arXiv:2503.12645 , year=

    Dmitry Kovalev. Understanding gradient orthogonalization for deep learning via non-euclidean trust-region opti- mization.arXiv preprint arXiv:2503.12645, 2025

  17. [17]

    Imagenet classification with deep convolutional neural networks.Communications of the ACM, 60(6):84–90, 2017

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Communications of the ACM, 60(6):84–90, 2017

  18. [18]

    arXiv preprint arXiv:2602.17155 , year=

    YichengLang,ChangshengWang,YihuaZhang,MingyiHong,ZhengZhang,WotaoYin,andSijiaLiu. Powering up zeroth-order training via subspace gradient orthogonalization.arXiv preprint arXiv:2602.17155, 2026

  19. [19]

    Stability and generalization of stochastic optimization with nonconvex and nonsmooth problems

    Yunwen Lei. Stability and generalization of stochastic optimization with nonconvex and nonsmooth problems. In The Thirty Sixth Annual Conference on Learning Theory, pages 191–227. PMLR, 2023

  20. [20]

    Fine-grainedanalysisofstabilityandgeneralizationforstochasticgradientdescent

    YunwenLeiandYimingYing. Fine-grainedanalysisofstabilityandgeneralizationforstochasticgradientdescent. InInternational Conference on Machine Learning, pages 5809–5819. PMLR, 2020

  21. [21]

    Li and M

    Jiaxiang Li and Mingyi Hong. A note on the convergence of muon and further.arXiv preprint arXiv:2502.02900, 2025

  22. [22]

    Muon is Scalable for LLM Training

    JingyuanLiu,JianlinSu,XingchengYao,ZhejunJiang,GuokunLai,YulunDu,YidaoQin,WeixinXu,EnzheLu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

  23. [23]

    Mars-m: When variance reduction meets matrices.arXiv preprint arXiv:2510.21800, 2025

    Yifeng Liu, Angela Yuan, and Quanquan Gu. Mars-m: When variance reduction meets matrices.arXiv preprint arXiv:2510.21800, 2025

  24. [24]

    Decoupled Weight Decay Regularization

    IlyaLoshchilovandFrankHutter. Decoupledweightdecayregularization.arXivpreprintarXiv:1711.05101,2017

  25. [25]

    Preconditioning benefits of spectral orthogonalization in muon.arXiv preprint arXiv:2601.13474, 2026

    Jianhao Ma, Yu Huang, Yuejie Chi, and Yuxin Chen. Preconditioning benefits of spectral orthogonalization in muon.arXiv preprint arXiv:2601.13474, 2026

  26. [26]

    arXiv preprint arXiv:2502.07529 , year=

    Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025

  27. [27]

    Generalized gradient norm clipping & non-euclidean(𝑙_0, 𝑙_1)-smoothness.arXiv preprint arXiv:2506.01913, 2025

    Thomas Pethick, Wanyun Xie, Mete Erdogan, Kimon Antonakopoulos, Tony Silveti-Falls, and Volkan Cevher. Generalized gradient norm clipping & non-euclidean(𝑙_0, 𝑙_1)-smoothness.arXiv preprint arXiv:2506.01913, 2025

  28. [28]

    Muon is provably faster with momentum variance reduction.arXiv preprint arXiv:2512.16598, 2025

    Xun Qian, Hussein Rammal, Dmitry Kovalev, and Peter Richtarik. Muon is provably faster with momentum variance reduction.arXiv preprint arXiv:2512.16598, 2025

  29. [29]

    On the gener- alization of stochastic gradient descent with momentum.Journal of Machine Learning Research, 25(22):1–56, 2024

    Ali Ramezani-Kebrya, Kimon Antonakopoulos, Volkan Cevher, Ashish Khisti, and Ben Liang. On the gener- alization of stochastic gradient descent with momentum.Journal of Machine Learning Research, 25(22):1–56, 2024

  30. [30]

    Sumo: Subspace-aware moment- orthogonalization for accelerating memory-efficient llm training.arXiv preprint arXiv:2505.24749, 2025

    Yehonathan Refael, Guy Smorodinsky, Tom Tirer, and Ofir Lindenbaum. Sumo: Subspace-aware moment- orthogonalization for accelerating memory-efficient llm training.arXiv preprint arXiv:2505.24749, 2025

  31. [31]

    arXiv preprint arXiv:2505.13416 , year=

    Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, and Peter Richtárik. Gluon: Making muon & scion great again!(bridging theory and practice of lmo-based optimizers for llms).arXiv preprint arXiv:2505.13416, 2025

  32. [32]

    A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

    Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951. 12

  33. [33]

    RanjanSapkota,RahulHarshaCheppally,AjaySharda,andManojKarkee.Yolo26: keyarchitecturalenhancements and performance benchmarking for real-time object detection.arXiv preprint arXiv:2509.25164, 2025

  34. [34]

    Lionsandmuons: Optimizationviastochasticfrank-wolfe.arXivpreprint arXiv:2506.04192, 2025

    Maria-EleniSfyrakiandJun-KunWang. Lionsandmuons: Optimizationviastochasticfrank-wolfe.arXivpreprint arXiv:2506.04192, 2025

  35. [35]

    Learnability, stability and uniform convergence.The Journal of Machine Learning Research, 11:2635–2670, 2010

    Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence.The Journal of Machine Learning Research, 11:2635–2670, 2010

  36. [36]

    On the Convergence Analysis of Muon

    Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon. arXiv preprint arXiv:2505.23737, 2025

  37. [37]

    ArXiv Preprint: 2511.00674 , Year =

    Weijie Su. Isotropic curvature model for understanding deep learning optimization: Is gradient orthogonalization optimal?arXiv preprint arXiv:2511.00674, 2025

  38. [38]

    On the importance of initialization and momentum in deep learning

    Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. InInternational conference on machine learning, pages 1139–1147. pmlr, 2013

  39. [39]

    Qwen3 technical report, 2025

    Qwen Team. Qwen3 technical report, 2025

  40. [40]

    On generalization of spectral gradient descent: A case study on imbalanced data

    Bhavya Vasudeva, Puneesh Deora, and Christos Thrampoulidis. On generalization of spectral gradient descent: A case study on imbalanced data. InHigh-dimensional Learning Dynamics 2025

  41. [41]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  42. [42]

    Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise

    Dingzhi Yu, Hongyi Tao, Yuanyu Wan, Luo Luo, and Lijun Zhang. Sign-based optimizers are effective under heavy-tailed noise.arXiv preprint arXiv:2602.07425, 2026

  43. [43]

    A useful variant of the davis–kahan theorem for statisticians

    Yi Yu, Tengyao Wang, and Richard J Samworth. A useful variant of the davis–kahan theorem for statisticians. Biometrika, 102(2):315–323, 2015

  44. [44]

    Cambridge University Press, 2023

    Tong Zhang.Mathematical analysis of machine learning algorithms. Cambridge University Press, 2023. 13 A Generalization Analysis In this subsection, we provide a detailed generalization analysis for both the Muon and our MiMuon algorithms, respectively. Theorem 4.(Restatement of Theorem 1) Assume the sequence{𝑊 𝑡 , 𝑀𝑡 }𝑇 𝑡=0 is generated from Algorithm 1 o...