Recognition: 2 theorem links
· Lean TheoremMuon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition
Pith reviewed 2026-05-11 00:47 UTC · model grok-4.3
The pith
Muon optimizer with Nesterov momentum achieves optimal convergence rates under heavy-tailed noise and inexact polar decomposition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under a unified framework for inexact polar decomposition, Muon with Nesterov momentum finds an ε-stationary point in O(ε^(-(3α-2)/(α-1))) iterations and samples when the stochastic gradients have heavy tails with index α ∈ (1,2]. For the case of inexact polar with σ1=0, the guarantees hold without knowing α in advance. A randomized low-rank version is also analyzed and shown to be compatible with the theory.
What carries the argument
The unified framework for inexact polar decomposition that quantifies error propagation from iterative approximations such as Newton-Schulz through the optimization dynamics.
Load-bearing premise
The heavy-tailed noise must satisfy moment conditions indexed by α in (1,2] and the approximation errors from inexact polar decomposition must remain controlled as they propagate through the dynamics.
What would settle it
An experiment on a synthetic non-convex matrix problem with known heavy-tail index α where the observed iterations to reach ε-stationarity scale worse than the predicted O(ε^(-(3α-2)/(α-1))) would falsify the complexity bound.
Figures
read the original abstract
Most first-order optimizers treat matrix-valued parameters as vectors, ignoring the intrinsic geometry of hidden-layer weights in neural networks. Muon addresses this mismatch by updating along the polar factor of a momentum matrix, but its theoretical understanding has lagged behind practice. In particular, practical implementations incorporate Nesterov momentum, compute the polar factor only approximately, and operate with stochastic gradients that may be heavy-tailed. We close this gap by developing a convergence theory for Muon with Nesterov momentum and inexact polar decomposition in non-convex matrix optimization under heavy-tailed noise. Our analysis builds on a unified framework for inexact polar decomposition that captures practical iterative approximations such as Newton-Schulz and quantifies how their errors propagate through the optimization dynamics. Under this framework, we establish an optimal iteration and sample complexity of $O \left(\varepsilon^{\frac{-(3\alpha-2)}{(\alpha-1)}} \right)$ for finding an $\varepsilon$-stationary point, where $\alpha\in(1,2]$ denotes the heavy-tail index. For the inexact-polar setting with $\sigma_1=0$, we also provide guarantees that do not require prior knowledge of $\alpha$. We analyze a randomized low-rank polar decomposition that is substantially more efficient than full-space methods while remaining compatible with our theory. Numerical experiments further demonstrate the effectiveness of the proposed inexact and randomized variants.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops a convergence theory for Muon with Nesterov momentum and inexact (including randomized low-rank) polar decomposition in non-convex matrix optimization under heavy-tailed stochastic gradients. It builds on a unified framework for inexact polar decomposition (capturing Newton-Schulz and similar methods) that quantifies error propagation, and claims an optimal iteration/sample complexity of O(ε^{-(3α-2)/(α-1)}) for an ε-stationary point where α∈(1,2] is the heavy-tail index; it also gives α-independent guarantees for the σ1=0 inexact-polar case.
Significance. If the error-propagation analysis holds, the work supplies the first rigorous rates for a geometry-aware optimizer under heavy tails, which is significant given the prevalence of matrix parameters in neural networks and the practical use of inexact polar factors. The unified framework, the randomized low-rank variant, and the numerical experiments are strengths that enhance applicability.
major comments (2)
- [Section 3] Section 3 (unified framework for inexact polar decomposition): the framework bounds polar errors additively and feeds them into the heavy-tail descent lemma, but lacks an explicit lemma establishing that the nonlinear polar factor (via singular vectors) preserves the α-moment condition on the effective noise when the momentum matrix has only α-moments for α<2. This preservation is load-bearing for the claimed rate in Theorem 4.1, as heavier tails in the composite update could introduce extra factors or invalidate the O(ε^{-(3α-2)/(α-1)}) bound.
- [Proof of Theorem 4.1] Proof of Theorem 4.1: the analysis treats the inexact-polar error as an additive perturbation whose norm is controlled, yet provides no uniform-in-iteration bound showing that the α-moment of the composed noise remains finite and independent of conditioning; without this, the optimality claim and the α-independent guarantee for σ1=0 may not hold as stated.
minor comments (2)
- Notation for the heavy-tail index α and the parameters σ1, σ2 could be introduced earlier with a brief reminder of their roles in the moment assumptions.
- Figure captions for the numerical experiments should explicitly state the matrix dimensions and the value of α used in the synthetic heavy-tailed noise.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback. We address the major comments point by point below. We will revise the manuscript to add explicit statements on moment preservation for improved clarity.
read point-by-point responses
-
Referee: [Section 3] Section 3 (unified framework for inexact polar decomposition): the framework bounds polar errors additively and feeds them into the heavy-tail descent lemma, but lacks an explicit lemma establishing that the nonlinear polar factor (via singular vectors) preserves the α-moment condition on the effective noise when the momentum matrix has only α-moments for α<2. This preservation is load-bearing for the claimed rate in Theorem 4.1, as heavier tails in the composite update could introduce extra factors or invalidate the O(ε^{-(3α-2)/(α-1)}) bound.
Authors: We appreciate the referee highlighting the need for an explicit lemma on this point. The polar factor has operator norm exactly 1, so the effective update direction is bounded in norm by 1; this immediately yields that its α-moment is at most 1 (uniformly in the iteration and independent of conditioning). The additive inexact-polar error is controlled separately by the framework. To address the concern directly, we will insert an explicit lemma in the revised Section 3 establishing preservation of the α-moment for the composed noise term under the stated conditions. revision: yes
-
Referee: [Proof of Theorem 4.1] Proof of Theorem 4.1: the analysis treats the inexact-polar error as an additive perturbation whose norm is controlled, yet provides no uniform-in-iteration bound showing that the α-moment of the composed noise remains finite and independent of conditioning; without this, the optimality claim and the α-independent guarantee for σ1=0 may not hold as stated.
Authors: We agree that an explicit uniform-in-iteration bound strengthens the proof. Because the polar factor is bounded by norm 1, the α-moment of the composed noise (polar factor plus additive inexact error) is bounded by a constant such as (1 + σ1)^α that is independent of iteration and conditioning. We will add this uniform bound explicitly to the proof of Theorem 4.1 (with a cross-reference to the new lemma in Section 3), which also secures the α-independent guarantee when σ1 = 0. revision: yes
Circularity Check
Analysis builds on external unified framework with no self-referential reduction in rates
full rationale
The paper derives optimal iteration/sample complexity O(ε^(-(3α-2)/(α-1))) for ε-stationary points under heavy-tailed noise (α∈(1,2]) by building on a unified framework for inexact polar decomposition (Newton-Schulz etc.) that quantifies error propagation. No quoted equations or steps show the final bound reducing by construction to a fitted parameter, self-defined quantity, or load-bearing self-citation chain; the framework is invoked as external support, and the σ1=0 case even removes the need for prior α knowledge. This leaves the central claim with independent analytical content, consistent with only minor self-citation at most.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Stochastic gradients are heavy-tailed with index α ∈ (1,2]
- domain assumption Inexact polar decomposition errors are bounded and propagate controllably through the dynamics
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We establish an optimal iteration and sample complexity of O(ε^−(3α−2)/(α−1)) for finding an ε-stationary point, where α∈(1,2] denotes the heavy-tail index. ... unified framework for inexact polar decomposition ... Newton-Schulz
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Assumption 2.1 ... E[⟨Mk,T(Mk)⟩|Mk]≥(1−γk)∥Mk∥∗ and E[∥T(Mk)∥²op|Mk]≤(1+νk)²
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
K. Ahn, X. Cheng, M. Song, C. Yun, A. Jadbabaie, and S. Sra. Linear attention is (maybe) all you need (to understand transformer optimization).International Conference on Learning Representations, 2024
2024
-
[2]
Amsel, D
N. Amsel, D. Persson, C. Musco, and R. M. Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm.International Conference on Learning Representations, 2026
2026
-
[3]
Battash, L
B. Battash, L. Wolf, and O. Lindenbaum. Revisiting the noise model of stochastic gradient descent. InInternational Conference on Artificial Intelligence and Statistics, pages 4780–4788. PMLR, 2024
2024
-
[4]
Bernstein and L
J. Bernstein and L. Newhouse. Old optimizer, new norm: An anthology.NeurIPS Workshop on Optimization for Machine Learning, 2024
2024
-
[5]
On the Convergence of Muon and Beyond
D. Chang, Y . Liu, and G. Yuan. On the convergence of muon and beyond.arXiv preprint arXiv:2509.15816, 2025
work page internal anchor Pith review arXiv 2025
-
[6]
X. Chen, C. Liang, D. Huang, E. Real, K. Wang, H. Pham, X. Dong, T. Luong, C.-J. Hsieh, Y . Lu, et al. Symbolic discovery of optimization algorithms.Advances in neural information processing systems, 36:49205–49233, 2023
2023
-
[7]
Chezhegov, Y
S. Chezhegov, Y . Klyukin, A. Semenov, A. Beznosikov, A. Gasnikov, S. Horváth, M. Takáˇc, and E. Gorbunov. Clipping improves adam-norm and adagrad-norm when the noise is heavy-tailed. International Conference on Machine Learning, 2025
2025
-
[8]
Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
2026
-
[9]
Drineas, R
P. Drineas, R. Kannan, and M. W. Mahoney. Fast monte carlo algorithms for matrices i: Approximating matrix multiplication.SIAM Journal on Computing, 36(1):132–157, Jan. 2006
2006
-
[10]
Ghadimi and G
S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming.SIAM journal on optimization, 23(4):2341–2368, 2013
2013
-
[11]
Gupta, T
V . Gupta, T. Koren, and Y . Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, pages 1842–1850. PMLR, 2018
2018
-
[12]
Halko, P.-G
N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.SIAM review, 53(2):217–288, 2011
2011
-
[13]
Y . Hao, Y . Cao, and L. Mou. Flora: Low-rank adapters are secretly gradient compressors. International Conference on Machine Learning, 2024
2024
-
[14]
C. He, Z. Deng, and Z. Lu. Low-rank orthogonalization for large-scale matrix optimization with applications to foundation model training.arXiv preprint arXiv:2509.11983, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
N. J. Higham. Computing the polar decomposition—with applications.SIAM Journal on Scientific and Statistical Computing, 7(4):1160–1174, 1986
1986
-
[16]
Hinton, N
G. Hinton, N. Srivastava, and K. Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent.Cited on, 14(8):2, 2012
2012
-
[17]
Hoffmann, S
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, et al. Training compute-optimal large language models. InAdvances in Neural Information Processing Systems, 2022
2022
-
[18]
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.International Conference on Learning Repre- sentations, 1(2):3, 2022
2022
- [19]
-
[20]
Johnson and T
R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction.Advances in neural information processing systems, 26, 2013
2013
- [21]
-
[22]
Jordan, J
K. Jordan, J. Bernstein, B. Rappazzo, V . Boza, J. You, F. Cesista, and B. Koszarsky. Modded- NanoGPT: Speedrunning the NanoGPT baseline. https://github.com/KellerJordan/ modded-nanogpt, 2024. GitHub repository, accessed 2026-05-01
2024
-
[23]
Jordan, Y
K. Jordan, Y . Jin, V . Boza, Y . Jiacheng, F. Cesista, L. Newhouse, and J. Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan. github. io/posts/muon, 6(3):4, 2024
2024
-
[24]
D. P. Kingma. Adam: A method for stochastic optimization.International Conference on Learning Representations, 2015
2015
-
[25]
Kornilov, O
N. Kornilov, O. Shamir, A. Lobanov, D. Dvinskikh, A. Gasnikov, I. Shibaev, E. Gorbunov, and S. Horváth. Accelerated zeroth-order method for non-smooth stochastic convex optimization problem with infinite variance.Advances in Neural Information Processing Systems, 36: 64083–64102, 2023
2023
- [26]
-
[27]
Krizhevsky and G
A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009
2009
-
[28]
G. Lan. An optimal method for stochastic composite optimization.Mathematical Programming, 133(1):365–397, 2012
2012
-
[29]
Lan.First-order and stochastic optimization methods for machine learning, volume 1
G. Lan.First-order and stochastic optimization methods for machine learning, volume 1. Springer, 2020
2020
-
[30]
J. Li and M. Hong. A note on the convergence of muon.arXiv preprint arXiv:2502.02900, 2025
-
[31]
J. Liu, J. Su, X. Yao, Z. Jiang, G. Lai, Y . Du, Y . Qin, W. Xu, E. Lu, J. Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025
work page internal anchor Pith review arXiv 2025
-
[32]
Liu and Z
Z. Liu and Z. Zhou. Nonconvex stochastic optimization under heavy-tailed noises: Optimal convergence without gradient clipping.International Conference on Learning Representations, 2025
2025
-
[33]
Loizou, S
N. Loizou, S. Vaswani, I. H. Laradji, and S. Lacoste-Julien. Stochastic polyak step-size for sgd: An adaptive learning rate for fast convergence. InInternational conference on artificial intelligence and statistics, pages 1306–1314. PMLR, 2021
2021
-
[34]
Loshchilov and F
I. Loshchilov and F. Hutter. Decoupled weight decay regularization.International Conference on Learning Representations, 2019
2019
-
[35]
Narayanan, M
D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V . Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, et al. Efficient large-scale language model training on GPU clusters using Megatron-LM. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021
2021
-
[36]
Nesterov.Introductory Lectures on Convex Optimization: A Basic Course, volume 87 of Applied Optimization
Y . Nesterov.Introductory Lectures on Convex Optimization: A Basic Course, volume 87 of Applied Optimization. Springer, 2004
2004
-
[37]
Nguyen, J
L. Nguyen, J. Liu, K. Scheinberg, and M. Takáˇc. Sarah: A novel method for machine learning problems using stochastic recursive gradient. InIn 34th International Conference on Machine Learning, ICML 2017, 2017. 12
2017
-
[38]
Paszke, S
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. InAdvances in Neural Information Processing Syst...
2019
-
[39]
Penedo, H
G. Penedo, H. Kydlí ˇcek, A. Lozhkov, M. Mitchell, C. Raffel, L. V on Werra, T. Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024
2024
-
[40]
Pethick, W
T. Pethick, W. Xie, K. Antonakopoulos, Z. Zhu, A. Silveti-Falls, and V . Cevher. Training deep learning models with norm-constrained lmos.International Conference on Machine Learning, 2025
2025
-
[41]
PyTorch documentation: torch.optim.Muon, 2026
PyTorch Contributors. PyTorch documentation: torch.optim.Muon, 2026. URLhttps://docs. pytorch.org/docs/stable/generated/torch.optim.Muon.html. Accessed: 2026-04- 28
2026
-
[42]
Radford, J
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
2019
-
[43]
A. Riabinin, E. Shulgin, K. Gruntkowska, and P. Richtárik. Gluon: Making muon & scion great again!(bridging theory and practice of lmo-based optimizers for llms).arXiv preprint arXiv:2505.13416, 2025
- [44]
-
[45]
W. Shen, R. Huang, M. Huang, C. Shen, and J. Zhang. On the convergence analysis of muon. arXiv preprint arXiv:2505.23737, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Beyond the ideal: Analyzing the inexact muon update.arXiv preprint arXiv:2510.19933, 2025
E. Shulgin, S. AlRashed, F. Orabona, and P. Richtárik. Beyond the ideal: Analyzing the inexact muon update.arXiv preprint arXiv:2510.19933, 2025
-
[47]
Simsekli, L
U. Simsekli, L. Sagun, and M. Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. InInternational Conference on Machine Learning, pages 5827–5837. PMLR, 2019
2019
-
[48]
T. Strohmer and R. Vershynin. A randomized kaczmarz algorithm with exponential convergence. Journal of Fourier Analysis and Applications, 15(2):262–278, Apr. 2008. ISSN 1531-5851. doi: 10.1007/s00041-008-9030-4
-
[49]
Takáˇc, A
M. Takáˇc, A. Bijral, P. Richtárik, and N. Srebro. Mini-batch primal and dual methods for svms. InIn 30th International Conference on Machine Learning, ICML 2013, 2013
2013
-
[50]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
Vaswani, N
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
2017
-
[52]
N. Vyas, D. Morwani, R. Zhao, M. Kwun, I. Shapira, D. Brandfonbrener, L. Janson, and S. Kakade. Soap: Improving and stabilizing shampoo using adam.International Conference on Learning Representations, 2025
2025
-
[53]
R. Ward, X. Wu, and L. Bottou. Adagrad stepsizes: Sharp convergence over nonconvex landscapes.Journal of Machine Learning Research, 21(219):1–30, 2020
2020
-
[54]
Zhang, S
J. Zhang, S. P. Karimireddy, A. Veit, S. Kim, S. Reddi, S. Kumar, and S. Sra. Why are adaptive methods good for attention models?Advances in Neural Information Processing Systems, 33: 15383–15393, 2020
2020
-
[55]
J. Zhao, Z. Zhang, B. Chen, Z. Wang, A. Anandkumar, and Y . Tian. Galore: Memory-efficient llm training by gradient low-rank projection.arXiv preprint arXiv:2403.03507, 2024. 13 Appendix Contents 1 Introduction 1 2 (Randomized) Inexact Polar Decomposition 3 3 Main Theoretical Results 5 3.1 Warm-up: Analysis for Exact Polar Decomposition . . . . . . . . . ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.