OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality
Pith reviewed 2026-06-29 05:31 UTC · model grok-4.3
The pith
OptMuon pairs Muon-style orthogonal directions with a trajectory-calibrated AdaGrad-Norm schedule to obtain noise-adaptive stationarity rates that reduce to the optimal deterministic rate without retuning when noise vanishes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OptMuon combines Muon-style polar-factor directions with a trajectory-dependent AdaGrad-Norm-type coefficient schedule whose magnitude is determined by the observed gradient and momentum history rather than by a prescribed Lipschitz-dependent rule. Under lower-boundedness, unbiased stochastic gradients with bounded variance, smoothness, and an almost-sure bounded stochastic-gradient condition, OptMuon-A achieves the noise-adaptive rate O~(T^{-1/2} + sigma^{1/2} T^{-1/4}) under average smoothness while OptMuon-I achieves O~(T^{-1/2} + sigma^{1/3} T^{-1/3}) under individual smoothness; both bounds reduce to O~(T^{-1/2}) when the noise level is zero without any manual hyperparameter retuning.
What carries the argument
The closed-loop AdaGrad-Norm-type coefficient schedule applied to Muon-style polar-factor directions, equipped with a running-maximum correction to avoid coefficient collapse from isolated spikes.
If this is right
- The magnitude schedule requires no knowledge of the smoothness constant, variance level, or bounded-gradient constant.
- The running-maximum correction prevents isolated gradient spikes from causing excessive coefficient collapse.
- Both OptMuon variants retain noise adaptivity while automatically recovering the nearly optimal deterministic first-order rate up to logarithmic factors when noise is absent.
- The guarantees apply to stochastic nonconvex optimization under the stated assumptions.
Where Pith is reading between the lines
- The distinction between average and individual smoothness versions suggests the method may behave differently on heterogeneous versus homogeneous data distributions.
- Closed-loop scalar adaptation of this form could be combined with other orthogonalization techniques beyond the Muon polar-factor construction.
- Relaxing the almost-sure bounded-gradient condition while preserving the rates would be a direct next theoretical question.
Load-bearing premise
The analysis requires an almost-sure bounded stochastic-gradient condition beyond standard bounded-variance and smoothness assumptions to keep the running-maximum correction from failing.
What would settle it
A concrete sequence of unbiased stochastic gradients with bounded variance but not almost-surely bounded, for which the claimed rates fail to hold under the running-maximum correction.
read the original abstract
Orthogonalized momentum updates, as used in Muon-style optimizers, have recently shown strong empirical stability in large-scale deep learning. However, most current orthogonalized methods are still paired with fixed, externally scheduled, or otherwise open-loop magnitude rules, so their scale is not directly calibrated from the realized optimization trajectory. Motivated by the closed-loop perspective behind Lipschitz-free and noise-adaptive methods, we propose OptMuon, a family of adaptive momentum orthogonalization methods for stochastic nonconvex optimization. OptMuon combines Muon-style polar-factor directions with a trajectory-dependent AdaGrad-Norm-type coefficient schedule, so that the update magnitude is determined by the observed gradient and momentum history rather than by a prescribed Lipschitz-dependent rule. The schedule does not use the smoothness constant, the variance level, or the bounded-gradient constant in parameter selection, and its running-maximum correction prevents isolated gradient spikes from causing excessive coefficient collapse. Under lower-boundedness, unbiased stochastic gradients with bounded variance, smoothness, and an almost-sure bounded stochastic-gradient condition, we prove two complementary expected-stationarity guarantees. OptMuon-A achieves the noise-adaptive rate \(\tilde{\mathcal O}(T^{-1/2}+\sigma^{1/2}T^{-1/4})\) under average smoothness, while OptMuon-I achieves \(\tilde{\mathcal O}(T^{-1/2}+\sigma^{1/3}T^{-1/3})\) under individual smoothness. In the zero-noise regime, both bounds automatically reduce to a nearly optimal deterministic first-order rate \(\tilde{\mathcal O}(T^{-1/2})\) without manual hyperparameter retuning. These results show that closed-loop scalar adaptation can be combined with Muon-style momentum orthogonalization while retaining noise adaptivity and zero-noise optimality up to logarithmic factors.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes OptMuon, a family of closed-loop adaptive orthogonalized momentum methods for stochastic nonconvex optimization. It pairs Muon-style polar-factor directions with a trajectory-dependent AdaGrad-Norm-type scalar schedule whose magnitude is determined by observed gradient and momentum history (with a running-maximum correction). Under lower-boundedness, unbiased stochastic gradients with bounded variance, smoothness, and an additional almost-sure bounded stochastic-gradient condition, two variants are analyzed: OptMuon-A yields the noise-adaptive rate Õ(T^{-1/2} + σ^{1/2} T^{-1/4}) under average smoothness, while OptMuon-I yields Õ(T^{-1/2} + σ^{1/3} T^{-1/3}) under individual smoothness; both reduce automatically to Õ(T^{-1/2}) in the zero-noise regime without retuning.
Significance. If the stated rates and zero-noise optimality hold, the work shows that Muon-style orthogonalization can be combined with fully closed-loop scalar adaptation while retaining noise-adaptivity and deterministic optimality up to logs. This would be a useful addition to the literature on adaptive first-order methods that avoid manual Lipschitz or noise tuning.
major comments (2)
- [Abstract / assumptions paragraph] Abstract (and the theorems whose statements appear there): the almost-sure bounded stochastic-gradient condition is listed as an explicit hypothesis required for the running-maximum correction to prevent coefficient collapse. Bounded variance alone permits sample paths with arbitrarily large gradients on sets of positive probability, so the extra assumption is not redundant; the manuscript should either (a) exhibit a concrete counter-example showing that the claimed rates can fail without it or (b) derive the rates under only the standard bounded-variance hypothesis.
- [Abstract / convergence theorems] Abstract: the two complementary rates are stated for average vs. individual smoothness, yet the manuscript does not indicate whether the individual-smoothness result (OptMuon-I) can be recovered from the average-smoothness analysis (OptMuon-A) by a standard localization argument or whether a genuinely different proof technique is required; this affects how load-bearing the individual-smoothness case is.
minor comments (2)
- [Abstract] The abstract uses the notation Õ without defining the hidden logarithmic factors; a brief parenthetical clarification would improve readability.
- [Abstract] The phrase “zero-noise optimality” is used; it would be helpful to state explicitly whether the deterministic rate matches the known lower bound up to the precise logarithmic factors or only up to constants.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. Below we respond point-by-point to the major comments, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract / assumptions paragraph] Abstract (and the theorems whose statements appear there): the almost-sure bounded stochastic-gradient condition is listed as an explicit hypothesis required for the running-maximum correction to prevent coefficient collapse. Bounded variance alone permits sample paths with arbitrarily large gradients on sets of positive probability, so the extra assumption is not redundant; the manuscript should either (a) exhibit a concrete counter-example showing that the claimed rates can fail without it or (b) derive the rates under only the standard bounded-variance hypothesis.
Authors: We agree that the almost-sure bounded stochastic-gradient condition is strictly stronger than bounded variance and is used to guarantee that the running-maximum correction prevents coefficient collapse on sample paths containing arbitrarily large gradient realizations. Bounded variance alone does not preclude such paths. Constructing an explicit counter-example under only bounded variance would require designing a specific stochastic process that exploits the adaptive coefficient mechanism to violate the claimed rates; this lies outside the scope of the present analysis. Likewise, removing the assumption would necessitate a substantially different argument (e.g., truncation or higher-moment controls) that is not immediate from the existing proof. In the revision we will add a dedicated remark clarifying the role of the assumption and noting that its relaxation is left for future work. revision: partial
-
Referee: [Abstract / convergence theorems] Abstract: the two complementary rates are stated for average vs. individual smoothness, yet the manuscript does not indicate whether the individual-smoothness result (OptMuon-I) can be recovered from the average-smoothness analysis (OptMuon-A) by a standard localization argument or whether a genuinely different proof technique is required; this affects how load-bearing the individual-smoothness case is.
Authors: The OptMuon-I guarantee under individual smoothness requires a genuinely different proof technique. The OptMuon-A analysis exploits a uniform (average) smoothness bound that permits a global control on the trajectory-dependent coefficients, whereas the individual-smoothness setting involves iteration-dependent Lipschitz constants that couple directly with the history-dependent adaptation; a standard localization argument does not close the estimates. We will revise the manuscript to state this distinction explicitly and to outline the principal differences between the two proof strategies. revision: yes
- Exhibiting a concrete counter-example under bounded variance alone or deriving the rates without the almost-sure bounded stochastic-gradient condition.
Circularity Check
No circularity: rates derived from external assumptions via analysis
full rationale
The paper proposes OptMuon and states convergence rates proved under lower-boundedness, unbiased gradients with bounded variance, smoothness, and an additional almost-sure bounded stochastic-gradient condition. These rates are expressed directly in terms of T and sigma; no equations or self-citations reduce the claimed O~(T^{-1/2} + ...) bounds to fitted parameters, self-defined quantities, or prior author results by construction. The derivation chain relies on standard stochastic optimization analysis under the listed hypotheses rather than any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
axioms (4)
- domain assumption The objective function is lower-bounded
- domain assumption Stochastic gradients are unbiased with bounded variance
- domain assumption The objective is smooth (average or individual)
- domain assumption Almost-sure bounded stochastic-gradient condition
Reference graph
Works this paper leans on
-
[1]
International Conference on Learning Representations , year=
Adam: A method for stochastic optimization , author=. International Conference on Learning Representations , year=
-
[2]
International Conference on Machine Learning (ICML) , pages=
SARAH: A novel method for machine learning problems using stochastic recursive gradient , author=. International Conference on Machine Learning (ICML) , pages=
-
[4]
International Conference on Learning Representations , year =
Reddi, Sashank J and Kale, Satyen and Kumar, Sanjiv , title =. International Conference on Learning Representations , year =
-
[5]
Ussr computational mathematics and mathematical physics , volume=
Some methods of speeding up the convergence of iteration methods , author=. Ussr computational mathematics and mathematical physics , volume=. 1964 , publisher=
1964
-
[6]
International Conference on Learning Representations (ICLR) , year =
Kim, Gyu Yeol and Oh, Min-hwan , title =. International Conference on Learning Representations (ICLR) , year =
-
[7]
The annals of mathematical statistics , pages=
A stochastic approximation method , author=. The annals of mathematical statistics , pages=. 1951 , publisher=
1951
-
[8]
SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path-Integrated Differential Estimator , volume =
Fang, Cong and Li, Chris Junchi and Lin, Zhouchen and Zhang, Tong , booktitle =. SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path-Integrated Differential Estimator , volume =
-
[9]
Advances in Neural Information Processing Systems , volume=
Super-adam: faster and universal framework of adaptive gradients , author=. Advances in Neural Information Processing Systems , volume=
-
[10]
Adaptive Variance Reduction for Stochastic Optimization under Weaker Assumptions , volume =
Jiang, Wei and Yang, Sifan and Wang, Yibo and Zhang, Lijun , booktitle =. Adaptive Variance Reduction for Stochastic Optimization under Weaker Assumptions , volume =
-
[16]
Advances in Neural Information Processing Systems , volume=
Storm+: Fully adaptive sgd with recursive momentum for nonconvex optimization , author=. Advances in Neural Information Processing Systems , volume=
-
[18]
Advances in neural information processing systems , volume=
Momentum-based variance reduction in non-convex sgd , author=. Advances in neural information processing systems , volume=
-
[19]
, author=
Adaptive subgradient methods for online learning and stochastic optimization. , author=. Journal of machine learning research , volume=
-
[20]
International Conference on Machine Learning (ICML) , pages =
Learning-Rate-Free Learning by D-Adaptation , author =. International Conference on Machine Learning (ICML) , pages =. 2023 , volume =
2023
-
[21]
Mathematical Programming , volume=
Accelerated gradient methods for nonconvex nonlinear and stochastic programming , author=. Mathematical Programming , volume=. 2016 , publisher=
2016
-
[22]
Journal of Machine Learning Research , volume=
Towards practical adam: Non-convexity, convergence theory, and mini-batch acceleration , author=. Journal of Machine Learning Research , volume=
-
[23]
Reddi, Sashank and Sra, Suvrit and Poczos, Barnabas and Smola, Alexander J , booktitle =
J. Reddi, Sashank and Sra, Suvrit and Poczos, Barnabas and Smola, Alexander J , booktitle =. Proximal Stochastic Methods for Nonsmooth Nonconvex Finite-Sum Optimization , volume =
-
[24]
International Conference on Machine Learning (ICML) , pages=
PAGE: A simple and optimal probabilistic gradient estimator for nonconvex optimization , author=. International Conference on Machine Learning (ICML) , pages=
-
[25]
The Journal of Machine Learning Research , volume=
Stochastic nested variance reduction for nonconvex optimization , author=. The Journal of Machine Learning Research , volume=
-
[26]
Advances in Neural Information Processing Systems (NeurlPS) , volume=
SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives , author=. Advances in Neural Information Processing Systems (NeurlPS) , volume=
-
[27]
Advances in Neural Information Processing Systems (NeurlPS) , volume=
Accelerated proximal gradient methods for nonconvex programming , author=. Advances in Neural Information Processing Systems (NeurlPS) , volume=
-
[28]
On the Generalization Ability of Online Strongly Convex Programming Algorithms , volume =
Kakade, Sham M and Tewari, Ambuj , booktitle =. On the Generalization Ability of Online Strongly Convex Programming Algorithms , volume =
-
[29]
SIAM Journal on Optimization , volume=
Convergence of Random Reshuffling under the Kurdyka--Lojasiewicz Inequality , author=. SIAM Journal on Optimization , volume=
-
[30]
arXiv preprint arXiv:2401.08024 , year=
AdaBB: Adaptive Barzilai-Borwein method for convex optimization , author=. arXiv preprint arXiv:2401.08024 , year=
-
[31]
International Conference on Machine Learning (ICML) , pages=
Generalized Polyak step size for first order optimization with momentum , author=. International Conference on Machine Learning (ICML) , pages=. 2023 , organization=
2023
-
[32]
Mathematics of Operations Research , title =
Hedy Attouch and J. Mathematics of Operations Research , title =. 2010 , number =
2010
-
[33]
Advances in Neural Information Processing Systems (NeurlPS) , volume=
A simple proximal stochastic gradient method for nonsmooth nonconvex optimization , author=. Advances in Neural Information Processing Systems (NeurlPS) , volume=
-
[34]
International Conference on Learning Representations (ICLR) , year =
Ali Kavis and Kfir Yehuda Levy and Volkan Cevher , title =. International Conference on Learning Representations (ICLR) , year =
-
[35]
Advances in Neural Information Processing Systems (NeurlPS) , volume=
Non-asymptotic analysis of stochastic methods for non-smooth non-convex regularized problems , author=. Advances in Neural Information Processing Systems (NeurlPS) , volume=
-
[36]
International Conference on Learning Representations (ICLR) , year=
Proxsgd: Training structured neural networks under regularization and constraints , author=. International Conference on Learning Representations (ICLR) , year=
-
[37]
International Conference on Machine Learning (ICML) , pages=
Stochastic optimization for DC functions and non-smooth non-convex regularizers with non-asymptotic convergence , author=. International Conference on Machine Learning (ICML) , pages=
-
[38]
SIAM Journal on Optimization , volume=
Stochastic model-based minimization of weakly convex functions , author=. SIAM Journal on Optimization , volume=
-
[39]
SIAM Journal on Optimization , volume =
Ghadimi, Saeed and Lan, Guanghui , title =. SIAM Journal on Optimization , volume =
-
[40]
Cited on , volume=
Neural networks for machine learning lecture 6a overview of mini-batch gradient descent , author=. Cited on , volume=
-
[41]
International Conference on Machine Learning , pages=
Inertial block proximal methods for non-convex non-smooth optimization , author=. International Conference on Machine Learning , pages=
-
[42]
Journal of Optimization Theory and Applications , volume=
Proximal Gradient Method with Extrapolation and Line Search for a Class of Non-convex and Non-smooth Problems , author=. Journal of Optimization Theory and Applications , volume=
-
[43]
2017 , publisher=
First-order methods in optimization , author=. 2017 , publisher=
2017
-
[44]
Bishop, Christopher M and Nasrabadi, Nasser M , title=
-
[45]
A Unified Convergence Analysis of Block Successive Minimization Methods for Nonsmooth Optimization , journal =
Meisam Razaviyayn and Mingyi Hong and Zhi. A Unified Convergence Analysis of Block Successive Minimization Methods for Nonsmooth Optimization , journal =
-
[46]
2003 , publisher=
Introductory lectures on convex optimization: A basic course , author=. 2003 , publisher=
2003
-
[47]
SIAM Journal on Imaging Sciences , volume=
A fast iterative shrinkage-thresholding algorithm for linear inverse problems , author=. SIAM Journal on Imaging Sciences , volume=. 2009 , publisher=
2009
-
[48]
International Conference on Machine Learning (ICML) , volume =
Julien Mairal , title =. International Conference on Machine Learning (ICML) , volume =
-
[49]
2013 , publisher=
Matrix computations , author=. 2013 , publisher=
2013
-
[51]
International conference on machine learning , pages=
Momentum improves normalized sgd , author=. International conference on machine learning , pages=. 2020 , organization=
2020
-
[52]
Advances in Neural Information Processing Systems , volume=
Two sides of one coin: the limits of untuned SGD and the power of adaptive methods , author=. Advances in Neural Information Processing Systems , volume=
-
[53]
Advances in Neural Information Processing Systems , volume=
High-probability bounds for non-convex stochastic optimization with heavy tails , author=. Advances in Neural Information Processing Systems , volume=
-
[54]
arXiv preprint arXiv:1905.11881 , year=
Why gradient clipping accelerates training: A theoretical justification for adaptivity , author=. arXiv preprint arXiv:1905.11881 , year=
-
[55]
International conference on machine learning , pages=
signSGD: Compressed optimisation for non-convex problems , author=. International conference on machine learning , pages=. 2018 , organization=
2018
-
[56]
Journal of Machine Learning Research , volume=
Simple and optimal stochastic gradient methods for nonsmooth nonconvex optimization , author=. Journal of Machine Learning Research , volume=
-
[57]
International conference on artificial intelligence and statistics , pages=
Linear convergence of adaptive stochastic gradient descent , author=. International conference on artificial intelligence and statistics , pages=. 2020 , organization=
2020
-
[58]
Linear convergence of gradient and proximal-gradient methods under the polyak-
Karimi, Hamed and Nutini, Julie and Schmidt, Mark , booktitle=. Linear convergence of gradient and proximal-gradient methods under the polyak-. 2016 , organization=
2016
-
[59]
Scaling Laws for Neural Language Models
Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[60]
Journal of machine learning research , volume=
Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=
-
[61]
Training Compute-Optimal Large Language Models
Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[62]
Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[63]
Practical efficiency of Muon for pretraining.arXiv:2505.02222,
Practical efficiency of muon for pretraining , author=. arXiv preprint arXiv:2505.02222 , year=
-
[64]
The quarterly journal of mathematics , volume=
Symmetric gauge functions and unitarily invariant norms , author=. The quarterly journal of mathematics , volume=. 1960 , publisher=
1960
-
[65]
Advances in neural information processing systems , volume=
A stochastic gradient method with an exponential convergence \_rate for finite training sets , author=. Advances in neural information processing systems , volume=
-
[66]
Journal of machine learning research , volume=
Stochastic nested variance reduction for nonconvex optimization , author=. Journal of machine learning research , volume=
-
[67]
Old Optimizer, New Norm: An Anthology
Old optimizer, new norm: An anthology , author=. arXiv preprint arXiv:2409.20325 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[68]
Training Deep Learning Models with Norm-Constrained LMOs
Training deep learning models with norm-constrained lmos , author=. arXiv preprint arXiv:2502.07529 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[69]
arXiv preprint arXiv:2502.04664 , year=
Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data , author=. arXiv preprint arXiv:2502.04664 , year=
-
[70]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[71]
PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective , author=. arXiv preprint arXiv:2505.21799 , year=
-
[72]
Muon optimizes under spectral norm constraints
Muon Optimizes Under Spectral Norm Constraints , author=. arXiv preprint arXiv:2506.15054 , year=
-
[73]
Lions and Muons: Optimization via stochastic Frank-Wolfe.arXiv:2506.04192,
Lions and muons: Optimization via stochastic frank-wolfe , author=. arXiv preprint arXiv:2506.04192 , year=
-
[74]
ArXiv , year=
ASGO: Adaptive Structured Gradient Optimization , author=. ArXiv , year=
-
[75]
Kakade , booktitle=
Nikhil Vyas and Depen Morwani and Rosie Zhao and Itai Shapira and David Brandfonbrener and Lucas Janson and Sham M. Kakade , booktitle=. 2025 , url=
2025
-
[76]
Cosmos: A hybrid adaptive optimizer for memory-efficient training of llms
Cosmos: A hybrid adaptive optimizer for memory-efficient training of llms , author=. arXiv preprint arXiv:2502.17410 , year=
-
[77]
2025 , url=
AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates , author=. 2025 , url=
2025
-
[79]
2025 , url=
Convergence Bound and Critical Batch Size of Muon Optimizer , author=. 2025 , url=
2025
-
[80]
2025 , url=
A Note on the Convergence of Muon , author=. 2025 , url=
2025
-
[81]
On the Convergence Analysis of Muon
On the convergence analysis of muon , author=. arXiv preprint arXiv:2505.23737 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[82]
ArXiv , year=
MARS: Unleashing the Power of Variance Reduction for Training Large Models , author=. ArXiv , year=
-
[83]
ArXiv , year=
SUPER-ADAM: Faster and Universal Framework of Adaptive Gradients , author=. ArXiv , year=
-
[84]
arXiv preprint arXiv:2307.11782 , year=
Convergence of adam for non-convex objectives: Relaxed hyperparameters and non-ergodic case , author=. arXiv preprint arXiv:2307.11782 , year=
-
[85]
arXiv preprint arXiv:2402.03982 , year=
On convergence of adam for stochastic optimization under relaxed assumptions , author=. arXiv preprint arXiv:2402.03982 , year=
-
[86]
arXiv preprint arXiv:2007.14294 , year=
A high probability analysis of adaptive sgd with momentum , author=. arXiv preprint arXiv:2007.14294 , year=
-
[87]
arXiv preprint arXiv:2003.02395 , year=
A simple convergence proof of adam and adagrad , author=. arXiv preprint arXiv:2003.02395 , year=
-
[88]
ArXiv , year=
High Probability Convergence of Adam Under Unbounded Gradients and Affine Variance Noise , author=. ArXiv , year=
-
[89]
2024 , url =
Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.