OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality

Ganzhao Yuan

arxiv: 2606.08783 · v2 · pith:G5TBRIGQnew · submitted 2026-06-07 · 🧮 math.OC · cs.LG· cs.NA· math.NA

OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality

Ganzhao Yuan This is my paper

Pith reviewed 2026-06-29 05:31 UTC · model grok-4.3

classification 🧮 math.OC cs.LGcs.NAmath.NA

keywords stochastic nonconvex optimizationorthogonalized momentumclosed-loop adaptationnoise-adaptive rateszero-noise optimalityAdaGrad-Norm scheduleMuon-style orthogonalization

0 comments

The pith

OptMuon pairs Muon-style orthogonal directions with a trajectory-calibrated AdaGrad-Norm schedule to obtain noise-adaptive stationarity rates that reduce to the optimal deterministic rate without retuning when noise vanishes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OptMuon, a family of methods that apply Muon-style polar-factor orthogonalization to momentum vectors while determining the update magnitude from the observed gradient and momentum history via a closed-loop AdaGrad-Norm schedule. This design avoids any dependence on the smoothness constant, noise variance, or bounded-gradient bound for hyperparameter choice. Under lower-boundedness, unbiased stochastic gradients with bounded variance, smoothness, and an almost-sure bounded stochastic-gradient condition, OptMuon-A attains an expected-stationarity rate of order T to the minus one-half plus sigma to the one-half times T to the minus one-fourth under average smoothness, while OptMuon-I attains T to the minus one-half plus sigma to the one-third times T to the minus one-third under individual smoothness. Both rates simplify automatically to order T to the minus one-half in the zero-noise limit. A sympathetic reader would care because the same algorithm works across noisy stochastic and clean deterministic regimes without manual adjustment.

Core claim

OptMuon combines Muon-style polar-factor directions with a trajectory-dependent AdaGrad-Norm-type coefficient schedule whose magnitude is determined by the observed gradient and momentum history rather than by a prescribed Lipschitz-dependent rule. Under lower-boundedness, unbiased stochastic gradients with bounded variance, smoothness, and an almost-sure bounded stochastic-gradient condition, OptMuon-A achieves the noise-adaptive rate O~(T^{-1/2} + sigma^{1/2} T^{-1/4}) under average smoothness while OptMuon-I achieves O~(T^{-1/2} + sigma^{1/3} T^{-1/3}) under individual smoothness; both bounds reduce to O~(T^{-1/2}) when the noise level is zero without any manual hyperparameter retuning.

What carries the argument

The closed-loop AdaGrad-Norm-type coefficient schedule applied to Muon-style polar-factor directions, equipped with a running-maximum correction to avoid coefficient collapse from isolated spikes.

If this is right

The magnitude schedule requires no knowledge of the smoothness constant, variance level, or bounded-gradient constant.
The running-maximum correction prevents isolated gradient spikes from causing excessive coefficient collapse.
Both OptMuon variants retain noise adaptivity while automatically recovering the nearly optimal deterministic first-order rate up to logarithmic factors when noise is absent.
The guarantees apply to stochastic nonconvex optimization under the stated assumptions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The distinction between average and individual smoothness versions suggests the method may behave differently on heterogeneous versus homogeneous data distributions.
Closed-loop scalar adaptation of this form could be combined with other orthogonalization techniques beyond the Muon polar-factor construction.
Relaxing the almost-sure bounded-gradient condition while preserving the rates would be a direct next theoretical question.

Load-bearing premise

The analysis requires an almost-sure bounded stochastic-gradient condition beyond standard bounded-variance and smoothness assumptions to keep the running-maximum correction from failing.

What would settle it

A concrete sequence of unbiased stochastic gradients with bounded variance but not almost-surely bounded, for which the claimed rates fail to hold under the running-maximum correction.

read the original abstract

Orthogonalized momentum updates, as used in Muon-style optimizers, have recently shown strong empirical stability in large-scale deep learning. However, most current orthogonalized methods are still paired with fixed, externally scheduled, or otherwise open-loop magnitude rules, so their scale is not directly calibrated from the realized optimization trajectory. Motivated by the closed-loop perspective behind Lipschitz-free and noise-adaptive methods, we propose OptMuon, a family of adaptive momentum orthogonalization methods for stochastic nonconvex optimization. OptMuon combines Muon-style polar-factor directions with a trajectory-dependent AdaGrad-Norm-type coefficient schedule, so that the update magnitude is determined by the observed gradient and momentum history rather than by a prescribed Lipschitz-dependent rule. The schedule does not use the smoothness constant, the variance level, or the bounded-gradient constant in parameter selection, and its running-maximum correction prevents isolated gradient spikes from causing excessive coefficient collapse. Under lower-boundedness, unbiased stochastic gradients with bounded variance, smoothness, and an almost-sure bounded stochastic-gradient condition, we prove two complementary expected-stationarity guarantees. OptMuon-A achieves the noise-adaptive rate \(\tilde{\mathcal O}(T^{-1/2}+\sigma^{1/2}T^{-1/4})\) under average smoothness, while OptMuon-I achieves \(\tilde{\mathcal O}(T^{-1/2}+\sigma^{1/3}T^{-1/3})\) under individual smoothness. In the zero-noise regime, both bounds automatically reduce to a nearly optimal deterministic first-order rate \(\tilde{\mathcal O}(T^{-1/2})\) without manual hyperparameter retuning. These results show that closed-loop scalar adaptation can be combined with Muon-style momentum orthogonalization while retaining noise adaptivity and zero-noise optimality up to logarithmic factors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OptMuon pairs Muon orthogonalization with a closed-loop AdaGrad-Norm schedule to claim noise-adaptive rates that hit deterministic optimality without retuning, but the extra almost-sure bounded-gradient assumption carries real weight.

read the letter

The paper's core move is to take Muon-style polar-factor directions and replace the usual fixed or open-loop magnitude rule with a trajectory-dependent AdaGrad-Norm schedule that uses a running-maximum correction. This produces two expected-stationarity bounds: one under average smoothness that scales as O~(T^{-1/2} + sigma^{1/2} T^{-1/4}) and another under individual smoothness that scales as O~(T^{-1/2} + sigma^{1/3} T^{-1/3}). Both collapse to the near-optimal deterministic rate O~(T^{-1/2}) when sigma = 0, and the schedule itself does not require the user to supply smoothness, variance, or bound constants.

That combination of orthogonal momentum with closed-loop scalar adaptation is not a direct restatement of the Muon or Lipschitz-free lines cited in the abstract, so the pairing itself is new. The running-maximum correction is a concrete device meant to stop isolated spikes from driving the coefficient to zero, and the zero-noise optimality without retuning is a clean selling point if the analysis holds.

The clearest limitation is the additional almost-sure bounded stochastic-gradient condition. Bounded variance alone permits paths where gradients become arbitrarily large on sets of positive probability, so the extra assumption is not implied by the other hypotheses and is explicitly linked in the abstract to keeping the correction from collapsing. That makes the noise-adaptive and zero-noise claims rest on a stronger hypothesis than the standard bounded-variance setup. The abstract states the rates under lower-boundedness, unbiased gradients, bounded variance, smoothness, and this condition, but without the full derivation it is hard to judge how much slack the correction actually buys or whether the rates remain meaningful when the condition is relaxed.

The work is aimed at readers who track theoretical adaptive methods for nonconvex stochastic optimization and want to see orthogonal momentum analyzed under closed-loop scaling. It is coherent on its own terms and engages the relevant literature, so it deserves a serious referee to check the proofs and the necessity of the bounded-gradient hypothesis.

Referee Report

2 major / 2 minor

Summary. The paper proposes OptMuon, a family of closed-loop adaptive orthogonalized momentum methods for stochastic nonconvex optimization. It pairs Muon-style polar-factor directions with a trajectory-dependent AdaGrad-Norm-type scalar schedule whose magnitude is determined by observed gradient and momentum history (with a running-maximum correction). Under lower-boundedness, unbiased stochastic gradients with bounded variance, smoothness, and an additional almost-sure bounded stochastic-gradient condition, two variants are analyzed: OptMuon-A yields the noise-adaptive rate Õ(T^{-1/2} + σ^{1/2} T^{-1/4}) under average smoothness, while OptMuon-I yields Õ(T^{-1/2} + σ^{1/3} T^{-1/3}) under individual smoothness; both reduce automatically to Õ(T^{-1/2}) in the zero-noise regime without retuning.

Significance. If the stated rates and zero-noise optimality hold, the work shows that Muon-style orthogonalization can be combined with fully closed-loop scalar adaptation while retaining noise-adaptivity and deterministic optimality up to logs. This would be a useful addition to the literature on adaptive first-order methods that avoid manual Lipschitz or noise tuning.

major comments (2)

[Abstract / assumptions paragraph] Abstract (and the theorems whose statements appear there): the almost-sure bounded stochastic-gradient condition is listed as an explicit hypothesis required for the running-maximum correction to prevent coefficient collapse. Bounded variance alone permits sample paths with arbitrarily large gradients on sets of positive probability, so the extra assumption is not redundant; the manuscript should either (a) exhibit a concrete counter-example showing that the claimed rates can fail without it or (b) derive the rates under only the standard bounded-variance hypothesis.
[Abstract / convergence theorems] Abstract: the two complementary rates are stated for average vs. individual smoothness, yet the manuscript does not indicate whether the individual-smoothness result (OptMuon-I) can be recovered from the average-smoothness analysis (OptMuon-A) by a standard localization argument or whether a genuinely different proof technique is required; this affects how load-bearing the individual-smoothness case is.

minor comments (2)

[Abstract] The abstract uses the notation Õ without defining the hidden logarithmic factors; a brief parenthetical clarification would improve readability.
[Abstract] The phrase “zero-noise optimality” is used; it would be helpful to state explicitly whether the deterministic rate matches the known lower bound up to the precise logarithmic factors or only up to constants.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the careful reading and constructive comments. Below we respond point-by-point to the major comments, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract / assumptions paragraph] Abstract (and the theorems whose statements appear there): the almost-sure bounded stochastic-gradient condition is listed as an explicit hypothesis required for the running-maximum correction to prevent coefficient collapse. Bounded variance alone permits sample paths with arbitrarily large gradients on sets of positive probability, so the extra assumption is not redundant; the manuscript should either (a) exhibit a concrete counter-example showing that the claimed rates can fail without it or (b) derive the rates under only the standard bounded-variance hypothesis.

Authors: We agree that the almost-sure bounded stochastic-gradient condition is strictly stronger than bounded variance and is used to guarantee that the running-maximum correction prevents coefficient collapse on sample paths containing arbitrarily large gradient realizations. Bounded variance alone does not preclude such paths. Constructing an explicit counter-example under only bounded variance would require designing a specific stochastic process that exploits the adaptive coefficient mechanism to violate the claimed rates; this lies outside the scope of the present analysis. Likewise, removing the assumption would necessitate a substantially different argument (e.g., truncation or higher-moment controls) that is not immediate from the existing proof. In the revision we will add a dedicated remark clarifying the role of the assumption and noting that its relaxation is left for future work. revision: partial
Referee: [Abstract / convergence theorems] Abstract: the two complementary rates are stated for average vs. individual smoothness, yet the manuscript does not indicate whether the individual-smoothness result (OptMuon-I) can be recovered from the average-smoothness analysis (OptMuon-A) by a standard localization argument or whether a genuinely different proof technique is required; this affects how load-bearing the individual-smoothness case is.

Authors: The OptMuon-I guarantee under individual smoothness requires a genuinely different proof technique. The OptMuon-A analysis exploits a uniform (average) smoothness bound that permits a global control on the trajectory-dependent coefficients, whereas the individual-smoothness setting involves iteration-dependent Lipschitz constants that couple directly with the history-dependent adaptation; a standard localization argument does not close the estimates. We will revise the manuscript to state this distinction explicitly and to outline the principal differences between the two proof strategies. revision: yes

standing simulated objections not resolved

Exhibiting a concrete counter-example under bounded variance alone or deriving the rates without the almost-sure bounded stochastic-gradient condition.

Circularity Check

0 steps flagged

No circularity: rates derived from external assumptions via analysis

full rationale

The paper proposes OptMuon and states convergence rates proved under lower-boundedness, unbiased gradients with bounded variance, smoothness, and an additional almost-sure bounded stochastic-gradient condition. These rates are expressed directly in terms of T and sigma; no equations or self-citations reduce the claimed O~(T^{-1/2} + ...) bounds to fitted parameters, self-defined quantities, or prior author results by construction. The derivation chain relies on standard stochastic optimization analysis under the listed hypotheses rather than any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 4 axioms · 0 invented entities

The central claims rest on four domain assumptions standard in stochastic nonconvex optimization plus one additional bounded-gradient condition required for the adaptive schedule; no free parameters are introduced or fitted in the method or rates.

axioms (4)

domain assumption The objective function is lower-bounded
Invoked to guarantee convergence to stationarity
domain assumption Stochastic gradients are unbiased with bounded variance
Used to control the noise term in the expected-stationarity bounds
domain assumption The objective is smooth (average or individual)
Required for the two complementary rate statements
domain assumption Almost-sure bounded stochastic-gradient condition
Additional assumption needed for the running-maximum correction to avoid collapse

pith-pipeline@v0.9.1-grok · 5868 in / 1628 out tokens · 32068 ms · 2026-06-29T05:31:23.197152+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

187 extracted references · 34 canonical work pages · 15 internal anchors

[1]

International Conference on Learning Representations , year=

Adam: A method for stochastic optimization , author=. International Conference on Learning Representations , year=
[2]

International Conference on Machine Learning (ICML) , pages=

SARAH: A novel method for machine learning problems using stochastic recursive gradient , author=. International Conference on Machine Learning (ICML) , pages=
[4]

International Conference on Learning Representations , year =

Reddi, Sashank J and Kale, Satyen and Kumar, Sanjiv , title =. International Conference on Learning Representations , year =
[5]

Ussr computational mathematics and mathematical physics , volume=

Some methods of speeding up the convergence of iteration methods , author=. Ussr computational mathematics and mathematical physics , volume=. 1964 , publisher=

1964
[6]

International Conference on Learning Representations (ICLR) , year =

Kim, Gyu Yeol and Oh, Min-hwan , title =. International Conference on Learning Representations (ICLR) , year =
[7]

The annals of mathematical statistics , pages=

A stochastic approximation method , author=. The annals of mathematical statistics , pages=. 1951 , publisher=

1951
[8]

SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path-Integrated Differential Estimator , volume =

Fang, Cong and Li, Chris Junchi and Lin, Zhouchen and Zhang, Tong , booktitle =. SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path-Integrated Differential Estimator , volume =
[9]

Advances in Neural Information Processing Systems , volume=

Super-adam: faster and universal framework of adaptive gradients , author=. Advances in Neural Information Processing Systems , volume=
[10]

Adaptive Variance Reduction for Stochastic Optimization under Weaker Assumptions , volume =

Jiang, Wei and Yang, Sifan and Wang, Yibo and Zhang, Lijun , booktitle =. Adaptive Variance Reduction for Stochastic Optimization under Weaker Assumptions , volume =
[16]

Advances in Neural Information Processing Systems , volume=

Storm+: Fully adaptive sgd with recursive momentum for nonconvex optimization , author=. Advances in Neural Information Processing Systems , volume=
[18]

Advances in neural information processing systems , volume=

Momentum-based variance reduction in non-convex sgd , author=. Advances in neural information processing systems , volume=
[19]

, author=

Adaptive subgradient methods for online learning and stochastic optimization. , author=. Journal of machine learning research , volume=
[20]

International Conference on Machine Learning (ICML) , pages =

Learning-Rate-Free Learning by D-Adaptation , author =. International Conference on Machine Learning (ICML) , pages =. 2023 , volume =

2023
[21]

Mathematical Programming , volume=

Accelerated gradient methods for nonconvex nonlinear and stochastic programming , author=. Mathematical Programming , volume=. 2016 , publisher=

2016
[22]

Journal of Machine Learning Research , volume=

Towards practical adam: Non-convexity, convergence theory, and mini-batch acceleration , author=. Journal of Machine Learning Research , volume=
[23]

Reddi, Sashank and Sra, Suvrit and Poczos, Barnabas and Smola, Alexander J , booktitle =

J. Reddi, Sashank and Sra, Suvrit and Poczos, Barnabas and Smola, Alexander J , booktitle =. Proximal Stochastic Methods for Nonsmooth Nonconvex Finite-Sum Optimization , volume =
[24]

International Conference on Machine Learning (ICML) , pages=

PAGE: A simple and optimal probabilistic gradient estimator for nonconvex optimization , author=. International Conference on Machine Learning (ICML) , pages=
[25]

The Journal of Machine Learning Research , volume=

Stochastic nested variance reduction for nonconvex optimization , author=. The Journal of Machine Learning Research , volume=
[26]

Advances in Neural Information Processing Systems (NeurlPS) , volume=

SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives , author=. Advances in Neural Information Processing Systems (NeurlPS) , volume=
[27]

Advances in Neural Information Processing Systems (NeurlPS) , volume=

Accelerated proximal gradient methods for nonconvex programming , author=. Advances in Neural Information Processing Systems (NeurlPS) , volume=
[28]

On the Generalization Ability of Online Strongly Convex Programming Algorithms , volume =

Kakade, Sham M and Tewari, Ambuj , booktitle =. On the Generalization Ability of Online Strongly Convex Programming Algorithms , volume =
[29]

SIAM Journal on Optimization , volume=

Convergence of Random Reshuffling under the Kurdyka--Lojasiewicz Inequality , author=. SIAM Journal on Optimization , volume=
[30]

arXiv preprint arXiv:2401.08024 , year=

AdaBB: Adaptive Barzilai-Borwein method for convex optimization , author=. arXiv preprint arXiv:2401.08024 , year=

work page arXiv
[31]

International Conference on Machine Learning (ICML) , pages=

Generalized Polyak step size for first order optimization with momentum , author=. International Conference on Machine Learning (ICML) , pages=. 2023 , organization=

2023
[32]

Mathematics of Operations Research , title =

Hedy Attouch and J. Mathematics of Operations Research , title =. 2010 , number =

2010
[33]

Advances in Neural Information Processing Systems (NeurlPS) , volume=

A simple proximal stochastic gradient method for nonsmooth nonconvex optimization , author=. Advances in Neural Information Processing Systems (NeurlPS) , volume=
[34]

International Conference on Learning Representations (ICLR) , year =

Ali Kavis and Kfir Yehuda Levy and Volkan Cevher , title =. International Conference on Learning Representations (ICLR) , year =
[35]

Advances in Neural Information Processing Systems (NeurlPS) , volume=

Non-asymptotic analysis of stochastic methods for non-smooth non-convex regularized problems , author=. Advances in Neural Information Processing Systems (NeurlPS) , volume=
[36]

International Conference on Learning Representations (ICLR) , year=

Proxsgd: Training structured neural networks under regularization and constraints , author=. International Conference on Learning Representations (ICLR) , year=
[37]

International Conference on Machine Learning (ICML) , pages=

Stochastic optimization for DC functions and non-smooth non-convex regularizers with non-asymptotic convergence , author=. International Conference on Machine Learning (ICML) , pages=
[38]

SIAM Journal on Optimization , volume=

Stochastic model-based minimization of weakly convex functions , author=. SIAM Journal on Optimization , volume=
[39]

SIAM Journal on Optimization , volume =

Ghadimi, Saeed and Lan, Guanghui , title =. SIAM Journal on Optimization , volume =
[40]

Cited on , volume=

Neural networks for machine learning lecture 6a overview of mini-batch gradient descent , author=. Cited on , volume=
[41]

International Conference on Machine Learning , pages=

Inertial block proximal methods for non-convex non-smooth optimization , author=. International Conference on Machine Learning , pages=
[42]

Journal of Optimization Theory and Applications , volume=

Proximal Gradient Method with Extrapolation and Line Search for a Class of Non-convex and Non-smooth Problems , author=. Journal of Optimization Theory and Applications , volume=
[43]

2017 , publisher=

First-order methods in optimization , author=. 2017 , publisher=

2017
[44]

Bishop, Christopher M and Nasrabadi, Nasser M , title=
[45]

A Unified Convergence Analysis of Block Successive Minimization Methods for Nonsmooth Optimization , journal =

Meisam Razaviyayn and Mingyi Hong and Zhi. A Unified Convergence Analysis of Block Successive Minimization Methods for Nonsmooth Optimization , journal =
[46]

2003 , publisher=

Introductory lectures on convex optimization: A basic course , author=. 2003 , publisher=

2003
[47]

SIAM Journal on Imaging Sciences , volume=

A fast iterative shrinkage-thresholding algorithm for linear inverse problems , author=. SIAM Journal on Imaging Sciences , volume=. 2009 , publisher=

2009
[48]

International Conference on Machine Learning (ICML) , volume =

Julien Mairal , title =. International Conference on Machine Learning (ICML) , volume =
[49]

2013 , publisher=

Matrix computations , author=. 2013 , publisher=

2013
[51]

International conference on machine learning , pages=

Momentum improves normalized sgd , author=. International conference on machine learning , pages=. 2020 , organization=

2020
[52]

Advances in Neural Information Processing Systems , volume=

Two sides of one coin: the limits of untuned SGD and the power of adaptive methods , author=. Advances in Neural Information Processing Systems , volume=
[53]

Advances in Neural Information Processing Systems , volume=

High-probability bounds for non-convex stochastic optimization with heavy tails , author=. Advances in Neural Information Processing Systems , volume=
[54]

arXiv preprint arXiv:1905.11881 , year=

Why gradient clipping accelerates training: A theoretical justification for adaptivity , author=. arXiv preprint arXiv:1905.11881 , year=

work page arXiv 1905
[55]

International conference on machine learning , pages=

signSGD: Compressed optimisation for non-convex problems , author=. International conference on machine learning , pages=. 2018 , organization=

2018
[56]

Journal of Machine Learning Research , volume=

Simple and optimal stochastic gradient methods for nonsmooth nonconvex optimization , author=. Journal of Machine Learning Research , volume=
[57]

International conference on artificial intelligence and statistics , pages=

Linear convergence of adaptive stochastic gradient descent , author=. International conference on artificial intelligence and statistics , pages=. 2020 , organization=

2020
[58]

Linear convergence of gradient and proximal-gradient methods under the polyak-

Karimi, Hamed and Nutini, Julie and Schmidt, Mark , booktitle=. Linear convergence of gradient and proximal-gradient methods under the polyak-. 2016 , organization=

2016
[59]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[60]

Journal of machine learning research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=
[61]

Training Compute-Optimal Large Language Models

Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[63]

Practical efficiency of Muon for pretraining.arXiv:2505.02222,

Practical efficiency of muon for pretraining , author=. arXiv preprint arXiv:2505.02222 , year=

work page arXiv
[64]

The quarterly journal of mathematics , volume=

Symmetric gauge functions and unitarily invariant norms , author=. The quarterly journal of mathematics , volume=. 1960 , publisher=

1960
[65]

Advances in neural information processing systems , volume=

A stochastic gradient method with an exponential convergence \_rate for finite training sets , author=. Advances in neural information processing systems , volume=
[66]

Journal of machine learning research , volume=

Stochastic nested variance reduction for nonconvex optimization , author=. Journal of machine learning research , volume=
[67]

Old Optimizer, New Norm: An Anthology

Old optimizer, new norm: An anthology , author=. arXiv preprint arXiv:2409.20325 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[68]

Training Deep Learning Models with Norm-Constrained LMOs

Training deep learning models with norm-constrained lmos , author=. arXiv preprint arXiv:2502.07529 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[69]

arXiv preprint arXiv:2502.04664 , year=

Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data , author=. arXiv preprint arXiv:2502.04664 , year=

work page arXiv
[70]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
[71]

PolarGrad: A class of matrix-gradient opti- mizers from a unifying preconditioning perspective.arXiv:2505.21799,

PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective , author=. arXiv preprint arXiv:2505.21799 , year=

work page arXiv
[72]

Muon optimizes under spectral norm constraints

Muon Optimizes Under Spectral Norm Constraints , author=. arXiv preprint arXiv:2506.15054 , year=

work page arXiv
[73]

Lions and Muons: Optimization via stochastic Frank-Wolfe.arXiv:2506.04192,

Lions and muons: Optimization via stochastic frank-wolfe , author=. arXiv preprint arXiv:2506.04192 , year=

work page arXiv
[74]

ArXiv , year=

ASGO: Adaptive Structured Gradient Optimization , author=. ArXiv , year=
[75]

Kakade , booktitle=

Nikhil Vyas and Depen Morwani and Rosie Zhao and Itai Shapira and David Brandfonbrener and Lucas Janson and Sham M. Kakade , booktitle=. 2025 , url=

2025
[76]

Cosmos: A hybrid adaptive optimizer for memory-efficient training of llms

Cosmos: A hybrid adaptive optimizer for memory-efficient training of llms , author=. arXiv preprint arXiv:2502.17410 , year=

work page arXiv
[77]

2025 , url=

AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates , author=. 2025 , url=

2025
[79]

2025 , url=

Convergence Bound and Critical Batch Size of Muon Optimizer , author=. 2025 , url=

2025
[80]

2025 , url=

A Note on the Convergence of Muon , author=. 2025 , url=

2025
[81]

On the Convergence Analysis of Muon

On the convergence analysis of muon , author=. arXiv preprint arXiv:2505.23737 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[82]

ArXiv , year=

MARS: Unleashing the Power of Variance Reduction for Training Large Models , author=. ArXiv , year=
[83]

ArXiv , year=

SUPER-ADAM: Faster and Universal Framework of Adaptive Gradients , author=. ArXiv , year=
[84]

arXiv preprint arXiv:2307.11782 , year=

Convergence of adam for non-convex objectives: Relaxed hyperparameters and non-ergodic case , author=. arXiv preprint arXiv:2307.11782 , year=

work page arXiv
[85]

arXiv preprint arXiv:2402.03982 , year=

On convergence of adam for stochastic optimization under relaxed assumptions , author=. arXiv preprint arXiv:2402.03982 , year=

work page arXiv
[86]

arXiv preprint arXiv:2007.14294 , year=

A high probability analysis of adaptive sgd with momentum , author=. arXiv preprint arXiv:2007.14294 , year=

work page arXiv 2007
[87]

arXiv preprint arXiv:2003.02395 , year=

A simple convergence proof of adam and adagrad , author=. arXiv preprint arXiv:2003.02395 , year=

work page arXiv 2003
[88]

ArXiv , year=

High Probability Convergence of Adam Under Unbounded Gradients and Affine Variance Noise , author=. ArXiv , year=
[89]

2024 , url =

Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

2024

Showing first 80 references.

[1] [1]

International Conference on Learning Representations , year=

Adam: A method for stochastic optimization , author=. International Conference on Learning Representations , year=

[2] [2]

International Conference on Machine Learning (ICML) , pages=

SARAH: A novel method for machine learning problems using stochastic recursive gradient , author=. International Conference on Machine Learning (ICML) , pages=

[3] [4]

International Conference on Learning Representations , year =

Reddi, Sashank J and Kale, Satyen and Kumar, Sanjiv , title =. International Conference on Learning Representations , year =

[4] [5]

Ussr computational mathematics and mathematical physics , volume=

Some methods of speeding up the convergence of iteration methods , author=. Ussr computational mathematics and mathematical physics , volume=. 1964 , publisher=

1964

[5] [6]

International Conference on Learning Representations (ICLR) , year =

Kim, Gyu Yeol and Oh, Min-hwan , title =. International Conference on Learning Representations (ICLR) , year =

[6] [7]

The annals of mathematical statistics , pages=

A stochastic approximation method , author=. The annals of mathematical statistics , pages=. 1951 , publisher=

1951

[7] [8]

SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path-Integrated Differential Estimator , volume =

Fang, Cong and Li, Chris Junchi and Lin, Zhouchen and Zhang, Tong , booktitle =. SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path-Integrated Differential Estimator , volume =

[8] [9]

Advances in Neural Information Processing Systems , volume=

Super-adam: faster and universal framework of adaptive gradients , author=. Advances in Neural Information Processing Systems , volume=

[9] [10]

Adaptive Variance Reduction for Stochastic Optimization under Weaker Assumptions , volume =

Jiang, Wei and Yang, Sifan and Wang, Yibo and Zhang, Lijun , booktitle =. Adaptive Variance Reduction for Stochastic Optimization under Weaker Assumptions , volume =

[10] [16]

Advances in Neural Information Processing Systems , volume=

Storm+: Fully adaptive sgd with recursive momentum for nonconvex optimization , author=. Advances in Neural Information Processing Systems , volume=

[11] [18]

Advances in neural information processing systems , volume=

Momentum-based variance reduction in non-convex sgd , author=. Advances in neural information processing systems , volume=

[12] [19]

, author=

Adaptive subgradient methods for online learning and stochastic optimization. , author=. Journal of machine learning research , volume=

[13] [20]

International Conference on Machine Learning (ICML) , pages =

Learning-Rate-Free Learning by D-Adaptation , author =. International Conference on Machine Learning (ICML) , pages =. 2023 , volume =

2023

[14] [21]

Mathematical Programming , volume=

Accelerated gradient methods for nonconvex nonlinear and stochastic programming , author=. Mathematical Programming , volume=. 2016 , publisher=

2016

[15] [22]

Journal of Machine Learning Research , volume=

Towards practical adam: Non-convexity, convergence theory, and mini-batch acceleration , author=. Journal of Machine Learning Research , volume=

[16] [23]

Reddi, Sashank and Sra, Suvrit and Poczos, Barnabas and Smola, Alexander J , booktitle =

J. Reddi, Sashank and Sra, Suvrit and Poczos, Barnabas and Smola, Alexander J , booktitle =. Proximal Stochastic Methods for Nonsmooth Nonconvex Finite-Sum Optimization , volume =

[17] [24]

International Conference on Machine Learning (ICML) , pages=

PAGE: A simple and optimal probabilistic gradient estimator for nonconvex optimization , author=. International Conference on Machine Learning (ICML) , pages=

[18] [25]

The Journal of Machine Learning Research , volume=

Stochastic nested variance reduction for nonconvex optimization , author=. The Journal of Machine Learning Research , volume=

[19] [26]

Advances in Neural Information Processing Systems (NeurlPS) , volume=

SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives , author=. Advances in Neural Information Processing Systems (NeurlPS) , volume=

[20] [27]

Advances in Neural Information Processing Systems (NeurlPS) , volume=

Accelerated proximal gradient methods for nonconvex programming , author=. Advances in Neural Information Processing Systems (NeurlPS) , volume=

[21] [28]

On the Generalization Ability of Online Strongly Convex Programming Algorithms , volume =

Kakade, Sham M and Tewari, Ambuj , booktitle =. On the Generalization Ability of Online Strongly Convex Programming Algorithms , volume =

[22] [29]

SIAM Journal on Optimization , volume=

Convergence of Random Reshuffling under the Kurdyka--Lojasiewicz Inequality , author=. SIAM Journal on Optimization , volume=

[23] [30]

arXiv preprint arXiv:2401.08024 , year=

AdaBB: Adaptive Barzilai-Borwein method for convex optimization , author=. arXiv preprint arXiv:2401.08024 , year=

work page arXiv

[24] [31]

International Conference on Machine Learning (ICML) , pages=

Generalized Polyak step size for first order optimization with momentum , author=. International Conference on Machine Learning (ICML) , pages=. 2023 , organization=

2023

[25] [32]

Mathematics of Operations Research , title =

Hedy Attouch and J. Mathematics of Operations Research , title =. 2010 , number =

2010

[26] [33]

Advances in Neural Information Processing Systems (NeurlPS) , volume=

A simple proximal stochastic gradient method for nonsmooth nonconvex optimization , author=. Advances in Neural Information Processing Systems (NeurlPS) , volume=

[27] [34]

International Conference on Learning Representations (ICLR) , year =

Ali Kavis and Kfir Yehuda Levy and Volkan Cevher , title =. International Conference on Learning Representations (ICLR) , year =

[28] [35]

Advances in Neural Information Processing Systems (NeurlPS) , volume=

Non-asymptotic analysis of stochastic methods for non-smooth non-convex regularized problems , author=. Advances in Neural Information Processing Systems (NeurlPS) , volume=

[29] [36]

International Conference on Learning Representations (ICLR) , year=

Proxsgd: Training structured neural networks under regularization and constraints , author=. International Conference on Learning Representations (ICLR) , year=

[30] [37]

International Conference on Machine Learning (ICML) , pages=

Stochastic optimization for DC functions and non-smooth non-convex regularizers with non-asymptotic convergence , author=. International Conference on Machine Learning (ICML) , pages=

[31] [38]

SIAM Journal on Optimization , volume=

Stochastic model-based minimization of weakly convex functions , author=. SIAM Journal on Optimization , volume=

[32] [39]

SIAM Journal on Optimization , volume =

Ghadimi, Saeed and Lan, Guanghui , title =. SIAM Journal on Optimization , volume =

[33] [40]

Cited on , volume=

Neural networks for machine learning lecture 6a overview of mini-batch gradient descent , author=. Cited on , volume=

[34] [41]

International Conference on Machine Learning , pages=

Inertial block proximal methods for non-convex non-smooth optimization , author=. International Conference on Machine Learning , pages=

[35] [42]

Journal of Optimization Theory and Applications , volume=

Proximal Gradient Method with Extrapolation and Line Search for a Class of Non-convex and Non-smooth Problems , author=. Journal of Optimization Theory and Applications , volume=

[36] [43]

2017 , publisher=

First-order methods in optimization , author=. 2017 , publisher=

2017

[37] [44]

Bishop, Christopher M and Nasrabadi, Nasser M , title=

[38] [45]

A Unified Convergence Analysis of Block Successive Minimization Methods for Nonsmooth Optimization , journal =

Meisam Razaviyayn and Mingyi Hong and Zhi. A Unified Convergence Analysis of Block Successive Minimization Methods for Nonsmooth Optimization , journal =

[39] [46]

2003 , publisher=

Introductory lectures on convex optimization: A basic course , author=. 2003 , publisher=

2003

[40] [47]

SIAM Journal on Imaging Sciences , volume=

A fast iterative shrinkage-thresholding algorithm for linear inverse problems , author=. SIAM Journal on Imaging Sciences , volume=. 2009 , publisher=

2009

[41] [48]

International Conference on Machine Learning (ICML) , volume =

Julien Mairal , title =. International Conference on Machine Learning (ICML) , volume =

[42] [49]

2013 , publisher=

Matrix computations , author=. 2013 , publisher=

2013

[43] [51]

International conference on machine learning , pages=

Momentum improves normalized sgd , author=. International conference on machine learning , pages=. 2020 , organization=

2020

[44] [52]

Advances in Neural Information Processing Systems , volume=

Two sides of one coin: the limits of untuned SGD and the power of adaptive methods , author=. Advances in Neural Information Processing Systems , volume=

[45] [53]

Advances in Neural Information Processing Systems , volume=

High-probability bounds for non-convex stochastic optimization with heavy tails , author=. Advances in Neural Information Processing Systems , volume=

[46] [54]

arXiv preprint arXiv:1905.11881 , year=

Why gradient clipping accelerates training: A theoretical justification for adaptivity , author=. arXiv preprint arXiv:1905.11881 , year=

work page arXiv 1905

[47] [55]

International conference on machine learning , pages=

signSGD: Compressed optimisation for non-convex problems , author=. International conference on machine learning , pages=. 2018 , organization=

2018

[48] [56]

Journal of Machine Learning Research , volume=

Simple and optimal stochastic gradient methods for nonsmooth nonconvex optimization , author=. Journal of Machine Learning Research , volume=

[49] [57]

International conference on artificial intelligence and statistics , pages=

Linear convergence of adaptive stochastic gradient descent , author=. International conference on artificial intelligence and statistics , pages=. 2020 , organization=

2020

[50] [58]

Linear convergence of gradient and proximal-gradient methods under the polyak-

Karimi, Hamed and Nutini, Julie and Schmidt, Mark , booktitle=. Linear convergence of gradient and proximal-gradient methods under the polyak-. 2016 , organization=

2016

[51] [59]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001

[52] [60]

Journal of machine learning research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=

[53] [61]

Training Compute-Optimal Large Language Models

Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[54] [62]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[55] [63]

Practical efficiency of Muon for pretraining.arXiv:2505.02222,

Practical efficiency of muon for pretraining , author=. arXiv preprint arXiv:2505.02222 , year=

work page arXiv

[56] [64]

The quarterly journal of mathematics , volume=

Symmetric gauge functions and unitarily invariant norms , author=. The quarterly journal of mathematics , volume=. 1960 , publisher=

1960

[57] [65]

Advances in neural information processing systems , volume=

A stochastic gradient method with an exponential convergence \_rate for finite training sets , author=. Advances in neural information processing systems , volume=

[58] [66]

Journal of machine learning research , volume=

Stochastic nested variance reduction for nonconvex optimization , author=. Journal of machine learning research , volume=

[59] [67]

Old Optimizer, New Norm: An Anthology

Old optimizer, new norm: An anthology , author=. arXiv preprint arXiv:2409.20325 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[60] [68]

Training Deep Learning Models with Norm-Constrained LMOs

Training deep learning models with norm-constrained lmos , author=. arXiv preprint arXiv:2502.07529 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[61] [69]

arXiv preprint arXiv:2502.04664 , year=

Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data , author=. arXiv preprint arXiv:2502.04664 , year=

work page arXiv

[62] [70]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

[63] [71]

PolarGrad: A class of matrix-gradient opti- mizers from a unifying preconditioning perspective.arXiv:2505.21799,

PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective , author=. arXiv preprint arXiv:2505.21799 , year=

work page arXiv

[64] [72]

Muon optimizes under spectral norm constraints

Muon Optimizes Under Spectral Norm Constraints , author=. arXiv preprint arXiv:2506.15054 , year=

work page arXiv

[65] [73]

Lions and Muons: Optimization via stochastic Frank-Wolfe.arXiv:2506.04192,

Lions and muons: Optimization via stochastic frank-wolfe , author=. arXiv preprint arXiv:2506.04192 , year=

work page arXiv

[66] [74]

ArXiv , year=

ASGO: Adaptive Structured Gradient Optimization , author=. ArXiv , year=

[67] [75]

Kakade , booktitle=

Nikhil Vyas and Depen Morwani and Rosie Zhao and Itai Shapira and David Brandfonbrener and Lucas Janson and Sham M. Kakade , booktitle=. 2025 , url=

2025

[68] [76]

Cosmos: A hybrid adaptive optimizer for memory-efficient training of llms

Cosmos: A hybrid adaptive optimizer for memory-efficient training of llms , author=. arXiv preprint arXiv:2502.17410 , year=

work page arXiv

[69] [77]

2025 , url=

AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates , author=. 2025 , url=

2025

[70] [79]

2025 , url=

Convergence Bound and Critical Batch Size of Muon Optimizer , author=. 2025 , url=

2025

[71] [80]

2025 , url=

A Note on the Convergence of Muon , author=. 2025 , url=

2025

[72] [81]

On the Convergence Analysis of Muon

On the convergence analysis of muon , author=. arXiv preprint arXiv:2505.23737 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[73] [82]

ArXiv , year=

MARS: Unleashing the Power of Variance Reduction for Training Large Models , author=. ArXiv , year=

[74] [83]

ArXiv , year=

SUPER-ADAM: Faster and Universal Framework of Adaptive Gradients , author=. ArXiv , year=

[75] [84]

arXiv preprint arXiv:2307.11782 , year=

Convergence of adam for non-convex objectives: Relaxed hyperparameters and non-ergodic case , author=. arXiv preprint arXiv:2307.11782 , year=

work page arXiv

[76] [85]

arXiv preprint arXiv:2402.03982 , year=

On convergence of adam for stochastic optimization under relaxed assumptions , author=. arXiv preprint arXiv:2402.03982 , year=

work page arXiv

[77] [86]

arXiv preprint arXiv:2007.14294 , year=

A high probability analysis of adaptive sgd with momentum , author=. arXiv preprint arXiv:2007.14294 , year=

work page arXiv 2007

[78] [87]

arXiv preprint arXiv:2003.02395 , year=

A simple convergence proof of adam and adagrad , author=. arXiv preprint arXiv:2003.02395 , year=

work page arXiv 2003

[79] [88]

ArXiv , year=

High Probability Convergence of Adam Under Unbounded Gradients and Affine Variance Noise , author=. ArXiv , year=

[80] [89]

2024 , url =

Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

2024