Recognition: unknown
A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo
Pith reviewed 2026-05-10 07:16 UTC · model grok-4.3
The pith
A single framework proves global convergence rates for AdaGrad, Shampoo and related adaptive methods in nonconvex stochastic optimization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce a unified framework for first-order optimization algorithms that rely on adaptively preconditioned gradients. The framework includes full and diagonal AdaGrad, AdaNorm, and adaptive variants of Shampoo and Muon, and it permits heterogeneous geometries across different groups of variables while keeping a single convergence proof. For every method covered, we give a fully stochastic global rate-of-convergence analysis both with and without two forms of momentum, relying solely on variance assumptions for the gradient oracle.
What carries the argument
The unified framework of adaptively preconditioned gradients that permits heterogeneous geometries across variable groups while supporting a common convergence analysis.
If this is right
- Every method inside the framework, including those with momentum, achieves a global rate of convergence to stationary points under the stated variance assumptions.
- The same analysis covers both full-matrix and diagonal preconditioners as well as combinations across variable groups.
- No separate proof is needed for each new adaptive variant that fits the framework.
- Convergence holds without assuming bounded stochastic gradients or sufficiently small stepsizes.
Where Pith is reading between the lines
- New adaptive preconditioners could be designed simply by verifying they fit inside the framework, thereby inheriting the convergence guarantees automatically.
- The allowance for heterogeneous geometries suggests that mixed-precision or block-wise adaptive schemes may converge under the same mild variance conditions.
- The result opens the possibility of testing whether other popular but unanalyzed adaptive methods can be recast in this form to obtain immediate rate guarantees.
Load-bearing premise
The stochastic gradient oracle has finite variance, without any bound on the gradients themselves or restriction on stepsize size.
What would settle it
A concrete nonconvex problem together with a stochastic gradient oracle of finite variance on which diagonal AdaGrad (or any other method inside the framework) fails to reach a stationary point at the claimed rate.
read the original abstract
A unified framework for first-order optimization algorithms fornonconvex unconstrained optimization is proposed that uses adaptivelypreconditioned gradients and includes popular methods such as full anddiagonal AdaGrad, AdaNorm, as well as adpative variants of Shampoo andMuon. This framework also allows combining heterogeneous geometriesacross different groups of variables while preserving a unifiedconvergence analysis. A fully stochastic global rate-of-convergenceanalysis is conducted for all methods in the framework, with andwithout two types of momentum, using reasonable assumptions on thevariance of the gradient oracle and without assuming boundedstochastic gradients or small enough stepsize.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a unified framework for adaptive first-order methods in nonconvex unconstrained optimization using adaptively preconditioned gradients. It includes full and diagonal AdaGrad, AdaNorm, adaptive Shampoo, and Muon, and allows for heterogeneous geometries across variable groups. The paper provides a fully stochastic global rate-of-convergence analysis for these methods with and without momentum, under assumptions on the variance of the gradient oracle, without assuming bounded stochastic gradients or small step sizes.
Significance. If the analysis is correct, this work would offer a substantial advance in the theoretical understanding of adaptive optimizers in nonconvex settings, which are prevalent in machine learning. By relaxing standard assumptions and providing a unified treatment including momentum and heterogeneous preconditioners, it could bridge the gap between theory and practice for methods like AdaGrad and Shampoo.
major comments (1)
- [Convergence Analysis] The central claim that reasonable variance assumptions on the gradient oracle suffice for global rates without bounded gradients or small stepsizes is load-bearing. The skeptic's concern is valid: a standard bound E[||noise||^2] <= sigma^2 may not control the growth of adaptive preconditioners (e.g., cumulative 1/sqrt(v_t) terms) in unbounded gradient regimes common in nonconvex problems. The paper should explicitly show in the proof how the variance bound closes the Lyapunov analysis for arbitrary gradient magnitudes; if this is done via a specific lemma, it needs to be clearly stated as it underpins all results.
minor comments (2)
- [Abstract] There are minor typos: 'adpative' should be 'adaptive', and 'Muo' should be 'Muon'. Also, 'nonconvex' is missing a space after 'for' in the first sentence.
- [General] The paper would benefit from a clear statement of the main theorem early on, with all assumptions listed explicitly.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for recognizing the potential significance of the unified framework. We address the major comment on the convergence analysis below and will revise the manuscript accordingly to improve explicitness.
read point-by-point responses
-
Referee: [Convergence Analysis] The central claim that reasonable variance assumptions on the gradient oracle suffice for global rates without bounded gradients or small stepsizes is load-bearing. The skeptic's concern is valid: a standard bound E[||noise||^2] <= sigma^2 may not control the growth of adaptive preconditioners (e.g., cumulative 1/sqrt(v_t) terms) in unbounded gradient regimes common in nonconvex problems. The paper should explicitly show in the proof how the variance bound closes the Lyapunov analysis for arbitrary gradient magnitudes; if this is done via a specific lemma, it needs to be clearly stated as it underpins all results.
Authors: We thank the referee for highlighting this critical point. The analysis does rely on the variance bound E[||noise||^2] <= sigma^2 (Assumption 2.2) to obtain the stated rates without bounded gradients or small step sizes. In the current proof of Theorem 3.1, this bound is applied to the stochastic error term after expanding the Lyapunov function decrease; the adaptive preconditioner (whether diagonal, full-matrix, or heterogeneous as in Shampoo/Muon) enters the analysis via a telescoping sum that absorbs the gradient magnitude into the denominator, preventing uncontrolled growth. However, we agree that the connection is not stated with sufficient clarity for a skeptical reader. We will revise the manuscript by adding an explicit new lemma (Lemma 3.4) that isolates and proves the bounding step for arbitrary gradient magnitudes, showing precisely how the variance assumption closes the inequality without invoking boundedness. This lemma will be referenced at the key step in the main proof and in the discussion of each method (AdaGrad, AdaNorm, Shampoo, Muon). revision: yes
Circularity Check
No circularity: rates derived from explicit variance assumptions
full rationale
The paper defines a unified framework for adaptively preconditioned first-order methods and derives global stochastic convergence rates directly from stated assumptions on the variance of the gradient oracle, without bounded gradients or small stepsize restrictions. The analysis proceeds via a potential or Lyapunov argument that incorporates the adaptive preconditioners (AdaGrad, Shampoo, etc.) under those variance bounds. No step reduces a claimed result to a fitted parameter, self-referential definition, or load-bearing self-citation whose validity is presupposed by the present work. The derivation remains independent of the target rates and does not rename known empirical patterns or smuggle ansatzes.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption reasonable assumptions on the variance of the gradient oracle
Reference graph
Works this paper leans on
-
[1]
Alacaoglu and H
A. Alacaoglu and H. Lyu. Convergence of first-order methods for constrained nonconvex optimization with dependent data. InInternational Conference on Machine Learning. PMLR, pages 458–489, 2023
2023
-
[2]
Alacaoglu, Y
A. Alacaoglu, Y. Malitsky, and V. Cevher. Convergence od adaptive algorithms for constrained weakly-convex optimization. InProceedings of the 35th Conference on Neural Information Processing Systems (NEURIPS 2021), 2021
2021
-
[3]
R. Anil, V. Gupta, T. Koren, K. Regan, and Y. Singer. Scalable second order optimization for deep learning. InProceedings of the 37th International Conference on Machine Learning (ICML), volume 119 ofProceedings of Machine Learning Research, pages 459–469. PMLR, 2020
2020
-
[4]
A. Attia and T. Koren. SGD with AdaGrad stepsizes: Full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance. arXiv:2302.08783, 2023
-
[5]
The geometry of sign gradient descent.arXiv preprint arXiv:2002.08056,
L. Balles, F. Pedregosa, and N. Le Roux. The geometry of sign gradient descent. arXiv:2002.08056, 2020
-
[6]
Bellavia, G
S. Bellavia, G. Gratton, B. Morini, and Ph. L. Toint. Fast stochastic Adagrad for nonconvex bound-constrained optimization. arXiv:2505:06374, 2025
2025
-
[7]
Bellavia, G
S. Bellavia, G. Gratton, B. Morini, and Ph. L. Toint. An objective-function-free algorithm for general smooth constrained optimization. arXiv:2602:11770, 2026
2026
-
[8]
J. Bernstein and L. Newhouse. Modular duality in deep learning. ArXiv:2410.21265v2, 2024
-
[9]
Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024
J. Bernstein and L. Newhouse. Old optimizer, new norm: an anthology. ArXiv:2409.20325v2, 2024
-
[10]
Bhatia.Matrix Analysis
R. Bhatia.Matrix Analysis. Number 169 in Graduate Texts in Mathematics. Springer Verlag, Heidelberg, Berlin, New York, 1996
1996
-
[11]
Bottou, F
L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning.SIAM Review, 60(2):223–311, 2018
2018
-
[12]
Boyd and L
S. Boyd and L. Vandenberghe.Convex Optimization. Cambridge University Press, Cambridge, England, 2004
2004
-
[13]
Davis and D
D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions.SIAM Journal on Optimization, 29(1):207–239, 2019
2019
-
[14]
D´ efossez, L
A. D´ efossez, L. Bottou, F. Bach, and N. Usunier. A simple convergence proof for Adam and Adagrad. Transactions on Machine Learning Research, October 2022
2022
-
[15]
Duchi, E
J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimiza- tion.Journal of Machine Learning Research, 12, July 2011
2011
-
[16]
Y. Fang, S. Na, M. Mahoney, and M. Kolar. Fully stochastic trust-region sequential quadratic programming for equality constrained optimization problems.SIAM Journal on Optimization, 34(2):2007–2037, 2024
2007
- [17]
-
[18]
M. Faw, I. Tziotis, C. Caramanis, A. Mokhtari, S. Shakkottai, and R. Ward. The power of adaptivity in SGD: Self-tuning step sizes with unbounded gradients and affine variance. InProceedings of 35th Conference on Learning Theory, volume 178, pages 313–355, 2022
2022
- [19]
-
[20]
B. Ginsburg, P. Castonguay, O. Hrinchuk, O. Kuchaiev, R. Leary, V. Lavrukhin, J. Li, H. Nguyen, Y. Zhang, and J. M. Cohen. Training deep networks with stochastic gradient normalized by layerwise adaptive second moments. arXiv:1905.11286v3, 2019
-
[21]
Gratton, S
S. Gratton, S. Jerad, and Ph. L. Toint. Complexity and performance for two classes of noise-tolerant first-order algorithms.Optimization Methods and Software, 40(6):1473–1499, 2025
2025
-
[22]
S. Gratton and Ph. L. Toint. An objective-function-free algorithm for nonconvex stochastic optimization with deterministic equality and inequality constraints. arXiv:2603.29685, 2026
-
[23]
Z. Guo, W. Yin, R. Jin, and T. Yang. A novel convergence analysis for algorithms of the Adam family. In Proceedings of OPT2021: 13th Annual Workshop on Optimization for Machine Learning, 2021
2021
-
[24]
arXiv preprint arXiv:1706.06569 , year =
V. Gupta, T. Koren, and Y. Singer. A unified approach to adaptive regularization in online and stochastic optimization. arXiv:1706.06569, 2017
-
[25]
Gupta, T
V. Gupta, T. Koren, and Y. Singer. Shampoo: Preconditioned stochstic tensor optimization. InProceedings of the International Conference on Machine Learning, pages 1842–1850, 2018
2018
-
[26]
Cosmic shear with small scales: DES-Y3, KiDS-1000 and HSC-DR1
Y. Hong and J. Lin. Revisting convergence of Adagrad with relaxed assumptions. arXiv:2403.13794v2, 2024
-
[27]
R. Jiang, D. Maladkar, and A. Mokhtari. Convergence analysis of adaptive gradient methods under refined smoothness and noise assumptions. arXiv:2406.04592, 2024
-
[28]
Jordan, Y
K. Jordan, Y. Jin, V. Boza, Y. Jiacheng, F. Cecista, L. Newhouse, and J. Bernstein. URL:https://kellerjordan.github.io.posts/muon, 2024. Gratton & Toint: Unified convergence theory for adaptive first-order methods25
2024
-
[29]
Kavis, K.-Y
A. Kavis, K.-Y. Levy, and V. Cevher. High probability bounds for a class of nonconvex algorithms with AdaGrad stepsize. InInternational Conference on Learning Representations, 2022
2022
-
[30]
Kingma and J
D. Kingma and J. Ba. Adam: A method for stochastic optimization. InProceedings in the International Conference on Learning Representations (ICLR), 2015
2015
- [31]
-
[32]
H. Li, Y. Dong, and Z Lin. On theO( √ d/t1/4) convergence rate for RMSProp and its momentum extension measured inℓ 1 norm.Journal of Machine Learning Research, 26:1–25, 2025
2025
-
[33]
J. Li and M. Hong. A note on the convergence of Muon. arXiv:2502.02900, 2025
-
[34]
Li and F
X. Li and F. Orabona. On the convergence of stochastic gradient descent with adaptive stepsizes. InThe 22nd International Conference on Artificial Intelligence and Statistics, page 983–992, 2019
2019
- [35]
-
[36]
Z. Liu, T. D. Nguyen, A. Ene, and H. Nguyen. On the convergence of Adagrad(Norm) onRd: Beyond convexity, non-asymptotic rate and acceleration. In International Conference on Learning Representations, 2023
2023
-
[37]
L¨ owner
K. L¨ owner. ¨Uber monotone Matrixfunktionen.Mathematisch Zeitung, 38:177–216, 1934
1934
-
[38]
Martens and R
J. Martens and R. Grosse. Optimizing neural networks with Kronecker-factored approximate curvature. In International Conference on Machine Learning, 2015
2015
-
[39]
McMahan and M
B. McMahan and M. Streeter. Adaptive bound optimization for online convex optimization. InConference on Learning Theory, page 244sq, 2010
2010
-
[40]
M. C. Mukkamala and M. Hein. Variants of RMSProp and Adagrad with logarithmic regret bounds. In Proceedings of the 34th International Conference on Machine Learning, page 2545–2553, 2017
2017
-
[41]
Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025
Th. Pethick, W. Xie, K. Antonakopoulos, Z. Zhu, A. Silveti-Falls, and V. Cevher. Training deep learning models with norm-constrained LMOs. arXiv:2502.07529v2, 2025
-
[42]
Reddi, S
S. Reddi, S. Kale, and S. Kumar. On the convergence of Adam and beyond. InProceedings in the International Conference on Learning Representations (ICLR), 2018
2018
-
[43]
Robbins and S
H. Robbins and S. Monroe. A stochastic approximation method.The Annals of Mathematical Statistics, 22(3):400––407, 1951
1951
- [44]
-
[45]
Streeter and B
M. Streeter and B. McMahan. Less regret via online conditioning, 2010
2010
-
[46]
Tieleman and G
T. Tieleman and G. Hinton. Lecture 6.5-RMSPROP. COURSERA: Neural Networks for Machine Learning, 2012
2012
-
[47]
Ph. L. Toint. Divergence of the ADAM algorithm with fixed-stepsize: a (very) simple example. In K. Fackeldey, A. Kannan, S. Pokutta, K. Sharma, D. Walter, A. Walther, and M. Weiser, editors,Mathematical Optimization for Machine Learning: Proceedings of the MATH+ Thematic Einstein Semester 2023, pages 195–199. De Gruyter Brill, 2025
2023
-
[48]
N. Vyas, D. Morwani, R. Zhao, M. Kwun, I. Shapira, B. Brandfonbrener, L. Janson, and S. Kakade. SOAP: Improving and stabilizing Shampoo using Adam. arXiv:2409.11321, 2024
work page internal anchor Pith review arXiv 2024
-
[49]
B. Wang, H. Zhang, Z. Ma, and W. Chen. Convergence of Adagrad for non-convex objectives: simple proofs and relaxed assumptions. InProceedings of the Thirty-Sixth Conference on Learning Theory, pages 161–190, 2023
2023
- [50]
-
[51]
R. Ward, X. Wu, and L. Bottou. Adagrad stepsizes: sharp convergence over nonconvex landscapes. In Proceedings in the International Conference on Machine Learning (ICML2019), 2019
2019
- [52]
-
[53]
N. Xiao, X. Hu, X. Lin, and K.-C. Toh. Adam family methods for nonsmooth optimization with convergence guarantees.Journal of Machine Learning Research, 25, 2024
2024
- [54]
- [55]
-
[56]
AdaGrad meets Muon: Adaptive stepsizes for orthogonal updates
M. Zhang, Y. Liu, and H. Schaeffer. AdaGrad meets Muon: Adaptive stepsizes for orthogonal updates. arXiv:2509.02981v2, 2025
-
[57]
Zhang, Y
Q. Zhang, Y. Zhou, and S. Zou. Convergence guarantees for RMSProp and Adam in generalized-smooth non-convex optimization with affine noise variance, 2025
2025
-
[58]
Zhang and Z
T. Zhang and Z. Xu. Convergence of stochastic gradient descent for non-convex problems. InProceedings for the COLT Conference, 2012. Gratton & Toint: Unified convergence theory for adaptive first-order methods26 Proofs for Theorem 3.14 and Corollary 3.15 The definition (2.6) immediately implies the following bounds. Lemma A.1We have that, for allk≥0 and a...
2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.