A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo
Pith reviewed 2026-05-10 07:16 UTC · model grok-4.3
The pith
A single framework proves global convergence rates for AdaGrad, Shampoo and related adaptive methods in nonconvex stochastic optimization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce a unified framework for first-order optimization algorithms that rely on adaptively preconditioned gradients. The framework includes full and diagonal AdaGrad, AdaNorm, and adaptive variants of Shampoo and Muon, and it permits heterogeneous geometries across different groups of variables while keeping a single convergence proof. For every method covered, we give a fully stochastic global rate-of-convergence analysis both with and without two forms of momentum, relying solely on variance assumptions for the gradient oracle.
What carries the argument
The unified framework of adaptively preconditioned gradients that permits heterogeneous geometries across variable groups while supporting a common convergence analysis.
If this is right
- Every method inside the framework, including those with momentum, achieves a global rate of convergence to stationary points under the stated variance assumptions.
- The same analysis covers both full-matrix and diagonal preconditioners as well as combinations across variable groups.
- No separate proof is needed for each new adaptive variant that fits the framework.
- Convergence holds without assuming bounded stochastic gradients or sufficiently small stepsizes.
Where Pith is reading between the lines
- New adaptive preconditioners could be designed simply by verifying they fit inside the framework, thereby inheriting the convergence guarantees automatically.
- The allowance for heterogeneous geometries suggests that mixed-precision or block-wise adaptive schemes may converge under the same mild variance conditions.
- The result opens the possibility of testing whether other popular but unanalyzed adaptive methods can be recast in this form to obtain immediate rate guarantees.
Load-bearing premise
The stochastic gradient oracle has finite variance, without any bound on the gradients themselves or restriction on stepsize size.
What would settle it
A concrete nonconvex problem together with a stochastic gradient oracle of finite variance on which diagonal AdaGrad (or any other method inside the framework) fails to reach a stationary point at the claimed rate.
read the original abstract
A unified framework for first-order optimization algorithms fornonconvex unconstrained optimization is proposed that uses adaptivelypreconditioned gradients and includes popular methods such as full anddiagonal AdaGrad, AdaNorm, as well as adpative variants of Shampoo andMuon. This framework also allows combining heterogeneous geometriesacross different groups of variables while preserving a unifiedconvergence analysis. A fully stochastic global rate-of-convergenceanalysis is conducted for all methods in the framework, with andwithout two types of momentum, using reasonable assumptions on thevariance of the gradient oracle and without assuming boundedstochastic gradients or small enough stepsize.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a unified framework for adaptive first-order methods in nonconvex unconstrained optimization using adaptively preconditioned gradients. It includes full and diagonal AdaGrad, AdaNorm, adaptive Shampoo, and Muon, and allows for heterogeneous geometries across variable groups. The paper provides a fully stochastic global rate-of-convergence analysis for these methods with and without momentum, under assumptions on the variance of the gradient oracle, without assuming bounded stochastic gradients or small step sizes.
Significance. If the analysis is correct, this work would offer a substantial advance in the theoretical understanding of adaptive optimizers in nonconvex settings, which are prevalent in machine learning. By relaxing standard assumptions and providing a unified treatment including momentum and heterogeneous preconditioners, it could bridge the gap between theory and practice for methods like AdaGrad and Shampoo.
major comments (1)
- [Convergence Analysis] The central claim that reasonable variance assumptions on the gradient oracle suffice for global rates without bounded gradients or small stepsizes is load-bearing. The skeptic's concern is valid: a standard bound E[||noise||^2] <= sigma^2 may not control the growth of adaptive preconditioners (e.g., cumulative 1/sqrt(v_t) terms) in unbounded gradient regimes common in nonconvex problems. The paper should explicitly show in the proof how the variance bound closes the Lyapunov analysis for arbitrary gradient magnitudes; if this is done via a specific lemma, it needs to be clearly stated as it underpins all results.
minor comments (2)
- [Abstract] There are minor typos: 'adpative' should be 'adaptive', and 'Muo' should be 'Muon'. Also, 'nonconvex' is missing a space after 'for' in the first sentence.
- [General] The paper would benefit from a clear statement of the main theorem early on, with all assumptions listed explicitly.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for recognizing the potential significance of the unified framework. We address the major comment on the convergence analysis below and will revise the manuscript accordingly to improve explicitness.
read point-by-point responses
-
Referee: [Convergence Analysis] The central claim that reasonable variance assumptions on the gradient oracle suffice for global rates without bounded gradients or small stepsizes is load-bearing. The skeptic's concern is valid: a standard bound E[||noise||^2] <= sigma^2 may not control the growth of adaptive preconditioners (e.g., cumulative 1/sqrt(v_t) terms) in unbounded gradient regimes common in nonconvex problems. The paper should explicitly show in the proof how the variance bound closes the Lyapunov analysis for arbitrary gradient magnitudes; if this is done via a specific lemma, it needs to be clearly stated as it underpins all results.
Authors: We thank the referee for highlighting this critical point. The analysis does rely on the variance bound E[||noise||^2] <= sigma^2 (Assumption 2.2) to obtain the stated rates without bounded gradients or small step sizes. In the current proof of Theorem 3.1, this bound is applied to the stochastic error term after expanding the Lyapunov function decrease; the adaptive preconditioner (whether diagonal, full-matrix, or heterogeneous as in Shampoo/Muon) enters the analysis via a telescoping sum that absorbs the gradient magnitude into the denominator, preventing uncontrolled growth. However, we agree that the connection is not stated with sufficient clarity for a skeptical reader. We will revise the manuscript by adding an explicit new lemma (Lemma 3.4) that isolates and proves the bounding step for arbitrary gradient magnitudes, showing precisely how the variance assumption closes the inequality without invoking boundedness. This lemma will be referenced at the key step in the main proof and in the discussion of each method (AdaGrad, AdaNorm, Shampoo, Muon). revision: yes
Circularity Check
No circularity: rates derived from explicit variance assumptions
full rationale
The paper defines a unified framework for adaptively preconditioned first-order methods and derives global stochastic convergence rates directly from stated assumptions on the variance of the gradient oracle, without bounded gradients or small stepsize restrictions. The analysis proceeds via a potential or Lyapunov argument that incorporates the adaptive preconditioners (AdaGrad, Shampoo, etc.) under those variance bounds. No step reduces a claimed result to a fitted parameter, self-referential definition, or load-bearing self-citation whose validity is presupposed by the present work. The derivation remains independent of the target rates and does not rename known empirical patterns or smuggle ansatzes.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption reasonable assumptions on the variance of the gradient oracle
Forward citations
Cited by 2 Pith papers
-
bAdag: an adaptive block coordinate gradient method for smooth nonconvex functions
Introduces bAdag, an AdaGrad-based block coordinate gradient method with ergodic sublinear convergence proofs for smooth nonconvex objectives under block Lipschitz gradient assumptions, covering cyclic, uniform random...
-
Stochastic convergence of parallel asynchronous adaptive first-order methods
Introduces a class of asynchronous adaptive first-order methods and establishes O(1/sqrt t) convergence (up to logs) for non-convex stochastic optimization under reasonable assumptions.
Reference graph
Works this paper leans on
-
[1]
A. Alacaoglu and H. Lyu. Convergence of first-order methods for constrained nonconvex optimization with dependent data. InInternational Conference on Machine Learning. PMLR, pages 458–489, 2023
work page 2023
-
[2]
A. Alacaoglu, Y. Malitsky, and V. Cevher. Convergence od adaptive algorithms for constrained weakly-convex optimization. InProceedings of the 35th Conference on Neural Information Processing Systems (NEURIPS 2021), 2021
work page 2021
-
[3]
R. Anil, V. Gupta, T. Koren, K. Regan, and Y. Singer. Scalable second order optimization for deep learning. InProceedings of the 37th International Conference on Machine Learning (ICML), volume 119 ofProceedings of Machine Learning Research, pages 459–469. PMLR, 2020
work page 2020
-
[4]
A. Attia and T. Koren. SGD with AdaGrad stepsizes: Full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance. arXiv:2302.08783, 2023
-
[5]
L. Balles, F. Pedregosa, and N. Le Roux. The geometry of sign gradient descent. arXiv:2002.08056, 2020
-
[6]
S. Bellavia, G. Gratton, B. Morini, and Ph. L. Toint. Fast stochastic Adagrad for nonconvex bound-constrained optimization. arXiv:2505:06374, 2025
work page 2025
-
[7]
S. Bellavia, G. Gratton, B. Morini, and Ph. L. Toint. An objective-function-free algorithm for general smooth constrained optimization. arXiv:2602:11770, 2026
work page 2026
-
[8]
J. Bernstein and L. Newhouse. Modular duality in deep learning. ArXiv:2410.21265v2, 2024
-
[9]
Old Optimizer, New Norm: An Anthology
J. Bernstein and L. Newhouse. Old optimizer, new norm: an anthology. ArXiv:2409.20325v2, 2024
work page internal anchor Pith review arXiv 2024
-
[10]
R. Bhatia.Matrix Analysis. Number 169 in Graduate Texts in Mathematics. Springer Verlag, Heidelberg, Berlin, New York, 1996
work page 1996
- [11]
-
[12]
S. Boyd and L. Vandenberghe.Convex Optimization. Cambridge University Press, Cambridge, England, 2004
work page 2004
-
[13]
D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions.SIAM Journal on Optimization, 29(1):207–239, 2019
work page 2019
-
[14]
A. D´ efossez, L. Bottou, F. Bach, and N. Usunier. A simple convergence proof for Adam and Adagrad. Transactions on Machine Learning Research, October 2022
work page 2022
- [15]
-
[16]
Y. Fang, S. Na, M. Mahoney, and M. Kolar. Fully stochastic trust-region sequential quadratic programming for equality constrained optimization problems.SIAM Journal on Optimization, 34(2):2007–2037, 2024
work page 2007
- [17]
-
[18]
M. Faw, I. Tziotis, C. Caramanis, A. Mokhtari, S. Shakkottai, and R. Ward. The power of adaptivity in SGD: Self-tuning step sizes with unbounded gradients and affine variance. InProceedings of 35th Conference on Learning Theory, volume 178, pages 313–355, 2022
work page 2022
- [19]
-
[20]
B. Ginsburg, P. Castonguay, O. Hrinchuk, O. Kuchaiev, R. Leary, V. Lavrukhin, J. Li, H. Nguyen, Y. Zhang, and J. M. Cohen. Training deep networks with stochastic gradient normalized by layerwise adaptive second moments. arXiv:1905.11286v3, 2019
-
[21]
S. Gratton, S. Jerad, and Ph. L. Toint. Complexity and performance for two classes of noise-tolerant first-order algorithms.Optimization Methods and Software, 40(6):1473–1499, 2025
work page 2025
-
[22]
S. Gratton and Ph. L. Toint. An objective-function-free algorithm for nonconvex stochastic optimization with deterministic equality and inequality constraints. arXiv:2603.29685, 2026
-
[23]
Z. Guo, W. Yin, R. Jin, and T. Yang. A novel convergence analysis for algorithms of the Adam family. In Proceedings of OPT2021: 13th Annual Workshop on Optimization for Machine Learning, 2021
work page 2021
-
[24]
A Unified Approach to Adaptive Regularization in Online and Stochastic Optimization
V. Gupta, T. Koren, and Y. Singer. A unified approach to adaptive regularization in online and stochastic optimization. arXiv:1706.06569, 2017
work page Pith review arXiv 2017
- [25]
-
[26]
Y. Hong and J. Lin. Revisting convergence of Adagrad with relaxed assumptions. arXiv:2403.13794v2, 2024
-
[27]
R. Jiang, D. Maladkar, and A. Mokhtari. Convergence analysis of adaptive gradient methods under refined smoothness and noise assumptions. arXiv:2406.04592, 2024
- [28]
-
[29]
A. Kavis, K.-Y. Levy, and V. Cevher. High probability bounds for a class of nonconvex algorithms with AdaGrad stepsize. InInternational Conference on Learning Representations, 2022
work page 2022
-
[30]
D. Kingma and J. Ba. Adam: A method for stochastic optimization. InProceedings in the International Conference on Learning Representations (ICLR), 2015
work page 2015
- [31]
-
[32]
H. Li, Y. Dong, and Z Lin. On theO( √ d/t1/4) convergence rate for RMSProp and its momentum extension measured inℓ 1 norm.Journal of Machine Learning Research, 26:1–25, 2025
work page 2025
- [33]
- [34]
- [35]
-
[36]
Z. Liu, T. D. Nguyen, A. Ene, and H. Nguyen. On the convergence of Adagrad(Norm) onRd: Beyond convexity, non-asymptotic rate and acceleration. In International Conference on Learning Representations, 2023
work page 2023
- [37]
-
[38]
J. Martens and R. Grosse. Optimizing neural networks with Kronecker-factored approximate curvature. In International Conference on Machine Learning, 2015
work page 2015
-
[39]
B. McMahan and M. Streeter. Adaptive bound optimization for online convex optimization. InConference on Learning Theory, page 244sq, 2010
work page 2010
-
[40]
M. C. Mukkamala and M. Hein. Variants of RMSProp and Adagrad with logarithmic regret bounds. In Proceedings of the 34th International Conference on Machine Learning, page 2545–2553, 2017
work page 2017
-
[41]
Training Deep Learning Models with Norm-Constrained LMOs
Th. Pethick, W. Xie, K. Antonakopoulos, Z. Zhu, A. Silveti-Falls, and V. Cevher. Training deep learning models with norm-constrained LMOs. arXiv:2502.07529v2, 2025
work page internal anchor Pith review arXiv 2025
- [42]
-
[43]
H. Robbins and S. Monroe. A stochastic approximation method.The Annals of Mathematical Statistics, 22(3):400––407, 1951
work page 1951
- [44]
- [45]
-
[46]
T. Tieleman and G. Hinton. Lecture 6.5-RMSPROP. COURSERA: Neural Networks for Machine Learning, 2012
work page 2012
-
[47]
Ph. L. Toint. Divergence of the ADAM algorithm with fixed-stepsize: a (very) simple example. In K. Fackeldey, A. Kannan, S. Pokutta, K. Sharma, D. Walter, A. Walther, and M. Weiser, editors,Mathematical Optimization for Machine Learning: Proceedings of the MATH+ Thematic Einstein Semester 2023, pages 195–199. De Gruyter Brill, 2025
work page 2023
-
[48]
N. Vyas, D. Morwani, R. Zhao, M. Kwun, I. Shapira, B. Brandfonbrener, L. Janson, and S. Kakade. SOAP: Improving and stabilizing Shampoo using Adam. arXiv:2409.11321, 2024
work page internal anchor Pith review arXiv 2024
-
[49]
B. Wang, H. Zhang, Z. Ma, and W. Chen. Convergence of Adagrad for non-convex objectives: simple proofs and relaxed assumptions. InProceedings of the Thirty-Sixth Conference on Learning Theory, pages 161–190, 2023
work page 2023
- [50]
-
[51]
R. Ward, X. Wu, and L. Bottou. Adagrad stepsizes: sharp convergence over nonconvex landscapes. In Proceedings in the International Conference on Machine Learning (ICML2019), 2019
work page 2019
- [52]
-
[53]
N. Xiao, X. Hu, X. Lin, and K.-C. Toh. Adam family methods for nonsmooth optimization with convergence guarantees.Journal of Machine Learning Research, 25, 2024
work page 2024
- [54]
-
[55]
A. W. Yu, L. Huang, Q. Lin, R. Salakhutdinov, and J. Carbonell. Block-normalized gradient method; an empirical study for training deep neural network. arXiv:1707.04822v2, 2017
work page Pith review arXiv 2017
- [56]
- [57]
-
[58]
T. Zhang and Z. Xu. Convergence of stochastic gradient descent for non-convex problems. InProceedings for the COLT Conference, 2012. Gratton & Toint: Unified convergence theory for adaptive first-order methods26 Proofs for Theorem 3.14 and Corollary 3.15 The definition (2.6) immediately implies the following bounds. Lemma A.1We have that, for allk≥0 and a...
work page 2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.