pith. machine review for the scientific record. sign in

arxiv: 2604.17423 · v2 · submitted 2026-04-19 · 💻 cs.LG

Recognition: unknown

A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:16 UTC · model grok-4.3

classification 💻 cs.LG
keywords adaptive optimizationnonconvex optimizationstochastic convergenceAdaGradShampoofirst-order methodsmomentum methods
0
0 comments X

The pith

A single framework proves global convergence rates for AdaGrad, Shampoo and related adaptive methods in nonconvex stochastic optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a unified framework for first-order methods that precondition gradients adaptively and covers full and diagonal AdaGrad, AdaNorm, plus adaptive versions of Shampoo and Muon. It supplies a fully stochastic convergence analysis that applies to every method in the framework, both with and without momentum. The analysis uses only assumptions on the variance of the stochastic gradient oracle and avoids any requirement that gradients remain bounded or that stepsizes stay small enough. A reader would care because these methods are widely used in nonconvex problems such as neural network training, so a common theory clarifies when and why they succeed.

Core claim

We introduce a unified framework for first-order optimization algorithms that rely on adaptively preconditioned gradients. The framework includes full and diagonal AdaGrad, AdaNorm, and adaptive variants of Shampoo and Muon, and it permits heterogeneous geometries across different groups of variables while keeping a single convergence proof. For every method covered, we give a fully stochastic global rate-of-convergence analysis both with and without two forms of momentum, relying solely on variance assumptions for the gradient oracle.

What carries the argument

The unified framework of adaptively preconditioned gradients that permits heterogeneous geometries across variable groups while supporting a common convergence analysis.

If this is right

  • Every method inside the framework, including those with momentum, achieves a global rate of convergence to stationary points under the stated variance assumptions.
  • The same analysis covers both full-matrix and diagonal preconditioners as well as combinations across variable groups.
  • No separate proof is needed for each new adaptive variant that fits the framework.
  • Convergence holds without assuming bounded stochastic gradients or sufficiently small stepsizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • New adaptive preconditioners could be designed simply by verifying they fit inside the framework, thereby inheriting the convergence guarantees automatically.
  • The allowance for heterogeneous geometries suggests that mixed-precision or block-wise adaptive schemes may converge under the same mild variance conditions.
  • The result opens the possibility of testing whether other popular but unanalyzed adaptive methods can be recast in this form to obtain immediate rate guarantees.

Load-bearing premise

The stochastic gradient oracle has finite variance, without any bound on the gradients themselves or restriction on stepsize size.

What would settle it

A concrete nonconvex problem together with a stochastic gradient oracle of finite variance on which diagonal AdaGrad (or any other method inside the framework) fails to reach a stationary point at the claimed rate.

read the original abstract

A unified framework for first-order optimization algorithms fornonconvex unconstrained optimization is proposed that uses adaptivelypreconditioned gradients and includes popular methods such as full anddiagonal AdaGrad, AdaNorm, as well as adpative variants of Shampoo andMuon. This framework also allows combining heterogeneous geometriesacross different groups of variables while preserving a unifiedconvergence analysis. A fully stochastic global rate-of-convergenceanalysis is conducted for all methods in the framework, with andwithout two types of momentum, using reasonable assumptions on thevariance of the gradient oracle and without assuming boundedstochastic gradients or small enough stepsize.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces a unified framework for adaptive first-order methods in nonconvex unconstrained optimization using adaptively preconditioned gradients. It includes full and diagonal AdaGrad, AdaNorm, adaptive Shampoo, and Muon, and allows for heterogeneous geometries across variable groups. The paper provides a fully stochastic global rate-of-convergence analysis for these methods with and without momentum, under assumptions on the variance of the gradient oracle, without assuming bounded stochastic gradients or small step sizes.

Significance. If the analysis is correct, this work would offer a substantial advance in the theoretical understanding of adaptive optimizers in nonconvex settings, which are prevalent in machine learning. By relaxing standard assumptions and providing a unified treatment including momentum and heterogeneous preconditioners, it could bridge the gap between theory and practice for methods like AdaGrad and Shampoo.

major comments (1)
  1. [Convergence Analysis] The central claim that reasonable variance assumptions on the gradient oracle suffice for global rates without bounded gradients or small stepsizes is load-bearing. The skeptic's concern is valid: a standard bound E[||noise||^2] <= sigma^2 may not control the growth of adaptive preconditioners (e.g., cumulative 1/sqrt(v_t) terms) in unbounded gradient regimes common in nonconvex problems. The paper should explicitly show in the proof how the variance bound closes the Lyapunov analysis for arbitrary gradient magnitudes; if this is done via a specific lemma, it needs to be clearly stated as it underpins all results.
minor comments (2)
  1. [Abstract] There are minor typos: 'adpative' should be 'adaptive', and 'Muo' should be 'Muon'. Also, 'nonconvex' is missing a space after 'for' in the first sentence.
  2. [General] The paper would benefit from a clear statement of the main theorem early on, with all assumptions listed explicitly.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for recognizing the potential significance of the unified framework. We address the major comment on the convergence analysis below and will revise the manuscript accordingly to improve explicitness.

read point-by-point responses
  1. Referee: [Convergence Analysis] The central claim that reasonable variance assumptions on the gradient oracle suffice for global rates without bounded gradients or small stepsizes is load-bearing. The skeptic's concern is valid: a standard bound E[||noise||^2] <= sigma^2 may not control the growth of adaptive preconditioners (e.g., cumulative 1/sqrt(v_t) terms) in unbounded gradient regimes common in nonconvex problems. The paper should explicitly show in the proof how the variance bound closes the Lyapunov analysis for arbitrary gradient magnitudes; if this is done via a specific lemma, it needs to be clearly stated as it underpins all results.

    Authors: We thank the referee for highlighting this critical point. The analysis does rely on the variance bound E[||noise||^2] <= sigma^2 (Assumption 2.2) to obtain the stated rates without bounded gradients or small step sizes. In the current proof of Theorem 3.1, this bound is applied to the stochastic error term after expanding the Lyapunov function decrease; the adaptive preconditioner (whether diagonal, full-matrix, or heterogeneous as in Shampoo/Muon) enters the analysis via a telescoping sum that absorbs the gradient magnitude into the denominator, preventing uncontrolled growth. However, we agree that the connection is not stated with sufficient clarity for a skeptical reader. We will revise the manuscript by adding an explicit new lemma (Lemma 3.4) that isolates and proves the bounding step for arbitrary gradient magnitudes, showing precisely how the variance assumption closes the inequality without invoking boundedness. This lemma will be referenced at the key step in the main proof and in the discussion of each method (AdaGrad, AdaNorm, Shampoo, Muon). revision: yes

Circularity Check

0 steps flagged

No circularity: rates derived from explicit variance assumptions

full rationale

The paper defines a unified framework for adaptively preconditioned first-order methods and derives global stochastic convergence rates directly from stated assumptions on the variance of the gradient oracle, without bounded gradients or small stepsize restrictions. The analysis proceeds via a potential or Lyapunov argument that incorporates the adaptive preconditioners (AdaGrad, Shampoo, etc.) under those variance bounds. No step reduces a claimed result to a fitted parameter, self-referential definition, or load-bearing self-citation whose validity is presupposed by the present work. The derivation remains independent of the target rates and does not rename known empirical patterns or smuggle ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on assumptions about gradient variance that are described as reasonable but not derived within the paper.

axioms (1)
  • domain assumption reasonable assumptions on the variance of the gradient oracle
    Invoked to enable the fully stochastic global rate analysis without bounded gradients.

pith-pipeline@v0.9.0 · 5410 in / 1185 out tokens · 33870 ms · 2026-05-10T07:16:18.497600+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 22 canonical work pages · 1 internal anchor

  1. [1]

    Alacaoglu and H

    A. Alacaoglu and H. Lyu. Convergence of first-order methods for constrained nonconvex optimization with dependent data. InInternational Conference on Machine Learning. PMLR, pages 458–489, 2023

  2. [2]

    Alacaoglu, Y

    A. Alacaoglu, Y. Malitsky, and V. Cevher. Convergence od adaptive algorithms for constrained weakly-convex optimization. InProceedings of the 35th Conference on Neural Information Processing Systems (NEURIPS 2021), 2021

  3. [3]

    R. Anil, V. Gupta, T. Koren, K. Regan, and Y. Singer. Scalable second order optimization for deep learning. InProceedings of the 37th International Conference on Machine Learning (ICML), volume 119 ofProceedings of Machine Learning Research, pages 459–469. PMLR, 2020

  4. [4]

    Attia and T

    A. Attia and T. Koren. SGD with AdaGrad stepsizes: Full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance. arXiv:2302.08783, 2023

  5. [5]

    The geometry of sign gradient descent.arXiv preprint arXiv:2002.08056,

    L. Balles, F. Pedregosa, and N. Le Roux. The geometry of sign gradient descent. arXiv:2002.08056, 2020

  6. [6]

    Bellavia, G

    S. Bellavia, G. Gratton, B. Morini, and Ph. L. Toint. Fast stochastic Adagrad for nonconvex bound-constrained optimization. arXiv:2505:06374, 2025

  7. [7]

    Bellavia, G

    S. Bellavia, G. Gratton, B. Morini, and Ph. L. Toint. An objective-function-free algorithm for general smooth constrained optimization. arXiv:2602:11770, 2026

  8. [8]

    arXiv:2410.21265 , year=

    J. Bernstein and L. Newhouse. Modular duality in deep learning. ArXiv:2410.21265v2, 2024

  9. [9]

    Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

    J. Bernstein and L. Newhouse. Old optimizer, new norm: an anthology. ArXiv:2409.20325v2, 2024

  10. [10]

    Bhatia.Matrix Analysis

    R. Bhatia.Matrix Analysis. Number 169 in Graduate Texts in Mathematics. Springer Verlag, Heidelberg, Berlin, New York, 1996

  11. [11]

    Bottou, F

    L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning.SIAM Review, 60(2):223–311, 2018

  12. [12]

    Boyd and L

    S. Boyd and L. Vandenberghe.Convex Optimization. Cambridge University Press, Cambridge, England, 2004

  13. [13]

    Davis and D

    D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions.SIAM Journal on Optimization, 29(1):207–239, 2019

  14. [14]

    D´ efossez, L

    A. D´ efossez, L. Bottou, F. Bach, and N. Usunier. A simple convergence proof for Adam and Adagrad. Transactions on Machine Learning Research, October 2022

  15. [15]

    Duchi, E

    J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimiza- tion.Journal of Machine Learning Research, 12, July 2011

  16. [16]

    Y. Fang, S. Na, M. Mahoney, and M. Kolar. Fully stochastic trust-region sequential quadratic programming for equality constrained optimization problems.SIAM Journal on Optimization, 34(2):2007–2037, 2024

  17. [17]

    M. Faw, L. Rout, C. Caramanis, and S. Shakkottai. Beyond uniform smoothness: A stopped analysis of adaptive SGD. arXiv:2302.06570, 2023

  18. [18]

    M. Faw, I. Tziotis, C. Caramanis, A. Mokhtari, S. Shakkottai, and R. Ward. The power of adaptivity in SGD: Self-tuning step sizes with unbounded gradients and affine variance. InProceedings of 35th Conference on Learning Theory, volume 178, pages 313–355, 2022

  19. [19]

    Th. Flynn. The duality structure gradient descent algorithm: analysis and application to neural networks. arXiv:1708.00523v8, 2017

  20. [20]

    Ginsburg, P

    B. Ginsburg, P. Castonguay, O. Hrinchuk, O. Kuchaiev, R. Leary, V. Lavrukhin, J. Li, H. Nguyen, Y. Zhang, and J. M. Cohen. Training deep networks with stochastic gradient normalized by layerwise adaptive second moments. arXiv:1905.11286v3, 2019

  21. [21]

    Gratton, S

    S. Gratton, S. Jerad, and Ph. L. Toint. Complexity and performance for two classes of noise-tolerant first-order algorithms.Optimization Methods and Software, 40(6):1473–1499, 2025

  22. [22]

    Gratton and Ph

    S. Gratton and Ph. L. Toint. An objective-function-free algorithm for nonconvex stochastic optimization with deterministic equality and inequality constraints. arXiv:2603.29685, 2026

  23. [23]

    Z. Guo, W. Yin, R. Jin, and T. Yang. A novel convergence analysis for algorithms of the Adam family. In Proceedings of OPT2021: 13th Annual Workshop on Optimization for Machine Learning, 2021

  24. [24]

    arXiv preprint arXiv:1706.06569 , year =

    V. Gupta, T. Koren, and Y. Singer. A unified approach to adaptive regularization in online and stochastic optimization. arXiv:1706.06569, 2017

  25. [25]

    Gupta, T

    V. Gupta, T. Koren, and Y. Singer. Shampoo: Preconditioned stochstic tensor optimization. InProceedings of the International Conference on Machine Learning, pages 1842–1850, 2018

  26. [26]

    Cosmic shear with small scales: DES-Y3, KiDS-1000 and HSC-DR1

    Y. Hong and J. Lin. Revisting convergence of Adagrad with relaxed assumptions. arXiv:2403.13794v2, 2024

  27. [27]

    Provable complexity improvement of adagrad over sgd: Upper and lower bounds in stochastic non-convex optimization

    R. Jiang, D. Maladkar, and A. Mokhtari. Convergence analysis of adaptive gradient methods under refined smoothness and noise assumptions. arXiv:2406.04592, 2024

  28. [28]

    Jordan, Y

    K. Jordan, Y. Jin, V. Boza, Y. Jiacheng, F. Cecista, L. Newhouse, and J. Bernstein. URL:https://kellerjordan.github.io.posts/muon, 2024. Gratton & Toint: Unified convergence theory for adaptive first-order methods25

  29. [29]

    Kavis, K.-Y

    A. Kavis, K.-Y. Levy, and V. Cevher. High probability bounds for a class of nonconvex algorithms with AdaGrad stepsize. InInternational Conference on Learning Representations, 2022

  30. [30]

    Kingma and J

    D. Kingma and J. Ba. Adam: A method for stochastic optimization. InProceedings in the International Conference on Learning Representations (ICLR), 2015

  31. [31]

    D. Kovalev. SGD with adaptive preconditioning: Unified analysis and momentum acceleration. arXiv÷2506.23804, 2025

  32. [32]

    H. Li, Y. Dong, and Z Lin. On theO( √ d/t1/4) convergence rate for RMSProp and its momentum extension measured inℓ 1 norm.Journal of Machine Learning Research, 26:1–25, 2025

  33. [33]

    Jiaxiang Li and Mingyi Hong

    J. Li and M. Hong. A note on the convergence of Muon. arXiv:2502.02900, 2025

  34. [34]

    Li and F

    X. Li and F. Orabona. On the convergence of stochastic gradient descent with adaptive stepsizes. InThe 22nd International Conference on Artificial Intelligence and Statistics, page 983–992, 2019

  35. [35]

    L. Liu, Z. Xu, Z. Zhang, H. Kang, Z. Li, C. Liang, W. Chen, and T. Zhao. COSMOS: A hybrid adaptive optimizer for memory-efficient training of LLMs. arXiv:2502.17410, 2025

  36. [36]

    Z. Liu, T. D. Nguyen, A. Ene, and H. Nguyen. On the convergence of Adagrad(Norm) onRd: Beyond convexity, non-asymptotic rate and acceleration. In International Conference on Learning Representations, 2023

  37. [37]

    L¨ owner

    K. L¨ owner. ¨Uber monotone Matrixfunktionen.Mathematisch Zeitung, 38:177–216, 1934

  38. [38]

    Martens and R

    J. Martens and R. Grosse. Optimizing neural networks with Kronecker-factored approximate curvature. In International Conference on Machine Learning, 2015

  39. [39]

    McMahan and M

    B. McMahan and M. Streeter. Adaptive bound optimization for online convex optimization. InConference on Learning Theory, page 244sq, 2010

  40. [40]

    M. C. Mukkamala and M. Hein. Variants of RMSProp and Adagrad with logarithmic regret bounds. In Proceedings of the 34th International Conference on Machine Learning, page 2545–2553, 2017

  41. [41]

    Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025

    Th. Pethick, W. Xie, K. Antonakopoulos, Z. Zhu, A. Silveti-Falls, and V. Cevher. Training deep learning models with norm-constrained LMOs. arXiv:2502.07529v2, 2025

  42. [42]

    Reddi, S

    S. Reddi, S. Kale, and S. Kumar. On the convergence of Adam and beyond. InProceedings in the International Conference on Learning Representations (ICLR), 2018

  43. [43]

    Robbins and S

    H. Robbins and S. Monroe. A stochastic approximation method.The Annals of Mathematical Statistics, 22(3):400––407, 1951

  44. [44]

    C. Si, D. Zhang, and W. Shen. AdaMuon: Adaptive Muon optimizer. arXiv2507.11005, 2025

  45. [45]

    Streeter and B

    M. Streeter and B. McMahan. Less regret via online conditioning, 2010

  46. [46]

    Tieleman and G

    T. Tieleman and G. Hinton. Lecture 6.5-RMSPROP. COURSERA: Neural Networks for Machine Learning, 2012

  47. [47]

    Ph. L. Toint. Divergence of the ADAM algorithm with fixed-stepsize: a (very) simple example. In K. Fackeldey, A. Kannan, S. Pokutta, K. Sharma, D. Walter, A. Walther, and M. Weiser, editors,Mathematical Optimization for Machine Learning: Proceedings of the MATH+ Thematic Einstein Semester 2023, pages 195–199. De Gruyter Brill, 2025

  48. [48]

    N. Vyas, D. Morwani, R. Zhao, M. Kwun, I. Shapira, B. Brandfonbrener, L. Janson, and S. Kakade. SOAP: Improving and stabilizing Shampoo using Adam. arXiv:2409.11321, 2024

  49. [49]

    B. Wang, H. Zhang, Z. Ma, and W. Chen. Convergence of Adagrad for non-convex objectives: simple proofs and relaxed assumptions. InProceedings of the Thirty-Sixth Conference on Learning Theory, pages 161–190, 2023

  50. [50]

    Wang, Ch

    Q. Wang, Ch. Peirmarini, Y. Zhu, and F. E. Curtis. Projected stochastic momentum methods for nonlinear equality-constrained optimization for machine learning. arXiv:2601.11785, 2026

  51. [51]

    R. Ward, X. Wu, and L. Bottou. Adagrad stepsizes: sharp convergence over nonconvex landscapes. In Proceedings in the International Conference on Machine Learning (ICML2019), 2019

  52. [52]

    X. Wu, R. Ward, and L. Bottou. WNGRAD: Learn the learning rate in gradient descent. arXiv:1803.02865, 2018

  53. [53]

    N. Xiao, X. Hu, X. Lin, and K.-C. Toh. Adam family methods for nonsmooth optimization with convergence guarantees.Journal of Machine Learning Research, 25, 2024

  54. [54]

    S. Xie, T. Wang, S. Reddi, S. Kumar, and Z. Li. Structured preconditioners in adaptive optimization: A unified analysis. arXiv:2503.10537, 2025

  55. [55]

    A. W. Yu, L. Huang, Q. Lin, R. Salakhutdinov, and J. Carbonell. Block-normalized gradient method; an empirical study for training deep neural network. arXiv:1707.04822v2, 2017

  56. [56]

    AdaGrad meets Muon: Adaptive stepsizes for orthogonal updates

    M. Zhang, Y. Liu, and H. Schaeffer. AdaGrad meets Muon: Adaptive stepsizes for orthogonal updates. arXiv:2509.02981v2, 2025

  57. [57]

    Zhang, Y

    Q. Zhang, Y. Zhou, and S. Zou. Convergence guarantees for RMSProp and Adam in generalized-smooth non-convex optimization with affine noise variance, 2025

  58. [58]

    Zhang and Z

    T. Zhang and Z. Xu. Convergence of stochastic gradient descent for non-convex problems. InProceedings for the COLT Conference, 2012. Gratton & Toint: Unified convergence theory for adaptive first-order methods26 Proofs for Theorem 3.14 and Corollary 3.15 The definition (2.6) immediately implies the following bounds. Lemma A.1We have that, for allk≥0 and a...