pith. sign in

arxiv: 2606.21433 · v1 · pith:G55AMEKKnew · submitted 2026-06-19 · 🧮 math.PR · cs.LG· math.ST· stat.TH

Central limit theorem for the averaged Adam optimizer

Pith reviewed 2026-06-26 13:27 UTC · model grok-4.3

classification 🧮 math.PR cs.LGmath.STstat.TH
keywords central limit theoremAdam optimizeraveraged iteratesstochastic approximationconvergence ratevector fieldattractor
0
0 comments X

The pith

Averaged Adam optimizer satisfies a central limit theorem with convergence rate exactly n to the power of minus one half.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyses the convergence of averaged iterates of the Adam optimizer toward an attracting zero of the Adam vector field. It establishes a central limit theorem that gives the precise asymptotic distribution after normalization by the square root of the number of steps. The limiting rate matches the classical order seen in stochastic approximation methods. The covariance matrix of the limit is expressed directly in terms of the Adam algorithm's behavior at the attractor. A reader would care because the result supplies an exact quantitative description of the error for this widely used optimizer.

Core claim

Under the assumption that the averaged Adam iterates converge to an attracting zero of the Adam vector field, a central limit theorem holds for these iterates: after centering at the attractor and scaling by the square root of the number of steps n, the distribution converges to a normal law whose covariance is determined by the properties of the Adam algorithm evaluated in the state of the attractor.

What carries the argument

The Adam vector field and its attracting zero, together with the averaged iterates whose fluctuations are described by the central limit theorem.

If this is right

  • The convergence speed of the averaged Adam iterates is of order n to the power of minus one half.
  • The asymptotic covariance is given explicitly by quantities derived from the Adam algorithm at the attractor.
  • The result recovers the same rate previously known only for classical non-adaptive stochastic approximation schemes.
  • The theorem applies specifically after averaging the Adam updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the theorem holds, then averaged Adam can be equipped with the same asymptotic error bounds used in Robbins-Monro analysis for practical uncertainty quantification.
  • The explicit covariance formula could be used to construct confidence regions around the estimated optimum in high-dimensional training.
  • The approach might extend to other adaptive methods such as RMSprop once their vector fields are shown to possess attracting zeros.
  • Testing the predicted covariance on benchmark optimization problems would provide a direct check independent of the proof.

Load-bearing premise

There exists an attracting zero of the Adam vector field to which the averaged iterates converge.

What would settle it

Numerical experiments or simulations in which the averaged Adam iterates converge to the attractor at a rate other than n to the power of minus one half would contradict the central limit theorem.

read the original abstract

In this article, we analyse convergence of the averaged Adam optimizer to an attracting zero of the Adam vector field. We provide a central limit theorem that, in particular, quantifies exactly the speed of convergence. The order of convergence is $n^{-1/2}$ in the number of steps of the algorithm which coincides with the order observed for classical stochastic approximation algorithms. The covariance in the central limit theorem is given in terms of properties of the Adam algorithm in the state of the attractor.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript claims to analyze the convergence of averaged Adam iterates to an attracting zero of the Adam vector field and to establish a central limit theorem for these iterates. The CLT quantifies the convergence speed as n^{-1/2} (matching classical stochastic approximation) with the limiting covariance expressed in terms of the attractor properties.

Significance. If the result holds under verifiable conditions, it would supply a precise asymptotic characterization of averaged Adam, extending classical SA theory to this adaptive optimizer and potentially aiding analysis of convergence rates in non-convex optimization.

major comments (1)
  1. [Abstract / Introduction] Abstract and introduction: the stated CLT presupposes that the averaged iterates converge to an attracting zero of the (nonlinear, state-dependent) Adam vector field, yet no primitive conditions are supplied on the objective, gradient noise, or Jacobian spectrum that would guarantee existence or global attraction of such a zero. This renders the theorem conditional on an unverified dynamical assumption that is load-bearing for the central claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on our manuscript. We respond to the major comment below.

read point-by-point responses
  1. Referee: [Abstract / Introduction] Abstract and introduction: the stated CLT presupposes that the averaged iterates converge to an attracting zero of the (nonlinear, state-dependent) Adam vector field, yet no primitive conditions are supplied on the objective, gradient noise, or Jacobian spectrum that would guarantee existence or global attraction of such a zero. This renders the theorem conditional on an unverified dynamical assumption that is load-bearing for the central claim.

    Authors: We acknowledge that the central limit theorem is established conditionally on the averaged Adam iterates converging to an attracting zero of the (nonlinear) Adam vector field. The manuscript's contribution lies in deriving the precise n^{-1/2} rate and the explicit form of the limiting covariance expressed via the attractor properties, thereby extending the classical stochastic approximation CLT to this adaptive setting. Establishing primitive conditions on the objective, noise, or Jacobian spectrum that would guarantee existence and global attraction of such zeros is a distinct and technically demanding question concerning the mean-field dynamics; it lies outside the scope of the present work. We will revise the introduction to state this modeling assumption more explicitly and to clarify the conditional nature of the result. revision: partial

Circularity Check

0 steps flagged

No circularity: CLT derived under explicit dynamical assumption with rate matching external classical SA results

full rationale

The provided abstract and context present a conditional central limit theorem for averaged Adam iterates, assuming convergence to an attracting zero of the Adam vector field and stating that the n^{-1/2} rate coincides with (but is not derived from) classical stochastic approximation algorithms. No equations, self-citations, fitted parameters, or ansatzes are shown that reduce the claimed result to its own inputs by construction. The derivation chain is therefore self-contained as a standard application of CLT methods to the given stochastic recursion under the stated hypothesis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities used in the proof.

pith-pipeline@v0.9.1-grok · 5601 in / 920 out tokens · 18914 ms · 2026-06-26T13:27:05.718102+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 4 linked inside Pith

  1. [1]

    Adam with model exponential moving average is effective for nonconvex optimization.arXiv:2405.18199, 2024

    Kwangjun Ahn and Ashok Cutkosky. Adam with model exponential moving average is effective for nonconvex optimization.arXiv:2405.18199, 2024

  2. [2]

    General framework for online- to-nonconvex conversion: Schedule-free SGD is also effective for nonconvex optimization

    Kwangjun Ahn, Gagik Magakyan, and Ashok Cutkosky. General framework for online- to-nonconvex conversion: Schedule-free SGD is also effective for nonconvex optimization. arXiv:2411.07061, 2024

  3. [3]

    Springer-Verlag, Berlin, 1990

    Albert Benveniste, Michel M´ etivier, and Pierre Priouret.Adaptive algorithms and stochastic approximations, volume 22 ofApplications of Mathematics (New York). Springer-Verlag, Berlin, 1990. Translated from the French by Stephen S. Wilson

  4. [4]

    The Road Less Scheduled.arXiv:2405.15682, 2024

    Aaron Defazio, Xingyu Alice Yang, Harsh Mehta, Konstantin Mishchenko, Ahmed Khaled, and Ashok Cutkosky. The Road Less Scheduled.arXiv:2405.15682, 2024

  5. [5]

    A Simple Convergence Proof of Adam and Adagrad.Transactions on Machine Learning Research, 2022

    Alexandre D´ efossez, Leon Bottou, Francis Bach, and Nicolas Usunier. A Simple Convergence Proof of Adam and Adagrad.Transactions on Machine Learning Research, 2022

  6. [6]

    Stochastic optimization with averaging of trajectories

    Bernard Delyon and Anatoli Juditsky. Stochastic optimization with averaging of trajectories. Stochastics Stochastics Rep., 39(2-3):107–118, 1992

  7. [7]

    General multilevel adaptations for stochastic approximation algorithms II: CLTs.Stochastic Process

    Steffen Dereich. General multilevel adaptations for stochastic approximation algorithms II: CLTs.Stochastic Process. Appl., 132:226–260, 2021

  8. [8]

    Uniform a priori bounds and error analysis for the Adam stochastic gradient descent optimization method.arXiv:2603.18899, 2026

    Steffen Dereich, Thang Do, and Arnulf Jentzen. Uniform a priori bounds and error analysis for the Adam stochastic gradient descent optimization method.arXiv:2603.18899, 2026. 35

  9. [9]

    Adam sym- metry theorem: characterization of the convergence of the stochastic Adam optimizer

    Steffen Dereich, Thang Do, Arnulf Jentzen, and Philippe von Wurstemberger. Adam sym- metry theorem: characterization of the convergence of the stochastic Adam optimizer. arXiv:2511.06675, 2025

  10. [10]

    Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates

    Steffen Dereich, Robin Graeber, and Arnulf Jentzen. Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates. arXiv:2407.08100, 2024

  11. [11]

    Convergence rates for the Adam optimizer

    Steffen Dereich and Arnulf Jentzen. Convergence rates for the Adam optimizer. arXiv:2407.21078, 2024

  12. [12]

    ODE approximation for the Adam algorithm: General and overparametrized setting.arXiv:2511.04622, 2025

    Steffen Dereich, Arnulf Jentzen, and Sebastian Kassing. ODE approximation for the Adam algorithm: General and overparametrized setting.arXiv:2511.04622, 2025

  13. [13]

    Steffen Dereich, Arnulf Jentzen, and Adrian Riekert. Averaged Adam accelerates stochastic optimization in the training of deep neural network approximations for partial differential equation and optimal control problems.arXiv:2501.06081, 2025

  14. [14]

    Central limit theorems for stochastic gradient descent with averaging for stable manifolds.Electron

    Steffen Dereich and Sebastian Kassing. Central limit theorems for stochastic gradient descent with averaging for stable manifolds.Electron. J. Probab., 28:Paper No. 57. 48, 2023

  15. [15]

    General multilevel adaptations for stochas- tic approximation algorithms of Robbins-Monro and Polyak-Ruppert type.Numer

    Steffen Dereich and Thomas M¨ uller-Gronbach. General multilevel adaptations for stochas- tic approximation algorithms of Robbins-Monro and Polyak-Ruppert type.Numer. Math., 142(2):279–328, 2019

  16. [16]

    Optimal non-asymptotic analysis of the Ruppert- Polyak averaging stochastic algorithm.Stochastic Process

    S´ ebastien Gadat and Fabien Panloup. Optimal non-asymptotic analysis of the Ruppert- Polyak averaging stochastic algorithm.Stochastic Process. Appl., 156:312–348, 2023

  17. [17]

    Blow up phenomena for gradient descent optimization methods in the training of artificial neural networks.arXiv:2211.15641, 2022

    Davide Gallon, Arnulf Jentzen, and Felix Lindner. Blow up phenomena for gradient descent optimization methods in the training of artificial neural networks.arXiv:2211.15641, 2022

  18. [18]

    M. I. Gordin. The central limit theorem for stationary processes.Dokl. Akad. Nauk SSSR, 188:739–741, 1969

  19. [19]

    Heyde.Martingale limit theory and its application

    Peter Hall and Chris C. Heyde.Martingale limit theory and its application. Probability and Mathematical Statistics. Academic Press, Inc. [Harcourt Brace Jovanovich, Publishers], New York-London, 1980

  20. [20]

    PADAM: Parallel averaged Adam re- duces the error for stochastic optimization in scientific machine learning.arXiv:2505.22085; Revision requested from J

    Arnulf Jentzen, Julian Kranz, and Adrian Riekert. PADAM: Parallel averaged Adam re- duces the error for stochastic optimization in scientific machine learning.arXiv:2505.22085; Revision requested from J. Comp. Math., 2024

  21. [21]

    Strong error analysis for stochastic gradient descent optimization algorithms.IMA J

    Arnulf Jentzen, Benno Kuckuck, Ariel Neufeld, and Philippe von Wurstemberger. Strong error analysis for stochastic gradient descent optimization algorithms.IMA J. Numer. Anal., 41(1):455–492, 2021

  22. [22]

    Mathematical Introduc- tion to Deep Learning: Methods, Implementations, and Theory.arXiv:2310.20360, 2023

    Arnulf Jentzen, Benno Kuckuck, and Philippe von Wurstemberger. Mathematical Introduc- tion to Deep Learning: Methods, Implementations, and Theory.arXiv:2310.20360, 2023

  23. [23]

    Lower error bounds for the stochastic gradient descent optimization algorithm: sharp convergence rates for slowly and fast decaying learning rates.J

    Arnulf Jentzen and Philippe von Wurstemberger. Lower error bounds for the stochastic gradient descent optimization algorithm: sharp convergence rates for slowly and fast decaying learning rates.J. Complexity, 57:101438, 16, 2020

  24. [24]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, 2014. 36

  25. [25]

    SAD Neural Networks: Di- vergent Gradient Flows and Asymptotic Optimality via o-minimal Structures

    Julian Kranz, Davide Gallon, Steffen Dereich, and Arnulf Jentzen. SAD Neural Networks: Di- vergent Gradient Flows and Asymptotic Optimality via o-minimal Structures. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 3930–3966. Curran Associates, I...

  26. [26]

    Kushner and G

    Harold J. Kushner and G. George Yin.Stochastic approximation and recursive algorithms and applications, volume 35 ofApplications of Mathematics (New York). Springer-Verlag, New York, second edition, 2003. Stochastic Modelling and Applied Probability

  27. [27]

    A Short Survey of Averaging Techniques in Stochastic Gradient Methods.arXiv:2603.09634, 2026

    Kailasam Lakshmanan. A Short Survey of Averaging Techniques in Stochastic Gradient Methods.arXiv:2603.09634, 2026

  28. [28]

    Non-asymptotic analysis of stochastic approximation al- gorithms for machine learning

    Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation al- gorithms for machine learning. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors,Advances in Neural Information Processing Systems 24 (NeurIPS 2011), volume 24, pages 451–459. Curran Associates, Inc., 2011

  29. [29]

    Topological properties of the set of functions generated by neural networks of fixed size.Found

    Philipp Petersen, Mones Raslan, and Felix Voigtlaender. Topological properties of the set of functions generated by neural networks of fixed size.Found. Comput. Math., 21(2):375–444, 2021

  30. [30]

    Boris T. Polyak. New method of stochastic approximation type.Autom. Remote Control, 51(7):937–946, 1990

  31. [31]

    Polyak and Anatoli B

    Boris T. Polyak and Anatoli B. Juditsky. Acceleration of stochastic approximation by aver- aging.SIAM J. Control Optim., 30(4):838–855, 1992

  32. [32]

    On the Convergence of Adam and Beyond

    Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the Convergence of Adam and Beyond. arXiv:1904.09237, 2019

  33. [33]

    A stochastic approximation method.Ann

    Herbert Robbins and Sutton Monro. A stochastic approximation method.Ann. Math. Statistics, 22:400–407, 1951

  34. [34]

    An overview of gradient descent optimization algorithms

    Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv:1609.04747, 2016

  35. [35]

    Efficient estimations from a slowly convergent Robbins-Monro process

    David Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. (Technical Report 781). Cornell University School of Operations Research and Industrial Engineering, 1988

  36. [36]

    Adam Can Converge Without Any Modification On Update Rules.arXiv:2208.09632, 2022

    Yushun Zhang, Congliang Chen, Naichen Shi, Ruoyu Sun, and Zhi-Quan Luo. Adam Can Converge Without Any Modification On Update Rules.arXiv:2208.09632, 2022

  37. [37]

    On Constructing Confidence Region for Model Parameters in Stochas- tic Gradient Descent via Batch Means.arXiv:1911.01483, 2020

    Yi Zhu and Jing Dong. On Constructing Confidence Region for Model Parameters in Stochas- tic Gradient Descent via Batch Means.arXiv:1911.01483, 2020. 37