Central limit theorem for the averaged Adam optimizer
Pith reviewed 2026-06-26 13:27 UTC · model grok-4.3
The pith
Averaged Adam optimizer satisfies a central limit theorem with convergence rate exactly n to the power of minus one half.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the assumption that the averaged Adam iterates converge to an attracting zero of the Adam vector field, a central limit theorem holds for these iterates: after centering at the attractor and scaling by the square root of the number of steps n, the distribution converges to a normal law whose covariance is determined by the properties of the Adam algorithm evaluated in the state of the attractor.
What carries the argument
The Adam vector field and its attracting zero, together with the averaged iterates whose fluctuations are described by the central limit theorem.
If this is right
- The convergence speed of the averaged Adam iterates is of order n to the power of minus one half.
- The asymptotic covariance is given explicitly by quantities derived from the Adam algorithm at the attractor.
- The result recovers the same rate previously known only for classical non-adaptive stochastic approximation schemes.
- The theorem applies specifically after averaging the Adam updates.
Where Pith is reading between the lines
- If the theorem holds, then averaged Adam can be equipped with the same asymptotic error bounds used in Robbins-Monro analysis for practical uncertainty quantification.
- The explicit covariance formula could be used to construct confidence regions around the estimated optimum in high-dimensional training.
- The approach might extend to other adaptive methods such as RMSprop once their vector fields are shown to possess attracting zeros.
- Testing the predicted covariance on benchmark optimization problems would provide a direct check independent of the proof.
Load-bearing premise
There exists an attracting zero of the Adam vector field to which the averaged iterates converge.
What would settle it
Numerical experiments or simulations in which the averaged Adam iterates converge to the attractor at a rate other than n to the power of minus one half would contradict the central limit theorem.
read the original abstract
In this article, we analyse convergence of the averaged Adam optimizer to an attracting zero of the Adam vector field. We provide a central limit theorem that, in particular, quantifies exactly the speed of convergence. The order of convergence is $n^{-1/2}$ in the number of steps of the algorithm which coincides with the order observed for classical stochastic approximation algorithms. The covariance in the central limit theorem is given in terms of properties of the Adam algorithm in the state of the attractor.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims to analyze the convergence of averaged Adam iterates to an attracting zero of the Adam vector field and to establish a central limit theorem for these iterates. The CLT quantifies the convergence speed as n^{-1/2} (matching classical stochastic approximation) with the limiting covariance expressed in terms of the attractor properties.
Significance. If the result holds under verifiable conditions, it would supply a precise asymptotic characterization of averaged Adam, extending classical SA theory to this adaptive optimizer and potentially aiding analysis of convergence rates in non-convex optimization.
major comments (1)
- [Abstract / Introduction] Abstract and introduction: the stated CLT presupposes that the averaged iterates converge to an attracting zero of the (nonlinear, state-dependent) Adam vector field, yet no primitive conditions are supplied on the objective, gradient noise, or Jacobian spectrum that would guarantee existence or global attraction of such a zero. This renders the theorem conditional on an unverified dynamical assumption that is load-bearing for the central claim.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive comments on our manuscript. We respond to the major comment below.
read point-by-point responses
-
Referee: [Abstract / Introduction] Abstract and introduction: the stated CLT presupposes that the averaged iterates converge to an attracting zero of the (nonlinear, state-dependent) Adam vector field, yet no primitive conditions are supplied on the objective, gradient noise, or Jacobian spectrum that would guarantee existence or global attraction of such a zero. This renders the theorem conditional on an unverified dynamical assumption that is load-bearing for the central claim.
Authors: We acknowledge that the central limit theorem is established conditionally on the averaged Adam iterates converging to an attracting zero of the (nonlinear) Adam vector field. The manuscript's contribution lies in deriving the precise n^{-1/2} rate and the explicit form of the limiting covariance expressed via the attractor properties, thereby extending the classical stochastic approximation CLT to this adaptive setting. Establishing primitive conditions on the objective, noise, or Jacobian spectrum that would guarantee existence and global attraction of such zeros is a distinct and technically demanding question concerning the mean-field dynamics; it lies outside the scope of the present work. We will revise the introduction to state this modeling assumption more explicitly and to clarify the conditional nature of the result. revision: partial
Circularity Check
No circularity: CLT derived under explicit dynamical assumption with rate matching external classical SA results
full rationale
The provided abstract and context present a conditional central limit theorem for averaged Adam iterates, assuming convergence to an attracting zero of the Adam vector field and stating that the n^{-1/2} rate coincides with (but is not derived from) classical stochastic approximation algorithms. No equations, self-citations, fitted parameters, or ansatzes are shown that reduce the claimed result to its own inputs by construction. The derivation chain is therefore self-contained as a standard application of CLT methods to the given stochastic recursion under the stated hypothesis.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Kwangjun Ahn and Ashok Cutkosky. Adam with model exponential moving average is effective for nonconvex optimization.arXiv:2405.18199, 2024
arXiv 2024
-
[2]
Kwangjun Ahn, Gagik Magakyan, and Ashok Cutkosky. General framework for online- to-nonconvex conversion: Schedule-free SGD is also effective for nonconvex optimization. arXiv:2411.07061, 2024
arXiv 2024
-
[3]
Springer-Verlag, Berlin, 1990
Albert Benveniste, Michel M´ etivier, and Pierre Priouret.Adaptive algorithms and stochastic approximations, volume 22 ofApplications of Mathematics (New York). Springer-Verlag, Berlin, 1990. Translated from the French by Stephen S. Wilson
1990
-
[4]
The Road Less Scheduled.arXiv:2405.15682, 2024
Aaron Defazio, Xingyu Alice Yang, Harsh Mehta, Konstantin Mishchenko, Ahmed Khaled, and Ashok Cutkosky. The Road Less Scheduled.arXiv:2405.15682, 2024
arXiv 2024
-
[5]
A Simple Convergence Proof of Adam and Adagrad.Transactions on Machine Learning Research, 2022
Alexandre D´ efossez, Leon Bottou, Francis Bach, and Nicolas Usunier. A Simple Convergence Proof of Adam and Adagrad.Transactions on Machine Learning Research, 2022
2022
-
[6]
Stochastic optimization with averaging of trajectories
Bernard Delyon and Anatoli Juditsky. Stochastic optimization with averaging of trajectories. Stochastics Stochastics Rep., 39(2-3):107–118, 1992
1992
-
[7]
General multilevel adaptations for stochastic approximation algorithms II: CLTs.Stochastic Process
Steffen Dereich. General multilevel adaptations for stochastic approximation algorithms II: CLTs.Stochastic Process. Appl., 132:226–260, 2021
2021
-
[8]
Steffen Dereich, Thang Do, and Arnulf Jentzen. Uniform a priori bounds and error analysis for the Adam stochastic gradient descent optimization method.arXiv:2603.18899, 2026. 35
arXiv 2026
-
[9]
Adam sym- metry theorem: characterization of the convergence of the stochastic Adam optimizer
Steffen Dereich, Thang Do, Arnulf Jentzen, and Philippe von Wurstemberger. Adam sym- metry theorem: characterization of the convergence of the stochastic Adam optimizer. arXiv:2511.06675, 2025
Pith/arXiv arXiv 2025
-
[10]
Steffen Dereich, Robin Graeber, and Arnulf Jentzen. Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates. arXiv:2407.08100, 2024
arXiv 2024
-
[11]
Convergence rates for the Adam optimizer
Steffen Dereich and Arnulf Jentzen. Convergence rates for the Adam optimizer. arXiv:2407.21078, 2024
arXiv 2024
-
[12]
Steffen Dereich, Arnulf Jentzen, and Sebastian Kassing. ODE approximation for the Adam algorithm: General and overparametrized setting.arXiv:2511.04622, 2025
arXiv 2025
-
[13]
Steffen Dereich, Arnulf Jentzen, and Adrian Riekert. Averaged Adam accelerates stochastic optimization in the training of deep neural network approximations for partial differential equation and optimal control problems.arXiv:2501.06081, 2025
arXiv 2025
-
[14]
Central limit theorems for stochastic gradient descent with averaging for stable manifolds.Electron
Steffen Dereich and Sebastian Kassing. Central limit theorems for stochastic gradient descent with averaging for stable manifolds.Electron. J. Probab., 28:Paper No. 57. 48, 2023
2023
-
[15]
General multilevel adaptations for stochas- tic approximation algorithms of Robbins-Monro and Polyak-Ruppert type.Numer
Steffen Dereich and Thomas M¨ uller-Gronbach. General multilevel adaptations for stochas- tic approximation algorithms of Robbins-Monro and Polyak-Ruppert type.Numer. Math., 142(2):279–328, 2019
2019
-
[16]
Optimal non-asymptotic analysis of the Ruppert- Polyak averaging stochastic algorithm.Stochastic Process
S´ ebastien Gadat and Fabien Panloup. Optimal non-asymptotic analysis of the Ruppert- Polyak averaging stochastic algorithm.Stochastic Process. Appl., 156:312–348, 2023
2023
-
[17]
Davide Gallon, Arnulf Jentzen, and Felix Lindner. Blow up phenomena for gradient descent optimization methods in the training of artificial neural networks.arXiv:2211.15641, 2022
arXiv 2022
-
[18]
M. I. Gordin. The central limit theorem for stationary processes.Dokl. Akad. Nauk SSSR, 188:739–741, 1969
1969
-
[19]
Heyde.Martingale limit theory and its application
Peter Hall and Chris C. Heyde.Martingale limit theory and its application. Probability and Mathematical Statistics. Academic Press, Inc. [Harcourt Brace Jovanovich, Publishers], New York-London, 1980
1980
-
[20]
Arnulf Jentzen, Julian Kranz, and Adrian Riekert. PADAM: Parallel averaged Adam re- duces the error for stochastic optimization in scientific machine learning.arXiv:2505.22085; Revision requested from J. Comp. Math., 2024
arXiv 2024
-
[21]
Strong error analysis for stochastic gradient descent optimization algorithms.IMA J
Arnulf Jentzen, Benno Kuckuck, Ariel Neufeld, and Philippe von Wurstemberger. Strong error analysis for stochastic gradient descent optimization algorithms.IMA J. Numer. Anal., 41(1):455–492, 2021
2021
-
[22]
Arnulf Jentzen, Benno Kuckuck, and Philippe von Wurstemberger. Mathematical Introduc- tion to Deep Learning: Methods, Implementations, and Theory.arXiv:2310.20360, 2023
arXiv 2023
-
[23]
Lower error bounds for the stochastic gradient descent optimization algorithm: sharp convergence rates for slowly and fast decaying learning rates.J
Arnulf Jentzen and Philippe von Wurstemberger. Lower error bounds for the stochastic gradient descent optimization algorithm: sharp convergence rates for slowly and fast decaying learning rates.J. Complexity, 57:101438, 16, 2020
2020
-
[24]
Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, 2014. 36
Pith/arXiv arXiv 2014
-
[25]
SAD Neural Networks: Di- vergent Gradient Flows and Asymptotic Optimality via o-minimal Structures
Julian Kranz, Davide Gallon, Steffen Dereich, and Arnulf Jentzen. SAD Neural Networks: Di- vergent Gradient Flows and Asymptotic Optimality via o-minimal Structures. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 3930–3966. Curran Associates, I...
2025
-
[26]
Kushner and G
Harold J. Kushner and G. George Yin.Stochastic approximation and recursive algorithms and applications, volume 35 ofApplications of Mathematics (New York). Springer-Verlag, New York, second edition, 2003. Stochastic Modelling and Applied Probability
2003
-
[27]
A Short Survey of Averaging Techniques in Stochastic Gradient Methods.arXiv:2603.09634, 2026
Kailasam Lakshmanan. A Short Survey of Averaging Techniques in Stochastic Gradient Methods.arXiv:2603.09634, 2026
arXiv 2026
-
[28]
Non-asymptotic analysis of stochastic approximation al- gorithms for machine learning
Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation al- gorithms for machine learning. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors,Advances in Neural Information Processing Systems 24 (NeurIPS 2011), volume 24, pages 451–459. Curran Associates, Inc., 2011
2011
-
[29]
Topological properties of the set of functions generated by neural networks of fixed size.Found
Philipp Petersen, Mones Raslan, and Felix Voigtlaender. Topological properties of the set of functions generated by neural networks of fixed size.Found. Comput. Math., 21(2):375–444, 2021
2021
-
[30]
Boris T. Polyak. New method of stochastic approximation type.Autom. Remote Control, 51(7):937–946, 1990
1990
-
[31]
Polyak and Anatoli B
Boris T. Polyak and Anatoli B. Juditsky. Acceleration of stochastic approximation by aver- aging.SIAM J. Control Optim., 30(4):838–855, 1992
1992
-
[32]
On the Convergence of Adam and Beyond
Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the Convergence of Adam and Beyond. arXiv:1904.09237, 2019
Pith/arXiv arXiv 1904
-
[33]
A stochastic approximation method.Ann
Herbert Robbins and Sutton Monro. A stochastic approximation method.Ann. Math. Statistics, 22:400–407, 1951
1951
-
[34]
An overview of gradient descent optimization algorithms
Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv:1609.04747, 2016
Pith/arXiv arXiv 2016
-
[35]
Efficient estimations from a slowly convergent Robbins-Monro process
David Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. (Technical Report 781). Cornell University School of Operations Research and Industrial Engineering, 1988
1988
-
[36]
Adam Can Converge Without Any Modification On Update Rules.arXiv:2208.09632, 2022
Yushun Zhang, Congliang Chen, Naichen Shi, Ruoyu Sun, and Zhi-Quan Luo. Adam Can Converge Without Any Modification On Update Rules.arXiv:2208.09632, 2022
arXiv 2022
-
[37]
Yi Zhu and Jing Dong. On Constructing Confidence Region for Model Parameters in Stochas- tic Gradient Descent via Batch Means.arXiv:1911.01483, 2020. 37
arXiv 1911
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.