Central limit theorem for the averaged Adam optimizer

Arnulf Jentzen; Steffen Dereich

arxiv: 2606.21433 · v1 · pith:G55AMEKKnew · submitted 2026-06-19 · 🧮 math.PR · cs.LG· math.ST· stat.TH

Central limit theorem for the averaged Adam optimizer

Steffen Dereich , Arnulf Jentzen This is my paper

Pith reviewed 2026-06-26 13:27 UTC · model grok-4.3

classification 🧮 math.PR cs.LGmath.STstat.TH

keywords central limit theoremAdam optimizeraveraged iteratesstochastic approximationconvergence ratevector fieldattractor

0 comments

The pith

Averaged Adam optimizer satisfies a central limit theorem with convergence rate exactly n to the power of minus one half.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyses the convergence of averaged iterates of the Adam optimizer toward an attracting zero of the Adam vector field. It establishes a central limit theorem that gives the precise asymptotic distribution after normalization by the square root of the number of steps. The limiting rate matches the classical order seen in stochastic approximation methods. The covariance matrix of the limit is expressed directly in terms of the Adam algorithm's behavior at the attractor. A reader would care because the result supplies an exact quantitative description of the error for this widely used optimizer.

Core claim

Under the assumption that the averaged Adam iterates converge to an attracting zero of the Adam vector field, a central limit theorem holds for these iterates: after centering at the attractor and scaling by the square root of the number of steps n, the distribution converges to a normal law whose covariance is determined by the properties of the Adam algorithm evaluated in the state of the attractor.

What carries the argument

The Adam vector field and its attracting zero, together with the averaged iterates whose fluctuations are described by the central limit theorem.

If this is right

The convergence speed of the averaged Adam iterates is of order n to the power of minus one half.
The asymptotic covariance is given explicitly by quantities derived from the Adam algorithm at the attractor.
The result recovers the same rate previously known only for classical non-adaptive stochastic approximation schemes.
The theorem applies specifically after averaging the Adam updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the theorem holds, then averaged Adam can be equipped with the same asymptotic error bounds used in Robbins-Monro analysis for practical uncertainty quantification.
The explicit covariance formula could be used to construct confidence regions around the estimated optimum in high-dimensional training.
The approach might extend to other adaptive methods such as RMSprop once their vector fields are shown to possess attracting zeros.
Testing the predicted covariance on benchmark optimization problems would provide a direct check independent of the proof.

Load-bearing premise

There exists an attracting zero of the Adam vector field to which the averaged iterates converge.

What would settle it

Numerical experiments or simulations in which the averaged Adam iterates converge to the attractor at a rate other than n to the power of minus one half would contradict the central limit theorem.

read the original abstract

In this article, we analyse convergence of the averaged Adam optimizer to an attracting zero of the Adam vector field. We provide a central limit theorem that, in particular, quantifies exactly the speed of convergence. The order of convergence is $n^{-1/2}$ in the number of steps of the algorithm which coincides with the order observed for classical stochastic approximation algorithms. The covariance in the central limit theorem is given in terms of properties of the Adam algorithm in the state of the attractor.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CLT for averaged Adam recovers n^{-1/2} rate but stays conditional on unverified convergence to an attractor of the nonlinear vector field.

read the letter

This paper derives a central limit theorem for the averaged Adam optimizer that recovers the usual n^{-1/2} rate, with covariance expressed at the attractor. The result is conditional on the averaged iterates converging to an attracting zero of the Adam vector field.

What is new is the direct application of stochastic approximation CLT machinery to Adam's update rule, which folds in momentum and the running second-moment estimate. The covariance formula is written in terms of the vector field evaluated at that point.

The work does a clean job of matching the rate to classical stochastic approximation, which lines up with observed behavior for both Adam and simpler methods.

The soft spot is the missing primitive conditions. The Adam vector field is nonlinear and state-dependent through the moment recursions, so existence or global attraction of the zero is not automatic. The paper supplies no assumptions on the objective (strong convexity, smoothness, or noise moments) that would make the convergence hold in general. The stress-test note is accurate on this point; the theorem remains conditional.

This is for readers already working on asymptotic analysis of adaptive optimizers. Someone who wants an explicit CLT expression for Adam would get something concrete from it.

It deserves peer review so the proofs can be checked and the scope of the assumption clarified.

Referee Report

1 major / 0 minor

Summary. The manuscript claims to analyze the convergence of averaged Adam iterates to an attracting zero of the Adam vector field and to establish a central limit theorem for these iterates. The CLT quantifies the convergence speed as n^{-1/2} (matching classical stochastic approximation) with the limiting covariance expressed in terms of the attractor properties.

Significance. If the result holds under verifiable conditions, it would supply a precise asymptotic characterization of averaged Adam, extending classical SA theory to this adaptive optimizer and potentially aiding analysis of convergence rates in non-convex optimization.

major comments (1)

[Abstract / Introduction] Abstract and introduction: the stated CLT presupposes that the averaged iterates converge to an attracting zero of the (nonlinear, state-dependent) Adam vector field, yet no primitive conditions are supplied on the objective, gradient noise, or Jacobian spectrum that would guarantee existence or global attraction of such a zero. This renders the theorem conditional on an unverified dynamical assumption that is load-bearing for the central claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on our manuscript. We respond to the major comment below.

read point-by-point responses

Referee: [Abstract / Introduction] Abstract and introduction: the stated CLT presupposes that the averaged iterates converge to an attracting zero of the (nonlinear, state-dependent) Adam vector field, yet no primitive conditions are supplied on the objective, gradient noise, or Jacobian spectrum that would guarantee existence or global attraction of such a zero. This renders the theorem conditional on an unverified dynamical assumption that is load-bearing for the central claim.

Authors: We acknowledge that the central limit theorem is established conditionally on the averaged Adam iterates converging to an attracting zero of the (nonlinear) Adam vector field. The manuscript's contribution lies in deriving the precise n^{-1/2} rate and the explicit form of the limiting covariance expressed via the attractor properties, thereby extending the classical stochastic approximation CLT to this adaptive setting. Establishing primitive conditions on the objective, noise, or Jacobian spectrum that would guarantee existence and global attraction of such zeros is a distinct and technically demanding question concerning the mean-field dynamics; it lies outside the scope of the present work. We will revise the introduction to state this modeling assumption more explicitly and to clarify the conditional nature of the result. revision: partial

Circularity Check

0 steps flagged

No circularity: CLT derived under explicit dynamical assumption with rate matching external classical SA results

full rationale

The provided abstract and context present a conditional central limit theorem for averaged Adam iterates, assuming convergence to an attracting zero of the Adam vector field and stating that the n^{-1/2} rate coincides with (but is not derived from) classical stochastic approximation algorithms. No equations, self-citations, fitted parameters, or ansatzes are shown that reduce the claimed result to its own inputs by construction. The derivation chain is therefore self-contained as a standard application of CLT methods to the given stochastic recursion under the stated hypothesis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities used in the proof.

pith-pipeline@v0.9.1-grok · 5601 in / 920 out tokens · 18914 ms · 2026-06-26T13:27:05.718102+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 4 linked inside Pith

[1]

Adam with model exponential moving average is effective for nonconvex optimization.arXiv:2405.18199, 2024

Kwangjun Ahn and Ashok Cutkosky. Adam with model exponential moving average is effective for nonconvex optimization.arXiv:2405.18199, 2024

arXiv 2024
[2]

General framework for online- to-nonconvex conversion: Schedule-free SGD is also effective for nonconvex optimization

Kwangjun Ahn, Gagik Magakyan, and Ashok Cutkosky. General framework for online- to-nonconvex conversion: Schedule-free SGD is also effective for nonconvex optimization. arXiv:2411.07061, 2024

arXiv 2024
[3]

Springer-Verlag, Berlin, 1990

Albert Benveniste, Michel M´ etivier, and Pierre Priouret.Adaptive algorithms and stochastic approximations, volume 22 ofApplications of Mathematics (New York). Springer-Verlag, Berlin, 1990. Translated from the French by Stephen S. Wilson

1990
[4]

The Road Less Scheduled.arXiv:2405.15682, 2024

Aaron Defazio, Xingyu Alice Yang, Harsh Mehta, Konstantin Mishchenko, Ahmed Khaled, and Ashok Cutkosky. The Road Less Scheduled.arXiv:2405.15682, 2024

arXiv 2024
[5]

A Simple Convergence Proof of Adam and Adagrad.Transactions on Machine Learning Research, 2022

Alexandre D´ efossez, Leon Bottou, Francis Bach, and Nicolas Usunier. A Simple Convergence Proof of Adam and Adagrad.Transactions on Machine Learning Research, 2022

2022
[6]

Stochastic optimization with averaging of trajectories

Bernard Delyon and Anatoli Juditsky. Stochastic optimization with averaging of trajectories. Stochastics Stochastics Rep., 39(2-3):107–118, 1992

1992
[7]

General multilevel adaptations for stochastic approximation algorithms II: CLTs.Stochastic Process

Steffen Dereich. General multilevel adaptations for stochastic approximation algorithms II: CLTs.Stochastic Process. Appl., 132:226–260, 2021

2021
[8]

Uniform a priori bounds and error analysis for the Adam stochastic gradient descent optimization method.arXiv:2603.18899, 2026

Steffen Dereich, Thang Do, and Arnulf Jentzen. Uniform a priori bounds and error analysis for the Adam stochastic gradient descent optimization method.arXiv:2603.18899, 2026. 35

arXiv 2026
[9]

Adam sym- metry theorem: characterization of the convergence of the stochastic Adam optimizer

Steffen Dereich, Thang Do, Arnulf Jentzen, and Philippe von Wurstemberger. Adam sym- metry theorem: characterization of the convergence of the stochastic Adam optimizer. arXiv:2511.06675, 2025

Pith/arXiv arXiv 2025
[10]

Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates

Steffen Dereich, Robin Graeber, and Arnulf Jentzen. Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates. arXiv:2407.08100, 2024

arXiv 2024
[11]

Convergence rates for the Adam optimizer

Steffen Dereich and Arnulf Jentzen. Convergence rates for the Adam optimizer. arXiv:2407.21078, 2024

arXiv 2024
[12]

ODE approximation for the Adam algorithm: General and overparametrized setting.arXiv:2511.04622, 2025

Steffen Dereich, Arnulf Jentzen, and Sebastian Kassing. ODE approximation for the Adam algorithm: General and overparametrized setting.arXiv:2511.04622, 2025

arXiv 2025
[13]

Steffen Dereich, Arnulf Jentzen, and Adrian Riekert. Averaged Adam accelerates stochastic optimization in the training of deep neural network approximations for partial differential equation and optimal control problems.arXiv:2501.06081, 2025

arXiv 2025
[14]

Central limit theorems for stochastic gradient descent with averaging for stable manifolds.Electron

Steffen Dereich and Sebastian Kassing. Central limit theorems for stochastic gradient descent with averaging for stable manifolds.Electron. J. Probab., 28:Paper No. 57. 48, 2023

2023
[15]

General multilevel adaptations for stochas- tic approximation algorithms of Robbins-Monro and Polyak-Ruppert type.Numer

Steffen Dereich and Thomas M¨ uller-Gronbach. General multilevel adaptations for stochas- tic approximation algorithms of Robbins-Monro and Polyak-Ruppert type.Numer. Math., 142(2):279–328, 2019

2019
[16]

Optimal non-asymptotic analysis of the Ruppert- Polyak averaging stochastic algorithm.Stochastic Process

S´ ebastien Gadat and Fabien Panloup. Optimal non-asymptotic analysis of the Ruppert- Polyak averaging stochastic algorithm.Stochastic Process. Appl., 156:312–348, 2023

2023
[17]

Blow up phenomena for gradient descent optimization methods in the training of artificial neural networks.arXiv:2211.15641, 2022

Davide Gallon, Arnulf Jentzen, and Felix Lindner. Blow up phenomena for gradient descent optimization methods in the training of artificial neural networks.arXiv:2211.15641, 2022

arXiv 2022
[18]

M. I. Gordin. The central limit theorem for stationary processes.Dokl. Akad. Nauk SSSR, 188:739–741, 1969

1969
[19]

Heyde.Martingale limit theory and its application

Peter Hall and Chris C. Heyde.Martingale limit theory and its application. Probability and Mathematical Statistics. Academic Press, Inc. [Harcourt Brace Jovanovich, Publishers], New York-London, 1980

1980
[20]

PADAM: Parallel averaged Adam re- duces the error for stochastic optimization in scientific machine learning.arXiv:2505.22085; Revision requested from J

Arnulf Jentzen, Julian Kranz, and Adrian Riekert. PADAM: Parallel averaged Adam re- duces the error for stochastic optimization in scientific machine learning.arXiv:2505.22085; Revision requested from J. Comp. Math., 2024

arXiv 2024
[21]

Strong error analysis for stochastic gradient descent optimization algorithms.IMA J

Arnulf Jentzen, Benno Kuckuck, Ariel Neufeld, and Philippe von Wurstemberger. Strong error analysis for stochastic gradient descent optimization algorithms.IMA J. Numer. Anal., 41(1):455–492, 2021

2021
[22]

Mathematical Introduc- tion to Deep Learning: Methods, Implementations, and Theory.arXiv:2310.20360, 2023

Arnulf Jentzen, Benno Kuckuck, and Philippe von Wurstemberger. Mathematical Introduc- tion to Deep Learning: Methods, Implementations, and Theory.arXiv:2310.20360, 2023

arXiv 2023
[23]

Lower error bounds for the stochastic gradient descent optimization algorithm: sharp convergence rates for slowly and fast decaying learning rates.J

Arnulf Jentzen and Philippe von Wurstemberger. Lower error bounds for the stochastic gradient descent optimization algorithm: sharp convergence rates for slowly and fast decaying learning rates.J. Complexity, 57:101438, 16, 2020

2020
[24]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, 2014. 36

Pith/arXiv arXiv 2014
[25]

SAD Neural Networks: Di- vergent Gradient Flows and Asymptotic Optimality via o-minimal Structures

Julian Kranz, Davide Gallon, Steffen Dereich, and Arnulf Jentzen. SAD Neural Networks: Di- vergent Gradient Flows and Asymptotic Optimality via o-minimal Structures. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 3930–3966. Curran Associates, I...

2025
[26]

Kushner and G

Harold J. Kushner and G. George Yin.Stochastic approximation and recursive algorithms and applications, volume 35 ofApplications of Mathematics (New York). Springer-Verlag, New York, second edition, 2003. Stochastic Modelling and Applied Probability

2003
[27]

A Short Survey of Averaging Techniques in Stochastic Gradient Methods.arXiv:2603.09634, 2026

Kailasam Lakshmanan. A Short Survey of Averaging Techniques in Stochastic Gradient Methods.arXiv:2603.09634, 2026

arXiv 2026
[28]

Non-asymptotic analysis of stochastic approximation al- gorithms for machine learning

Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation al- gorithms for machine learning. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors,Advances in Neural Information Processing Systems 24 (NeurIPS 2011), volume 24, pages 451–459. Curran Associates, Inc., 2011

2011
[29]

Topological properties of the set of functions generated by neural networks of fixed size.Found

Philipp Petersen, Mones Raslan, and Felix Voigtlaender. Topological properties of the set of functions generated by neural networks of fixed size.Found. Comput. Math., 21(2):375–444, 2021

2021
[30]

Boris T. Polyak. New method of stochastic approximation type.Autom. Remote Control, 51(7):937–946, 1990

1990
[31]

Polyak and Anatoli B

Boris T. Polyak and Anatoli B. Juditsky. Acceleration of stochastic approximation by aver- aging.SIAM J. Control Optim., 30(4):838–855, 1992

1992
[32]

On the Convergence of Adam and Beyond

Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the Convergence of Adam and Beyond. arXiv:1904.09237, 2019

Pith/arXiv arXiv 1904
[33]

A stochastic approximation method.Ann

Herbert Robbins and Sutton Monro. A stochastic approximation method.Ann. Math. Statistics, 22:400–407, 1951

1951
[34]

An overview of gradient descent optimization algorithms

Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv:1609.04747, 2016

Pith/arXiv arXiv 2016
[35]

Efficient estimations from a slowly convergent Robbins-Monro process

David Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. (Technical Report 781). Cornell University School of Operations Research and Industrial Engineering, 1988

1988
[36]

Adam Can Converge Without Any Modification On Update Rules.arXiv:2208.09632, 2022

Yushun Zhang, Congliang Chen, Naichen Shi, Ruoyu Sun, and Zhi-Quan Luo. Adam Can Converge Without Any Modification On Update Rules.arXiv:2208.09632, 2022

arXiv 2022
[37]

On Constructing Confidence Region for Model Parameters in Stochas- tic Gradient Descent via Batch Means.arXiv:1911.01483, 2020

Yi Zhu and Jing Dong. On Constructing Confidence Region for Model Parameters in Stochas- tic Gradient Descent via Batch Means.arXiv:1911.01483, 2020. 37

arXiv 1911

[1] [1]

Adam with model exponential moving average is effective for nonconvex optimization.arXiv:2405.18199, 2024

Kwangjun Ahn and Ashok Cutkosky. Adam with model exponential moving average is effective for nonconvex optimization.arXiv:2405.18199, 2024

arXiv 2024

[2] [2]

General framework for online- to-nonconvex conversion: Schedule-free SGD is also effective for nonconvex optimization

Kwangjun Ahn, Gagik Magakyan, and Ashok Cutkosky. General framework for online- to-nonconvex conversion: Schedule-free SGD is also effective for nonconvex optimization. arXiv:2411.07061, 2024

arXiv 2024

[3] [3]

Springer-Verlag, Berlin, 1990

Albert Benveniste, Michel M´ etivier, and Pierre Priouret.Adaptive algorithms and stochastic approximations, volume 22 ofApplications of Mathematics (New York). Springer-Verlag, Berlin, 1990. Translated from the French by Stephen S. Wilson

1990

[4] [4]

The Road Less Scheduled.arXiv:2405.15682, 2024

Aaron Defazio, Xingyu Alice Yang, Harsh Mehta, Konstantin Mishchenko, Ahmed Khaled, and Ashok Cutkosky. The Road Less Scheduled.arXiv:2405.15682, 2024

arXiv 2024

[5] [5]

A Simple Convergence Proof of Adam and Adagrad.Transactions on Machine Learning Research, 2022

Alexandre D´ efossez, Leon Bottou, Francis Bach, and Nicolas Usunier. A Simple Convergence Proof of Adam and Adagrad.Transactions on Machine Learning Research, 2022

2022

[6] [6]

Stochastic optimization with averaging of trajectories

Bernard Delyon and Anatoli Juditsky. Stochastic optimization with averaging of trajectories. Stochastics Stochastics Rep., 39(2-3):107–118, 1992

1992

[7] [7]

General multilevel adaptations for stochastic approximation algorithms II: CLTs.Stochastic Process

Steffen Dereich. General multilevel adaptations for stochastic approximation algorithms II: CLTs.Stochastic Process. Appl., 132:226–260, 2021

2021

[8] [8]

Uniform a priori bounds and error analysis for the Adam stochastic gradient descent optimization method.arXiv:2603.18899, 2026

Steffen Dereich, Thang Do, and Arnulf Jentzen. Uniform a priori bounds and error analysis for the Adam stochastic gradient descent optimization method.arXiv:2603.18899, 2026. 35

arXiv 2026

[9] [9]

Adam sym- metry theorem: characterization of the convergence of the stochastic Adam optimizer

Steffen Dereich, Thang Do, Arnulf Jentzen, and Philippe von Wurstemberger. Adam sym- metry theorem: characterization of the convergence of the stochastic Adam optimizer. arXiv:2511.06675, 2025

Pith/arXiv arXiv 2025

[10] [10]

Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates

Steffen Dereich, Robin Graeber, and Arnulf Jentzen. Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates. arXiv:2407.08100, 2024

arXiv 2024

[11] [11]

Convergence rates for the Adam optimizer

Steffen Dereich and Arnulf Jentzen. Convergence rates for the Adam optimizer. arXiv:2407.21078, 2024

arXiv 2024

[12] [12]

ODE approximation for the Adam algorithm: General and overparametrized setting.arXiv:2511.04622, 2025

Steffen Dereich, Arnulf Jentzen, and Sebastian Kassing. ODE approximation for the Adam algorithm: General and overparametrized setting.arXiv:2511.04622, 2025

arXiv 2025

[13] [13]

Steffen Dereich, Arnulf Jentzen, and Adrian Riekert. Averaged Adam accelerates stochastic optimization in the training of deep neural network approximations for partial differential equation and optimal control problems.arXiv:2501.06081, 2025

arXiv 2025

[14] [14]

Central limit theorems for stochastic gradient descent with averaging for stable manifolds.Electron

Steffen Dereich and Sebastian Kassing. Central limit theorems for stochastic gradient descent with averaging for stable manifolds.Electron. J. Probab., 28:Paper No. 57. 48, 2023

2023

[15] [15]

General multilevel adaptations for stochas- tic approximation algorithms of Robbins-Monro and Polyak-Ruppert type.Numer

Steffen Dereich and Thomas M¨ uller-Gronbach. General multilevel adaptations for stochas- tic approximation algorithms of Robbins-Monro and Polyak-Ruppert type.Numer. Math., 142(2):279–328, 2019

2019

[16] [16]

Optimal non-asymptotic analysis of the Ruppert- Polyak averaging stochastic algorithm.Stochastic Process

S´ ebastien Gadat and Fabien Panloup. Optimal non-asymptotic analysis of the Ruppert- Polyak averaging stochastic algorithm.Stochastic Process. Appl., 156:312–348, 2023

2023

[17] [17]

Blow up phenomena for gradient descent optimization methods in the training of artificial neural networks.arXiv:2211.15641, 2022

Davide Gallon, Arnulf Jentzen, and Felix Lindner. Blow up phenomena for gradient descent optimization methods in the training of artificial neural networks.arXiv:2211.15641, 2022

arXiv 2022

[18] [18]

M. I. Gordin. The central limit theorem for stationary processes.Dokl. Akad. Nauk SSSR, 188:739–741, 1969

1969

[19] [19]

Heyde.Martingale limit theory and its application

Peter Hall and Chris C. Heyde.Martingale limit theory and its application. Probability and Mathematical Statistics. Academic Press, Inc. [Harcourt Brace Jovanovich, Publishers], New York-London, 1980

1980

[20] [20]

PADAM: Parallel averaged Adam re- duces the error for stochastic optimization in scientific machine learning.arXiv:2505.22085; Revision requested from J

Arnulf Jentzen, Julian Kranz, and Adrian Riekert. PADAM: Parallel averaged Adam re- duces the error for stochastic optimization in scientific machine learning.arXiv:2505.22085; Revision requested from J. Comp. Math., 2024

arXiv 2024

[21] [21]

Strong error analysis for stochastic gradient descent optimization algorithms.IMA J

Arnulf Jentzen, Benno Kuckuck, Ariel Neufeld, and Philippe von Wurstemberger. Strong error analysis for stochastic gradient descent optimization algorithms.IMA J. Numer. Anal., 41(1):455–492, 2021

2021

[22] [22]

Mathematical Introduc- tion to Deep Learning: Methods, Implementations, and Theory.arXiv:2310.20360, 2023

Arnulf Jentzen, Benno Kuckuck, and Philippe von Wurstemberger. Mathematical Introduc- tion to Deep Learning: Methods, Implementations, and Theory.arXiv:2310.20360, 2023

arXiv 2023

[23] [23]

Lower error bounds for the stochastic gradient descent optimization algorithm: sharp convergence rates for slowly and fast decaying learning rates.J

Arnulf Jentzen and Philippe von Wurstemberger. Lower error bounds for the stochastic gradient descent optimization algorithm: sharp convergence rates for slowly and fast decaying learning rates.J. Complexity, 57:101438, 16, 2020

2020

[24] [24]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, 2014. 36

Pith/arXiv arXiv 2014

[25] [25]

SAD Neural Networks: Di- vergent Gradient Flows and Asymptotic Optimality via o-minimal Structures

Julian Kranz, Davide Gallon, Steffen Dereich, and Arnulf Jentzen. SAD Neural Networks: Di- vergent Gradient Flows and Asymptotic Optimality via o-minimal Structures. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 3930–3966. Curran Associates, I...

2025

[26] [26]

Kushner and G

Harold J. Kushner and G. George Yin.Stochastic approximation and recursive algorithms and applications, volume 35 ofApplications of Mathematics (New York). Springer-Verlag, New York, second edition, 2003. Stochastic Modelling and Applied Probability

2003

[27] [27]

A Short Survey of Averaging Techniques in Stochastic Gradient Methods.arXiv:2603.09634, 2026

Kailasam Lakshmanan. A Short Survey of Averaging Techniques in Stochastic Gradient Methods.arXiv:2603.09634, 2026

arXiv 2026

[28] [28]

Non-asymptotic analysis of stochastic approximation al- gorithms for machine learning

Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation al- gorithms for machine learning. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors,Advances in Neural Information Processing Systems 24 (NeurIPS 2011), volume 24, pages 451–459. Curran Associates, Inc., 2011

2011

[29] [29]

Topological properties of the set of functions generated by neural networks of fixed size.Found

Philipp Petersen, Mones Raslan, and Felix Voigtlaender. Topological properties of the set of functions generated by neural networks of fixed size.Found. Comput. Math., 21(2):375–444, 2021

2021

[30] [30]

Boris T. Polyak. New method of stochastic approximation type.Autom. Remote Control, 51(7):937–946, 1990

1990

[31] [31]

Polyak and Anatoli B

Boris T. Polyak and Anatoli B. Juditsky. Acceleration of stochastic approximation by aver- aging.SIAM J. Control Optim., 30(4):838–855, 1992

1992

[32] [32]

On the Convergence of Adam and Beyond

Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the Convergence of Adam and Beyond. arXiv:1904.09237, 2019

Pith/arXiv arXiv 1904

[33] [33]

A stochastic approximation method.Ann

Herbert Robbins and Sutton Monro. A stochastic approximation method.Ann. Math. Statistics, 22:400–407, 1951

1951

[34] [34]

An overview of gradient descent optimization algorithms

Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv:1609.04747, 2016

Pith/arXiv arXiv 2016

[35] [35]

Efficient estimations from a slowly convergent Robbins-Monro process

David Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. (Technical Report 781). Cornell University School of Operations Research and Industrial Engineering, 1988

1988

[36] [36]

Adam Can Converge Without Any Modification On Update Rules.arXiv:2208.09632, 2022

Yushun Zhang, Congliang Chen, Naichen Shi, Ruoyu Sun, and Zhi-Quan Luo. Adam Can Converge Without Any Modification On Update Rules.arXiv:2208.09632, 2022

arXiv 2022

[37] [37]

On Constructing Confidence Region for Model Parameters in Stochas- tic Gradient Descent via Batch Means.arXiv:1911.01483, 2020

Yi Zhu and Jing Dong. On Constructing Confidence Region for Model Parameters in Stochas- tic Gradient Descent via Batch Means.arXiv:1911.01483, 2020. 37

arXiv 1911