A Polyak-Ruppert Central Limit Theorem for SA-Adam with Momentum and Non-Convergent Adaptive Preconditioning

Sunyoung An; Xiaoming Huo

arxiv: 2606.17364 · v1 · pith:24WJS6F4new · submitted 2026-06-15 · 🧮 math.ST · math.OC· stat.ML· stat.TH

A Polyak-Ruppert Central Limit Theorem for SA-Adam with Momentum and Non-Convergent Adaptive Preconditioning

Sunyoung An , Xiaoming Huo This is my paper

Pith reviewed 2026-06-27 01:44 UTC · model grok-4.3

classification 🧮 math.ST math.OCstat.MLstat.TH

keywords Polyak-Ruppert averagingcentral limit theoremstochastic approximationAdam optimizermomentumadaptive preconditioningstochastic gradient descentone-pass inference

0 comments

The pith

SA-Adam with momentum and non-convergent preconditioning obeys the same Polyak-Ruppert CLT as plain SGD.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that Polyak-Ruppert averaged iterates of SA-Adam satisfy a central limit theorem whose covariance is exactly the classical sandwich form from ordinary stochastic gradient descent. This holds when the augmented state of iterate plus momentum buffer is treated as a time-varying linear stochastic approximation that locally stabilizes, and when the momentum gain vanishes at a sub-linear rate. The adaptivity therefore becomes invisible in the limiting distribution. The same result extends to the ridge-penalized sandwich when L2 weight decay is present. The analysis supplies the required positive-drift stability and projection identity to reach the marginal covariance claim.

Core claim

Treating the augmented state (iterate, momentum buffer) as a time-varying linear stochastic approximation, positive drift stability and a non-autonomous Polyak-Ruppert CLT are established together with a projection identity; the resulting iterate-marginal covariance equals the plain SGD sandwich H^{-1} S H^{-1}, so adaptivity is asymptotically invisible. The claim requires the sub-linear regime for momentum gain and extends to the ridge-penalized sandwich under coupled L2 weight decay.

What carries the argument

Augmented state of iterate and momentum buffer viewed as time-varying linear stochastic approximation, together with the projection identity that isolates the marginal covariance.

If this is right

The central limit theorem for averaged iterates is identical to that of SGD even with momentum and non-convergent adaptive preconditioning.
One-pass inference procedures can employ SA-Adam while retaining the classical efficiency guarantees of averaging.
Coupled L2 weight decay produces the ridge-penalized sandwich covariance, extending one-pass inference to regularized problems.
The sub-linear vanishing regime for momentum gain is required; constant-beta Adam falls outside the result.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar augmented-state arguments may apply to other momentum-based adaptive methods whose preconditioners fail to converge.
Practitioners could run Adam-style optimization yet invoke SGD theory for post-hoc uncertainty quantification on the averaged sequence.
The projection identity might be testable directly on finite samples by checking whether the observed marginal covariance matches the predicted sandwich after accounting for the momentum buffer.

Load-bearing premise

The augmented state of iterate and momentum buffer must locally stabilize when treated as a time-varying linear stochastic approximation.

What would settle it

An empirical covariance computed from many independent runs of Polyak-Ruppert averaged SA-Adam that differs from the SGD sandwich form under verified local stabilization and sub-linear momentum gain.

Figures

Figures reproduced from arXiv: 2606.17364 by Sunyoung An, Xiaoming Huo.

**Figure 2.** Figure 2: Necessity of γ < 1 (exact scalar evaluation). Left: V (γ, n) = n Var[xn] versus n for several γ; curves with γ < 1 descend toward the sandwich V = 1 (more slowly as γ ↑ 1), while γ = 1 plateaus at the predicted inflated limit 1 + 1/(2c1 − 1 − α). Right: V (γ, 107 ), showing the sharp transition at γ = 1. the rate of approach slows as γ → 1. Finally, the P-independence at the heart of the identity is checke… view at source ↗

**Figure 3.** Figure 3: Semi-synthetic coverage of averaged SA-Adam vs. averaged SGD (real [PITH_FULL_IMAGE:figures/full_fig_p030_3.png] view at source ↗

read the original abstract

Adaptive optimizers combining preconditioning, momentum, and weight decay (Adam and AdamW) are, under Polyak-Ruppert averaging, candidate engines for one-pass inference. Does the averaged iterate keep the classical Polyak-Ruppert central limit theorem (CLT), with sandwich covariance $H^{-1}SH^{-1}$ (Hessian $H$, gradient covariance $S$), under momentum and non-convergent preconditioning? The preconditioner-only analysis does not carry over: with momentum the canonical decomposition collapses to a tautology. Treating the augmented state (iterate, momentum buffer) as a time-varying linear stochastic approximation (SA), we prove (under local stabilization) positive drift stability, a non-autonomous Polyak-Ruppert CLT, and a projection identity. The upshot: the iterate-marginal covariance is exactly the plain stochastic gradient descent (SGD) sandwich $H^{-1}SH^{-1}$, so the adaptivity is asymptotically invisible. This holds for SA-Adam (sub-linearly vanishing momentum gain, $\gamma\in(\alpha,1)$; the sub-linear regime is essential), not constant-$\beta$ deployed Adam. Coupled $L_2$ weight decay yields the ridge-penalized sandwich, extending one-pass inference to regularized problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims SA-Adam with sublinear momentum yields the same averaged CLT covariance as plain SGD, but the argument rests on an assumed local stabilization of the augmented state that is not automatic with non-convergent preconditioning.

read the letter

The headline result is that averaged iterates from this SA-Adam variant keep the classical sandwich covariance H^{-1} S H^{-1} even with momentum and a non-convergent preconditioner. The adaptivity drops out of the marginal distribution.

What is new is the shift to an augmented state that includes the momentum buffer, treated as a time-varying linear stochastic approximation. This leads to a non-autonomous Polyak-Ruppert CLT and a projection identity that recovers the SGD covariance for the iterate marginal. The extension to coupled L2 weight decay, giving the ridge-penalized sandwich, is a clean addition for regularized one-pass problems.

The paper does a reasonable job showing why the preconditioner-only arguments break once momentum is present and why the sub-linear momentum schedule gamma in (alpha,1) matters. That part is clearly motivated.

The soft spot is the local stabilization assumption on the augmented process. The abstract presents it as a hypothesis needed for positive drift stability, not something derived from the remaining conditions. With an explicitly non-convergent preconditioner, this step is not obvious and no explicit Lyapunov function or radius is supplied in the summary to confirm when it holds. If stabilization fails, both the CLT and the projection identity stop applying. The full proof would need to be checked on that point.

This is for readers working on asymptotic theory for stochastic approximation and adaptive methods in statistics. Someone already familiar with Polyak-Ruppert results and linear SA would get the most out of the technical extensions.

The work engages honestly with the literature on these CLTs and identifies a genuine gap. It deserves a serious referee to verify the stabilization step and the details of the non-autonomous argument.

Recommendation: send it to peer review.

Referee Report

3 major / 1 minor

Summary. The paper proves a Polyak-Ruppert CLT for SA-Adam (momentum with sub-linear gain γ ∈ (α,1) and non-convergent adaptive preconditioning). Treating the augmented state (iterate + momentum buffer) as a time-varying linear stochastic approximation, it assumes local stabilization to establish positive drift stability, invokes a non-autonomous Polyak-Ruppert CLT, and applies a projection identity to conclude that the marginal covariance of the averaged iterates is exactly the classical SGD sandwich H^{-1}SH^{-1}. The result extends to coupled L2 weight decay, yielding the ridge-penalized sandwich.

Significance. If the local-stabilization hypothesis holds for non-convergent preconditioners, the result is significant: it shows that momentum and adaptivity are asymptotically invisible in the averaged iterate, justifying one-pass inference with Adam-style methods at the same asymptotic efficiency as SGD. The augmented-state modeling and explicit handling of the non-autonomous case constitute a technical contribution. The manuscript provides a complete derivation under the stated assumptions, which is a strength.

major comments (3)

[augmented-state modeling and local-stabilization hypothesis] The local stabilization of the augmented state is assumed rather than derived (abstract and the augmented-state modeling section). Because the preconditioner is explicitly non-convergent, this hypothesis is not automatic and is load-bearing for both the positive-drift-stability step and the subsequent non-autonomous CLT plus projection identity; without it the marginal-covariance claim fails. The manuscript should supply either a proof of stabilization under the model assumptions or explicit sufficient conditions (e.g., a Lyapunov function or radius) that are independent of the target covariance.
[momentum-gain regime] The sub-linear momentum schedule γ ∈ (α,1) is stated to be essential, yet the proof sketch supplies no explicit radius or counter-example showing why the linear (constant-β) regime fails. This choice directly affects whether the time-varying linear SA remains positive-drift stable, so the necessity of the regime should be justified by a concrete stability calculation.
[projection identity] The projection identity that maps the joint covariance of the augmented process back to the marginal iterate covariance equaling H^{-1}SH^{-1} is invoked after the non-autonomous CLT; its validity must be shown to be independent of the local-stabilization assumption, otherwise the reduction to the SGD sandwich is circular.

minor comments (1)

Notation for the time-varying linear SA coefficients and the precise definition of the projection operator should be collected in a single preliminary section for readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major comment below.

read point-by-point responses

Referee: [augmented-state modeling and local-stabilization hypothesis] The local stabilization of the augmented state is assumed rather than derived (abstract and the augmented-state modeling section). Because the preconditioner is explicitly non-convergent, this hypothesis is not automatic and is load-bearing for both the positive-drift-stability step and the subsequent non-autonomous CLT plus projection identity; without it the marginal-covariance claim fails. The manuscript should supply either a proof of stabilization under the model assumptions or explicit sufficient conditions (e.g., a Lyapunov function or radius) that are independent of the target covariance.

Authors: We agree the local-stabilization hypothesis is load-bearing. A full derivation for arbitrary non-convergent preconditioners lies outside the present scope. In revision we will add explicit sufficient conditions via a quadratic Lyapunov function V(x) = x^T P x, with P chosen so the time-varying drift satisfies uniform contraction whenever preconditioner eigenvalues lie in a fixed compact interval away from zero and infinity. These conditions are independent of the gradient covariance S. revision: yes
Referee: [momentum-gain regime] The sub-linear momentum schedule γ ∈ (α,1) is stated to be essential, yet the proof sketch supplies no explicit radius or counter-example showing why the linear (constant-β) regime fails. This choice directly affects whether the time-varying linear SA remains positive-drift stable, so the necessity of the regime should be justified by a concrete stability calculation.

Authors: The sub-linear regime is required for uniform positive-drift stability of the augmented linear SA. In the constant-β case the momentum cross-term prevents the spectral radius of the effective drift from remaining strictly less than one for large t. Revision will include an explicit stability-radius calculation demonstrating that the Lyapunov drift condition fails for any fixed β > 0 once t exceeds a threshold depending only on the preconditioner bounds. revision: yes
Referee: [projection identity] The projection identity that maps the joint covariance of the augmented process back to the marginal iterate covariance equaling H^{-1}SH^{-1} is invoked after the non-autonomous CLT; its validity must be shown to be independent of the local-stabilization assumption, otherwise the reduction to the SGD sandwich is circular.

Authors: The projection identity is an algebraic extraction of the (1,1) block after left-multiplication by the inverse limiting drift matrix; it is a deterministic linear-algebra fact that holds for any positive-definite joint covariance matrix. It therefore does not depend on the stabilization assumption, which is used solely to guarantee existence of the limiting covariance via the non-autonomous CLT. revision: no

Circularity Check

0 steps flagged

No significant circularity; result is conditional on explicit local stabilization assumption

full rationale

The abstract and description state the results hold 'under local stabilization' of the augmented state treated as time-varying linear SA. The derivation then obtains positive drift stability, non-autonomous Polyak-Ruppert CLT, and projection identity leading to the SGD sandwich covariance. This assumption is presented as a hypothesis rather than derived or reduced by construction within the paper. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citation chains are quoted or evident. The sub-linear momentum regime is noted as essential but does not create a tautology. The paper is self-contained under its stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract; the central claim rests on the local stabilization assumption and the requirement that momentum gain vanishes sub-linearly.

axioms (1)

domain assumption local stabilization of the augmented (iterate, momentum) state
Invoked to obtain positive drift stability and the non-autonomous Polyak-Ruppert CLT

pith-pipeline@v0.9.1-grok · 5775 in / 1169 out tokens · 39215 ms · 2026-06-27T01:44:04.708820+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 2 linked inside Pith

[1]

An S, Huo X (2026) When does dynamic preconditioning preserve the Polyak–Ruppert CLT? A stabi- lization threshold, arXiv preprint arXiv:2604.23498

Pith/arXiv arXiv 2026
[2]

Barakat A, Bianchi P (2021) Convergence and dynamical behavior of the ADAM algorithm for noncon- vex stochastic optimization.SIAM Journal on Optimization31(1):244–274

2021
[3]

Borkar VS (2008)Stochastic Approximation: A Dynamical Systems Viewpoint(Cambridge University Press and Hindustan Book Agency)

2008
[4]

Boyer C, Godichon-Baggioni A (2023) On the asymptotic rate of convergence of stochastic Newton algo- rithms and their weighted averaged versions.Computational Optimization and Applications84(3):921– 972

2023
[5]

Chen X, Lee JD, Tong XT, Zhang Y (2020) Statistical inference for model parameters in stochastic gradient descent.Annals of Statistics48(1):251–273

2020
[6]

Transactions on Machine Learning Research

D´ efossez A, Bottou L, Bach F, Usunier N (2022) A simple convergence proof of Adam and AdaGrad. Transactions on Machine Learning Research

2022
[7]

Dieuleveut A, Durmus A, Bach F (2020) Bridging the gap between constant step size stochastic gradient descent and Markov chains.Annals of Statistics48(3):1348–1382

2020
[8]

Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research12:2121–2159

2011
[9]

Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression.The Annals of Statistics 32(2):407–499. 42

2004
[10]

Gadat S, Panloup F, Saadane S (2018) Stochastic heavy ball.Electronic Journal of Statistics12(1):461– 529

2018
[11]

Hall P, Heyde CC (1980)Martingale Limit Theory and Its Application(Academic Press)

1980
[12]

Machine Learning69(2–3):169–192

Hazan E, Agarwal A, Kale S (2007) Logarithmic regret algorithms for online convex optimization. Machine Learning69(2–3):169–192

2007
[13]

Horn RA, Johnson CR (2013)Matrix Analysis(Cambridge University Press), 2nd edition

2013
[14]

Kaledin M, Moulines E, Naumov A, Tadic V, Wai HT (2020) Finite time analysis of linear two-timescale stochastic approximation with Markovian noise.Proceedings of the 33rd Conference on Learning Theory (COLT), volume 125 ofPMLR, 2144–2203, arXiv:2002.01268

arXiv 2020
[15]

Kingma DP, Ba J (2015) Adam: A method for stochastic optimization.International Conference on Learning Representations

2015
[16]

Annals of Applied Probability14(2):796–819

Konda VR, Tsitsiklis JN (2004) Convergence rate of linear two-time-scale stochastic approximation. Annals of Applied Probability14(2):796–819

2004
[17]

arXiv preprint arXiv:2506.23803

Kovalev D (2025) SGD with adaptive preconditioning: unified analysis and momentum acceleration. arXiv preprint arXiv:2506.23803

arXiv 2025
[18]

Lee S, Liao Y, Seo MH, Shin Y (2022) Fast and robust online inference with stochastic gradient descent via random scaling.Proceedings of the AAAI Conference on Artificial Intelligence36(7):7381–7389

2022
[19]

Leluc R, Portier F (2023) Asymptotic analysis of conditioned stochastic gradient descent.Transactions on Machine Learning Research

2023
[20]

Lessard L, Recht B, Packard A (2016) Analysis and design of optimization algorithms via integral quadratic constraints.SIAM Journal on Optimization26(1):57–95

2016
[21]

Loshchilov I, Hutter F (2019) Decoupled weight decay regularization.International Conference on Learn- ing Representations

2019
[22]

Mokkadem A, Pelletier M (2006) Convergence rate and averaging of nonlinear two-time-scale stochastic approximation algorithms.Annals of Applied Probability16(3):1671–1702

2006
[23]

Mou W, Li CJ, Wainwright MJ, Bartlett PL, Jordan MI (2020) On linear stochastic approximation: Fine-grained Polyak–Ruppert and non-asymptotic concentration.Proceedings of the 33rd Conference on Learning Theory (COLT), volume 125 ofPMLR, 2947–2997

2020
[24]

(2011) Scikit-learn: Machine learning in Python.Journal of Machine Learning Research12:2825–2830

Pedregosa F, et al. (2011) Scikit-learn: Machine learning in Python.Journal of Machine Learning Research12:2825–2830

2011
[25]

Polyak BT (1964) Some methods of speeding up the convergence of iteration methods.USSR Compu- tational Mathematics and Mathematical Physics4(5):1–17

1964
[26]

Polyak BT, Juditsky AB (1992) Acceleration of stochastic approximation by averaging.SIAM Journal on Control and Optimization30(4):838–855

1992
[27]

Reddi SJ, Kale S, Kumar S (2018) On the convergence of Adam and beyond.International Conference on Learning Representations

2018
[28]

Technical Report 781, Cornell University Operations Research and Industrial Engineering

Ruppert D (1988) Efficient estimators from a slowly convergent Robbins–Monro process. Technical Report 781, Cornell University Operations Research and Industrial Engineering

1988
[29]

Sebbouh O, Gower RM, Defazio A (2021) Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball.Proceedings of the 34th Conference on Learning Theory (COLT), volume 134 ofPMLR, 3935–3971. 43

2021
[30]

Surendran S, Fermanian A, Godichon-Baggioni A, Le Corff S (2024) Non-asymptotic analysis of biased adaptive stochastic approximation.Advances in Neural Information Processing Systems 37 (NeurIPS 2024), 12897–12943

2024
[31]

Tang K, Liu W, Zhang Y, Chen X (2023) Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality.arXiv preprint arXiv:2305.17665

arXiv 2023
[32]

COURSERA: Neural Networks for Machine Learning

Tieleman T, Hinton G (2012) Lecture 6.5—RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning

2012
[33]

van der Vaart AW (1998)Asymptotic Statistics(Cambridge University Press)

1998
[34]

Proceedings of the 43rd International Conference on Machine Learning (ICML), volume 306 ofPMLR, arXiv preprint arXiv:2604.23436

Wang H, Du X, Na S (2026) Inference of online Newton methods with Nesterov’s accelerated sketching. Proceedings of the 43rd International Conference on Machine Learning (ICML), volume 306 ofPMLR, arXiv preprint arXiv:2604.23436

Pith/arXiv arXiv 2026
[35]

Wei Z, Zhu W, Wu WB (2025) Weighted averaged stochastic gradient descent: asymptotic normality and optimality.arXiv preprint arXiv:2307.06915Version 3, 2025; first version 2023

arXiv 2025
[36]

Journal of the American Statistical Association118(541):393–404

Zhu W, Chen X, Wu WB (2023) Online covariance matrix estimation in stochastic gradient descent. Journal of the American Statistical Association118(541):393–404. 44

2023

[1] [1]

An S, Huo X (2026) When does dynamic preconditioning preserve the Polyak–Ruppert CLT? A stabi- lization threshold, arXiv preprint arXiv:2604.23498

Pith/arXiv arXiv 2026

[2] [2]

Barakat A, Bianchi P (2021) Convergence and dynamical behavior of the ADAM algorithm for noncon- vex stochastic optimization.SIAM Journal on Optimization31(1):244–274

2021

[3] [3]

Borkar VS (2008)Stochastic Approximation: A Dynamical Systems Viewpoint(Cambridge University Press and Hindustan Book Agency)

2008

[4] [4]

Boyer C, Godichon-Baggioni A (2023) On the asymptotic rate of convergence of stochastic Newton algo- rithms and their weighted averaged versions.Computational Optimization and Applications84(3):921– 972

2023

[5] [5]

Chen X, Lee JD, Tong XT, Zhang Y (2020) Statistical inference for model parameters in stochastic gradient descent.Annals of Statistics48(1):251–273

2020

[6] [6]

Transactions on Machine Learning Research

D´ efossez A, Bottou L, Bach F, Usunier N (2022) A simple convergence proof of Adam and AdaGrad. Transactions on Machine Learning Research

2022

[7] [7]

Dieuleveut A, Durmus A, Bach F (2020) Bridging the gap between constant step size stochastic gradient descent and Markov chains.Annals of Statistics48(3):1348–1382

2020

[8] [8]

Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research12:2121–2159

2011

[9] [9]

Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression.The Annals of Statistics 32(2):407–499. 42

2004

[10] [10]

Gadat S, Panloup F, Saadane S (2018) Stochastic heavy ball.Electronic Journal of Statistics12(1):461– 529

2018

[11] [11]

Hall P, Heyde CC (1980)Martingale Limit Theory and Its Application(Academic Press)

1980

[12] [12]

Machine Learning69(2–3):169–192

Hazan E, Agarwal A, Kale S (2007) Logarithmic regret algorithms for online convex optimization. Machine Learning69(2–3):169–192

2007

[13] [13]

Horn RA, Johnson CR (2013)Matrix Analysis(Cambridge University Press), 2nd edition

2013

[14] [14]

Kaledin M, Moulines E, Naumov A, Tadic V, Wai HT (2020) Finite time analysis of linear two-timescale stochastic approximation with Markovian noise.Proceedings of the 33rd Conference on Learning Theory (COLT), volume 125 ofPMLR, 2144–2203, arXiv:2002.01268

arXiv 2020

[15] [15]

Kingma DP, Ba J (2015) Adam: A method for stochastic optimization.International Conference on Learning Representations

2015

[16] [16]

Annals of Applied Probability14(2):796–819

Konda VR, Tsitsiklis JN (2004) Convergence rate of linear two-time-scale stochastic approximation. Annals of Applied Probability14(2):796–819

2004

[17] [17]

arXiv preprint arXiv:2506.23803

Kovalev D (2025) SGD with adaptive preconditioning: unified analysis and momentum acceleration. arXiv preprint arXiv:2506.23803

arXiv 2025

[18] [18]

Lee S, Liao Y, Seo MH, Shin Y (2022) Fast and robust online inference with stochastic gradient descent via random scaling.Proceedings of the AAAI Conference on Artificial Intelligence36(7):7381–7389

2022

[19] [19]

Leluc R, Portier F (2023) Asymptotic analysis of conditioned stochastic gradient descent.Transactions on Machine Learning Research

2023

[20] [20]

Lessard L, Recht B, Packard A (2016) Analysis and design of optimization algorithms via integral quadratic constraints.SIAM Journal on Optimization26(1):57–95

2016

[21] [21]

Loshchilov I, Hutter F (2019) Decoupled weight decay regularization.International Conference on Learn- ing Representations

2019

[22] [22]

Mokkadem A, Pelletier M (2006) Convergence rate and averaging of nonlinear two-time-scale stochastic approximation algorithms.Annals of Applied Probability16(3):1671–1702

2006

[23] [23]

Mou W, Li CJ, Wainwright MJ, Bartlett PL, Jordan MI (2020) On linear stochastic approximation: Fine-grained Polyak–Ruppert and non-asymptotic concentration.Proceedings of the 33rd Conference on Learning Theory (COLT), volume 125 ofPMLR, 2947–2997

2020

[24] [24]

(2011) Scikit-learn: Machine learning in Python.Journal of Machine Learning Research12:2825–2830

Pedregosa F, et al. (2011) Scikit-learn: Machine learning in Python.Journal of Machine Learning Research12:2825–2830

2011

[25] [25]

Polyak BT (1964) Some methods of speeding up the convergence of iteration methods.USSR Compu- tational Mathematics and Mathematical Physics4(5):1–17

1964

[26] [26]

Polyak BT, Juditsky AB (1992) Acceleration of stochastic approximation by averaging.SIAM Journal on Control and Optimization30(4):838–855

1992

[27] [27]

Reddi SJ, Kale S, Kumar S (2018) On the convergence of Adam and beyond.International Conference on Learning Representations

2018

[28] [28]

Technical Report 781, Cornell University Operations Research and Industrial Engineering

Ruppert D (1988) Efficient estimators from a slowly convergent Robbins–Monro process. Technical Report 781, Cornell University Operations Research and Industrial Engineering

1988

[29] [29]

Sebbouh O, Gower RM, Defazio A (2021) Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball.Proceedings of the 34th Conference on Learning Theory (COLT), volume 134 ofPMLR, 3935–3971. 43

2021

[30] [30]

Surendran S, Fermanian A, Godichon-Baggioni A, Le Corff S (2024) Non-asymptotic analysis of biased adaptive stochastic approximation.Advances in Neural Information Processing Systems 37 (NeurIPS 2024), 12897–12943

2024

[31] [31]

Tang K, Liu W, Zhang Y, Chen X (2023) Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality.arXiv preprint arXiv:2305.17665

arXiv 2023

[32] [32]

COURSERA: Neural Networks for Machine Learning

Tieleman T, Hinton G (2012) Lecture 6.5—RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning

2012

[33] [33]

van der Vaart AW (1998)Asymptotic Statistics(Cambridge University Press)

1998

[34] [34]

Proceedings of the 43rd International Conference on Machine Learning (ICML), volume 306 ofPMLR, arXiv preprint arXiv:2604.23436

Wang H, Du X, Na S (2026) Inference of online Newton methods with Nesterov’s accelerated sketching. Proceedings of the 43rd International Conference on Machine Learning (ICML), volume 306 ofPMLR, arXiv preprint arXiv:2604.23436

Pith/arXiv arXiv 2026

[35] [35]

Wei Z, Zhu W, Wu WB (2025) Weighted averaged stochastic gradient descent: asymptotic normality and optimality.arXiv preprint arXiv:2307.06915Version 3, 2025; first version 2023

arXiv 2025

[36] [36]

Journal of the American Statistical Association118(541):393–404

Zhu W, Chen X, Wu WB (2023) Online covariance matrix estimation in stochastic gradient descent. Journal of the American Statistical Association118(541):393–404. 44

2023