pith. sign in

arxiv: 2606.17364 · v1 · pith:24WJS6F4new · submitted 2026-06-15 · 🧮 math.ST · math.OC· stat.ML· stat.TH

A Polyak-Ruppert Central Limit Theorem for SA-Adam with Momentum and Non-Convergent Adaptive Preconditioning

Pith reviewed 2026-06-27 01:44 UTC · model grok-4.3

classification 🧮 math.ST math.OCstat.MLstat.TH
keywords Polyak-Ruppert averagingcentral limit theoremstochastic approximationAdam optimizermomentumadaptive preconditioningstochastic gradient descentone-pass inference
0
0 comments X

The pith

SA-Adam with momentum and non-convergent preconditioning obeys the same Polyak-Ruppert CLT as plain SGD.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that Polyak-Ruppert averaged iterates of SA-Adam satisfy a central limit theorem whose covariance is exactly the classical sandwich form from ordinary stochastic gradient descent. This holds when the augmented state of iterate plus momentum buffer is treated as a time-varying linear stochastic approximation that locally stabilizes, and when the momentum gain vanishes at a sub-linear rate. The adaptivity therefore becomes invisible in the limiting distribution. The same result extends to the ridge-penalized sandwich when L2 weight decay is present. The analysis supplies the required positive-drift stability and projection identity to reach the marginal covariance claim.

Core claim

Treating the augmented state (iterate, momentum buffer) as a time-varying linear stochastic approximation, positive drift stability and a non-autonomous Polyak-Ruppert CLT are established together with a projection identity; the resulting iterate-marginal covariance equals the plain SGD sandwich H^{-1} S H^{-1}, so adaptivity is asymptotically invisible. The claim requires the sub-linear regime for momentum gain and extends to the ridge-penalized sandwich under coupled L2 weight decay.

What carries the argument

Augmented state of iterate and momentum buffer viewed as time-varying linear stochastic approximation, together with the projection identity that isolates the marginal covariance.

If this is right

  • The central limit theorem for averaged iterates is identical to that of SGD even with momentum and non-convergent adaptive preconditioning.
  • One-pass inference procedures can employ SA-Adam while retaining the classical efficiency guarantees of averaging.
  • Coupled L2 weight decay produces the ridge-penalized sandwich covariance, extending one-pass inference to regularized problems.
  • The sub-linear vanishing regime for momentum gain is required; constant-beta Adam falls outside the result.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar augmented-state arguments may apply to other momentum-based adaptive methods whose preconditioners fail to converge.
  • Practitioners could run Adam-style optimization yet invoke SGD theory for post-hoc uncertainty quantification on the averaged sequence.
  • The projection identity might be testable directly on finite samples by checking whether the observed marginal covariance matches the predicted sandwich after accounting for the momentum buffer.

Load-bearing premise

The augmented state of iterate and momentum buffer must locally stabilize when treated as a time-varying linear stochastic approximation.

What would settle it

An empirical covariance computed from many independent runs of Polyak-Ruppert averaged SA-Adam that differs from the SGD sandwich form under verified local stabilization and sub-linear momentum gain.

Figures

Figures reproduced from arXiv: 2606.17364 by Sunyoung An, Xiaoming Huo.

Figure 1
Figure 1. Figure 1: Projection identity and momentum-invisibility (streaming Toeplitz regression, [PITH_FULL_IMAGE:figures/full_fig_p028_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Necessity of γ < 1 (exact scalar evaluation). Left: V (γ, n) = n Var[xn] versus n for several γ; curves with γ < 1 descend toward the sandwich V = 1 (more slowly as γ ↑ 1), while γ = 1 plateaus at the predicted inflated limit 1 + 1/(2c1 − 1 − α). Right: V (γ, 107 ), showing the sharp transition at γ = 1. the rate of approach slows as γ → 1. Finally, the P-independence at the heart of the identity is checke… view at source ↗
Figure 3
Figure 3. Figure 3: Semi-synthetic coverage of averaged SA-Adam vs. averaged SGD (real [PITH_FULL_IMAGE:figures/full_fig_p030_3.png] view at source ↗
read the original abstract

Adaptive optimizers combining preconditioning, momentum, and weight decay (Adam and AdamW) are, under Polyak-Ruppert averaging, candidate engines for one-pass inference. Does the averaged iterate keep the classical Polyak-Ruppert central limit theorem (CLT), with sandwich covariance $H^{-1}SH^{-1}$ (Hessian $H$, gradient covariance $S$), under momentum and non-convergent preconditioning? The preconditioner-only analysis does not carry over: with momentum the canonical decomposition collapses to a tautology. Treating the augmented state (iterate, momentum buffer) as a time-varying linear stochastic approximation (SA), we prove (under local stabilization) positive drift stability, a non-autonomous Polyak-Ruppert CLT, and a projection identity. The upshot: the iterate-marginal covariance is exactly the plain stochastic gradient descent (SGD) sandwich $H^{-1}SH^{-1}$, so the adaptivity is asymptotically invisible. This holds for SA-Adam (sub-linearly vanishing momentum gain, $\gamma\in(\alpha,1)$; the sub-linear regime is essential), not constant-$\beta$ deployed Adam. Coupled $L_2$ weight decay yields the ridge-penalized sandwich, extending one-pass inference to regularized problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proves a Polyak-Ruppert CLT for SA-Adam (momentum with sub-linear gain γ ∈ (α,1) and non-convergent adaptive preconditioning). Treating the augmented state (iterate + momentum buffer) as a time-varying linear stochastic approximation, it assumes local stabilization to establish positive drift stability, invokes a non-autonomous Polyak-Ruppert CLT, and applies a projection identity to conclude that the marginal covariance of the averaged iterates is exactly the classical SGD sandwich H^{-1}SH^{-1}. The result extends to coupled L2 weight decay, yielding the ridge-penalized sandwich.

Significance. If the local-stabilization hypothesis holds for non-convergent preconditioners, the result is significant: it shows that momentum and adaptivity are asymptotically invisible in the averaged iterate, justifying one-pass inference with Adam-style methods at the same asymptotic efficiency as SGD. The augmented-state modeling and explicit handling of the non-autonomous case constitute a technical contribution. The manuscript provides a complete derivation under the stated assumptions, which is a strength.

major comments (3)
  1. [augmented-state modeling and local-stabilization hypothesis] The local stabilization of the augmented state is assumed rather than derived (abstract and the augmented-state modeling section). Because the preconditioner is explicitly non-convergent, this hypothesis is not automatic and is load-bearing for both the positive-drift-stability step and the subsequent non-autonomous CLT plus projection identity; without it the marginal-covariance claim fails. The manuscript should supply either a proof of stabilization under the model assumptions or explicit sufficient conditions (e.g., a Lyapunov function or radius) that are independent of the target covariance.
  2. [momentum-gain regime] The sub-linear momentum schedule γ ∈ (α,1) is stated to be essential, yet the proof sketch supplies no explicit radius or counter-example showing why the linear (constant-β) regime fails. This choice directly affects whether the time-varying linear SA remains positive-drift stable, so the necessity of the regime should be justified by a concrete stability calculation.
  3. [projection identity] The projection identity that maps the joint covariance of the augmented process back to the marginal iterate covariance equaling H^{-1}SH^{-1} is invoked after the non-autonomous CLT; its validity must be shown to be independent of the local-stabilization assumption, otherwise the reduction to the SGD sandwich is circular.
minor comments (1)
  1. Notation for the time-varying linear SA coefficients and the precise definition of the projection operator should be collected in a single preliminary section for readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major comment below.

read point-by-point responses
  1. Referee: [augmented-state modeling and local-stabilization hypothesis] The local stabilization of the augmented state is assumed rather than derived (abstract and the augmented-state modeling section). Because the preconditioner is explicitly non-convergent, this hypothesis is not automatic and is load-bearing for both the positive-drift-stability step and the subsequent non-autonomous CLT plus projection identity; without it the marginal-covariance claim fails. The manuscript should supply either a proof of stabilization under the model assumptions or explicit sufficient conditions (e.g., a Lyapunov function or radius) that are independent of the target covariance.

    Authors: We agree the local-stabilization hypothesis is load-bearing. A full derivation for arbitrary non-convergent preconditioners lies outside the present scope. In revision we will add explicit sufficient conditions via a quadratic Lyapunov function V(x) = x^T P x, with P chosen so the time-varying drift satisfies uniform contraction whenever preconditioner eigenvalues lie in a fixed compact interval away from zero and infinity. These conditions are independent of the gradient covariance S. revision: yes

  2. Referee: [momentum-gain regime] The sub-linear momentum schedule γ ∈ (α,1) is stated to be essential, yet the proof sketch supplies no explicit radius or counter-example showing why the linear (constant-β) regime fails. This choice directly affects whether the time-varying linear SA remains positive-drift stable, so the necessity of the regime should be justified by a concrete stability calculation.

    Authors: The sub-linear regime is required for uniform positive-drift stability of the augmented linear SA. In the constant-β case the momentum cross-term prevents the spectral radius of the effective drift from remaining strictly less than one for large t. Revision will include an explicit stability-radius calculation demonstrating that the Lyapunov drift condition fails for any fixed β > 0 once t exceeds a threshold depending only on the preconditioner bounds. revision: yes

  3. Referee: [projection identity] The projection identity that maps the joint covariance of the augmented process back to the marginal iterate covariance equaling H^{-1}SH^{-1} is invoked after the non-autonomous CLT; its validity must be shown to be independent of the local-stabilization assumption, otherwise the reduction to the SGD sandwich is circular.

    Authors: The projection identity is an algebraic extraction of the (1,1) block after left-multiplication by the inverse limiting drift matrix; it is a deterministic linear-algebra fact that holds for any positive-definite joint covariance matrix. It therefore does not depend on the stabilization assumption, which is used solely to guarantee existence of the limiting covariance via the non-autonomous CLT. revision: no

Circularity Check

0 steps flagged

No significant circularity; result is conditional on explicit local stabilization assumption

full rationale

The abstract and description state the results hold 'under local stabilization' of the augmented state treated as time-varying linear SA. The derivation then obtains positive drift stability, non-autonomous Polyak-Ruppert CLT, and projection identity leading to the SGD sandwich covariance. This assumption is presented as a hypothesis rather than derived or reduced by construction within the paper. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citation chains are quoted or evident. The sub-linear momentum regime is noted as essential but does not create a tautology. The paper is self-contained under its stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract; the central claim rests on the local stabilization assumption and the requirement that momentum gain vanishes sub-linearly.

axioms (1)
  • domain assumption local stabilization of the augmented (iterate, momentum) state
    Invoked to obtain positive drift stability and the non-autonomous Polyak-Ruppert CLT

pith-pipeline@v0.9.1-grok · 5775 in / 1169 out tokens · 39215 ms · 2026-06-27T01:44:04.708820+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 2 linked inside Pith

  1. [1]

    An S, Huo X (2026) When does dynamic preconditioning preserve the Polyak–Ruppert CLT? A stabi- lization threshold, arXiv preprint arXiv:2604.23498

  2. [2]

    Barakat A, Bianchi P (2021) Convergence and dynamical behavior of the ADAM algorithm for noncon- vex stochastic optimization.SIAM Journal on Optimization31(1):244–274

  3. [3]

    Borkar VS (2008)Stochastic Approximation: A Dynamical Systems Viewpoint(Cambridge University Press and Hindustan Book Agency)

  4. [4]

    Boyer C, Godichon-Baggioni A (2023) On the asymptotic rate of convergence of stochastic Newton algo- rithms and their weighted averaged versions.Computational Optimization and Applications84(3):921– 972

  5. [5]

    Chen X, Lee JD, Tong XT, Zhang Y (2020) Statistical inference for model parameters in stochastic gradient descent.Annals of Statistics48(1):251–273

  6. [6]

    Transactions on Machine Learning Research

    D´ efossez A, Bottou L, Bach F, Usunier N (2022) A simple convergence proof of Adam and AdaGrad. Transactions on Machine Learning Research

  7. [7]

    Dieuleveut A, Durmus A, Bach F (2020) Bridging the gap between constant step size stochastic gradient descent and Markov chains.Annals of Statistics48(3):1348–1382

  8. [8]

    Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research12:2121–2159

  9. [9]

    Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression.The Annals of Statistics 32(2):407–499. 42

  10. [10]

    Gadat S, Panloup F, Saadane S (2018) Stochastic heavy ball.Electronic Journal of Statistics12(1):461– 529

  11. [11]

    Hall P, Heyde CC (1980)Martingale Limit Theory and Its Application(Academic Press)

  12. [12]

    Machine Learning69(2–3):169–192

    Hazan E, Agarwal A, Kale S (2007) Logarithmic regret algorithms for online convex optimization. Machine Learning69(2–3):169–192

  13. [13]

    Horn RA, Johnson CR (2013)Matrix Analysis(Cambridge University Press), 2nd edition

  14. [14]

    Kaledin M, Moulines E, Naumov A, Tadic V, Wai HT (2020) Finite time analysis of linear two-timescale stochastic approximation with Markovian noise.Proceedings of the 33rd Conference on Learning Theory (COLT), volume 125 ofPMLR, 2144–2203, arXiv:2002.01268

  15. [15]

    Kingma DP, Ba J (2015) Adam: A method for stochastic optimization.International Conference on Learning Representations

  16. [16]

    Annals of Applied Probability14(2):796–819

    Konda VR, Tsitsiklis JN (2004) Convergence rate of linear two-time-scale stochastic approximation. Annals of Applied Probability14(2):796–819

  17. [17]

    arXiv preprint arXiv:2506.23803

    Kovalev D (2025) SGD with adaptive preconditioning: unified analysis and momentum acceleration. arXiv preprint arXiv:2506.23803

  18. [18]

    Lee S, Liao Y, Seo MH, Shin Y (2022) Fast and robust online inference with stochastic gradient descent via random scaling.Proceedings of the AAAI Conference on Artificial Intelligence36(7):7381–7389

  19. [19]

    Leluc R, Portier F (2023) Asymptotic analysis of conditioned stochastic gradient descent.Transactions on Machine Learning Research

  20. [20]

    Lessard L, Recht B, Packard A (2016) Analysis and design of optimization algorithms via integral quadratic constraints.SIAM Journal on Optimization26(1):57–95

  21. [21]

    Loshchilov I, Hutter F (2019) Decoupled weight decay regularization.International Conference on Learn- ing Representations

  22. [22]

    Mokkadem A, Pelletier M (2006) Convergence rate and averaging of nonlinear two-time-scale stochastic approximation algorithms.Annals of Applied Probability16(3):1671–1702

  23. [23]

    Mou W, Li CJ, Wainwright MJ, Bartlett PL, Jordan MI (2020) On linear stochastic approximation: Fine-grained Polyak–Ruppert and non-asymptotic concentration.Proceedings of the 33rd Conference on Learning Theory (COLT), volume 125 ofPMLR, 2947–2997

  24. [24]

    (2011) Scikit-learn: Machine learning in Python.Journal of Machine Learning Research12:2825–2830

    Pedregosa F, et al. (2011) Scikit-learn: Machine learning in Python.Journal of Machine Learning Research12:2825–2830

  25. [25]

    Polyak BT (1964) Some methods of speeding up the convergence of iteration methods.USSR Compu- tational Mathematics and Mathematical Physics4(5):1–17

  26. [26]

    Polyak BT, Juditsky AB (1992) Acceleration of stochastic approximation by averaging.SIAM Journal on Control and Optimization30(4):838–855

  27. [27]

    Reddi SJ, Kale S, Kumar S (2018) On the convergence of Adam and beyond.International Conference on Learning Representations

  28. [28]

    Technical Report 781, Cornell University Operations Research and Industrial Engineering

    Ruppert D (1988) Efficient estimators from a slowly convergent Robbins–Monro process. Technical Report 781, Cornell University Operations Research and Industrial Engineering

  29. [29]

    Sebbouh O, Gower RM, Defazio A (2021) Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball.Proceedings of the 34th Conference on Learning Theory (COLT), volume 134 ofPMLR, 3935–3971. 43

  30. [30]

    Surendran S, Fermanian A, Godichon-Baggioni A, Le Corff S (2024) Non-asymptotic analysis of biased adaptive stochastic approximation.Advances in Neural Information Processing Systems 37 (NeurIPS 2024), 12897–12943

  31. [31]

    Tang K, Liu W, Zhang Y, Chen X (2023) Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality.arXiv preprint arXiv:2305.17665

  32. [32]

    COURSERA: Neural Networks for Machine Learning

    Tieleman T, Hinton G (2012) Lecture 6.5—RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning

  33. [33]

    van der Vaart AW (1998)Asymptotic Statistics(Cambridge University Press)

  34. [34]

    Proceedings of the 43rd International Conference on Machine Learning (ICML), volume 306 ofPMLR, arXiv preprint arXiv:2604.23436

    Wang H, Du X, Na S (2026) Inference of online Newton methods with Nesterov’s accelerated sketching. Proceedings of the 43rd International Conference on Machine Learning (ICML), volume 306 ofPMLR, arXiv preprint arXiv:2604.23436

  35. [35]

    Wei Z, Zhu W, Wu WB (2025) Weighted averaged stochastic gradient descent: asymptotic normality and optimality.arXiv preprint arXiv:2307.06915Version 3, 2025; first version 2023

  36. [36]

    Journal of the American Statistical Association118(541):393–404

    Zhu W, Chen X, Wu WB (2023) Online covariance matrix estimation in stochastic gradient descent. Journal of the American Statistical Association118(541):393–404. 44