Recognition: no theorem link
Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails
Pith reviewed 2026-05-15 16:29 UTC · model grok-4.3
The pith
Adam achieves δ^{-1/2} high-probability convergence while SGD requires at least δ^{-1} under bounded variance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the classical bounded-variance model, Adam's second-moment normalization yields high-probability convergence guarantees whose dependence on the parameter δ is only δ to the power of negative one half, whereas any high-probability guarantee for SGD must incur at least a δ to the power of negative one dependence. The separation is obtained by applying a stopping-time and martingale argument that tracks the effect of the normalization on the tail behavior of the iterates.
What carries the argument
Second-moment normalization, the division of each gradient step by the square root of an exponentially weighted estimate of the second moment of the gradients.
If this is right
- High-probability statements for Adam are strictly stronger than those available for SGD under the same variance assumption.
- The separation holds for any algorithm whose updates lack the second-moment rescaling.
- The martingale analysis supplies a template for comparing tail behavior across other adaptive and non-adaptive first-order methods.
- Reliability at small δ improves measurably when the normalization is present.
Where Pith is reading between the lines
- Other adaptive schemes that perform comparable normalization may inherit the same improved δ scaling.
- In applications that demand very small failure probability, the choice between Adam and SGD becomes quantitatively consequential.
- The stopping-time technique could be reused to obtain tail bounds for variance-reduced or momentum-based variants.
Load-bearing premise
The gradients are assumed to have uniformly bounded second moments.
What would settle it
A single explicit stochastic optimization problem with bounded gradient variance in which an SGD trajectory achieves a high-probability bound with better than δ^{-1} scaling.
read the original abstract
Despite Adam demonstrating faster empirical convergence than SGD in many applications, much of the existing theory yields guarantees essentially comparable to those of SGD, leaving the empirical performance gap insufficiently explained. In this paper, we uncover a key second-moment normalization in Adam and develop a stopping-time/martingale analysis that provably distinguishes Adam from SGD under the classical bounded variance model (a second moment assumption). In particular, we establish the first theoretical separation between the high-probability convergence behaviors of the two methods: Adam achieves a $\delta^{-1/2}$ dependence on the confidence parameter $\delta$, whereas corresponding high-probability guarantee for SGD necessarily incurs at least a $\delta^{-1}$ dependence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a stopping-time and martingale analysis under the classical bounded-variance (second-moment) model to argue that Adam's second-moment normalization produces sharper tails for the optimization error than plain SGD. It claims to establish the first theoretical separation: Adam obtains high-probability convergence bounds with only δ^{-1/2} dependence on the failure probability δ, while any corresponding high-probability guarantee for SGD must incur at least δ^{-1} dependence.
Significance. If the separation were valid inside the stated model, the result would supply a clean, assumption-minimal explanation for Adam's empirical edge in high-confidence regimes and would highlight the role of adaptive normalization in controlling tail behavior. The manuscript ships a direct derivation from standard martingale tools with no fitted parameters.
major comments (2)
- [Abstract and §3] Abstract and §3 (martingale analysis): the central claim that 'corresponding high-probability guarantee for SGD necessarily incurs at least a δ^{-1} dependence' is inconsistent with the bounded-variance assumption adopted by the paper. Under finite second moment, Markov's inequality directly yields P(|X|>t) ≤ E[X²]/t², so a δ^{-1/2} dependence is always attainable; tails no better than 1/t would make E[X²] diverge, violating the model. This contradiction is load-bearing for the asserted separation.
- [§4] §4 (SGD tail bound derivation): the necessity argument for δ^{-1} appears to rely on a tail bound that cannot coexist with the second-moment hypothesis used for both algorithms; the manuscript does not reconcile how the SGD error can simultaneously possess finite variance and exhibit only 1/t tails.
minor comments (1)
- [§2] Notation for the stopping time τ_δ is introduced without an explicit definition of the filtration; a one-line clarification would improve readability.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for identifying a potential inconsistency between our high-probability claims and the bounded-variance model. We address each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (martingale analysis): the central claim that 'corresponding high-probability guarantee for SGD necessarily incurs at least a δ^{-1} dependence' is inconsistent with the bounded-variance assumption adopted by the paper. Under finite second moment, Markov's inequality directly yields P(|X|>t) ≤ E[X²]/t², so a δ^{-1/2} dependence is always attainable; tails no better than 1/t would make E[X²] diverge, violating the model. This contradiction is load-bearing for the asserted separation.
Authors: We acknowledge the force of this observation. Our necessity argument for a δ^{-1} term in any SGD high-probability bound was derived from a specific tail construction in the stopping-time analysis that, upon re-examination, does not preserve finite second moments. Markov's inequality indeed supplies a δ^{-1/2} bound under the stated model for any random variable with finite variance. We will revise the abstract, §3, and the introduction to remove the claim that SGD 'necessarily incurs at least a δ^{-1} dependence' and instead present Adam's bound as achieving δ^{-1/2} via its normalized martingale while noting that standard second-moment arguments for SGD also permit δ^{-1/2} (albeit possibly with worse leading constants). The separation will be reframed around the applicability and tightness of the respective martingale constructions rather than the exponent on δ. revision: yes
-
Referee: [§4] §4 (SGD tail bound derivation): the necessity argument for δ^{-1} appears to rely on a tail bound that cannot coexist with the second-moment hypothesis used for both algorithms; the manuscript does not reconcile how the SGD error can simultaneously possess finite variance and exhibit only 1/t tails.
Authors: We agree that the derivation in §4 contains an inconsistency. The counter-example distribution used to argue that SGD error tails cannot improve beyond 1/t produces infinite second moments, violating the model assumption. We will revise §4 to excise the necessity argument, replace it with a correct statement that both methods admit δ^{-1/2} bounds via Markov under finite second moments, and add a brief discussion clarifying that the practical advantage of Adam lies in the normalized variance estimate rather than in a fundamentally different tail exponent. This will be incorporated in the next manuscript version. revision: yes
Circularity Check
No circularity: derivation proceeds from bounded-variance assumption via martingale tools
full rationale
The paper states it develops a stopping-time/martingale analysis directly from the classical bounded second-moment assumption to obtain the high-probability bounds. No self-definitional steps, fitted parameters renamed as predictions, load-bearing self-citations, or ansatzes smuggled via prior work are present in the provided abstract or described chain. The claimed separation is presented as a consequence of the analysis rather than reducing to its own inputs by construction. The result is therefore self-contained against the stated model.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Bounded second moment of stochastic gradients
Forward citations
Cited by 1 Pith paper
-
PowerStep: Memory-Efficient Adaptive Optimization via $\ell_p$-Norm Steepest Descent
PowerStep delivers coordinate-wise adaptive optimization by nonlinearly transforming a momentum buffer under an lp-norm steepest-descent geometry, matching Adam convergence with half the memory and supporting aggressi...
Reference graph
Works this paper leans on
-
[1]
Alexandre D´efossez, L ´eon Bottou, Francis R. Bach, and Nicolas Usunier. A simple convergence proof of adam and adagrad.Trans. Mach. Learn. Res., 2022,
work page 2022
-
[2]
Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 23(4):2341–2368, 2013a. Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming.SIAM journal on optimization, 23(4):2341–2368, 2013b. Zhishuai Guo, Yi Xu, Wotao...
-
[3]
On convergence of adam for stochastic optimization under relaxed assumptions.CoRR, abs/2402.03982,
Yusu Hong and Junhong Lin. On convergence of adam for stochastic optimization under relaxed assumptions.CoRR, abs/2402.03982,
-
[4]
Ruinan Jin, Xiao Li, Yaoliang Yu, and Baoxiang Wang. A comprehensive framework for analyzing the convergence of adam: Bridging the gap with sgd.arXiv preprint arXiv:2410.04458,
-
[5]
Adam: A Method for Stochastic Optimization
Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors,3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings,
work page 2015
-
[7]
Haochuan Li, Jian Qian, Yi Tian, Alexander Rakhlin, and Ali Jadbabaie. Convex and non-convex optimization under generalized smoothness.Advances in Neural Information Processing Systems, 36:40238–40271, 2023a. Haochuan Li, Alexander Rakhlin, and Ali Jadbabaie. Convergence of adam under relaxed assump- tions. In Alice Oh, Tristan Naumann, Amir Globerson, Ka...
-
[8]
Rui Pan, Yuxing Liu, Xiaoyu Wang, and Tong Zhang. Accelerated convergence of stochastic heavy ball method under anisotropic gradient noise.arXiv preprint arXiv:2312.14567,
-
[9]
Yan Pan and Yuanzhi Li. Toward understanding why adam converges faster than sgd for transform- ers.arXiv preprint arXiv:2306.00204,
-
[10]
Reddi, Satyen Kale, and Sanjiv Kumar
Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net,
work page 2018
-
[11]
Bohan Wang, Jingwen Fu, Huishuai Zhang, Nanning Zheng, and Wei Chen. Closing the gap be- tween the upper bound and lower bound of adam’s iteration complexity.Advances in Neural Information Processing Systems, 36, 2024a. Bohan Wang, Huishuai Zhang, Qi Meng, Ruoyu Sun, Zhi-Ming Ma, and Wei Chen. On the con- vergence of adam under non-uniform smoothness: Sep...
-
[12]
dX i=1 (γt−1,i −γ t,i)(∇f(y t))imt−1,i #2 + 2 β1 1−β 1 2 E
+ τG∧T−1X t=1 Pt + τG∧T−1X t=1 Dt .(38) We refer to the left-hand side as theW τG∧T .Then, we aim to bound the p 6-th moment ofW τG∧T . Step 3 (Bounding the( p 6)-th Moment of theW τG∧T viaS τG∧T−1 ). For anyp≥12,we take the( p 6)-th moment on both sides of eq. (38), which yields E h W p 6 τG∧T i ≤ ¯f p 6 (y1) +E τG∧T−1X t=1 Pt ! p 6 +E τG∧T−1...
work page 2024
-
[13]
run forT≥10iterations with parametersβ 1 ∈[0,1), 42 WHYADAMCANBEATSGD: SECOND-MOMENTNORMALIZATIONYIELDSSHARPERTAILS β2 = 1−1/T, and step sizeγ=η/ √ T. For any0< δ <1, assume that the step size constantη satisfies 1 η ≥max ( 8dϵ v3/2 + 8√v , Dβ1β1 √ dL 1−β 1 ,1 ) , whereD β1 is defined in Lemma B.5. Then, with probability at least1−δ, one has 1 T TX t=1 ∥∇...
work page 2016
-
[14]
Lemma B.6 (Property 4 in Jin et al. (2024))For any iteration stept, the following inequality holds: m2 t,i −m 2 t−1,i ≤ −(1−β 1)m2 t−1,i + (1−β 1)g2 t,i. Lemma B.7Let ¯f(x) :=f(x)−f ⋆ + 1, wheref ⋆ := inf x∈Rd f(x). For any iteration stept, the following inequality holds: ¯f(y t)≤2 ¯f(x t) + Lβ2 1 (1−β 1)2 ∥γt−1 ⊙m t−1∥2. Lemma B.8Consider the coordinatew...
work page 2024
-
[15]
Therefore E[U Y µ∧n]≤ r r−1 ∥Xµ∧n∥Ls =s∥X µ∧n∥Ls. Taking the supremum over all suchUyields ∥Yµ∧n∥Ls ≤s∥X µ∧n∥Ls. This completes the proof. Appendix D. Lower bound of SGD To establish a lower bound, it is standard to exhibit a single hard instance (or a distribution over instances)—namely, a function and an oracle/noise model satisfying the assumed conditi...
work page 2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.