Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails
Pith reviewed 2026-05-21 11:42 UTC · model grok-4.3
The pith
Adam's second-moment normalization produces high-probability convergence bounds that scale as the square root of the failure probability, unlike SGD which requires a linear dependence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the classical bounded-variance model, the second-moment normalization inside Adam yields a high-probability convergence guarantee whose dependence on the parameter δ is only δ to the power of negative one-half, while any corresponding guarantee for SGD must incur at least a δ to the power of negative one dependence; the separation is proved via a stopping-time and martingale analysis that isolates the effect of the normalization.
What carries the argument
Second-moment normalization, the operation in Adam that divides each coordinate update by the square root of a running average of squared gradients, which the stopping-time martingale argument shows produces lighter tails in the concentration bound.
If this is right
- Adam supplies strictly stronger high-probability guarantees than SGD when the same second-moment bound is available.
- The separation arises solely from the coordinate-wise normalization rather than from any momentum or adaptive learning-rate schedule.
- Practitioners who require failure probabilities below 10^{-4} can expect Adam to need far fewer iterations than SGD to reach the same guarantee.
- The martingale stopping-time technique isolates the normalization effect and can be reused for other adaptive methods that perform similar rescaling.
Where Pith is reading between the lines
- The same normalization may improve tail behavior in non-convex landscapes where high-probability escape from saddle points matters.
- Hybrid algorithms that add explicit second-moment rescaling to existing SGD variants could inherit the improved dependence without adopting the full Adam update rule.
- Empirical studies that measure actual failure rates across repeated runs with varying δ would directly test the predicted gap.
Load-bearing premise
The gradients are assumed to have bounded second moments at every point.
What would settle it
A concrete stochastic optimization problem in which the high-probability bound for SGD is shown to achieve better than linear dependence on δ, or a direct counter-example in which Adam fails to achieve the square-root dependence, under the same bounded-variance assumption.
read the original abstract
Despite Adam demonstrating faster empirical convergence than SGD in many applications, much of the existing theory yields guarantees essentially comparable to those of SGD, leaving the empirical performance gap insufficiently explained. In this paper, we uncover a key second-moment normalization in Adam and develop a stopping-time/martingale analysis that provably distinguishes Adam from SGD under the classical bounded variance model (a second moment assumption). In particular, we establish the first theoretical separation between the high-probability convergence behaviors of the two methods: Adam achieves a $\delta^{-1/2}$ dependence on the confidence parameter $\delta$, whereas corresponding high-probability guarantee for SGD necessarily incurs at least a $\delta^{-1}$ dependence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to provide the first theoretical separation between the high-probability convergence of Adam and SGD under the classical bounded-variance model. Using a stopping-time and martingale analysis, it argues that Adam's second-moment normalization yields a δ^{-1/2} dependence on the failure probability δ, while any high-probability guarantee for SGD must incur at least a δ^{-1} dependence.
Significance. If the separation is rigorously established, the result would be significant for explaining Adam's empirical edge over SGD via tail behavior under standard second-moment assumptions alone. The deployment of classical martingale tools and stopping times to handle adaptivity is a methodological strength that could inform future high-probability analyses of adaptive optimizers.
major comments (2)
- [Section 4] Section 4 (SGD lower bound construction): the necessity claim that any high-probability bound for SGD must incur δ^{-1} under only E[||g_t - ∇f(x_t)||^2] ≤ σ^2 requires an explicit noise distribution with finite second moment (but heavier tails) whose lower bound is derived; without the concrete distribution and verification that the same oracle permits Adam's δ^{-1/2} bound, the separation is not yet load-bearing.
- [Theorem 3.2] Theorem 3.2 (Adam high-probability bound): the stopping-time argument must explicitly show that the running second-moment estimate does not introduce implicit higher-moment control that would be unavailable to SGD; the current derivation steps for the δ^{-1/2} exponent need to be checked against the bounded-variance assumption alone.
minor comments (2)
- [Section 2] Notation for the gradient estimator g_t and the second-moment accumulator should be introduced with a dedicated preliminary subsection for clarity.
- [Abstract] The abstract and introduction could more precisely state the precise form of the high-probability bound (e.g., the dependence on T, σ, and the function gap) rather than only the δ exponent.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The comments identify important areas where additional explicitness will strengthen the separation result. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Section 4] Section 4 (SGD lower bound construction): the necessity claim that any high-probability bound for SGD must incur δ^{-1} under only E[||g_t - ∇f(x_t)||^2] ≤ σ^2 requires an explicit noise distribution with finite second moment (but heavier tails) whose lower bound is derived; without the concrete distribution and verification that the same oracle permits Adam's δ^{-1/2} bound, the separation is not yet load-bearing.
Authors: We agree that an explicit construction is needed to make the lower bound fully load-bearing. In the revised manuscript we will add a concrete noise distribution that satisfies the bounded-variance assumption yet possesses heavier tails, derive the δ^{-1} high-probability lower bound for SGD under this oracle, and verify that the identical distribution permits Adam to attain the δ^{-1/2} rate stated in Theorem 3.2. These additions will appear in Section 4. revision: yes
-
Referee: [Theorem 3.2] Theorem 3.2 (Adam high-probability bound): the stopping-time argument must explicitly show that the running second-moment estimate does not introduce implicit higher-moment control that would be unavailable to SGD; the current derivation steps for the δ^{-1/2} exponent need to be checked against the bounded-variance assumption alone.
Authors: We will expand the proof of Theorem 3.2 to include an explicit verification that the stopping-time construction and the running second-moment estimate rely solely on the bounded-variance assumption. The revised derivation will break down each step to confirm that no higher-moment bounds are invoked and that the δ^{-1/2} dependence follows directly from the second-moment normalization together with standard martingale concentration, without any control unavailable to SGD. revision: yes
Circularity Check
No circularity: derivation uses standard martingale tools on bounded-variance assumption
full rationale
The paper's central separation (Adam's δ^{-1/2} high-probability rate versus SGD's necessary δ^{-1} under E[||g_t - ∇f(x_t)||^2] ≤ σ^2) is obtained via stopping-time analysis and classical martingale inequalities applied directly to the second-moment normalization. No equation reduces the claimed exponent difference to a fitted quantity, a self-citation chain, or a redefinition of the input assumption. The lower-bound construction for SGD is presented as a concrete noise distribution with only finite second moment; the Adam analysis exploits the running second-moment estimate without importing uniqueness theorems or ansatzes from prior author work. The derivation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Classical bounded variance model (second moment assumption)
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
second-moment normalization ... quadratic variation [x]_T ... v_t,i ≈ (1/T) Σ g² ... log(1 + Σ ||g_t||²) ... δ^{-1/2} vs δ^{-1} separation
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat induction and embedding unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
stopping time τ_G := min{t : f-bar(x_t) ≥ G} ... BDG on martingale differences D_t
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
PowerStep: Memory-Efficient Adaptive Optimization via $\ell_p$-Norm Steepest Descent
PowerStep delivers coordinate-wise adaptive optimization by nonlinearly transforming a momentum buffer under an lp-norm steepest-descent geometry, matching Adam convergence with half the memory and supporting aggressi...
Reference graph
Works this paper leans on
-
[1]
Alexandre D´efossez, L ´eon Bottou, Francis R. Bach, and Nicolas Usunier. A simple convergence proof of adam and adagrad.Trans. Mach. Learn. Res., 2022,
work page 2022
-
[2]
Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 23(4):2341–2368, 2013a. Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming.SIAM journal on optimization, 23(4):2341–2368, 2013b. Zhishuai Guo, Yi Xu, Wotao...
-
[3]
On convergence of adam for stochastic optimization under relaxed assumptions.CoRR, abs/2402.03982,
Yusu Hong and Junhong Lin. On convergence of adam for stochastic optimization under relaxed assumptions.CoRR, abs/2402.03982,
-
[4]
Ruinan Jin, Xiao Li, Yaoliang Yu, and Baoxiang Wang. A comprehensive framework for analyzing the convergence of adam: Bridging the gap with sgd.arXiv preprint arXiv:2410.04458,
-
[5]
Adam: A Method for Stochastic Optimization
Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors,3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings,
work page 2015
-
[7]
Haochuan Li, Jian Qian, Yi Tian, Alexander Rakhlin, and Ali Jadbabaie. Convex and non-convex optimization under generalized smoothness.Advances in Neural Information Processing Systems, 36:40238–40271, 2023a. Haochuan Li, Alexander Rakhlin, and Ali Jadbabaie. Convergence of adam under relaxed assump- tions. In Alice Oh, Tristan Naumann, Amir Globerson, Ka...
-
[8]
Rui Pan, Yuxing Liu, Xiaoyu Wang, and Tong Zhang. Accelerated convergence of stochastic heavy ball method under anisotropic gradient noise.arXiv preprint arXiv:2312.14567,
-
[9]
Yan Pan and Yuanzhi Li. Toward understanding why adam converges faster than sgd for transform- ers.arXiv preprint arXiv:2306.00204,
-
[10]
Reddi, Satyen Kale, and Sanjiv Kumar
Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net,
work page 2018
-
[11]
Bohan Wang, Jingwen Fu, Huishuai Zhang, Nanning Zheng, and Wei Chen. Closing the gap be- tween the upper bound and lower bound of adam’s iteration complexity.Advances in Neural Information Processing Systems, 36, 2024a. Bohan Wang, Huishuai Zhang, Qi Meng, Ruoyu Sun, Zhi-Ming Ma, and Wei Chen. On the con- vergence of adam under non-uniform smoothness: Sep...
-
[12]
dX i=1 (γt−1,i −γ t,i)(∇f(y t))imt−1,i #2 + 2 β1 1−β 1 2 E
+ τG∧T−1X t=1 Pt + τG∧T−1X t=1 Dt .(38) We refer to the left-hand side as theW τG∧T .Then, we aim to bound the p 6-th moment ofW τG∧T . Step 3 (Bounding the( p 6)-th Moment of theW τG∧T viaS τG∧T−1 ). For anyp≥12,we take the( p 6)-th moment on both sides of eq. (38), which yields E h W p 6 τG∧T i ≤ ¯f p 6 (y1) +E τG∧T−1X t=1 Pt ! p 6 +E τG∧T−1...
work page 2024
-
[13]
run forT≥10iterations with parametersβ 1 ∈[0,1), 42 WHYADAMCANBEATSGD: SECOND-MOMENTNORMALIZATIONYIELDSSHARPERTAILS β2 = 1−1/T, and step sizeγ=η/ √ T. For any0< δ <1, assume that the step size constantη satisfies 1 η ≥max ( 8dϵ v3/2 + 8√v , Dβ1β1 √ dL 1−β 1 ,1 ) , whereD β1 is defined in Lemma B.5. Then, with probability at least1−δ, one has 1 T TX t=1 ∥∇...
work page 2016
-
[14]
Lemma B.6 (Property 4 in Jin et al. (2024))For any iteration stept, the following inequality holds: m2 t,i −m 2 t−1,i ≤ −(1−β 1)m2 t−1,i + (1−β 1)g2 t,i. Lemma B.7Let ¯f(x) :=f(x)−f ⋆ + 1, wheref ⋆ := inf x∈Rd f(x). For any iteration stept, the following inequality holds: ¯f(y t)≤2 ¯f(x t) + Lβ2 1 (1−β 1)2 ∥γt−1 ⊙m t−1∥2. Lemma B.8Consider the coordinatew...
work page 2024
-
[15]
Therefore E[U Y µ∧n]≤ r r−1 ∥Xµ∧n∥Ls =s∥X µ∧n∥Ls. Taking the supremum over all suchUyields ∥Yµ∧n∥Ls ≤s∥X µ∧n∥Ls. This completes the proof. Appendix D. Lower bound of SGD To establish a lower bound, it is standard to exhibit a single hard instance (or a distribution over instances)—namely, a function and an oracle/noise model satisfying the assumed conditi...
work page 2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.