arxiv: 2603.03099 · v7 · submitted 2026-03-03 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails

Ruinan Jin , Yingbin Liang , Shaofeng Zou

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords AdamSGDhigh-probability convergencesecond-moment normalizationbounded variancemartingale analysisstochastic optimization

0 comments

The pith

Adam achieves δ^{-1/2} high-probability convergence while SGD requires at least δ^{-1} under bounded variance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes the first theoretical separation between the high-probability behaviors of Adam and SGD. Adam's second-moment normalization produces convergence bounds with only a square-root dependence on the failure probability δ. SGD necessarily incurs a linear dependence on δ in any corresponding high-probability guarantee. A reader would care because the result supplies a rigorous account of why Adam often reaches reliable solutions faster in practice, derived from a stopping-time and martingale analysis applied directly to the classical bounded-variance setting.

Core claim

Under the classical bounded-variance model, Adam's second-moment normalization yields high-probability convergence guarantees whose dependence on the parameter δ is only δ to the power of negative one half, whereas any high-probability guarantee for SGD must incur at least a δ to the power of negative one dependence. The separation is obtained by applying a stopping-time and martingale argument that tracks the effect of the normalization on the tail behavior of the iterates.

What carries the argument

Second-moment normalization, the division of each gradient step by the square root of an exponentially weighted estimate of the second moment of the gradients.

If this is right

High-probability statements for Adam are strictly stronger than those available for SGD under the same variance assumption.
The separation holds for any algorithm whose updates lack the second-moment rescaling.
The martingale analysis supplies a template for comparing tail behavior across other adaptive and non-adaptive first-order methods.
Reliability at small δ improves measurably when the normalization is present.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Other adaptive schemes that perform comparable normalization may inherit the same improved δ scaling.
In applications that demand very small failure probability, the choice between Adam and SGD becomes quantitatively consequential.
The stopping-time technique could be reused to obtain tail bounds for variance-reduced or momentum-based variants.

Load-bearing premise

The gradients are assumed to have uniformly bounded second moments.

What would settle it

A single explicit stochastic optimization problem with bounded gradient variance in which an SGD trajectory achieves a high-probability bound with better than δ^{-1} scaling.

read the original abstract

Despite Adam demonstrating faster empirical convergence than SGD in many applications, much of the existing theory yields guarantees essentially comparable to those of SGD, leaving the empirical performance gap insufficiently explained. In this paper, we uncover a key second-moment normalization in Adam and develop a stopping-time/martingale analysis that provably distinguishes Adam from SGD under the classical bounded variance model (a second moment assumption). In particular, we establish the first theoretical separation between the high-probability convergence behaviors of the two methods: Adam achieves a $\delta^{-1/2}$ dependence on the confidence parameter $\delta$, whereas corresponding high-probability guarantee for SGD necessarily incurs at least a $\delta^{-1}$ dependence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The claimed separation fails because SGD cannot be forced to δ^{-1} tails under the paper's own bounded-variance assumption.

read the letter

The punchline is that the paper's central separation cannot exist under its own assumptions. Claiming that SGD high-probability convergence necessarily requires a δ^{-1} dependence on the failure probability δ contradicts the classical bounded-variance model, since Markov's inequality on the second moment gives at most δ^{-1/2} for any random variable with finite variance. To get δ^{-1} you would need tails no better than 1/t, which makes the second moment diverge and violates the assumption the authors use throughout. The stress-test note lands cleanly here. The paper applies a stopping-time martingale argument to Adam's second-moment normalization and obtains the δ^{-1/2} tail for Adam. That technique is used in a new way for this setting and gives a direct way to see the benefit of the normalization step. The abstract states the claim without circular definitions or fitted parameters. The main weakness is the necessity part for SGD. Nothing in the provided text indicates extra assumptions that would produce the heavier tails required for the lower bound, so the separation does not hold inside the stated model. This work targets theorists who study high-probability bounds for adaptive optimizers. A reader might find the martingale construction worth looking at, but the core result does not stand up. I would not bring it to a reading group or cite it. I recommend against sending this version to peer review; the contradiction is load-bearing and needs to be fixed first.

Referee Report

2 major / 1 minor

Summary. The paper develops a stopping-time and martingale analysis under the classical bounded-variance (second-moment) model to argue that Adam's second-moment normalization produces sharper tails for the optimization error than plain SGD. It claims to establish the first theoretical separation: Adam obtains high-probability convergence bounds with only δ^{-1/2} dependence on the failure probability δ, while any corresponding high-probability guarantee for SGD must incur at least δ^{-1} dependence.

Significance. If the separation were valid inside the stated model, the result would supply a clean, assumption-minimal explanation for Adam's empirical edge in high-confidence regimes and would highlight the role of adaptive normalization in controlling tail behavior. The manuscript ships a direct derivation from standard martingale tools with no fitted parameters.

major comments (2)

[Abstract and §3] Abstract and §3 (martingale analysis): the central claim that 'corresponding high-probability guarantee for SGD necessarily incurs at least a δ^{-1} dependence' is inconsistent with the bounded-variance assumption adopted by the paper. Under finite second moment, Markov's inequality directly yields P(|X|>t) ≤ E[X²]/t², so a δ^{-1/2} dependence is always attainable; tails no better than 1/t would make E[X²] diverge, violating the model. This contradiction is load-bearing for the asserted separation.
[§4] §4 (SGD tail bound derivation): the necessity argument for δ^{-1} appears to rely on a tail bound that cannot coexist with the second-moment hypothesis used for both algorithms; the manuscript does not reconcile how the SGD error can simultaneously possess finite variance and exhibit only 1/t tails.

minor comments (1)

[§2] Notation for the stopping time τ_δ is introduced without an explicit definition of the filtration; a one-line clarification would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying a potential inconsistency between our high-probability claims and the bounded-variance model. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (martingale analysis): the central claim that 'corresponding high-probability guarantee for SGD necessarily incurs at least a δ^{-1} dependence' is inconsistent with the bounded-variance assumption adopted by the paper. Under finite second moment, Markov's inequality directly yields P(|X|>t) ≤ E[X²]/t², so a δ^{-1/2} dependence is always attainable; tails no better than 1/t would make E[X²] diverge, violating the model. This contradiction is load-bearing for the asserted separation.

Authors: We acknowledge the force of this observation. Our necessity argument for a δ^{-1} term in any SGD high-probability bound was derived from a specific tail construction in the stopping-time analysis that, upon re-examination, does not preserve finite second moments. Markov's inequality indeed supplies a δ^{-1/2} bound under the stated model for any random variable with finite variance. We will revise the abstract, §3, and the introduction to remove the claim that SGD 'necessarily incurs at least a δ^{-1} dependence' and instead present Adam's bound as achieving δ^{-1/2} via its normalized martingale while noting that standard second-moment arguments for SGD also permit δ^{-1/2} (albeit possibly with worse leading constants). The separation will be reframed around the applicability and tightness of the respective martingale constructions rather than the exponent on δ. revision: yes
Referee: [§4] §4 (SGD tail bound derivation): the necessity argument for δ^{-1} appears to rely on a tail bound that cannot coexist with the second-moment hypothesis used for both algorithms; the manuscript does not reconcile how the SGD error can simultaneously possess finite variance and exhibit only 1/t tails.

Authors: We agree that the derivation in §4 contains an inconsistency. The counter-example distribution used to argue that SGD error tails cannot improve beyond 1/t produces infinite second moments, violating the model assumption. We will revise §4 to excise the necessity argument, replace it with a correct statement that both methods admit δ^{-1/2} bounds via Markov under finite second moments, and add a brief discussion clarifying that the practical advantage of Adam lies in the normalized variance estimate rather than in a fundamentally different tail exponent. This will be incorporated in the next manuscript version. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation proceeds from bounded-variance assumption via martingale tools

full rationale

The paper states it develops a stopping-time/martingale analysis directly from the classical bounded second-moment assumption to obtain the high-probability bounds. No self-definitional steps, fitted parameters renamed as predictions, load-bearing self-citations, or ansatzes smuggled via prior work are present in the provided abstract or described chain. The claimed separation is presented as a consequence of the analysis rather than reducing to its own inputs by construction. The result is therefore self-contained against the stated model.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The proof rests on the classical bounded second-moment assumption and standard martingale theory; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Bounded second moment of stochastic gradients
Classical assumption stated in the abstract as the setting for the separation.

pith-pipeline@v0.9.0 · 5413 in / 1032 out tokens · 34347 ms · 2026-05-15T16:29:40.792470+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PowerStep: Memory-Efficient Adaptive Optimization via $\ell_p$-Norm Steepest Descent
cs.LG 2026-05 unverdicted novelty 6.0

PowerStep delivers coordinate-wise adaptive optimization by nonlinearly transforming a momentum buffer under an lp-norm steepest-descent geometry, matching Adam convergence with half the memory and supporting aggressi...

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Bach, and Nicolas Usunier

Alexandre D´efossez, L ´eon Bottou, Francis R. Bach, and Nicolas Usunier. A simple convergence proof of adam and adagrad.Trans. Mach. Learn. Res., 2022,

work page 2022
[2]

Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 23(4):2341–2368, 2013a

Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 23(4):2341–2368, 2013a. Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming.SIAM journal on optimization, 23(4):2341–2368, 2013b. Zhishuai Guo, Yi Xu, Wotao...

work page arXiv
[3]

On convergence of adam for stochastic optimization under relaxed assumptions.CoRR, abs/2402.03982,

Yusu Hong and Junhong Lin. On convergence of adam for stochastic optimization under relaxed assumptions.CoRR, abs/2402.03982,

work page arXiv
[4]

A comprehensive framework for analyzing the convergence of adam: Bridging the gap with sgd.arXiv preprint arXiv:2410.04458,

Ruinan Jin, Xiao Li, Yaoliang Yu, and Baoxiang Wang. A comprehensive framework for analyzing the convergence of adam: Bridging the gap with sgd.arXiv preprint arXiv:2410.04458,

work page arXiv
[5]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors,3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings,

work page 2015
[7]

Convex and non-convex optimization under generalized smoothness.Advances in Neural Information Processing Systems, 36:40238–40271, 2023a

Haochuan Li, Jian Qian, Yi Tian, Alexander Rakhlin, and Ali Jadbabaie. Convex and non-convex optimization under generalized smoothness.Advances in Neural Information Processing Systems, 36:40238–40271, 2023a. Haochuan Li, Alexander Rakhlin, and Ali Jadbabaie. Convergence of adam under relaxed assump- tions. In Alice Oh, Tristan Naumann, Amir Globerson, Ka...

work page arXiv 2023
[8]

Accelerated convergence of stochastic heavy ball method under anisotropic gradient noise.arXiv preprint arXiv:2312.14567,

Rui Pan, Yuxing Liu, Xiaoyu Wang, and Tong Zhang. Accelerated convergence of stochastic heavy ball method under anisotropic gradient noise.arXiv preprint arXiv:2312.14567,

work page arXiv
[9]

Toward understanding why adam converges faster than sgd for transform- ers.arXiv preprint arXiv:2306.00204,

Yan Pan and Yuanzhi Li. Toward understanding why adam converges faster than sgd for transform- ers.arXiv preprint arXiv:2306.00204,

work page arXiv
[10]

Reddi, Satyen Kale, and Sanjiv Kumar

Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net,

work page 2018
[11]

Closing the gap be- tween the upper bound and lower bound of adam’s iteration complexity.Advances in Neural Information Processing Systems, 36, 2024a

Bohan Wang, Jingwen Fu, Huishuai Zhang, Nanning Zheng, and Wei Chen. Closing the gap be- tween the upper bound and lower bound of adam’s iteration complexity.Advances in Neural Information Processing Systems, 36, 2024a. Bohan Wang, Huishuai Zhang, Qi Meng, Ruoyu Sun, Zhi-Ming Ma, and Wei Chen. On the con- vergence of adam under non-uniform smoothness: Sep...

work page arXiv
[12]

dX i=1 (γt−1,i −γ t,i)(∇f(y t))imt−1,i #2 + 2 β1 1−β 1 2 E  

+ τG∧T−1X t=1 Pt + τG∧T−1X t=1 Dt .(38) We refer to the left-hand side as theW τG∧T .Then, we aim to bound the p 6-th moment ofW τG∧T . Step 3 (Bounding the( p 6)-th Moment of theW τG∧T viaS τG∧T−1 ). For anyp≥12,we take the( p 6)-th moment on both sides of eq. (38), which yields E h W p 6 τG∧T i ≤ ¯f p 6 (y1) +E   τG∧T−1X t=1 Pt ! p 6   +E   τG∧T−1...

work page 2024
[13]

TX t=1 ∥gt∥2 # =dv+ 1 T E

run forT≥10iterations with parametersβ 1 ∈[0,1), 42 WHYADAMCANBEATSGD: SECOND-MOMENTNORMALIZATIONYIELDSSHARPERTAILS β2 = 1−1/T, and step sizeγ=η/ √ T. For any0< δ <1, assume that the step size constantη satisfies 1 η ≥max ( 8dϵ v3/2 + 8√v , Dβ1β1 √ dL 1−β 1 ,1 ) , whereD β1 is defined in Lemma B.5. Then, with probability at least1−δ, one has 1 T TX t=1 ∥∇...

work page 2016
[14]

(2024))For any iteration stept, the following inequality holds: m2 t,i −m 2 t−1,i ≤ −(1−β 1)m2 t−1,i + (1−β 1)g2 t,i

Lemma B.6 (Property 4 in Jin et al. (2024))For any iteration stept, the following inequality holds: m2 t,i −m 2 t−1,i ≤ −(1−β 1)m2 t−1,i + (1−β 1)g2 t,i. Lemma B.7Let ¯f(x) :=f(x)−f ⋆ + 1, wheref ⋆ := inf x∈Rd f(x). For any iteration stept, the following inequality holds: ¯f(y t)≤2 ¯f(x t) + Lβ2 1 (1−β 1)2 ∥γt−1 ⊙m t−1∥2. Lemma B.8Consider the coordinatew...

work page 2024
[15]

hard- instance

Therefore E[U Y µ∧n]≤ r r−1 ∥Xµ∧n∥Ls =s∥X µ∧n∥Ls. Taking the supremum over all suchUyields ∥Yµ∧n∥Ls ≤s∥X µ∧n∥Ls. This completes the proof. Appendix D. Lower bound of SGD To establish a lower bound, it is standard to exhibit a single hard instance (or a distribution over instances)—namely, a function and an oracle/noise model satisfying the assumed conditi...

work page 2009