pith. sign in

arxiv: 2603.03099 · v8 · pith:7IRQ65OXnew · submitted 2026-03-03 · 💻 cs.LG · cs.AI

Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails

Pith reviewed 2026-05-21 11:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords Adam optimizerSGDhigh-probability convergencesecond-moment normalizationmartingale analysisbounded variancestochastic optimizationadaptive methods
0
0 comments X

The pith

Adam's second-moment normalization produces high-probability convergence bounds that scale as the square root of the failure probability, unlike SGD which requires a linear dependence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a theoretical distinction in the high-probability convergence of Adam and SGD under the standard bounded-variance assumption on gradients. By focusing on the second-moment normalization step that Adam uses to rescale updates, the authors apply a stopping-time argument to martingale sequences and show that Adam obtains a δ to the power of negative one-half dependence on the failure probability δ. SGD, by contrast, cannot improve beyond a δ to the power of negative one dependence in the same setting. This separation supplies the first rigorous account of why Adam can deliver more reliable performance guarantees than plain stochastic gradient descent when practitioners demand high .

Core claim

Under the classical bounded-variance model, the second-moment normalization inside Adam yields a high-probability convergence guarantee whose dependence on the parameter δ is only δ to the power of negative one-half, while any corresponding guarantee for SGD must incur at least a δ to the power of negative one dependence; the separation is proved via a stopping-time and martingale analysis that isolates the effect of the normalization.

What carries the argument

Second-moment normalization, the operation in Adam that divides each coordinate update by the square root of a running average of squared gradients, which the stopping-time martingale argument shows produces lighter tails in the concentration bound.

If this is right

  • Adam supplies strictly stronger high-probability guarantees than SGD when the same second-moment bound is available.
  • The separation arises solely from the coordinate-wise normalization rather than from any momentum or adaptive learning-rate schedule.
  • Practitioners who require failure probabilities below 10^{-4} can expect Adam to need far fewer iterations than SGD to reach the same guarantee.
  • The martingale stopping-time technique isolates the normalization effect and can be reused for other adaptive methods that perform similar rescaling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same normalization may improve tail behavior in non-convex landscapes where high-probability escape from saddle points matters.
  • Hybrid algorithms that add explicit second-moment rescaling to existing SGD variants could inherit the improved dependence without adopting the full Adam update rule.
  • Empirical studies that measure actual failure rates across repeated runs with varying δ would directly test the predicted gap.

Load-bearing premise

The gradients are assumed to have bounded second moments at every point.

What would settle it

A concrete stochastic optimization problem in which the high-probability bound for SGD is shown to achieve better than linear dependence on δ, or a direct counter-example in which Adam fails to achieve the square-root dependence, under the same bounded-variance assumption.

read the original abstract

Despite Adam demonstrating faster empirical convergence than SGD in many applications, much of the existing theory yields guarantees essentially comparable to those of SGD, leaving the empirical performance gap insufficiently explained. In this paper, we uncover a key second-moment normalization in Adam and develop a stopping-time/martingale analysis that provably distinguishes Adam from SGD under the classical bounded variance model (a second moment assumption). In particular, we establish the first theoretical separation between the high-probability convergence behaviors of the two methods: Adam achieves a $\delta^{-1/2}$ dependence on the confidence parameter $\delta$, whereas corresponding high-probability guarantee for SGD necessarily incurs at least a $\delta^{-1}$ dependence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to provide the first theoretical separation between the high-probability convergence of Adam and SGD under the classical bounded-variance model. Using a stopping-time and martingale analysis, it argues that Adam's second-moment normalization yields a δ^{-1/2} dependence on the failure probability δ, while any high-probability guarantee for SGD must incur at least a δ^{-1} dependence.

Significance. If the separation is rigorously established, the result would be significant for explaining Adam's empirical edge over SGD via tail behavior under standard second-moment assumptions alone. The deployment of classical martingale tools and stopping times to handle adaptivity is a methodological strength that could inform future high-probability analyses of adaptive optimizers.

major comments (2)
  1. [Section 4] Section 4 (SGD lower bound construction): the necessity claim that any high-probability bound for SGD must incur δ^{-1} under only E[||g_t - ∇f(x_t)||^2] ≤ σ^2 requires an explicit noise distribution with finite second moment (but heavier tails) whose lower bound is derived; without the concrete distribution and verification that the same oracle permits Adam's δ^{-1/2} bound, the separation is not yet load-bearing.
  2. [Theorem 3.2] Theorem 3.2 (Adam high-probability bound): the stopping-time argument must explicitly show that the running second-moment estimate does not introduce implicit higher-moment control that would be unavailable to SGD; the current derivation steps for the δ^{-1/2} exponent need to be checked against the bounded-variance assumption alone.
minor comments (2)
  1. [Section 2] Notation for the gradient estimator g_t and the second-moment accumulator should be introduced with a dedicated preliminary subsection for clarity.
  2. [Abstract] The abstract and introduction could more precisely state the precise form of the high-probability bound (e.g., the dependence on T, σ, and the function gap) rather than only the δ exponent.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments identify important areas where additional explicitness will strengthen the separation result. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Section 4] Section 4 (SGD lower bound construction): the necessity claim that any high-probability bound for SGD must incur δ^{-1} under only E[||g_t - ∇f(x_t)||^2] ≤ σ^2 requires an explicit noise distribution with finite second moment (but heavier tails) whose lower bound is derived; without the concrete distribution and verification that the same oracle permits Adam's δ^{-1/2} bound, the separation is not yet load-bearing.

    Authors: We agree that an explicit construction is needed to make the lower bound fully load-bearing. In the revised manuscript we will add a concrete noise distribution that satisfies the bounded-variance assumption yet possesses heavier tails, derive the δ^{-1} high-probability lower bound for SGD under this oracle, and verify that the identical distribution permits Adam to attain the δ^{-1/2} rate stated in Theorem 3.2. These additions will appear in Section 4. revision: yes

  2. Referee: [Theorem 3.2] Theorem 3.2 (Adam high-probability bound): the stopping-time argument must explicitly show that the running second-moment estimate does not introduce implicit higher-moment control that would be unavailable to SGD; the current derivation steps for the δ^{-1/2} exponent need to be checked against the bounded-variance assumption alone.

    Authors: We will expand the proof of Theorem 3.2 to include an explicit verification that the stopping-time construction and the running second-moment estimate rely solely on the bounded-variance assumption. The revised derivation will break down each step to confirm that no higher-moment bounds are invoked and that the δ^{-1/2} dependence follows directly from the second-moment normalization together with standard martingale concentration, without any control unavailable to SGD. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation uses standard martingale tools on bounded-variance assumption

full rationale

The paper's central separation (Adam's δ^{-1/2} high-probability rate versus SGD's necessary δ^{-1} under E[||g_t - ∇f(x_t)||^2] ≤ σ^2) is obtained via stopping-time analysis and classical martingale inequalities applied directly to the second-moment normalization. No equation reduces the claimed exponent difference to a fitted quantity, a self-citation chain, or a redefinition of the input assumption. The lower-bound construction for SGD is presented as a concrete noise distribution with only finite second moment; the Adam analysis exploits the running second-moment estimate without importing uniqueness theorems or ansatzes from prior author work. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on the standard bounded-variance assumption common to stochastic optimization; no free parameters, new entities, or ad-hoc axioms are indicated in the abstract.

axioms (1)
  • domain assumption Classical bounded variance model (second moment assumption)
    Invoked to enable the martingale analysis that separates Adam from SGD.

pith-pipeline@v0.9.0 · 5644 in / 1225 out tokens · 34644 ms · 2026-05-21T11:42:46.373929+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PowerStep: Memory-Efficient Adaptive Optimization via $\ell_p$-Norm Steepest Descent

    cs.LG 2026-05 unverdicted novelty 6.0

    PowerStep delivers coordinate-wise adaptive optimization by nonlinearly transforming a momentum buffer under an lp-norm steepest-descent geometry, matching Adam convergence with half the memory and supporting aggressi...

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Bach, and Nicolas Usunier

    Alexandre D´efossez, L ´eon Bottou, Francis R. Bach, and Nicolas Usunier. A simple convergence proof of adam and adagrad.Trans. Mach. Learn. Res., 2022,

  2. [2]

    Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 23(4):2341–2368, 2013a

    Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 23(4):2341–2368, 2013a. Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming.SIAM journal on optimization, 23(4):2341–2368, 2013b. Zhishuai Guo, Yi Xu, Wotao...

  3. [3]

    On convergence of adam for stochastic optimization under relaxed assumptions.CoRR, abs/2402.03982,

    Yusu Hong and Junhong Lin. On convergence of adam for stochastic optimization under relaxed assumptions.CoRR, abs/2402.03982,

  4. [4]

    A comprehensive framework for analyzing the convergence of adam: Bridging the gap with sgd.arXiv preprint arXiv:2410.04458,

    Ruinan Jin, Xiao Li, Yaoliang Yu, and Baoxiang Wang. A comprehensive framework for analyzing the convergence of adam: Bridging the gap with sgd.arXiv preprint arXiv:2410.04458,

  5. [5]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

  6. [6]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors,3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings,

  7. [7]

    Convex and non-convex optimization under generalized smoothness.Advances in Neural Information Processing Systems, 36:40238–40271, 2023a

    Haochuan Li, Jian Qian, Yi Tian, Alexander Rakhlin, and Ali Jadbabaie. Convex and non-convex optimization under generalized smoothness.Advances in Neural Information Processing Systems, 36:40238–40271, 2023a. Haochuan Li, Alexander Rakhlin, and Ali Jadbabaie. Convergence of adam under relaxed assump- tions. In Alice Oh, Tristan Naumann, Amir Globerson, Ka...

  8. [8]

    Accelerated convergence of stochastic heavy ball method under anisotropic gradient noise.arXiv preprint arXiv:2312.14567,

    Rui Pan, Yuxing Liu, Xiaoyu Wang, and Tong Zhang. Accelerated convergence of stochastic heavy ball method under anisotropic gradient noise.arXiv preprint arXiv:2312.14567,

  9. [9]

    Toward understanding why adam converges faster than sgd for transform- ers.arXiv preprint arXiv:2306.00204,

    Yan Pan and Yuanzhi Li. Toward understanding why adam converges faster than sgd for transform- ers.arXiv preprint arXiv:2306.00204,

  10. [10]

    Reddi, Satyen Kale, and Sanjiv Kumar

    Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net,

  11. [11]

    Closing the gap be- tween the upper bound and lower bound of adam’s iteration complexity.Advances in Neural Information Processing Systems, 36, 2024a

    Bohan Wang, Jingwen Fu, Huishuai Zhang, Nanning Zheng, and Wei Chen. Closing the gap be- tween the upper bound and lower bound of adam’s iteration complexity.Advances in Neural Information Processing Systems, 36, 2024a. Bohan Wang, Huishuai Zhang, Qi Meng, Ruoyu Sun, Zhi-Ming Ma, and Wei Chen. On the con- vergence of adam under non-uniform smoothness: Sep...

  12. [12]

    dX i=1 (γt−1,i −γ t,i)(∇f(y t))imt−1,i #2 + 2 β1 1−β 1 2 E  

    + τG∧T−1X t=1 Pt + τG∧T−1X t=1 Dt .(38) We refer to the left-hand side as theW τG∧T .Then, we aim to bound the p 6-th moment ofW τG∧T . Step 3 (Bounding the( p 6)-th Moment of theW τG∧T viaS τG∧T−1 ). For anyp≥12,we take the( p 6)-th moment on both sides of eq. (38), which yields E h W p 6 τG∧T i ≤ ¯f p 6 (y1) +E   τG∧T−1X t=1 Pt ! p 6   +E   τG∧T−1...

  13. [13]

    TX t=1 ∥gt∥2 # =dv+ 1 T E

    run forT≥10iterations with parametersβ 1 ∈[0,1), 42 WHYADAMCANBEATSGD: SECOND-MOMENTNORMALIZATIONYIELDSSHARPERTAILS β2 = 1−1/T, and step sizeγ=η/ √ T. For any0< δ <1, assume that the step size constantη satisfies 1 η ≥max ( 8dϵ v3/2 + 8√v , Dβ1β1 √ dL 1−β 1 ,1 ) , whereD β1 is defined in Lemma B.5. Then, with probability at least1−δ, one has 1 T TX t=1 ∥∇...

  14. [14]

    (2024))For any iteration stept, the following inequality holds: m2 t,i −m 2 t−1,i ≤ −(1−β 1)m2 t−1,i + (1−β 1)g2 t,i

    Lemma B.6 (Property 4 in Jin et al. (2024))For any iteration stept, the following inequality holds: m2 t,i −m 2 t−1,i ≤ −(1−β 1)m2 t−1,i + (1−β 1)g2 t,i. Lemma B.7Let ¯f(x) :=f(x)−f ⋆ + 1, wheref ⋆ := inf x∈Rd f(x). For any iteration stept, the following inequality holds: ¯f(y t)≤2 ¯f(x t) + Lβ2 1 (1−β 1)2 ∥γt−1 ⊙m t−1∥2. Lemma B.8Consider the coordinatew...

  15. [15]

    hard- instance

    Therefore E[U Y µ∧n]≤ r r−1 ∥Xµ∧n∥Ls =s∥X µ∧n∥Ls. Taking the supremum over all suchUyields ∥Yµ∧n∥Ls ≤s∥X µ∧n∥Ls. This completes the proof. Appendix D. Lower bound of SGD To establish a lower bound, it is standard to exhibit a single hard instance (or a distribution over instances)—namely, a function and an oracle/noise model satisfying the assumed conditi...