pith. sign in

arxiv: 2606.23676 · v1 · pith:QUXHLJ6Pnew · submitted 2026-06-22 · 💻 cs.LG · cs.AI· math.OC· stat.ML

Open Problem: Is AdamW Effective Under Heavy-Tailed Noise?

Pith reviewed 2026-06-26 09:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OCstat.ML
keywords AdamWheavy-tailed noiseconvergence analysisstochastic optimizationlarge language modelsopen problemsecond-moment accumulator
0
0 comments X

The pith

AdamW lacks a convergence proof under heavy-tailed gradient noise typical of LLM training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper observes that AdamW serves as the standard optimizer for large language models, yet existing convergence analyses assume finite variance in the stochastic gradients. Empirical observations suggest the noise is often heavy-tailed, and recent results establish convergence for sign-based methods and AdaGrad in that regime. The authors therefore pose whether AdamW itself converges under matching heavy-tailed assumptions or whether its second-moment accumulator introduces an obstruction. They support the question by proving a positive result in a weighted metric and exhibiting a corridor-style lower bound that illustrates how the accumulator can suppress the effect of large gradient entries.

Core claim

The paper formulates the convergence of AdamW under heavy-tailed stochastic gradient noise as an open problem. It establishes a positive benchmark result in a suitably weighted metric and supplies a corridor lower-bound construction showing how the denominator memory can hide large gradient components.

What carries the argument

The second-moment accumulator that maintains a running estimate of squared gradient magnitudes and divides the current gradient by its square root.

If this is right

  • If AdamW converges, its practical success in LLM training would receive a theoretical justification that extends beyond finite-variance regimes.
  • If the accumulator creates an obstruction, the weighted-metric benchmark still shows that modified analyses can recover positive guarantees.
  • The corridor mechanism identifies a concrete way in which historical second-moment information can suppress the contribution of outlier gradients.
  • Resolution of the open problem would clarify whether existing heavy-tailed analyses for Lion, Muon, and AdaGrad extend to the AdamW family.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A negative resolution would motivate systematic replacement of AdamW by sign-based methods when heavy tails are expected.
  • A positive resolution could be tested by checking whether AdamW reaches the same iteration complexity as Lion on the same heavy-tailed benchmark problems.
  • The corridor construction may generalize to other adaptive methods that maintain exponential moving averages of squared gradients.

Load-bearing premise

Stochastic gradient noise in LLM pretraining is typically heavy-tailed.

What would settle it

A concrete counter-example sequence of heavy-tailed gradients on which AdamW diverges, or a matching convergence proof under the same heavy-tailed assumptions used for sign-based optimizers.

read the original abstract

AdamW is the de facto optimizer for training large language models (LLMs), yet the theory behind it still lives mostly in finite-variance regimes. This is increasingly unsatisfying, as empirical evidence indicates that stochastic gradient noise in LLM pretraining is typically heavy-tailed. Recent work shows that sign-based optimizers such as Lion and Muon achieve sharp heavy-tailed rates, and that AdaGrad can also converge under heavy-tailed noise. However, no rigorous convergence theory for AdamW has yet been established in this regime. Can AdamW converge under the same heavy-tailed assumptions, or does its second-moment accumulator create a genuine obstruction? We formulate this as an open problem, prove a positive weighted-metric benchmark, and give a corridor lower-bound mechanism showing how denominator memory can hide large gradients.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The manuscript poses an open problem on the convergence of AdamW under heavy-tailed stochastic gradient noise, motivated by its use in LLM pretraining. It contrasts this with recent positive results for sign-based methods and AdaGrad, formulates the question of whether AdamW's second-moment accumulator creates an obstruction, proves a positive benchmark result in a weighted metric, and constructs a corridor lower-bound mechanism illustrating how denominator memory can mask large gradients.

Significance. Resolving the open problem would clarify the theoretical status of the dominant optimizer for large-scale training under empirically relevant noise distributions. The weighted-metric benchmark demonstrates that AdamW remains viable in at least one natural distance, while the lower-bound construction supplies a concrete mechanism that future analyses must rule out or accommodate. These elements provide a precise framing that can guide subsequent work on heavy-tailed optimization without relying on fitted parameters or circular definitions.

minor comments (1)
  1. The abstract and problem statement would benefit from an explicit statement of the precise heavy-tailed assumption (e.g., moment index or tail index) used in the benchmark and lower-bound constructions.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and for recommending acceptance. We appreciate the recognition that the open problem, the weighted-metric benchmark, and the corridor lower-bound together provide a precise framing for future work on heavy-tailed optimization.

Circularity Check

0 steps flagged

No circularity: open problem with independent analytic constructions

full rationale

The paper poses an open question on AdamW convergence under heavy-tailed noise and supplies a weighted-metric positive benchmark plus a corridor lower-bound construction. These are presented as new analytic results rather than reductions of any quantity to a fitted parameter, self-referential definition, or load-bearing self-citation chain. The heavy-tailed premise appears only as empirical motivation, not as an input that forces the claimed results by construction. No derivation step equates to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no fitted parameters or new entities; it relies on standard heavy-tailed noise models from recent optimization literature to frame the open problem.

axioms (1)
  • domain assumption Heavy-tailed noise models established in prior work on sign-based and AdaGrad optimizers
    The relevance of the open problem and benchmark rests on the empirical claim that LLM gradient noise is heavy-tailed, drawn from cited recent studies.

pith-pipeline@v0.9.1-grok · 5675 in / 1121 out tokens · 25984 ms · 2026-06-26T09:07:26.958758+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 6 linked inside Pith

  1. [1]

    The geometry of sign gradient descent

    5 YUTAOWANLUOZHANG Lukas Balles, Fabian Pedregosa, and Nicolas Le Roux. The geometry of sign gradient descent. arXiv preprint arXiv:2002.08056,

  2. [2]

    Improved analysis for sign- based methods with momentum updates.arXiv preprint arXiv:2507.12091,

    Wei Jiang, Dingzhi Yu, Sifan Yang, Wenhao Yang, and Lijun Zhang. Improved analysis for sign- based methods with momentum updates.arXiv preprint arXiv:2507.12091,

  3. [3]

    Kimi Team

    URLhttps://github.com/karpathy/nanoGPT. Kimi Team. Kimi K2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

  4. [4]

    Muon is scalable for LLM training.arXiv preprint arXiv:2502.16982, 2025a

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for LLM training.arXiv preprint arXiv:2502.16982, 2025a. Yuxing Liu, Rui Pan, and Tong Zhang. AdaGrad under anisotropic smoothness. InInternational Conference on Learning Representations (ICLR), pages 19574–19608...

  5. [5]

    Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad.arXiv preprint arXiv:2605.18694,

    Zijian Liu. Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad.arXiv preprint arXiv:2605.18694,

  6. [6]

    Simple con- vergence proof of Adam from a sign-like descent perspective.arXiv preprint arXiv:2507.05966,

    Hanyang Peng, Shuang Qin, Yue Yu, Fangqing Jiang, Hui Wang, and Zhouchen Lin. Simple con- vergence proof of Adam from a sign-like descent perspective.arXiv preprint arXiv:2507.05966,

  7. [7]

    Benchmarking optimizers for large lan- guage model pretraining.arXiv preprint arXiv:2509.01440,

    Andrei Semenov, Matteo Pagliardini, and Martin Jaggi. Benchmarking optimizers for large lan- guage model pretraining.arXiv preprint arXiv:2509.01440,

  8. [8]

    Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J

    7 YUTAOWANLUOZHANG Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J. Shah, et al. Practical efficiency of Muon for pretraining.arXiv preprint arXiv:2505.02222,

  9. [9]

    When and Why SignSGD Outperforms SGD: A Theo- retical Study Based onℓ 1-norm Lower Bounds.arXiv preprint arXiv:2605.06615,

    Hongyi Tao, Dingzhi Yu, and Lijun Zhang. When and Why SignSGD Outperforms SGD: A Theo- retical Study Based onℓ 1-norm Lower Bounds.arXiv preprint arXiv:2605.06615,

  10. [10]

    Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046,

    Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046,

  11. [11]

    Provable benefit of sign descent: A mini- mal model under heavy-tail class imbalance

    Robin Yadav, Shuo Xie, Tianhao Wang, and Zhiyuan Li. Provable benefit of sign descent: A mini- mal model under heavy-tail class imbalance. InOPT 2025: Optimization for Machine Learning,

  12. [12]

    Mirror descent under general- ized smoothness.arXiv preprint arXiv:2502.00753,

    Dingzhi Yu, Wei Jiang, Hongyi Tao, Yuanyu Wan, and Lijun Zhang. Mirror descent under general- ized smoothness.arXiv preprint arXiv:2502.00753,

  13. [13]

    StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models.arXiv preprint arXiv:2604.15416, 2026a

    Dingzhi Yu, Rui Pan, Yuxing Liu, and Tong Zhang. StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models.arXiv preprint arXiv:2604.15416, 2026a. Dingzhi Yu, Hongyi Tao, Yuanyu Wan, Luo Luo, and Lijun Zhang. Sign-based optimizers are effective under heavy-tailed noise.arXiv preprint arXiv:2602.07425, 2026b. Aohan Zeng...

  14. [14]

    Sinceε t ≥0, this gives the deterministic update bound |ut,i| ≤C β = 1−β 1p (1−β 2)(1−β 2 1/β2) .(B) Proof of Proposition 2.Letρ= 1−β 1,ξ t =g t − ∇f(x t), ande t =m t − ∇f(x t)

    vt. Sinceε t ≥0, this gives the deterministic update bound |ut,i| ≤C β = 1−β 1p (1−β 2)(1−β 2 1/β2) .(B) Proof of Proposition 2.Letρ= 1−β 1,ξ t =g t − ∇f(x t), ande t =m t − ∇f(x t). We use the von Bahr–Esseen inequality (von Bahr and Esseen, 1965). For conditionally mean-zero random variables andp∈[1,2], thepth moment of their sum is bounded, up to a uni...