Open Problem: Is AdamW Effective Under Heavy-Tailed Noise?

Dingzhi Yu; Hongyi Tao; Lijun Zhang; Luo Luo; Yuanyu Wan

arxiv: 2606.23676 · v1 · pith:QUXHLJ6Pnew · submitted 2026-06-22 · 💻 cs.LG · cs.AI· math.OC· stat.ML

Open Problem: Is AdamW Effective Under Heavy-Tailed Noise?

Dingzhi Yu , Hongyi Tao , Yuanyu Wan , Luo Luo , Lijun Zhang This is my paper

Pith reviewed 2026-06-26 09:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OCstat.ML

keywords AdamWheavy-tailed noiseconvergence analysisstochastic optimizationlarge language modelsopen problemsecond-moment accumulator

0 comments

The pith

AdamW lacks a convergence proof under heavy-tailed gradient noise typical of LLM training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper observes that AdamW serves as the standard optimizer for large language models, yet existing convergence analyses assume finite variance in the stochastic gradients. Empirical observations suggest the noise is often heavy-tailed, and recent results establish convergence for sign-based methods and AdaGrad in that regime. The authors therefore pose whether AdamW itself converges under matching heavy-tailed assumptions or whether its second-moment accumulator introduces an obstruction. They support the question by proving a positive result in a weighted metric and exhibiting a corridor-style lower bound that illustrates how the accumulator can suppress the effect of large gradient entries.

Core claim

The paper formulates the convergence of AdamW under heavy-tailed stochastic gradient noise as an open problem. It establishes a positive benchmark result in a suitably weighted metric and supplies a corridor lower-bound construction showing how the denominator memory can hide large gradient components.

What carries the argument

The second-moment accumulator that maintains a running estimate of squared gradient magnitudes and divides the current gradient by its square root.

If this is right

If AdamW converges, its practical success in LLM training would receive a theoretical justification that extends beyond finite-variance regimes.
If the accumulator creates an obstruction, the weighted-metric benchmark still shows that modified analyses can recover positive guarantees.
The corridor mechanism identifies a concrete way in which historical second-moment information can suppress the contribution of outlier gradients.
Resolution of the open problem would clarify whether existing heavy-tailed analyses for Lion, Muon, and AdaGrad extend to the AdamW family.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A negative resolution would motivate systematic replacement of AdamW by sign-based methods when heavy tails are expected.
A positive resolution could be tested by checking whether AdamW reaches the same iteration complexity as Lion on the same heavy-tailed benchmark problems.
The corridor construction may generalize to other adaptive methods that maintain exponential moving averages of squared gradients.

Load-bearing premise

Stochastic gradient noise in LLM pretraining is typically heavy-tailed.

What would settle it

A concrete counter-example sequence of heavy-tailed gradients on which AdamW diverges, or a matching convergence proof under the same heavy-tailed assumptions used for sign-based optimizers.

read the original abstract

AdamW is the de facto optimizer for training large language models (LLMs), yet the theory behind it still lives mostly in finite-variance regimes. This is increasingly unsatisfying, as empirical evidence indicates that stochastic gradient noise in LLM pretraining is typically heavy-tailed. Recent work shows that sign-based optimizers such as Lion and Muon achieve sharp heavy-tailed rates, and that AdaGrad can also converge under heavy-tailed noise. However, no rigorous convergence theory for AdamW has yet been established in this regime. Can AdamW converge under the same heavy-tailed assumptions, or does its second-moment accumulator create a genuine obstruction? We formulate this as an open problem, prove a positive weighted-metric benchmark, and give a corridor lower-bound mechanism showing how denominator memory can hide large gradients.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper frames AdamW under heavy-tailed noise as an open problem and supplies a weighted-metric benchmark plus a corridor lower-bound construction that extend prior work on other optimizers.

read the letter

The main takeaway is that this is an open-problem paper asking whether AdamW can converge when gradients have heavy tails, with two concrete supporting pieces rather than a full resolution.

What is new is the direct focus on AdamW together with the weighted-metric positive benchmark and the corridor mechanism that shows how the second-moment accumulator can mask large updates. These build on earlier heavy-tailed analyses for Lion, Muon, and AdaGrad without repeating them. The paper does a clean job stating why the gap matters: AdamW is the default for large models yet its theory stops at finite variance.

The heavy-tailed premise is presented only as empirical motivation, which is fine for an open-problem statement but means the urgency rests on outside evidence. The constructions themselves look like genuine analytic additions rather than post-hoc fitting. No internal contradictions appear in the problem setup or the partial results described.

This is for researchers working on convergence rates of adaptive methods under non-standard noise. Someone already following the Lion/AdaGrad heavy-tailed line will see immediate next steps. It is worth sending to peer review because the question is sharply posed and the benchmark and lower-bound give reviewers and readers something specific to check and extend.

Referee Report

0 major / 1 minor

Summary. The manuscript poses an open problem on the convergence of AdamW under heavy-tailed stochastic gradient noise, motivated by its use in LLM pretraining. It contrasts this with recent positive results for sign-based methods and AdaGrad, formulates the question of whether AdamW's second-moment accumulator creates an obstruction, proves a positive benchmark result in a weighted metric, and constructs a corridor lower-bound mechanism illustrating how denominator memory can mask large gradients.

Significance. Resolving the open problem would clarify the theoretical status of the dominant optimizer for large-scale training under empirically relevant noise distributions. The weighted-metric benchmark demonstrates that AdamW remains viable in at least one natural distance, while the lower-bound construction supplies a concrete mechanism that future analyses must rule out or accommodate. These elements provide a precise framing that can guide subsequent work on heavy-tailed optimization without relying on fitted parameters or circular definitions.

minor comments (1)

The abstract and problem statement would benefit from an explicit statement of the precise heavy-tailed assumption (e.g., moment index or tail index) used in the benchmark and lower-bound constructions.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and for recommending acceptance. We appreciate the recognition that the open problem, the weighted-metric benchmark, and the corridor lower-bound together provide a precise framing for future work on heavy-tailed optimization.

Circularity Check

0 steps flagged

No circularity: open problem with independent analytic constructions

full rationale

The paper poses an open question on AdamW convergence under heavy-tailed noise and supplies a weighted-metric positive benchmark plus a corridor lower-bound construction. These are presented as new analytic results rather than reductions of any quantity to a fitted parameter, self-referential definition, or load-bearing self-citation chain. The heavy-tailed premise appears only as empirical motivation, not as an input that forces the claimed results by construction. No derivation step equates to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no fitted parameters or new entities; it relies on standard heavy-tailed noise models from recent optimization literature to frame the open problem.

axioms (1)

domain assumption Heavy-tailed noise models established in prior work on sign-based and AdaGrad optimizers
The relevance of the open problem and benchmark rests on the empirical claim that LLM gradient noise is heavy-tailed, drawn from cited recent studies.

pith-pipeline@v0.9.1-grok · 5675 in / 1121 out tokens · 25984 ms · 2026-06-26T09:07:26.958758+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 6 linked inside Pith

[1]

The geometry of sign gradient descent

5 YUTAOWANLUOZHANG Lukas Balles, Fabian Pedregosa, and Nicolas Le Roux. The geometry of sign gradient descent. arXiv preprint arXiv:2002.08056,

arXiv 2002
[2]

Improved analysis for sign- based methods with momentum updates.arXiv preprint arXiv:2507.12091,

Wei Jiang, Dingzhi Yu, Sifan Yang, Wenhao Yang, and Lijun Zhang. Improved analysis for sign- based methods with momentum updates.arXiv preprint arXiv:2507.12091,

arXiv
[3]

Kimi Team

URLhttps://github.com/karpathy/nanoGPT. Kimi Team. Kimi K2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

Pith/arXiv arXiv
[4]

Muon is scalable for LLM training.arXiv preprint arXiv:2502.16982, 2025a

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for LLM training.arXiv preprint arXiv:2502.16982, 2025a. Yuxing Liu, Rui Pan, and Tong Zhang. AdaGrad under anisotropic smoothness. InInternational Conference on Learning Representations (ICLR), pages 19574–19608...

Pith/arXiv arXiv
[5]

Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad.arXiv preprint arXiv:2605.18694,

Zijian Liu. Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad.arXiv preprint arXiv:2605.18694,

Pith/arXiv arXiv
[6]

Simple con- vergence proof of Adam from a sign-like descent perspective.arXiv preprint arXiv:2507.05966,

Hanyang Peng, Shuang Qin, Yue Yu, Fangqing Jiang, Hui Wang, and Zhouchen Lin. Simple con- vergence proof of Adam from a sign-like descent perspective.arXiv preprint arXiv:2507.05966,

arXiv
[7]

Benchmarking optimizers for large lan- guage model pretraining.arXiv preprint arXiv:2509.01440,

Andrei Semenov, Matteo Pagliardini, and Martin Jaggi. Benchmarking optimizers for large lan- guage model pretraining.arXiv preprint arXiv:2509.01440,

arXiv
[8]

Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J

7 YUTAOWANLUOZHANG Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J. Shah, et al. Practical efficiency of Muon for pretraining.arXiv preprint arXiv:2505.02222,

arXiv
[9]

When and Why SignSGD Outperforms SGD: A Theo- retical Study Based onℓ 1-norm Lower Bounds.arXiv preprint arXiv:2605.06615,

Hongyi Tao, Dingzhi Yu, and Lijun Zhang. When and Why SignSGD Outperforms SGD: A Theo- retical Study Based onℓ 1-norm Lower Bounds.arXiv preprint arXiv:2605.06615,

Pith/arXiv arXiv
[10]

Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046,

Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046,

arXiv
[11]

Provable benefit of sign descent: A mini- mal model under heavy-tail class imbalance

Robin Yadav, Shuo Xie, Tianhao Wang, and Zhiyuan Li. Provable benefit of sign descent: A mini- mal model under heavy-tail class imbalance. InOPT 2025: Optimization for Machine Learning,

2025
[12]

Mirror descent under general- ized smoothness.arXiv preprint arXiv:2502.00753,

Dingzhi Yu, Wei Jiang, Hongyi Tao, Yuanyu Wan, and Lijun Zhang. Mirror descent under general- ized smoothness.arXiv preprint arXiv:2502.00753,

Pith/arXiv arXiv
[13]

StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models.arXiv preprint arXiv:2604.15416, 2026a

Dingzhi Yu, Rui Pan, Yuxing Liu, and Tong Zhang. StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models.arXiv preprint arXiv:2604.15416, 2026a. Dingzhi Yu, Hongyi Tao, Yuanyu Wan, Luo Luo, and Lijun Zhang. Sign-based optimizers are effective under heavy-tailed noise.arXiv preprint arXiv:2602.07425, 2026b. Aohan Zeng...

Pith/arXiv arXiv
[14]

Sinceε t ≥0, this gives the deterministic update bound |ut,i| ≤C β = 1−β 1p (1−β 2)(1−β 2 1/β2) .(B) Proof of Proposition 2.Letρ= 1−β 1,ξ t =g t − ∇f(x t), ande t =m t − ∇f(x t)

vt. Sinceε t ≥0, this gives the deterministic update bound |ut,i| ≤C β = 1−β 1p (1−β 2)(1−β 2 1/β2) .(B) Proof of Proposition 2.Letρ= 1−β 1,ξ t =g t − ∇f(x t), ande t =m t − ∇f(x t). We use the von Bahr–Esseen inequality (von Bahr and Esseen, 1965). For conditionally mean-zero random variables andp∈[1,2], thepth moment of their sum is bounded, up to a uni...

1965

[1] [1]

The geometry of sign gradient descent

5 YUTAOWANLUOZHANG Lukas Balles, Fabian Pedregosa, and Nicolas Le Roux. The geometry of sign gradient descent. arXiv preprint arXiv:2002.08056,

arXiv 2002

[2] [2]

Improved analysis for sign- based methods with momentum updates.arXiv preprint arXiv:2507.12091,

Wei Jiang, Dingzhi Yu, Sifan Yang, Wenhao Yang, and Lijun Zhang. Improved analysis for sign- based methods with momentum updates.arXiv preprint arXiv:2507.12091,

arXiv

[3] [3]

Kimi Team

URLhttps://github.com/karpathy/nanoGPT. Kimi Team. Kimi K2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

Pith/arXiv arXiv

[4] [4]

Muon is scalable for LLM training.arXiv preprint arXiv:2502.16982, 2025a

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for LLM training.arXiv preprint arXiv:2502.16982, 2025a. Yuxing Liu, Rui Pan, and Tong Zhang. AdaGrad under anisotropic smoothness. InInternational Conference on Learning Representations (ICLR), pages 19574–19608...

Pith/arXiv arXiv

[5] [5]

Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad.arXiv preprint arXiv:2605.18694,

Zijian Liu. Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad.arXiv preprint arXiv:2605.18694,

Pith/arXiv arXiv

[6] [6]

Simple con- vergence proof of Adam from a sign-like descent perspective.arXiv preprint arXiv:2507.05966,

Hanyang Peng, Shuang Qin, Yue Yu, Fangqing Jiang, Hui Wang, and Zhouchen Lin. Simple con- vergence proof of Adam from a sign-like descent perspective.arXiv preprint arXiv:2507.05966,

arXiv

[7] [7]

Benchmarking optimizers for large lan- guage model pretraining.arXiv preprint arXiv:2509.01440,

Andrei Semenov, Matteo Pagliardini, and Martin Jaggi. Benchmarking optimizers for large lan- guage model pretraining.arXiv preprint arXiv:2509.01440,

arXiv

[8] [8]

Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J

7 YUTAOWANLUOZHANG Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J. Shah, et al. Practical efficiency of Muon for pretraining.arXiv preprint arXiv:2505.02222,

arXiv

[9] [9]

When and Why SignSGD Outperforms SGD: A Theo- retical Study Based onℓ 1-norm Lower Bounds.arXiv preprint arXiv:2605.06615,

Hongyi Tao, Dingzhi Yu, and Lijun Zhang. When and Why SignSGD Outperforms SGD: A Theo- retical Study Based onℓ 1-norm Lower Bounds.arXiv preprint arXiv:2605.06615,

Pith/arXiv arXiv

[10] [10]

Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046,

Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046,

arXiv

[11] [11]

Provable benefit of sign descent: A mini- mal model under heavy-tail class imbalance

Robin Yadav, Shuo Xie, Tianhao Wang, and Zhiyuan Li. Provable benefit of sign descent: A mini- mal model under heavy-tail class imbalance. InOPT 2025: Optimization for Machine Learning,

2025

[12] [12]

Mirror descent under general- ized smoothness.arXiv preprint arXiv:2502.00753,

Dingzhi Yu, Wei Jiang, Hongyi Tao, Yuanyu Wan, and Lijun Zhang. Mirror descent under general- ized smoothness.arXiv preprint arXiv:2502.00753,

Pith/arXiv arXiv

[13] [13]

StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models.arXiv preprint arXiv:2604.15416, 2026a

Dingzhi Yu, Rui Pan, Yuxing Liu, and Tong Zhang. StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models.arXiv preprint arXiv:2604.15416, 2026a. Dingzhi Yu, Hongyi Tao, Yuanyu Wan, Luo Luo, and Lijun Zhang. Sign-based optimizers are effective under heavy-tailed noise.arXiv preprint arXiv:2602.07425, 2026b. Aohan Zeng...

Pith/arXiv arXiv

[14] [14]

Sinceε t ≥0, this gives the deterministic update bound |ut,i| ≤C β = 1−β 1p (1−β 2)(1−β 2 1/β2) .(B) Proof of Proposition 2.Letρ= 1−β 1,ξ t =g t − ∇f(x t), ande t =m t − ∇f(x t)

vt. Sinceε t ≥0, this gives the deterministic update bound |ut,i| ≤C β = 1−β 1p (1−β 2)(1−β 2 1/β2) .(B) Proof of Proposition 2.Letρ= 1−β 1,ξ t =g t − ∇f(x t), ande t =m t − ∇f(x t). We use the von Bahr–Esseen inequality (von Bahr and Esseen, 1965). For conditionally mean-zero random variables andp∈[1,2], thepth moment of their sum is bounded, up to a uni...

1965