Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise

Dingzhi Yu; Hongyi Tao; Lijun Zhang; Luo Luo; Yuanyu Wan

arxiv: 2602.07425 · v2 · submitted 2026-02-07 · 💻 cs.LG · cs.CL· math.OC

Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise

Dingzhi Yu , Hongyi Tao , Yuanyu Wan , Luo Luo , Lijun Zhang This is my paper

Pith reviewed 2026-05-16 06:39 UTC · model grok-4.3

classification 💻 cs.LG cs.CLmath.OC

keywords sign-based optimizersheavy-tailed noiseconvergence ratesLion optimizerMuon optimizerlarge language modelsstochastic optimization

0 comments

The pith

Sign-based optimizers achieve sharp convergence rates under a generalized heavy-tailed noise model that fits large language model training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a new generalized heavy-tailed noise condition to describe stochastic gradients in LLM training, relaxing the usual finite-variance assumption. Under this model the authors prove that SignSGD and Lion attain sharp convergence rates on generalized smooth functions, matching or improving prior bounds. They also give the first rigorous convergence analysis for the matrix optimizers Muon and Muonlight when gradients are heavy-tailed. The results supply a theoretical reason for the practical advantage of sign-based methods over variance-adaptive ones such as AdamW.

Core claim

Under the generalized heavy-tailed noise condition, SignSGD and Lion reach sharp convergence rates for generalized smooth objectives that match or surpass earlier bounds; the same framework yields the first rigorous rates for Muon and Muonlight on matrix problems, confirming that sign-based updates are naturally adapted to the heavy-tailed stochasticity observed in language-model training.

What carries the argument

The generalized heavy-tailed noise condition, which permits infinite variance and models observed LLM gradient tails, allowing direct analysis of coordinate-wise sign updates without moment-based rescaling.

If this is right

SignSGD and Lion receive optimal convergence guarantees precisely when gradients follow heavy tails.
Muon and Muonlight obtain provable rates for matrix-parameter optimization under the same noise.
LLM training should favor sign-based updates over variance-adaptive methods such as AdamW.
The noise model aligns with measured gradient statistics in language-model pretraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Optimizer design for large models should emphasize robustness to infinite-variance noise rather than adaptation based on estimated moments.
The same heavy-tailed analysis could be applied to non-convex problems outside language modeling where gradient tails are heavy.
Measuring tail indices on specific model scales could inform whether to switch to sign-based methods for a given training run.

Load-bearing premise

The generalized heavy-tailed noise condition accurately captures the gradient noise that appears when training large language models.

What would settle it

Empirical measurements showing that gradient noise in an actual LLM pretraining run has finite variance or tails incompatible with the proposed model would remove the justification for the derived rates.

read the original abstract

While adaptive gradient methods are the workhorse of modern machine learning, sign-based optimization algorithms such as Lion and Muon have recently demonstrated superior empirical performance over AdamW in training large language models (LLM). However, a theoretical understanding of why sign-based updates outperform variance-adapted methods remains elusive. In this paper, we aim to bridge the gap between theory and practice through the lens of heavy-tailed gradient noise, a phenomenon frequently observed in language modeling tasks. Theoretically, we introduce a novel generalized heavy-tailed noise condition that captures the behavior of LLMs more accurately than standard finite variance assumptions. Under this noise model, we establish sharp convergence rates of SignSGD and Lion for generalized smooth function classes, matching or surpassing previous best-known bounds. Furthermore, we extend our analysis to Muon and Muonlight, providing what is, to our knowledge, the first rigorous analysis of matrix optimization under heavy-tailed stochasticity. These results offer a strong theoretical justification for the empirical superiority of sign-based optimizers, showcasing that they are naturally suited to handle the noisy gradients associated with heavy tails. Empirically, LLM pretraining experiments validate our theoretical insights and confirm that our proposed noise models are well-aligned with practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers the first Muon analysis under heavy-tailed noise plus matching rates for SignSGD and Lion, but the noise model's claimed superiority over finite-variance alternatives rests on thin empirical comparison.

read the letter

The main point is that this work supplies the first rigorous convergence analysis for Muon and Muonlight under heavy-tailed stochastic gradients, together with rates for SignSGD and Lion on generalized smooth functions that match or beat earlier bounds under the new noise condition. That extension to matrix methods is genuinely new. The LLM pretraining runs also tie the theory back to the motivating setting in a direct way. Those are the concrete advances. The generalized heavy-tailed noise model itself is a sensible modeling step beyond finite variance, and the paper shows how sign-based updates are naturally robust to it. That part holds up on its own terms. The soft spot is the justification for using this noise model in the first place. The claim that it captures LLM gradient behavior more accurately than standard assumptions is stated but not backed by quantitative model selection on the same data, such as direct likelihood comparisons or tail-index tests against a Gaussian baseline. Without that, the rates are formally correct under the assumption but the practical relevance stays partly qualitative. The abstract promises sharp rates, yet the lack of explicit constants or full derivation checks in what we see leaves some uncertainty about looseness. This is for readers working on optimization theory for large-scale training who want theoretical grounding for why Lion or Muon can outperform AdamW under realistic noise. It deserves peer review because the Muon extension is timely and the rates are a clear step forward, even if the empirical section on model fit needs tightening to make the motivation fully convincing.

Referee Report

2 major / 2 minor

Summary. The paper introduces a novel generalized heavy-tailed noise condition for stochastic gradients, claimed to model LLM training more accurately than finite-variance assumptions. Under this condition it derives sharp convergence rates for SignSGD and Lion on generalized smooth functions (matching or exceeding prior bounds) and provides the first rigorous analysis of Muon and Muonlight for matrix optimization under heavy-tailed stochasticity. Empirical LLM pretraining experiments are presented to validate the rates and the alignment of the noise model with practice.

Significance. If the derived rates are indeed sharp and the noise model is the appropriate one for LLM gradients, the work supplies a concrete theoretical explanation for the observed superiority of sign-based methods over variance-adapted ones such as AdamW. The extension to matrix-valued updates (Muon) is a genuine novelty; no prior analysis of this form existed under heavy-tailed noise. The combination of generalized smoothness, explicit rates, and first-time Muon analysis would be a useful addition to the literature on non-convex optimization with realistic noise.

major comments (2)

[Empirical validation (likely §5)] The central justification for the new noise model is that it 'captures the behavior of LLMs more accurately than standard finite variance assumptions.' The empirical section demonstrates the presence of heavy tails (kurtosis/QQ plots) but does not report quantitative model-selection metrics (log-likelihood ratios, AIC/BIC, or tail-index estimates) comparing the generalized condition against a Gaussian baseline on the same gradient samples. This weakens the claim that the derived rates are the relevant ones for the motivating application.
[Theoretical results (likely §3–4, Theorems on SignSGD/Lion)] The abstract and introduction assert 'sharp' rates for SignSGD/Lion that 'match or surpass previous best-known bounds.' No table or explicit comparison of constants, dependence on tail parameters, or smoothness constants appears in the provided material; without this, it is impossible to verify whether the new bounds are strictly sharper or merely different under the novel assumption.

minor comments (2)

[§2 (Preliminaries)] Notation for the generalized heavy-tailed condition should be introduced with an explicit equation number and contrasted immediately with the classical finite-variance assumption.
[Introduction / Related work] The Muon analysis is described as the 'first rigorous analysis'; a short related-work paragraph citing any prior matrix-sign analyses (even under lighter noise) would strengthen this claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback. The two major comments identify opportunities to strengthen the empirical support for the noise model and to make the theoretical rate comparisons more explicit. We address each point below and will incorporate the suggested additions in the revised manuscript.

read point-by-point responses

Referee: [Empirical validation (likely §5)] The central justification for the new noise model is that it 'captures the behavior of LLMs more accurately than standard finite variance assumptions.' The empirical section demonstrates the presence of heavy tails (kurtosis/QQ plots) but does not report quantitative model-selection metrics (log-likelihood ratios, AIC/BIC, or tail-index estimates) comparing the generalized condition against a Gaussian baseline on the same gradient samples. This weakens the claim that the derived rates are the relevant ones for the motivating application.

Authors: We agree that quantitative model-selection metrics would provide stronger evidence. In the revised manuscript we will add tail-index estimates computed via the Hill estimator on the gradient samples collected during the LLM pretraining runs. We will also report log-likelihood ratios (and, where appropriate, AIC values) comparing the fit of the proposed generalized heavy-tailed model against a Gaussian baseline on the identical set of gradient vectors. These additions will appear in an expanded §5 and will directly quantify the improved alignment with observed LLM gradient statistics. revision: yes
Referee: [Theoretical results (likely §3–4, Theorems on SignSGD/Lion)] The abstract and introduction assert 'sharp' rates for SignSGD/Lion that 'match or surpass previous best-known bounds.' No table or explicit comparison of constants, dependence on tail parameters, or smoothness constants appears in the provided material; without this, it is impossible to verify whether the new bounds are strictly sharper or merely different under the novel assumption.

Authors: We appreciate the request for an explicit side-by-side comparison. The sharpness claim rests on the fact that our bounds recover the best-known rates under finite-variance noise as a special case while improving the dependence on the tail index for heavier tails. To make this transparent, the revised version will include a new comparison table (placed after the main theorems) that lists the leading constants, the dependence on the smoothness parameter, the tail parameter, and the noise assumptions for our results versus the prior bounds of [relevant citations]. This table will allow readers to verify at a glance where the new rates match or improve upon existing ones. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a novel generalized heavy-tailed noise condition as an explicit modeling assumption chosen to better match observed LLM gradient statistics than finite-variance alternatives. Under this assumption it derives convergence rates for SignSGD, Lion, Muon and Muonlight on generalized smooth objectives. The rates are obtained by standard analysis techniques applied to the stated noise model; no equation defines the noise parameters in terms of the target rates, nor are any rates obtained by fitting or by renaming a self-citation. The empirical section is presented as separate validation that the assumption aligns with practice, not as input to the theoretical bounds. Consequently the claimed results are not equivalent to their inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the appropriateness of the generalized heavy-tailed noise model for LLM gradients and on generalized smoothness of the objective; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Generalized heavy-tailed noise condition
Assumed to describe LLM gradient noise more accurately than finite-variance models.

pith-pipeline@v0.9.0 · 5522 in / 1161 out tokens · 68332 ms · 2026-05-16T06:39:37.606848+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LionMuon: Alternating Spectral and Sign Descent for Efficient Training
cs.LG 2026-05 unverdicted novelty 6.0

LionMuon alternates Lion sign steps and Muon spectral steps with shared dual-EMA momentum to match Lion memory while outperforming both at P=2 on 124M-720M models, backed by heavy-tailed complexity bounds that predict...
Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise
math.OC 2026-05 unverdicted novelty 6.0

Establishes matching lower and upper oracle complexity bounds for scale-invariant methods with spectral norm under heavy-tailed noise, plus improved rates with higher-order smoothness, and practical tests on neural networks.
StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models
cs.LG 2026-04 unverdicted novelty 6.0

StoSignSGD resolves SignSGD divergence on non-smooth objectives via structural stochasticity, matching optimal convex rates and improving non-convex bounds while delivering 1.44-2.14x speedups in FP8 LLM pretraining.
CLion: Efficient Cautious Lion Optimizer with Enhanced Generalization
cs.LG 2026-04 unverdicted novelty 6.0

CLion achieves O(1/N) generalization error and O(√d / T^{1/4}) convergence for nonconvex stochastic optimization, improving on Lion's O(1/(N τ^T)) bound.
MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models
cs.LG 2026-05 unverdicted novelty 5.0

MiMuon is a hybrid optimizer that achieves a generalization error bound of O(1/N) independent of the small singular-value gap that limits the original Muon bound, while retaining the same O(1/T^{1/4}) convergence rate.