The sharpness disparity principle in transformers for accelerating language model pre-training.arXiv preprint arXiv:2502.19002

18 Jinbo Wang, Mingze Wang, Zhanpeng Zhou, Junchi Yan, Lei Wu · arXiv 2502.19002

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs

cs.LG · 2026-05-21 · conditional · novelty 6.0

Heavy-tail guided layerwise learning rates improve LLM convergence speed and generalization across LLaMA, GPT variants, AdamW and Muon optimizers from 60M to 1B parameters.

Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.

HTMuon: Improving Muon via Heavy-Tailed Spectral Correction

cs.LG · 2026-03-10 · unverdicted · novelty 5.0

HTMuon modifies Muon to produce heavier-tailed updates and weight spectra via HT-SR theory, yielding up to 0.98 lower perplexity on LLaMA pretraining and serving as a plug-in for other Muon variants.

GradPower: Powering Gradients for Faster Language Model Pre-Training

cs.LG · 2025-05-30 · unverdicted · novelty 5.0

GradPower applies sign-power to gradients before optimization and achieves lower terminal loss in language model pre-training across architectures, scales, datasets, and schedules.

citing papers explorer

Showing 4 of 4 citing papers.

One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs cs.LG · 2026-05-21 · conditional · none · ref 15
Heavy-tail guided layerwise learning rates improve LLM convergence speed and generalization across LLaMA, GPT variants, AdamW and Muon optimizers from 60M to 1B parameters.
Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio cs.LG · 2026-05-07 · unverdicted · none · ref 37
MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.
HTMuon: Improving Muon via Heavy-Tailed Spectral Correction cs.LG · 2026-03-10 · unverdicted · none · ref 29
HTMuon modifies Muon to produce heavier-tailed updates and weight spectra via HT-SR theory, yielding up to 0.98 lower perplexity on LLaMA pretraining and serving as a plug-in for other Muon variants.
GradPower: Powering Gradients for Faster Language Model Pre-Training cs.LG · 2025-05-30 · unverdicted · none · ref 15
GradPower applies sign-power to gradients before optimization and achieves lower terminal loss in language model pre-training across architectures, scales, datasets, and schedules.

The sharpness disparity principle in transformers for accelerating language model pre-training.arXiv preprint arXiv:2502.19002

fields

years

verdicts

representative citing papers

citing papers explorer