Heavy-tail guided layerwise learning rates improve LLM convergence speed and generalization across LLaMA, GPT variants, AdamW and Muon optimizers from 60M to 1B parameters.
The sharpness disparity principle in transformers for accelerating language model pre-training.arXiv preprint arXiv:2502.19002
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 4representative citing papers
MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.
HTMuon modifies Muon to produce heavier-tailed updates and weight spectra via HT-SR theory, yielding up to 0.98 lower perplexity on LLaMA pretraining and serving as a plug-in for other Muon variants.
GradPower applies sign-power to gradients before optimization and achieves lower terminal loss in language model pre-training across architectures, scales, datasets, and schedules.
citing papers explorer
-
One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs
Heavy-tail guided layerwise learning rates improve LLM convergence speed and generalization across LLaMA, GPT variants, AdamW and Muon optimizers from 60M to 1B parameters.
-
Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio
MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.
-
HTMuon: Improving Muon via Heavy-Tailed Spectral Correction
HTMuon modifies Muon to produce heavier-tailed updates and weight spectra via HT-SR theory, yielding up to 0.98 lower perplexity on LLaMA pretraining and serving as a plug-in for other Muon variants.
-
GradPower: Powering Gradients for Faster Language Model Pre-Training
GradPower applies sign-power to gradients before optimization and achieves lower terminal loss in language model pre-training across architectures, scales, datasets, and schedules.