GradPower applies sign-power to gradients before optimization and achieves lower terminal loss in language model pre-training across architectures, scales, datasets, and schedules.
For larger batch sizes (2048, 4096, 8192), we tune the max_lr over {6r-4, 1e-3, 2e-3, 4e-3, 8e-3} for Adam
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
GradPower: Powering Gradients for Faster Language Model Pre-Training
GradPower applies sign-power to gradients before optimization and achieves lower terminal loss in language model pre-training across architectures, scales, datasets, and schedules.