GradPower applies sign-power to gradients before optimization and achieves lower terminal loss in language model pre-training across architectures, scales, datasets, and schedules.
It is a large-scale public language dataset, widely used for LLM pre-training such as T5 (Raffel et al., 2020), and prior pre-training studies (Zhao et al., 2024; 2025)
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
GradPower: Powering Gradients for Faster Language Model Pre-Training
GradPower applies sign-power to gradients before optimization and achieves lower terminal loss in language model pre-training across architectures, scales, datasets, and schedules.