GradPower: Powering Gradients for Faster Language Model Pre-Training
Pith reviewed 2026-05-22 02:43 UTC · model grok-4.3
The pith
A fixed sign-power transform on gradients reaches lower terminal loss in language model pre-training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GradPower applies the elementwise sign-power transformation φ_p(g) = (sign(g_i) |g_i|^p)_i for a fixed p > 0 to the incoming gradient vector and passes the result directly to an unmodified base optimizer. When used with Adam this produces consistently lower terminal loss than the untransformed baseline across the tested model families, scales, datasets, and schedules, with the largest improvements appearing in modern mixture-of-experts models trained under warmup-stable-decay schedules.
What carries the argument
The elementwise sign-power transformation that multiplies the sign of each gradient component by its absolute value raised to a fixed positive power p.
If this is right
- AdamPower yields lower terminal loss on both dense LLaMA models and sparse Qwen2MoE models.
- The improvement appears at every tested scale from 66 million to 2 billion parameters.
- The same single-line change also improves the Muon optimizer without further tuning.
- The largest observed gains occur when training mixture-of-experts models with warmup-stable-decay schedules.
Where Pith is reading between the lines
- The benefit may stem from how the power transform alters the distribution of gradient noise, which could be tested by measuring gradient statistics before and after the transform.
- Because no schedule retuning is required, the method could be dropped into existing large-scale training runs with minimal engineering cost.
- If the optimal p turns out to be stable across many tasks, the same transform might accelerate training in domains outside language modeling where gradient magnitudes are similarly heterogeneous.
Load-bearing premise
A single fixed power p improves final loss without any need to retune the base optimizer's internal hyperparameters or the learning-rate schedule.
What would settle it
Running AdamPower on a 1-billion-parameter Qwen2MoE model trained on OpenWebText with a warmup-stable-decay schedule and finding that the terminal loss is not lower than the loss obtained with unmodified Adam.
Figures
read the original abstract
We propose GradPower, a lightweight gradient-transformation technique for accelerating language model pre-training. Given a gradient vector $g=(g_i)_i$, GradPower first applies the elementwise sign-power transformation: $\varphi_p(g)=({\rm sign}(g_i)|g_i|^p)_{i}$ for a fixed $p>0$, and then feeds the transformed gradient into a base optimizer. Notably, GradPower requires only a single-line code change and no modifications to the base optimizer's internal logic, including the hyperparameters. When applied to Adam (termed AdamPower), GradPower consistently achieves lower terminal loss across diverse architectures (LLaMA, Qwen2MoE), parameter scales (66M to 2B), datasets (C4, OpenWebText), and learning-rate schedules (cosine, warmup-stable-decay). The most pronounced gains are observed when training modern mixture-of-experts models with warmup-stable-decay schedules. GradPower also integrates seamlessly with other state-of-the-art optimizers, such as Muon, yielding further improvements. Finally, we provide theoretical analyses that reveal the underlying mechanism of GradPower and highlight the influence of gradient noise.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes GradPower, a lightweight elementwise gradient transformation φ_p(g) = sign(g_i) |g_i|^p for fixed p > 0 that is applied to the raw gradient before it is passed to a base optimizer (e.g., Adam, yielding AdamPower). The central empirical claim is that this single-line change, with no modifications to the base optimizer’s internal hyperparameters or learning-rate schedule, produces consistently lower terminal loss across LLaMA and Qwen2MoE architectures (66 M–2 B parameters), C4 and OpenWebText datasets, and both cosine and warmup-stable-decay schedules; additional gains are reported when combined with Muon. A theoretical section analyzes the effect through the lens of gradient noise.
Significance. If the reported gains are robust and not an artifact of implicit learning-rate retuning, GradPower would constitute a minimal-overhead, optimizer-agnostic improvement to large-scale language-model pre-training. The breadth of tested configurations and the attempt at a mechanistic explanation via gradient noise are strengths; however, the absence of quantitative deltas, error bars, or run counts in the abstract and the potential interaction with Adam’s moment estimates limit immediate impact.
major comments (3)
- [Abstract / Experiments] Abstract and Experiments section: the claim that “GradPower requires only a single-line code change and no modifications to the base optimizer’s internal logic, including the hyperparameters” is load-bearing for the central contribution. Because φ_p multiplies each coordinate by |g_i|^{p-1}, it directly rescales the inputs to both the bias-corrected first moment m_t and the second-moment denominator v_t in Adam; this alters the effective per-parameter step size even when the nominal η is left unchanged. The manuscript should therefore include an ablation in which the learning rate is re-optimized for each p (or at least for the best-reported p) and demonstrate that the terminal-loss advantage survives.
- [Theoretical analysis] Theoretical analysis section: the discussion of gradient noise does not derive the resulting shift in the stationary point or the modified convergence rate under the coordinate-wise rescaling induced by φ_p. A short derivation or bound showing how the noise statistics and effective Lipschitz constant change would strengthen the mechanistic claim.
- [Results] Results: the assertion of “consistently achieves lower terminal loss” is presented without quantitative deltas, standard deviations across runs, number of independent seeds, or statistical tests. This omission makes it impossible to judge the magnitude and reliability of the improvement, especially given the low-confidence soundness assessment.
minor comments (2)
- [Method] Notation: the definition φ_p(g) = (sign(g_i) |g_i|^p)_i should explicitly state the domain of p and whether p is chosen once for all experiments or tuned per model.
- [Figures] Figure clarity: loss curves comparing Adam and AdamPower should include shaded regions for multiple runs rather than single trajectories.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our work. We address each of the major comments in detail below and outline the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and Experiments section: the claim that “GradPower requires only a single-line code change and no modifications to the base optimizer’s internal logic, including the hyperparameters” is load-bearing for the central contribution. Because φ_p multiplies each coordinate by |g_i|^{p-1}, it directly rescales the inputs to both the bias-corrected first moment m_t and the second-moment denominator v_t in Adam; this alters the effective per-parameter step size even when the nominal η is left unchanged. The manuscript should therefore include an ablation in which the learning rate is re-optimized for each p (or at least for the best-reported p) and demonstrate that the terminal-loss advantage survives.
Authors: We agree that the elementwise power transformation affects the scale of the gradients fed into Adam's moment estimators, thereby influencing the effective step sizes. However, the core contribution remains that GradPower is a simple, external transformation applied to the raw gradient with no alterations to the optimizer's code or its user-specified hyperparameters. To directly address the concern and strengthen our claims, we will add an ablation study re-optimizing the learning rate for the optimal p and show that the performance gains persist compared to the baseline with its tuned learning rate. This revision will be included in the updated Experiments section. revision: partial
-
Referee: [Theoretical analysis] Theoretical analysis section: the discussion of gradient noise does not derive the resulting shift in the stationary point or the modified convergence rate under the coordinate-wise rescaling induced by φ_p. A short derivation or bound showing how the noise statistics and effective Lipschitz constant change would strengthen the mechanistic claim.
Authors: We appreciate this suggestion for enhancing the theoretical analysis. In the revised manuscript, we will expand the Theoretical analysis section to include a derivation of the shift in the stationary point and a bound on the modified convergence rate. This will involve analyzing the coordinate-wise rescaling's impact on gradient noise statistics and the effective Lipschitz constant, providing a more rigorous mechanistic explanation. revision: yes
-
Referee: [Results] Results: the assertion of “consistently achieves lower terminal loss” is presented without quantitative deltas, standard deviations across runs, number of independent seeds, or statistical tests. This omission makes it impossible to judge the magnitude and reliability of the improvement, especially given the low-confidence soundness assessment.
Authors: We acknowledge the importance of providing quantitative measures of improvement and statistical reliability. In the revised version, we will report specific deltas in terminal loss, include standard deviations from multiple independent runs with the number of seeds clearly stated, and incorporate appropriate statistical tests to support the claims of consistent improvement. These additions will be made throughout the Results section and in relevant figures/tables. revision: yes
Circularity Check
Empirical proposal with independent theoretical analysis of gradient noise
full rationale
The paper introduces GradPower as a simple elementwise sign-power transformation φ_p(g) applied before any base optimizer, with all claims resting on experimental results across architectures, scales, datasets, and schedules plus a separate theoretical section analyzing gradient noise effects. No derivation chain, fitted parameter, or first-principles result is shown to reduce by construction to the input transformation itself; the central performance claims are externally falsifiable via re-optimization of learning rate or direct comparison on held-out runs. Self-citations, if present, are not load-bearing for the uniqueness or correctness of the method. This is a standard non-circular empirical contribution.
Axiom & Free-Parameter Ledger
free parameters (1)
- p
axioms (1)
- domain assumption Gradient noise in language-model training has statistical properties that a sign-power map can exploit to improve optimization trajectories.
Forward citations
Cited by 1 Pith paper
-
Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio
MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
16 Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,
work page 1901
-
[3]
1 Xi Chen, Kaituo Feng, Changsheng Li, Xunhao Lai, Xiangyu Yue, Ye Yuan, and Guoren Wang. Fira: Can we achieve full-rank training of llms under low-rank constraint?arXiv preprint arXiv:2410.01623, 2024a. 3 10 Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, et al. Symbolic dis...
-
[4]
Alex Damian, Eshaan Nichani, and Jason D Lee
3 Jeremy M Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E Dahl, et al. Adaptive gradient methods at the edge of stability.arXiv preprint arXiv:2207.14484,
-
[5]
Understanding optimization in deep learning with central flows.arXiv preprint arXiv:2410.24206,
3 Jeremy M Cohen, Alex Damian, Ameet Talwalkar, Zico Kolter, and Jason D Lee. Understanding optimization in deep learning with central flows.arXiv preprint arXiv:2410.24206,
-
[6]
Training Compute-Optimal Large Language Models
4, 6, 14 Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
3 Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies.arXiv preprint arXiv:2404.06395,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Adam: A Method for Stochastic Optimization
7 Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Cautious optimizers: Improving training with one line of code.arXiv preprint arXiv:2411.16085,
3 Kaizhao Liang, Lizhang Chen, Bo Liu, and Qiang Liu. Cautious optimizers: Improving training with one line of code.arXiv preprint arXiv:2411.16085,
-
[10]
2 Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024a. 1, 5, 6 Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[11]
arXiv preprint arXiv:2501.12243 , year=
4, 14 Yizhou Liu, Ziming Liu, and Jeff Gore. Focus: First order concentrated updating scheme.arXiv preprint arXiv:2501.12243, 2025b. 1, 3 Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,
-
[12]
An Empirical Model of Large-Batch Training
1 Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch training.arXiv preprint arXiv:1812.06162,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
LLaMA: Open and Efficient Foundation Language Models
14 Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
SOAP: Improving and Stabilizing Shampoo using Adam
2, 4, 14, 15 Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
2 Jinbo Wang, Mingze Wang, Zhanpeng Zhou, Junchi Yan, Lei Wu, et al. The sharpness disparity princi- ple in transformers for accelerating language model pre-training.arXiv preprint arXiv:2502.19002,
-
[16]
1, 2 An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024a. 2, 4, 5, 14 An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical re...
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
We pre-train LLaMA models of sizes ranging from 66M to 2B parameters
is a popular Dense decoder-only Transformer architec- ture, incorporating Rotary Positional Encoding (RoPE) (Su et al., 2024), Swish-Gated Linear Unit (SwiGLU), and Root mean square layer normalization (RMSNorm). We pre-train LLaMA models of sizes ranging from 66M to 2B parameters. Additional model configurations are detailed in Table
work page 2024
-
[18]
Datasets.Models are pre-trained on the following datasets: • Colossal Clean Crawled Corpus (C4)(Raffel et al., 2020). It is a large-scale public language dataset, widely used for LLM pre-training such as T5 (Raffel et al., 2020), and prior pre-training studies (Zhao et al., 2024; 2025). We use the T5 tokenizer, with the vocabulary size 32100. • OpenWebTex...
work page 2020
-
[19]
Following Karpathy (2022); Liu et al
and nanoGPT (Karpathy, 2022). Following Karpathy (2022); Liu et al. (2024b), we use the GPT-2 tokenizer, with the vocabulary size 50304. LR schedulers.We evaluate two popular LR scheduling strategies: •cos (cosine scheduler) (Karpathy, 2022; Touvron et al., 2023): a linear warm-up to peaklr_max, followed by cosine decay to a terminal LRlr_min. •wsd (warmu...
work page 2022
-
[20]
The training includes 1,000 warm-up steps
Following the Chinchilla scaling law (Hoffmann et al., 2022), the total number of training tokens is set to be approximately 20 times the number of model parameters. The training includes 1,000 warm-up steps. The grid search for lr_max is performed over {1e-4, 2e-4, 3e-4, 6e-4, 1e-3, 1.5e-3}. Optimal learning rates for each model are detailed in Tables 1 and
work page 2022
-
[21]
All other optimizer hyperparameters are kept identical to those used for the Adam baselines
AdamPower experiments.We adopt p= 1.2 as the default in all experiments in Section 3.2 and 3.3. All other optimizer hyperparameters are kept identical to those used for the Adam baselines. Importantly, the powerp= 1.2proves to behighly robust. A.2 Experimental details for Section 3.4 Adam with Blockwise LR.Following Wang et al. (2025), we adopt the same p...
work page 2025
-
[22]
For batch size 512, the tuned max_lr is 1e-3 (Table 1). For larger batch sizes (2048, 4096, 8192), we tune the max_lr over {6r-4, 1e-3, 2e-3, 4e-3, 8e-3} for Adam. We find that 1e-3 consistently yields the best results across all batch sizes. For each batch size, we evaluate AdamPower with multiple values of p, and record their validation loss when the op...
work page 2048
-
[23]
3/2 ⩽σ p. Therefore, we have the estimate: log(ϵlog(1/σ)) log(σ/e3) ⩽p ⋆ ⩽ log(ϵlog(1/σ)) logσ Noticingσ≪1, we obtain: p⋆ = Θ log(ϵlog(1/σ)) logσ . C Proofs in Section 4.2 Recall that the udpate rule of AdagradPower (with powerp) follows: θt+1 =θt −ηu t, ut = φp(gt)√vt +ϵ , vt = tX s=1 φ2 p(gt). In general, our proof is inspired by the main techniques to ...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.