pith. sign in

arxiv: 2505.24275 · v3 · pith:3JKBJ6XXnew · submitted 2025-05-30 · 💻 cs.LG · math.OC· stat.ML

GradPower: Powering Gradients for Faster Language Model Pre-Training

Pith reviewed 2026-05-22 02:43 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML
keywords gradient transformationsign-powerAdam optimizerlanguage model pre-trainingterminal lossmixture-of-expertsoptimizer acceleration
0
0 comments X

The pith

A fixed sign-power transform on gradients reaches lower terminal loss in language model pre-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GradPower, a lightweight method that applies an elementwise sign-power operation to each component of the gradient vector before it enters a standard optimizer. The operation takes the sign of the gradient element and multiplies it by the absolute value raised to a fixed power p, then feeds the result into the base optimizer such as Adam. This requires only a single line of code and leaves all optimizer hyperparameters and learning-rate schedules untouched. Across experiments with LLaMA and Qwen2MoE models ranging from 66 million to 2 billion parameters, on C4 and OpenWebText data, and with both cosine and warmup-stable-decay schedules, the transformed gradients produce lower final loss values. The gains are largest for mixture-of-experts models under warmup-stable-decay schedules, and the same transform also improves performance when paired with the Muon optimizer.

Core claim

GradPower applies the elementwise sign-power transformation φ_p(g) = (sign(g_i) |g_i|^p)_i for a fixed p > 0 to the incoming gradient vector and passes the result directly to an unmodified base optimizer. When used with Adam this produces consistently lower terminal loss than the untransformed baseline across the tested model families, scales, datasets, and schedules, with the largest improvements appearing in modern mixture-of-experts models trained under warmup-stable-decay schedules.

What carries the argument

The elementwise sign-power transformation that multiplies the sign of each gradient component by its absolute value raised to a fixed positive power p.

If this is right

  • AdamPower yields lower terminal loss on both dense LLaMA models and sparse Qwen2MoE models.
  • The improvement appears at every tested scale from 66 million to 2 billion parameters.
  • The same single-line change also improves the Muon optimizer without further tuning.
  • The largest observed gains occur when training mixture-of-experts models with warmup-stable-decay schedules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benefit may stem from how the power transform alters the distribution of gradient noise, which could be tested by measuring gradient statistics before and after the transform.
  • Because no schedule retuning is required, the method could be dropped into existing large-scale training runs with minimal engineering cost.
  • If the optimal p turns out to be stable across many tasks, the same transform might accelerate training in domains outside language modeling where gradient magnitudes are similarly heterogeneous.

Load-bearing premise

A single fixed power p improves final loss without any need to retune the base optimizer's internal hyperparameters or the learning-rate schedule.

What would settle it

Running AdamPower on a 1-billion-parameter Qwen2MoE model trained on OpenWebText with a warmup-stable-decay schedule and finding that the terminal loss is not lower than the loss obtained with unmodified Adam.

Figures

Figures reproduced from arXiv: 2505.24275 by Jiaqi Zhang, Jinbo Wang, Lei Wu, Mingze Wang, Peng Pei, Weinan E, Wei Wang, Xunliang Cai.

Figure 1
Figure 1. Figure 1: Scaling-law comparison of AdamPower and Adam on the C4 dataset for dense LLaMA models and mixture-of-experts Qwen2MoE models. 1.1 Related Works Optimizer disign in LLM pre-training. In LLM pre-training, Adam (Kingma and Ba, 2014) has become the de facto optimizer. Recent efforts to improve its efficiency focus on two aspects: accelerating convergence and reducing memory usage. Techniques for accelerating c… view at source ↗
Figure 2
Figure 2. Figure 2: Pre-training LLaMA (0.2B) on C4 using AdamPower with different power p’s. The optimal power is 1.2. Adam Baselines. We use the standard Adam optimizer (with decoupled weight decay) as the baseline in most ex￾periments (expect Section 3.4). The baseline is configured with hyperparameters β1 = 0.9, β2 = 0.95, weight decay λ = 0.1, and gradient clipping threshold of 1.0, follow￾ing protocols used in LLaMA pre… view at source ↗
Figure 3
Figure 3. Figure 3: compares the performance of AdamPower (with p = 1.2) to that of vanilla Adam across a range of settings, including LLaMA models of size 66M, 0.2B, 0.4B, 1B and 2B; both cos and wsd LR schedulers; and the C4 and OpenWebText datasets. Across all experiments, AdamPower consistently achieves a lower terminal loss than well-tuned Adam baseline. To further assess its scalability, we visualize the scaling laws of… view at source ↗
Figure 4
Figure 4. Figure 4: AdamPower (p = 1.2) consistently outperforms Adam in QwenMoE pre-training tasks on C4, across varying model sizes. The learning rate schedule is wsd. 5 [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (left) AdamPower with Blockwise LR outperforms both AdamPower and Adam with [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The influence of batch size for the optimal power p. Finally, we investigate how batch size influence the performance of GradPower. Batch size plays a critical role in deep learning, with larger batch sizes producing lower gradient noise and more accurate gradient (Keskar et al., 2017; McCandlish et al., 2018). Unlike the previous experimental settings, here we conduct the experiments on C4 dataset, varyin… view at source ↗
Figure 7
Figure 7. Figure 7: Numerical results for Exam￾ple 4.1. We plot the value of ut at t = 106 for AdamPower across different p’s under and varying noise-to-signal ratios. For each curve, the optimal and suboptimal p values are marked with stars. The µ is set to µ = 10−6 . Other hyperparam￾eters follow standard values: β1 = 0.9, β2 = 0.95, ϵ = 10−8 , and λ = 0. The learning rate η does not affect the result. These findings closel… view at source ↗
read the original abstract

We propose GradPower, a lightweight gradient-transformation technique for accelerating language model pre-training. Given a gradient vector $g=(g_i)_i$, GradPower first applies the elementwise sign-power transformation: $\varphi_p(g)=({\rm sign}(g_i)|g_i|^p)_{i}$ for a fixed $p>0$, and then feeds the transformed gradient into a base optimizer. Notably, GradPower requires only a single-line code change and no modifications to the base optimizer's internal logic, including the hyperparameters. When applied to Adam (termed AdamPower), GradPower consistently achieves lower terminal loss across diverse architectures (LLaMA, Qwen2MoE), parameter scales (66M to 2B), datasets (C4, OpenWebText), and learning-rate schedules (cosine, warmup-stable-decay). The most pronounced gains are observed when training modern mixture-of-experts models with warmup-stable-decay schedules. GradPower also integrates seamlessly with other state-of-the-art optimizers, such as Muon, yielding further improvements. Finally, we provide theoretical analyses that reveal the underlying mechanism of GradPower and highlight the influence of gradient noise.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes GradPower, a lightweight elementwise gradient transformation φ_p(g) = sign(g_i) |g_i|^p for fixed p > 0 that is applied to the raw gradient before it is passed to a base optimizer (e.g., Adam, yielding AdamPower). The central empirical claim is that this single-line change, with no modifications to the base optimizer’s internal hyperparameters or learning-rate schedule, produces consistently lower terminal loss across LLaMA and Qwen2MoE architectures (66 M–2 B parameters), C4 and OpenWebText datasets, and both cosine and warmup-stable-decay schedules; additional gains are reported when combined with Muon. A theoretical section analyzes the effect through the lens of gradient noise.

Significance. If the reported gains are robust and not an artifact of implicit learning-rate retuning, GradPower would constitute a minimal-overhead, optimizer-agnostic improvement to large-scale language-model pre-training. The breadth of tested configurations and the attempt at a mechanistic explanation via gradient noise are strengths; however, the absence of quantitative deltas, error bars, or run counts in the abstract and the potential interaction with Adam’s moment estimates limit immediate impact.

major comments (3)
  1. [Abstract / Experiments] Abstract and Experiments section: the claim that “GradPower requires only a single-line code change and no modifications to the base optimizer’s internal logic, including the hyperparameters” is load-bearing for the central contribution. Because φ_p multiplies each coordinate by |g_i|^{p-1}, it directly rescales the inputs to both the bias-corrected first moment m_t and the second-moment denominator v_t in Adam; this alters the effective per-parameter step size even when the nominal η is left unchanged. The manuscript should therefore include an ablation in which the learning rate is re-optimized for each p (or at least for the best-reported p) and demonstrate that the terminal-loss advantage survives.
  2. [Theoretical analysis] Theoretical analysis section: the discussion of gradient noise does not derive the resulting shift in the stationary point or the modified convergence rate under the coordinate-wise rescaling induced by φ_p. A short derivation or bound showing how the noise statistics and effective Lipschitz constant change would strengthen the mechanistic claim.
  3. [Results] Results: the assertion of “consistently achieves lower terminal loss” is presented without quantitative deltas, standard deviations across runs, number of independent seeds, or statistical tests. This omission makes it impossible to judge the magnitude and reliability of the improvement, especially given the low-confidence soundness assessment.
minor comments (2)
  1. [Method] Notation: the definition φ_p(g) = (sign(g_i) |g_i|^p)_i should explicitly state the domain of p and whether p is chosen once for all experiments or tuned per model.
  2. [Figures] Figure clarity: loss curves comparing Adam and AdamPower should include shaded regions for multiple runs rather than single trajectories.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address each of the major comments in detail below and outline the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the claim that “GradPower requires only a single-line code change and no modifications to the base optimizer’s internal logic, including the hyperparameters” is load-bearing for the central contribution. Because φ_p multiplies each coordinate by |g_i|^{p-1}, it directly rescales the inputs to both the bias-corrected first moment m_t and the second-moment denominator v_t in Adam; this alters the effective per-parameter step size even when the nominal η is left unchanged. The manuscript should therefore include an ablation in which the learning rate is re-optimized for each p (or at least for the best-reported p) and demonstrate that the terminal-loss advantage survives.

    Authors: We agree that the elementwise power transformation affects the scale of the gradients fed into Adam's moment estimators, thereby influencing the effective step sizes. However, the core contribution remains that GradPower is a simple, external transformation applied to the raw gradient with no alterations to the optimizer's code or its user-specified hyperparameters. To directly address the concern and strengthen our claims, we will add an ablation study re-optimizing the learning rate for the optimal p and show that the performance gains persist compared to the baseline with its tuned learning rate. This revision will be included in the updated Experiments section. revision: partial

  2. Referee: [Theoretical analysis] Theoretical analysis section: the discussion of gradient noise does not derive the resulting shift in the stationary point or the modified convergence rate under the coordinate-wise rescaling induced by φ_p. A short derivation or bound showing how the noise statistics and effective Lipschitz constant change would strengthen the mechanistic claim.

    Authors: We appreciate this suggestion for enhancing the theoretical analysis. In the revised manuscript, we will expand the Theoretical analysis section to include a derivation of the shift in the stationary point and a bound on the modified convergence rate. This will involve analyzing the coordinate-wise rescaling's impact on gradient noise statistics and the effective Lipschitz constant, providing a more rigorous mechanistic explanation. revision: yes

  3. Referee: [Results] Results: the assertion of “consistently achieves lower terminal loss” is presented without quantitative deltas, standard deviations across runs, number of independent seeds, or statistical tests. This omission makes it impossible to judge the magnitude and reliability of the improvement, especially given the low-confidence soundness assessment.

    Authors: We acknowledge the importance of providing quantitative measures of improvement and statistical reliability. In the revised version, we will report specific deltas in terminal loss, include standard deviations from multiple independent runs with the number of seeds clearly stated, and incorporate appropriate statistical tests to support the claims of consistent improvement. These additions will be made throughout the Results section and in relevant figures/tables. revision: yes

Circularity Check

0 steps flagged

Empirical proposal with independent theoretical analysis of gradient noise

full rationale

The paper introduces GradPower as a simple elementwise sign-power transformation φ_p(g) applied before any base optimizer, with all claims resting on experimental results across architectures, scales, datasets, and schedules plus a separate theoretical section analyzing gradient noise effects. No derivation chain, fitted parameter, or first-principles result is shown to reduce by construction to the input transformation itself; the central performance claims are externally falsifiable via re-optimization of learning rate or direct comparison on held-out runs. Self-citations, if present, are not load-bearing for the uniqueness or correctness of the method. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach introduces one tunable exponent p and relies on the domain assumption that gradient noise in LM training can be beneficially reshaped by power transformations without harming convergence.

free parameters (1)
  • p
    The fixed exponent in the sign-power function is chosen once and held constant across experiments; its specific value is not derived from first principles.
axioms (1)
  • domain assumption Gradient noise in language-model training has statistical properties that a sign-power map can exploit to improve optimization trajectories.
    Invoked to explain why the transformation yields lower terminal loss.

pith-pipeline@v0.9.0 · 5761 in / 1270 out tokens · 57939 ms · 2026-05-22T02:43:54.076306+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio

    cs.LG 2026-05 unverdicted novelty 6.0

    MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 1 Pith paper · 9 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

    16 Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

  3. [3]

    Fira: Can we achieve full-rank training of llms under low-rank constraint?arXiv preprint arXiv:2410.01623, 2024a

    1 Xi Chen, Kaituo Feng, Changsheng Li, Xunhao Lai, Xiangyu Yue, Ye Yuan, and Guoren Wang. Fira: Can we achieve full-rank training of llms under low-rank constraint?arXiv preprint arXiv:2410.01623, 2024a. 3 10 Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, et al. Symbolic dis...

  4. [4]

    Alex Damian, Eshaan Nichani, and Jason D Lee

    3 Jeremy M Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E Dahl, et al. Adaptive gradient methods at the edge of stability.arXiv preprint arXiv:2207.14484,

  5. [5]

    Understanding optimization in deep learning with central flows.arXiv preprint arXiv:2410.24206,

    3 Jeremy M Cohen, Alex Damian, Ameet Talwalkar, Zico Kolter, and Jason D Lee. Understanding optimization in deep learning with central flows.arXiv preprint arXiv:2410.24206,

  6. [6]

    Training Compute-Optimal Large Language Models

    4, 6, 14 Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

  7. [7]

    MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    3 Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies.arXiv preprint arXiv:2404.06395,

  8. [8]

    Adam: A Method for Stochastic Optimization

    7 Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

  9. [9]

    Cautious optimizers: Improving training with one line of code.arXiv preprint arXiv:2411.16085,

    3 Kaizhao Liang, Lizhang Chen, Bo Liu, and Qiang Liu. Cautious optimizers: Improving training with one line of code.arXiv preprint arXiv:2411.16085,

  10. [10]

    DeepSeek-V3 Technical Report

    2 Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024a. 1, 5, 6 Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training...

  11. [11]

    arXiv preprint arXiv:2501.12243 , year=

    4, 14 Yizhou Liu, Ziming Liu, and Jeff Gore. Focus: First order concentrated updating scheme.arXiv preprint arXiv:2501.12243, 2025b. 1, 3 Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

  12. [12]

    An Empirical Model of Large-Batch Training

    1 Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch training.arXiv preprint arXiv:1812.06162,

  13. [13]

    LLaMA: Open and Efficient Foundation Language Models

    14 Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

  14. [14]

    SOAP: Improving and Stabilizing Shampoo using Adam

    2, 4, 14, 15 Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321,

  15. [15]

    The sharpness disparity principle in transformers for accelerating language model pre-training.arXiv preprint arXiv:2502.19002, 2025a

    2 Jinbo Wang, Mingze Wang, Zhanpeng Zhou, Junchi Yan, Lei Wu, et al. The sharpness disparity princi- ple in transformers for accelerating language model pre-training.arXiv preprint arXiv:2502.19002,

  16. [16]

    Qwen2 Technical Report

    1, 2 An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024a. 2, 4, 5, 14 An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical re...

  17. [17]

    We pre-train LLaMA models of sizes ranging from 66M to 2B parameters

    is a popular Dense decoder-only Transformer architec- ture, incorporating Rotary Positional Encoding (RoPE) (Su et al., 2024), Swish-Gated Linear Unit (SwiGLU), and Root mean square layer normalization (RMSNorm). We pre-train LLaMA models of sizes ranging from 66M to 2B parameters. Additional model configurations are detailed in Table

  18. [18]

    It is a large-scale public language dataset, widely used for LLM pre-training such as T5 (Raffel et al., 2020), and prior pre-training studies (Zhao et al., 2024; 2025)

    Datasets.Models are pre-trained on the following datasets: • Colossal Clean Crawled Corpus (C4)(Raffel et al., 2020). It is a large-scale public language dataset, widely used for LLM pre-training such as T5 (Raffel et al., 2020), and prior pre-training studies (Zhao et al., 2024; 2025). We use the T5 tokenizer, with the vocabulary size 32100. • OpenWebTex...

  19. [19]

    Following Karpathy (2022); Liu et al

    and nanoGPT (Karpathy, 2022). Following Karpathy (2022); Liu et al. (2024b), we use the GPT-2 tokenizer, with the vocabulary size 50304. LR schedulers.We evaluate two popular LR scheduling strategies: •cos (cosine scheduler) (Karpathy, 2022; Touvron et al., 2023): a linear warm-up to peaklr_max, followed by cosine decay to a terminal LRlr_min. •wsd (warmu...

  20. [20]

    The training includes 1,000 warm-up steps

    Following the Chinchilla scaling law (Hoffmann et al., 2022), the total number of training tokens is set to be approximately 20 times the number of model parameters. The training includes 1,000 warm-up steps. The grid search for lr_max is performed over {1e-4, 2e-4, 3e-4, 6e-4, 1e-3, 1.5e-3}. Optimal learning rates for each model are detailed in Tables 1 and

  21. [21]

    All other optimizer hyperparameters are kept identical to those used for the Adam baselines

    AdamPower experiments.We adopt p= 1.2 as the default in all experiments in Section 3.2 and 3.3. All other optimizer hyperparameters are kept identical to those used for the Adam baselines. Importantly, the powerp= 1.2proves to behighly robust. A.2 Experimental details for Section 3.4 Adam with Blockwise LR.Following Wang et al. (2025), we adopt the same p...

  22. [22]

    For larger batch sizes (2048, 4096, 8192), we tune the max_lr over {6r-4, 1e-3, 2e-3, 4e-3, 8e-3} for Adam

    For batch size 512, the tuned max_lr is 1e-3 (Table 1). For larger batch sizes (2048, 4096, 8192), we tune the max_lr over {6r-4, 1e-3, 2e-3, 4e-3, 8e-3} for Adam. We find that 1e-3 consistently yields the best results across all batch sizes. For each batch size, we evaluate AdamPower with multiple values of p, and record their validation loss when the op...

  23. [23]

    φ2 p(g) v+ϵ # . Given that q Et[φ2p(g)]⩽ √˜v+ϵand q Et[φ2p(g)]⩽R p, we can simplify the above estimate as: Et[I1]⩽ σ 4 |G|2 √˜v+ϵ + Rp σ Et

    3/2 ⩽σ p. Therefore, we have the estimate: log(ϵlog(1/σ)) log(σ/e3) ⩽p ⋆ ⩽ log(ϵlog(1/σ)) logσ Noticingσ≪1, we obtain: p⋆ = Θ log(ϵlog(1/σ)) logσ . C Proofs in Section 4.2 Recall that the udpate rule of AdagradPower (with powerp) follows: θt+1 =θt −ηu t, ut = φp(gt)√vt +ϵ , vt = tX s=1 φ2 p(gt). In general, our proof is inspired by the main techniques to ...