GradPower: Powering Gradients for Faster Language Model Pre-Training

Jiaqi Zhang; Jinbo Wang; Lei Wu; Mingze Wang; Peng Pei; Weinan E; Wei Wang; Xunliang Cai

arxiv: 2505.24275 · v3 · pith:3JKBJ6XXnew · submitted 2025-05-30 · 💻 cs.LG · math.OC· stat.ML

GradPower: Powering Gradients for Faster Language Model Pre-Training

Jinbo Wang , Mingze Wang , Jiaqi Zhang , Wei Wang , Peng Pei , Xunliang Cai , Weinan E , Lei Wu This is my paper

Pith reviewed 2026-05-22 02:43 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML

keywords gradient transformationsign-powerAdam optimizerlanguage model pre-trainingterminal lossmixture-of-expertsoptimizer acceleration

0 comments

The pith

A fixed sign-power transform on gradients reaches lower terminal loss in language model pre-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GradPower, a lightweight method that applies an elementwise sign-power operation to each component of the gradient vector before it enters a standard optimizer. The operation takes the sign of the gradient element and multiplies it by the absolute value raised to a fixed power p, then feeds the result into the base optimizer such as Adam. This requires only a single line of code and leaves all optimizer hyperparameters and learning-rate schedules untouched. Across experiments with LLaMA and Qwen2MoE models ranging from 66 million to 2 billion parameters, on C4 and OpenWebText data, and with both cosine and warmup-stable-decay schedules, the transformed gradients produce lower final loss values. The gains are largest for mixture-of-experts models under warmup-stable-decay schedules, and the same transform also improves performance when paired with the Muon optimizer.

Core claim

GradPower applies the elementwise sign-power transformation φ_p(g) = (sign(g_i) |g_i|^p)_i for a fixed p > 0 to the incoming gradient vector and passes the result directly to an unmodified base optimizer. When used with Adam this produces consistently lower terminal loss than the untransformed baseline across the tested model families, scales, datasets, and schedules, with the largest improvements appearing in modern mixture-of-experts models trained under warmup-stable-decay schedules.

What carries the argument

The elementwise sign-power transformation that multiplies the sign of each gradient component by its absolute value raised to a fixed positive power p.

If this is right

AdamPower yields lower terminal loss on both dense LLaMA models and sparse Qwen2MoE models.
The improvement appears at every tested scale from 66 million to 2 billion parameters.
The same single-line change also improves the Muon optimizer without further tuning.
The largest observed gains occur when training mixture-of-experts models with warmup-stable-decay schedules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benefit may stem from how the power transform alters the distribution of gradient noise, which could be tested by measuring gradient statistics before and after the transform.
Because no schedule retuning is required, the method could be dropped into existing large-scale training runs with minimal engineering cost.
If the optimal p turns out to be stable across many tasks, the same transform might accelerate training in domains outside language modeling where gradient magnitudes are similarly heterogeneous.

Load-bearing premise

A single fixed power p improves final loss without any need to retune the base optimizer's internal hyperparameters or the learning-rate schedule.

What would settle it

Running AdamPower on a 1-billion-parameter Qwen2MoE model trained on OpenWebText with a warmup-stable-decay schedule and finding that the terminal loss is not lower than the loss obtained with unmodified Adam.

Figures

Figures reproduced from arXiv: 2505.24275 by Jiaqi Zhang, Jinbo Wang, Lei Wu, Mingze Wang, Peng Pei, Weinan E, Wei Wang, Xunliang Cai.

**Figure 1.** Figure 1: Scaling-law comparison of AdamPower and Adam on the C4 dataset for dense LLaMA models and mixture-of-experts Qwen2MoE models. 1.1 Related Works Optimizer disign in LLM pre-training. In LLM pre-training, Adam (Kingma and Ba, 2014) has become the de facto optimizer. Recent efforts to improve its efficiency focus on two aspects: accelerating convergence and reducing memory usage. Techniques for accelerating c… view at source ↗

**Figure 2.** Figure 2: Pre-training LLaMA (0.2B) on C4 using AdamPower with different power p’s. The optimal power is 1.2. Adam Baselines. We use the standard Adam optimizer (with decoupled weight decay) as the baseline in most experiments (expect Section 3.4). The baseline is configured with hyperparameters β1 = 0.9, β2 = 0.95, weight decay λ = 0.1, and gradient clipping threshold of 1.0, following protocols used in LLaMA pre… view at source ↗

**Figure 3.** Figure 3: compares the performance of AdamPower (with p = 1.2) to that of vanilla Adam across a range of settings, including LLaMA models of size 66M, 0.2B, 0.4B, 1B and 2B; both cos and wsd LR schedulers; and the C4 and OpenWebText datasets. Across all experiments, AdamPower consistently achieves a lower terminal loss than well-tuned Adam baseline. To further assess its scalability, we visualize the scaling laws of… view at source ↗

**Figure 4.** Figure 4: AdamPower (p = 1.2) consistently outperforms Adam in QwenMoE pre-training tasks on C4, across varying model sizes. The learning rate schedule is wsd. 5 [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: (left) AdamPower with Blockwise LR outperforms both AdamPower and Adam with [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: The influence of batch size for the optimal power p. Finally, we investigate how batch size influence the performance of GradPower. Batch size plays a critical role in deep learning, with larger batch sizes producing lower gradient noise and more accurate gradient (Keskar et al., 2017; McCandlish et al., 2018). Unlike the previous experimental settings, here we conduct the experiments on C4 dataset, varyin… view at source ↗

**Figure 7.** Figure 7: Numerical results for Example 4.1. We plot the value of ut at t = 106 for AdamPower across different p’s under and varying noise-to-signal ratios. For each curve, the optimal and suboptimal p values are marked with stars. The µ is set to µ = 10−6 . Other hyperparameters follow standard values: β1 = 0.9, β2 = 0.95, ϵ = 10−8 , and λ = 0. The learning rate η does not affect the result. These findings closel… view at source ↗

read the original abstract

We propose GradPower, a lightweight gradient-transformation technique for accelerating language model pre-training. Given a gradient vector $g=(g_i)_i$, GradPower first applies the elementwise sign-power transformation: $\varphi_p(g)=({\rm sign}(g_i)|g_i|^p)_{i}$ for a fixed $p>0$, and then feeds the transformed gradient into a base optimizer. Notably, GradPower requires only a single-line code change and no modifications to the base optimizer's internal logic, including the hyperparameters. When applied to Adam (termed AdamPower), GradPower consistently achieves lower terminal loss across diverse architectures (LLaMA, Qwen2MoE), parameter scales (66M to 2B), datasets (C4, OpenWebText), and learning-rate schedules (cosine, warmup-stable-decay). The most pronounced gains are observed when training modern mixture-of-experts models with warmup-stable-decay schedules. GradPower also integrates seamlessly with other state-of-the-art optimizers, such as Muon, yielding further improvements. Finally, we provide theoretical analyses that reveal the underlying mechanism of GradPower and highlight the influence of gradient noise.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GradPower applies a fixed power transform to gradients before Adam or Muon and reports lower terminal loss across LM scales, but the gains may partly reflect an implicit shift in effective learning rate.

read the letter

The main thing to know is that this work proposes a simple elementwise power transform on gradients, φ_p(g) = sign(g) * |g|^p, plugged into Adam or Muon, and reports lower final losses on a range of language model pre-training setups without touching the optimizer's other settings. The experiments are the strongest part. They cover LLaMA and Qwen2MoE models from 66M up to 2B parameters, two datasets, and both cosine and warmup-stable-decay learning rate schedules. The gains look more noticeable on the MoE models with the stable-decay schedule. Adding the transform to Muon also helps. The fact that it's just one line of code makes it easy to test. They include some theoretical discussion about gradient noise to explain why it might help. On the downside, the central claim that a single fixed p works without any hyperparameter retuning is worth checking closely. Because the power changes the gradient magnitudes differently for each coordinate, it alters the first and second moments in Adam, which effectively changes the per-parameter update sizes. The paper's theory looks at noise but does not seem to quantify how this rescaling shifts the stationary point or the optimal learning rate. If the reported improvements hold only because the original learning rate happens to be better for the transformed gradients, then the benefit is less general than claimed. It would help to see whether re-tuning the base learning rate for the standard optimizer closes the gap. This paper is aimed at people who train large language models and are looking for small optimizations to the training loop. A reader who cares about practical improvements in pre-training efficiency would get value from the empirical results. The work shows clear thinking in setting up the experiments across scales and architectures, so it deserves a serious referee even if the theory needs tightening. I recommend sending it out for peer review.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes GradPower, a lightweight elementwise gradient transformation φ_p(g) = sign(g_i) |g_i|^p for fixed p > 0 that is applied to the raw gradient before it is passed to a base optimizer (e.g., Adam, yielding AdamPower). The central empirical claim is that this single-line change, with no modifications to the base optimizer’s internal hyperparameters or learning-rate schedule, produces consistently lower terminal loss across LLaMA and Qwen2MoE architectures (66 M–2 B parameters), C4 and OpenWebText datasets, and both cosine and warmup-stable-decay schedules; additional gains are reported when combined with Muon. A theoretical section analyzes the effect through the lens of gradient noise.

Significance. If the reported gains are robust and not an artifact of implicit learning-rate retuning, GradPower would constitute a minimal-overhead, optimizer-agnostic improvement to large-scale language-model pre-training. The breadth of tested configurations and the attempt at a mechanistic explanation via gradient noise are strengths; however, the absence of quantitative deltas, error bars, or run counts in the abstract and the potential interaction with Adam’s moment estimates limit immediate impact.

major comments (3)

[Abstract / Experiments] Abstract and Experiments section: the claim that “GradPower requires only a single-line code change and no modifications to the base optimizer’s internal logic, including the hyperparameters” is load-bearing for the central contribution. Because φ_p multiplies each coordinate by |g_i|^{p-1}, it directly rescales the inputs to both the bias-corrected first moment m_t and the second-moment denominator v_t in Adam; this alters the effective per-parameter step size even when the nominal η is left unchanged. The manuscript should therefore include an ablation in which the learning rate is re-optimized for each p (or at least for the best-reported p) and demonstrate that the terminal-loss advantage survives.
[Theoretical analysis] Theoretical analysis section: the discussion of gradient noise does not derive the resulting shift in the stationary point or the modified convergence rate under the coordinate-wise rescaling induced by φ_p. A short derivation or bound showing how the noise statistics and effective Lipschitz constant change would strengthen the mechanistic claim.
[Results] Results: the assertion of “consistently achieves lower terminal loss” is presented without quantitative deltas, standard deviations across runs, number of independent seeds, or statistical tests. This omission makes it impossible to judge the magnitude and reliability of the improvement, especially given the low-confidence soundness assessment.

minor comments (2)

[Method] Notation: the definition φ_p(g) = (sign(g_i) |g_i|^p)_i should explicitly state the domain of p and whether p is chosen once for all experiments or tuned per model.
[Figures] Figure clarity: loss curves comparing Adam and AdamPower should include shaded regions for multiple runs rather than single trajectories.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address each of the major comments in detail below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the claim that “GradPower requires only a single-line code change and no modifications to the base optimizer’s internal logic, including the hyperparameters” is load-bearing for the central contribution. Because φ_p multiplies each coordinate by |g_i|^{p-1}, it directly rescales the inputs to both the bias-corrected first moment m_t and the second-moment denominator v_t in Adam; this alters the effective per-parameter step size even when the nominal η is left unchanged. The manuscript should therefore include an ablation in which the learning rate is re-optimized for each p (or at least for the best-reported p) and demonstrate that the terminal-loss advantage survives.

Authors: We agree that the elementwise power transformation affects the scale of the gradients fed into Adam's moment estimators, thereby influencing the effective step sizes. However, the core contribution remains that GradPower is a simple, external transformation applied to the raw gradient with no alterations to the optimizer's code or its user-specified hyperparameters. To directly address the concern and strengthen our claims, we will add an ablation study re-optimizing the learning rate for the optimal p and show that the performance gains persist compared to the baseline with its tuned learning rate. This revision will be included in the updated Experiments section. revision: partial
Referee: [Theoretical analysis] Theoretical analysis section: the discussion of gradient noise does not derive the resulting shift in the stationary point or the modified convergence rate under the coordinate-wise rescaling induced by φ_p. A short derivation or bound showing how the noise statistics and effective Lipschitz constant change would strengthen the mechanistic claim.

Authors: We appreciate this suggestion for enhancing the theoretical analysis. In the revised manuscript, we will expand the Theoretical analysis section to include a derivation of the shift in the stationary point and a bound on the modified convergence rate. This will involve analyzing the coordinate-wise rescaling's impact on gradient noise statistics and the effective Lipschitz constant, providing a more rigorous mechanistic explanation. revision: yes
Referee: [Results] Results: the assertion of “consistently achieves lower terminal loss” is presented without quantitative deltas, standard deviations across runs, number of independent seeds, or statistical tests. This omission makes it impossible to judge the magnitude and reliability of the improvement, especially given the low-confidence soundness assessment.

Authors: We acknowledge the importance of providing quantitative measures of improvement and statistical reliability. In the revised version, we will report specific deltas in terminal loss, include standard deviations from multiple independent runs with the number of seeds clearly stated, and incorporate appropriate statistical tests to support the claims of consistent improvement. These additions will be made throughout the Results section and in relevant figures/tables. revision: yes

Circularity Check

0 steps flagged

Empirical proposal with independent theoretical analysis of gradient noise

full rationale

The paper introduces GradPower as a simple elementwise sign-power transformation φ_p(g) applied before any base optimizer, with all claims resting on experimental results across architectures, scales, datasets, and schedules plus a separate theoretical section analyzing gradient noise effects. No derivation chain, fitted parameter, or first-principles result is shown to reduce by construction to the input transformation itself; the central performance claims are externally falsifiable via re-optimization of learning rate or direct comparison on held-out runs. Self-citations, if present, are not load-bearing for the uniqueness or correctness of the method. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach introduces one tunable exponent p and relies on the domain assumption that gradient noise in LM training can be beneficially reshaped by power transformations without harming convergence.

free parameters (1)

p
The fixed exponent in the sign-power function is chosen once and held constant across experiments; its specific value is not derived from first principles.

axioms (1)

domain assumption Gradient noise in language-model training has statistical properties that a sign-power map can exploit to improve optimization trajectories.
Invoked to explain why the transformation yields lower terminal loss.

pith-pipeline@v0.9.0 · 5761 in / 1270 out tokens · 57939 ms · 2026-05-22T02:43:54.076306+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio
cs.LG 2026-05 unverdicted novelty 6.0

MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

16 Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901
[3]

Fira: Can we achieve full-rank training of llms under low-rank constraint?arXiv preprint arXiv:2410.01623, 2024a

1 Xi Chen, Kaituo Feng, Changsheng Li, Xunhao Lai, Xiangyu Yue, Ye Yuan, and Guoren Wang. Fira: Can we achieve full-rank training of llms under low-rank constraint?arXiv preprint arXiv:2410.01623, 2024a. 3 10 Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, et al. Symbolic dis...

work page arXiv
[4]

Alex Damian, Eshaan Nichani, and Jason D Lee

3 Jeremy M Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E Dahl, et al. Adaptive gradient methods at the edge of stability.arXiv preprint arXiv:2207.14484,

work page arXiv
[5]

Understanding optimization in deep learning with central flows.arXiv preprint arXiv:2410.24206,

3 Jeremy M Cohen, Alex Damian, Ameet Talwalkar, Zico Kolter, and Jason D Lee. Understanding optimization in deep learning with central flows.arXiv preprint arXiv:2410.24206,

work page arXiv
[6]

Training Compute-Optimal Large Language Models

4, 6, 14 Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

3 Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies.arXiv preprint arXiv:2404.06395,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Adam: A Method for Stochastic Optimization

7 Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Cautious optimizers: Improving training with one line of code.arXiv preprint arXiv:2411.16085,

3 Kaizhao Liang, Lizhang Chen, Bo Liu, and Qiang Liu. Cautious optimizers: Improving training with one line of code.arXiv preprint arXiv:2411.16085,

work page arXiv
[10]

DeepSeek-V3 Technical Report

2 Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024a. 1, 5, 6 Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[11]

arXiv preprint arXiv:2501.12243 , year=

4, 14 Yizhou Liu, Ziming Liu, and Jeff Gore. Focus: First order concentrated updating scheme.arXiv preprint arXiv:2501.12243, 2025b. 1, 3 Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page arXiv
[12]

An Empirical Model of Large-Batch Training

1 Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch training.arXiv preprint arXiv:1812.06162,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

LLaMA: Open and Efficient Foundation Language Models

14 Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

SOAP: Improving and Stabilizing Shampoo using Adam

2, 4, 14, 15 Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

The sharpness disparity principle in transformers for accelerating language model pre-training.arXiv preprint arXiv:2502.19002, 2025a

2 Jinbo Wang, Mingze Wang, Zhanpeng Zhou, Junchi Yan, Lei Wu, et al. The sharpness disparity princi- ple in transformers for accelerating language model pre-training.arXiv preprint arXiv:2502.19002,

work page arXiv
[16]

Qwen2 Technical Report

1, 2 An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024a. 2, 4, 5, 14 An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical re...

work page internal anchor Pith review Pith/arXiv arXiv
[17]

We pre-train LLaMA models of sizes ranging from 66M to 2B parameters

is a popular Dense decoder-only Transformer architec- ture, incorporating Rotary Positional Encoding (RoPE) (Su et al., 2024), Swish-Gated Linear Unit (SwiGLU), and Root mean square layer normalization (RMSNorm). We pre-train LLaMA models of sizes ranging from 66M to 2B parameters. Additional model configurations are detailed in Table

work page 2024
[18]

It is a large-scale public language dataset, widely used for LLM pre-training such as T5 (Raffel et al., 2020), and prior pre-training studies (Zhao et al., 2024; 2025)

Datasets.Models are pre-trained on the following datasets: • Colossal Clean Crawled Corpus (C4)(Raffel et al., 2020). It is a large-scale public language dataset, widely used for LLM pre-training such as T5 (Raffel et al., 2020), and prior pre-training studies (Zhao et al., 2024; 2025). We use the T5 tokenizer, with the vocabulary size 32100. • OpenWebTex...

work page 2020
[19]

Following Karpathy (2022); Liu et al

and nanoGPT (Karpathy, 2022). Following Karpathy (2022); Liu et al. (2024b), we use the GPT-2 tokenizer, with the vocabulary size 50304. LR schedulers.We evaluate two popular LR scheduling strategies: •cos (cosine scheduler) (Karpathy, 2022; Touvron et al., 2023): a linear warm-up to peaklr_max, followed by cosine decay to a terminal LRlr_min. •wsd (warmu...

work page 2022
[20]

The training includes 1,000 warm-up steps

Following the Chinchilla scaling law (Hoffmann et al., 2022), the total number of training tokens is set to be approximately 20 times the number of model parameters. The training includes 1,000 warm-up steps. The grid search for lr_max is performed over {1e-4, 2e-4, 3e-4, 6e-4, 1e-3, 1.5e-3}. Optimal learning rates for each model are detailed in Tables 1 and

work page 2022
[21]

All other optimizer hyperparameters are kept identical to those used for the Adam baselines

AdamPower experiments.We adopt p= 1.2 as the default in all experiments in Section 3.2 and 3.3. All other optimizer hyperparameters are kept identical to those used for the Adam baselines. Importantly, the powerp= 1.2proves to behighly robust. A.2 Experimental details for Section 3.4 Adam with Blockwise LR.Following Wang et al. (2025), we adopt the same p...

work page 2025
[22]

For larger batch sizes (2048, 4096, 8192), we tune the max_lr over {6r-4, 1e-3, 2e-3, 4e-3, 8e-3} for Adam

For batch size 512, the tuned max_lr is 1e-3 (Table 1). For larger batch sizes (2048, 4096, 8192), we tune the max_lr over {6r-4, 1e-3, 2e-3, 4e-3, 8e-3} for Adam. We find that 1e-3 consistently yields the best results across all batch sizes. For each batch size, we evaluate AdamPower with multiple values of p, and record their validation loss when the op...

work page 2048
[23]

φ2 p(g) v+ϵ # . Given that q Et[φ2p(g)]⩽ √˜v+ϵand q Et[φ2p(g)]⩽R p, we can simplify the above estimate as: Et[I1]⩽ σ 4 |G|2 √˜v+ϵ + Rp σ Et

3/2 ⩽σ p. Therefore, we have the estimate: log(ϵlog(1/σ)) log(σ/e3) ⩽p ⋆ ⩽ log(ϵlog(1/σ)) logσ Noticingσ≪1, we obtain: p⋆ = Θ log(ϵlog(1/σ)) logσ . C Proofs in Section 4.2 Recall that the udpate rule of AdagradPower (with powerp) follows: θt+1 =θt −ηu t, ut = φp(gt)√vt +ϵ , vt = tX s=1 φ2 p(gt). In general, our proof is inspired by the main techniques to ...

work page 2022

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

16 Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901

[3] [3]

Fira: Can we achieve full-rank training of llms under low-rank constraint?arXiv preprint arXiv:2410.01623, 2024a

1 Xi Chen, Kaituo Feng, Changsheng Li, Xunhao Lai, Xiangyu Yue, Ye Yuan, and Guoren Wang. Fira: Can we achieve full-rank training of llms under low-rank constraint?arXiv preprint arXiv:2410.01623, 2024a. 3 10 Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, et al. Symbolic dis...

work page arXiv

[4] [4]

Alex Damian, Eshaan Nichani, and Jason D Lee

3 Jeremy M Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E Dahl, et al. Adaptive gradient methods at the edge of stability.arXiv preprint arXiv:2207.14484,

work page arXiv

[5] [5]

Understanding optimization in deep learning with central flows.arXiv preprint arXiv:2410.24206,

3 Jeremy M Cohen, Alex Damian, Ameet Talwalkar, Zico Kolter, and Jason D Lee. Understanding optimization in deep learning with central flows.arXiv preprint arXiv:2410.24206,

work page arXiv

[6] [6]

Training Compute-Optimal Large Language Models

4, 6, 14 Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

3 Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies.arXiv preprint arXiv:2404.06395,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Adam: A Method for Stochastic Optimization

7 Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Cautious optimizers: Improving training with one line of code.arXiv preprint arXiv:2411.16085,

3 Kaizhao Liang, Lizhang Chen, Bo Liu, and Qiang Liu. Cautious optimizers: Improving training with one line of code.arXiv preprint arXiv:2411.16085,

work page arXiv

[10] [10]

DeepSeek-V3 Technical Report

2 Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024a. 1, 5, 6 Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[11] [11]

arXiv preprint arXiv:2501.12243 , year=

4, 14 Yizhou Liu, Ziming Liu, and Jeff Gore. Focus: First order concentrated updating scheme.arXiv preprint arXiv:2501.12243, 2025b. 1, 3 Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page arXiv

[12] [12]

An Empirical Model of Large-Batch Training

1 Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch training.arXiv preprint arXiv:1812.06162,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

LLaMA: Open and Efficient Foundation Language Models

14 Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

SOAP: Improving and Stabilizing Shampoo using Adam

2, 4, 14, 15 Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

The sharpness disparity principle in transformers for accelerating language model pre-training.arXiv preprint arXiv:2502.19002, 2025a

2 Jinbo Wang, Mingze Wang, Zhanpeng Zhou, Junchi Yan, Lei Wu, et al. The sharpness disparity princi- ple in transformers for accelerating language model pre-training.arXiv preprint arXiv:2502.19002,

work page arXiv

[16] [16]

Qwen2 Technical Report

1, 2 An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024a. 2, 4, 5, 14 An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical re...

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

We pre-train LLaMA models of sizes ranging from 66M to 2B parameters

is a popular Dense decoder-only Transformer architec- ture, incorporating Rotary Positional Encoding (RoPE) (Su et al., 2024), Swish-Gated Linear Unit (SwiGLU), and Root mean square layer normalization (RMSNorm). We pre-train LLaMA models of sizes ranging from 66M to 2B parameters. Additional model configurations are detailed in Table

work page 2024

[18] [18]

It is a large-scale public language dataset, widely used for LLM pre-training such as T5 (Raffel et al., 2020), and prior pre-training studies (Zhao et al., 2024; 2025)

Datasets.Models are pre-trained on the following datasets: • Colossal Clean Crawled Corpus (C4)(Raffel et al., 2020). It is a large-scale public language dataset, widely used for LLM pre-training such as T5 (Raffel et al., 2020), and prior pre-training studies (Zhao et al., 2024; 2025). We use the T5 tokenizer, with the vocabulary size 32100. • OpenWebTex...

work page 2020

[19] [19]

Following Karpathy (2022); Liu et al

and nanoGPT (Karpathy, 2022). Following Karpathy (2022); Liu et al. (2024b), we use the GPT-2 tokenizer, with the vocabulary size 50304. LR schedulers.We evaluate two popular LR scheduling strategies: •cos (cosine scheduler) (Karpathy, 2022; Touvron et al., 2023): a linear warm-up to peaklr_max, followed by cosine decay to a terminal LRlr_min. •wsd (warmu...

work page 2022

[20] [20]

The training includes 1,000 warm-up steps

Following the Chinchilla scaling law (Hoffmann et al., 2022), the total number of training tokens is set to be approximately 20 times the number of model parameters. The training includes 1,000 warm-up steps. The grid search for lr_max is performed over {1e-4, 2e-4, 3e-4, 6e-4, 1e-3, 1.5e-3}. Optimal learning rates for each model are detailed in Tables 1 and

work page 2022

[21] [21]

All other optimizer hyperparameters are kept identical to those used for the Adam baselines

AdamPower experiments.We adopt p= 1.2 as the default in all experiments in Section 3.2 and 3.3. All other optimizer hyperparameters are kept identical to those used for the Adam baselines. Importantly, the powerp= 1.2proves to behighly robust. A.2 Experimental details for Section 3.4 Adam with Blockwise LR.Following Wang et al. (2025), we adopt the same p...

work page 2025

[22] [22]

For larger batch sizes (2048, 4096, 8192), we tune the max_lr over {6r-4, 1e-3, 2e-3, 4e-3, 8e-3} for Adam

For batch size 512, the tuned max_lr is 1e-3 (Table 1). For larger batch sizes (2048, 4096, 8192), we tune the max_lr over {6r-4, 1e-3, 2e-3, 4e-3, 8e-3} for Adam. We find that 1e-3 consistently yields the best results across all batch sizes. For each batch size, we evaluate AdamPower with multiple values of p, and record their validation loss when the op...

work page 2048

[23] [23]

φ2 p(g) v+ϵ # . Given that q Et[φ2p(g)]⩽ √˜v+ϵand q Et[φ2p(g)]⩽R p, we can simplify the above estimate as: Et[I1]⩽ σ 4 |G|2 √˜v+ϵ + Rp σ Et

3/2 ⩽σ p. Therefore, we have the estimate: log(ϵlog(1/σ)) log(σ/e3) ⩽p ⋆ ⩽ log(ϵlog(1/σ)) logσ Noticingσ≪1, we obtain: p⋆ = Θ log(ϵlog(1/σ)) logσ . C Proofs in Section 4.2 Recall that the udpate rule of AdagradPower (with powerp) follows: θt+1 =θt −ηu t, ut = φp(gt)√vt +ϵ , vt = tX s=1 φ2 p(gt). In general, our proof is inspired by the main techniques to ...

work page 2022