pith. sign in

arxiv: 2506.16659 · v3 · pith:5GLYRQWInew · submitted 2025-06-20 · 💻 cs.LG · cs.AI· math.OC

Memory-Efficient LLM Pretraining via Minimalist Optimizer Design

Pith reviewed 2026-05-22 13:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OC
keywords memory-efficient optimizerLLM pretrainingcolumn-wise gradient normalizationlast-layer momentumSCALE optimizerAdam alternativeLLaMA training
0
0 comments X

The pith

Two minimal changes to SGD—column-wise gradient normalization and momentum only on the output layer—produce an optimizer that matches Adam performance while using 35-45% of the memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper starts from the question of how few alterations to plain SGD are required to reach the pretraining results of adaptive methods like Adam. It isolates column-wise normalization of gradients, which improves SGD without any momentum, and the restriction of first-order momentum to the final layer where gradient variance peaks. These two operations together form SCALE, which delivers Adam-level or better perplexity on models from 60M to 1B parameters while cutting total optimizer memory roughly in half. The same recipe also beats earlier memory-efficient proposals on a 7B LLaMA run. The work matters because current LLM training remains memory-bound; a method that removes most second-order state and most momentum buffers could let practitioners train larger models on existing hardware.

Core claim

SCALE is constructed by applying column-wise gradient normalization to every layer and first-order momentum exclusively to the output layer. On models ranging from 60M to 1B parameters this combination matches or exceeds Adam’s pretraining performance while consuming only 35-45% of the total memory. SCALE further outperforms GaLore, Fira, APOLLO and Muon on LLaMA-7B in both final perplexity and memory footprint.

What carries the argument

SCALE, the optimizer formed by column-wise gradient normalization across layers plus first-order momentum restricted to the output layer.

If this is right

  • Training runs of the same length become feasible on hardware with substantially smaller GPU memory.
  • Second-order moment buffers can be eliminated without loss of final model quality under the tested regimes.
  • The memory advantage persists when scaling from 60M to 1B parameters and appears on 7B-scale LLaMA.
  • The method can be used as a drop-in replacement for Adam in existing pretraining pipelines with only minor code changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If output-layer variance remains the dominant source of instability at larger scales, the same selective-momentum rule may continue to work without further tuning.
  • Column-wise normalization could be combined with existing gradient-compression schemes to obtain still lower memory footprints.
  • The approach invites direct measurement of per-layer gradient statistics to test whether the output layer is uniquely noisy in other architectures.

Load-bearing premise

The two identified changes remain sufficient when models, datasets, or training lengths grow beyond the tested range.

What would settle it

A controlled pretraining run on a model larger than 7B or on a new dataset in which SCALE’s final perplexity exceeds Adam’s or the memory reduction falls below 50%.

Figures

Figures reproduced from arXiv: 2506.16659 by Andi Han, Athanasios Glentis, Jiaxiang Li, Mingyi Hong.

Figure 1
Figure 1. Figure 1: Perplexity v.s. memory consumption among a number of SOTA algorithms. Solutions [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of SGD and Adam training loss and evaluation perplexity curves on LLaMA [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Estimated variance of the stochastic gradients (and momentum when applicable) for [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The evaluation perplexity curves of different methods on LLaMA 1B pretraining. Note [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Learning rate sensitivity analysis, comparing Stable-SPAM (a stabilized version of Adam) [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
read the original abstract

Training large language models (LLMs) relies on adaptive optimizers such as Adam, which introduce extra operations and require significantly more memory to maintain first- and second-order moments than SGD. While recent works such as GaLore, Fira and APOLLO have proposed state-compressed memory-efficient variants, a fundamental question remains: What are the minimum modifications to plain SGD needed to match state-of-the-art pretraining performance? We systematically investigate this question using a bottom-up approach, and identify two simple yet highly (memory- and compute-) efficient techniques: (1) column-wise gradient normalization (normalizing the gradient along the output dimension), that boosts SGD performance without momentum; and (2) applying first-order momentum only to the output layer, where gradient variance is highest. Combining these two techniques lead to SCALE (Stochastic Column-normAlized Last-layer momEntum), a simple optimizer for memory efficient pretraining. Across multiple models (60M-1B), SCALE matches or exceeds the performance of Adam while using only 35-45% of the total memory. It also consistently outperforms memory-efficient optimizers such as GaLore, Fira and APOLLO, making it a strong candidate for large-scale pretraining under memory constraints. For LLaMA 7B, SCALE outperforms the state-of-the-art memory-efficient methods APOLLO and Muon in both perplexity and memory consumption.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SCALE, a minimalist optimizer formed by combining column-wise gradient normalization (along the output dimension) with first-order momentum applied exclusively to the output layer. It claims this design matches or exceeds Adam performance while using 35-45% of the memory on models from 60M to 1B parameters and outperforms APOLLO and Muon on LLaMA 7B pretraining.

Significance. If the empirical results hold, the work is significant for demonstrating that targeted, low-overhead modifications to SGD can rival full adaptive optimizers in LLM pretraining, with clear memory benefits. The bottom-up identification of the two specific techniques provides useful insight into optimizer components and could simplify large-scale training under hardware constraints.

major comments (2)
  1. [§4, LLaMA 7B results] §4 (Experiments), LLaMA 7B subsection and associated tables: the single-run comparison showing SCALE outperforming APOLLO and Muon does not include layer-wise gradient variance measurements or an ablation confirming that output-layer momentum alone suffices for hidden layers at 7B scale; this directly bears on the central sufficiency claim when scaling beyond the 60M-1B regime.
  2. [Results tables (60M-1B)] Results tables for 60M-1B models: reported perplexity or loss values for SCALE versus Adam lack standard deviations across seeds or statistical significance tests, weakening the assertion of consistent matching or exceeding performance.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'Combining these two techniques lead to SCALE' contains a subject-verb agreement error and should read 'leads'.
  2. [§3] §3 (Method): the column-wise normalization operation would benefit from an explicit equation or pseudocode to clarify the exact normalization axis and any scaling factors.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive review. We address the major comments point by point below, providing clarifications and indicating revisions to the manuscript where applicable.

read point-by-point responses
  1. Referee: [§4, LLaMA 7B results] §4 (Experiments), LLaMA 7B subsection and associated tables: the single-run comparison showing SCALE outperforming APOLLO and Muon does not include layer-wise gradient variance measurements or an ablation confirming that output-layer momentum alone suffices for hidden layers at 7B scale; this directly bears on the central sufficiency claim when scaling beyond the 60M-1B regime.

    Authors: We acknowledge the referee's concern about validating the design choices at the 7B scale. The layer-wise gradient variance analysis and ablations for last-layer momentum were performed on models up to 1B parameters as part of our bottom-up investigation, showing that the output layer has the highest gradient variance and that momentum there suffices. At 7B scale, we performed single-run experiments due to significant computational requirements. The results demonstrate SCALE's superiority over APOLLO and Muon, consistent with the smaller-scale findings. In the revised manuscript, we have expanded the discussion in §4 to explicitly reference the variance measurements from the 1B model and explain why we expect the same principles to hold at 7B. We believe this provides sufficient support for the scaling claim. revision: partial

  2. Referee: [Results tables (60M-1B)] Results tables for 60M-1B models: reported perplexity or loss values for SCALE versus Adam lack standard deviations across seeds or statistical significance tests, weakening the assertion of consistent matching or exceeding performance.

    Authors: We agree that including standard deviations and discussing statistical significance would enhance the robustness of our claims. Given the high cost of LLM pretraining, our main experiments were run with a single seed. However, we have performed additional experiments with multiple seeds for the 60M and 350M models and updated the corresponding tables to report means and standard deviations. For the 1B model, we have added a footnote noting the single-run nature and the observed consistency with smaller models. We have also included a short paragraph on the reliability of the results across scales in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical identification of optimizer modifications

full rationale

The paper follows a bottom-up empirical investigation to identify column-wise gradient normalization and output-layer-only momentum as sufficient modifications to SGD. No derivation chain, equations, or first-principles claims are present that reduce to fitted parameters or self-referential definitions. Performance claims rest on direct comparisons against Adam, GaLore, Fira, APOLLO, and Muon across 60M–1B models plus one LLaMA-7B run, which are external benchmarks. The central result is therefore self-contained against measured outcomes rather than constructed by re-labeling inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is purely empirical and introduces no new mathematical axioms, free parameters, or invented entities; the two techniques are presented as direct, parameter-light modifications to SGD without additional fitted constants or postulated objects.

pith-pipeline@v0.9.0 · 5793 in / 1216 out tokens · 33980 ms · 2026-05-22T13:20:14.192049+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

    math.OC 2026-05 conditional novelty 7.0

    Proposes equivariant optimizers matched to the symmetry groups of embeddings, SwiGLU projections and MoE routers, with experiments showing consistent gains over AdamW on language model pre-training.

  2. Budget-aware Auto Optimizer Configurator

    cs.AI 2026-05 unverdicted novelty 6.0

    BAOC samples gradient streams to compute per-block risk metrics for cheap optimizer configs then solves a constrained optimization to minimize total risk under memory and time budgets while preserving training quality.

  3. Demystifying Manifold Constraints in LLM Pre-training

    cs.LG 2026-05 unverdicted novelty 6.0

    Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering co...

  4. MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

    cs.LG 2026-03 unverdicted novelty 6.0

    MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.

  5. Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

    cs.LG 2025-09 unverdicted novelty 5.0

    Proposes low-rank orthogonalization and derives low-rank Muon and MSGD variants that outperform standard Muon on GPT-2 and LLaMA pretraining while providing iteration complexity bounds.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · cited by 5 Pith papers · 4 internal anchors

  1. [1]

    Old Optimizer, New Norm: An Anthology

    J. Bernstein and L. Newhouse. Old optimizer, new norm: An anthology. arXiv preprint arXiv:2409.20325,

  2. [2]

    (Cited on pages 3, 5, and 6.) X. Chen, K. Feng, C. Li, X. Lai, X. Yue, Y. Yuan, and G. Wang. Fira: Can we achieve full-rank training of llms under low-rank constraint? arXiv preprint arXiv:2410.01623 ,

  3. [3]

    Glentis, J

    (Cited on page 7.) A. Glentis, J. Li, Q. Shang, A. Han, I. Tsaknakis, Q. Wei, and M. Hong. Scalable parameter and memory efficient pretraining for llm: Recent algorithmic advances and benchmarking. arXiv preprint arXiv:2505.22922,

  4. [4]

    Huang, H

    (Cited on page 1.) T. Huang, H. Hu, Z. Zhang, G. Jin, X. Li, L. Shen, T. Chen, L. Liu, Q. Wen, Z. Wang, et al. Stable- spam: How to train in 4-bit more stably than 16-bit adam. arXiv preprint arXiv:2502.17055 , 2025a. (Cited on pages 3, 5, 10, and 17.) T. Huang, Z. Zhu, G. Jin, L. Liu, Z. Wang, and S. Liu. SPAM: Spike-aware adam with momen- tum reset for ...

  5. [5]

    (Cited on pages 2, 3, 6, 7, 10, and 16.) D

    URL https://kellerjordan.github.io/posts/ muon/. (Cited on pages 2, 3, 6, 7, 10, and 16.) D. S. Kalra, J. Kirchenbauer, M. Barkeshli, and T. Goldstein. When can you get away with low memory adam? arXiv preprint arXiv:2503.01843 ,

  6. [6]

    The Power of Normalization: Faster Evasion of Saddle Points

    URL https://openreview.net/ forum?id=a65YK0cqH8g. (Cited on page 5.) K. Y. Levy. The power of normalization: Faster evasion of saddle points. arXiv preprint arXiv:1611.04831,

  7. [7]

    (Cited on page 5.) X. Liao, S. Li, Y. Xu, Z. Li, Y. Liu, and Y. He. Galore +: Boosting low-rank adaptation for llms with cross-head projection. arXiv preprint arXiv:2412.19820 ,

  8. [8]

    Muon is Scalable for LLM Training

    URL https://openreview.net/forum?id=3xHDeA8Noi. (Cited on pages 2 and 3.) J. Liu, J. Su, X. Yao, Z. Jiang, G. Lai, Y. Du, Y. Qin, W. Xu, E. Lu, J. Yan, et al. Muon is scalable for llm training. arXiv preprint arXiv:2502.16982 ,

  9. [9]

    (Cited on pages 7 and 18.) Q. Luo, H. Yu, and X. Li. Badam: A memory efficient full parameter training method for large language models. arXiv preprint arXiv:2404.02827 ,

  10. [10]

    (Cited on page 3.) C. Ma, W. Gong, M. Scetbon, and E. Meeds. SWAN: Preprocessing sgd enables adam-level performance on llm training with significant memory reduction. arXiv preprint arXiv:2412.13148 ,

  11. [11]

    Muhamed, O

    (Cited on pages 2, 3, 5, 6, 10, 11, and 12.) A. Muhamed, O. Li, D. Woodruff, M. Diab, and V. Smith. GRASS: Compute efficient low-memory llm training with structured sparse gradients. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14978–15003,

  12. [12]

    Training Deep Learning Models with Norm-Constrained LMOs

    (Cited on page 3.) T. Pethick, W. Xie, K. Antonakopoulos, Z. Zhu, A. Silveti-Falls, and V. Cevher. Training deep learning models with norm-constrained lmos. arXiv preprint arXiv:2502.07529 ,

  13. [13]

    (Cited on page 10.) A. V. Ramesh, V. Ganapathiraman, I. H. Laradji, and M. Schmidt. Blockllm: Memory-efficient adaptation of llms by selecting and optimizing the right coordinate blocks. arXiv preprint arXiv:2406.17296,

  14. [14]

    (Cited on page 3.) T. Sun, X. Liu, and K. Yuan. Gradient normalization provably benefits nonconvex sgd under heavy-tailed noise. arXiv preprint arXiv:2410.16561 ,

  15. [15]

    (Cited on page 5.) M. Xu, L. Xiang, X. Cai, and H. Wen. No more adam: Learning rate scaling at initialization is all you need. arXiv preprint arXiv:2412.11768 ,

  16. [16]

    Zhang, C

    (Cited on pages 2 and 3.) 14 J. Zhang, T. He, S. Sra, and A. Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. In International Conference on Learning Representations, 2020a. URL https://openreview.net/forum?id=BJgnXpVYwS. (Cited on page 5.) J. Zhang, S. P. Karimireddy, A. Veit, S. Kim, S. Reddi, S. Kumar, ...

  17. [17]

    (Cited on pages 3, 11, and 12.) A Appendix A.1 Details of memory estimation for 1B and 7B models Here we compute the memory estimate for both 1B and 7B LLaMA models

    URL https://openreview.net/forum?id=mJrPkdcZDj. (Cited on pages 3, 11, and 12.) A Appendix A.1 Details of memory estimation for 1B and 7B models Here we compute the memory estimate for both 1B and 7B LLaMA models. We only compute the major parameters, including embedding layers, attention and MLP layers. We follow prior works (Zhao et al., 2024; Han et al.,

  18. [18]

    7B model: Pre-last layers include 6.607B parameters and last layer includes 0.131B parameters, which in total leads to 6.738B parameters

    in estimating the memory using bfloat16 format, where each floating point number occupies 2 bytes. 7B model: Pre-last layers include 6.607B parameters and last layer includes 0.131B parameters, which in total leads to 6.738B parameters. • SGD: Only the parameter states are stored, which amount to 13.476G memory. • Adafactor: Apart from the parameter state...