SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization

Hyun-Tae Kim; Wooin Lee

arxiv: 2604.07663 · v2 · submitted 2026-04-09 · 💻 cs.LG

SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization

Wooin Lee , Hyun-Tae Kim This is my paper

Pith reviewed 2026-05-10 18:33 UTC · model grok-4.3

classification 💻 cs.LG

keywords memory-efficient optimizerLLM pretrainingembedding layeradaptive gradientSAGEperplexityLlamahybrid optimizer

0 comments

The pith

SAGE replaces AdamW in hybrid LLM optimizers by adding a bounded per-dimension scale that tames embedding-layer gradients, cutting memory while raising perplexity performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard light-state optimizers reduce memory by avoiding full AdamW states but still need AdamW on the embedding layer because its sparse high-variance gradients cause instability. SAGE solves this by pairing a Lion-style sign update with a new O(d) adaptive scale that stays provably at or below 1.0 and damps those difficult dimensions more effectively. The result is a fully light optimizer that trains more stably and reaches better final perplexity. On Llama models up to 1.3 billion parameters the SAGE hybrid sets new state-of-the-art perplexity numbers while using substantially less optimizer memory than prior hybrids.

Core claim

A hybrid optimizer that applies SAGE everywhere, where SAGE uses a sign-based direction together with an O(d) per-dimension adaptive scale bounded by 1.0, removes the need to fall back to AdamW on embeddings, produces stronger convergence than SinkGD hybrids, and materially lowers total optimizer-state memory.

What carries the argument

The safe damper: SAGE's O(d) per-dimension adaptive scale, provably bounded by 1.0, that damps high-variance embedding dimensions when combined with Lion-style sign updates.

If this is right

The full hybrid now uses only light-state memory across every layer instead of reverting to AdamW on embeddings.
Perplexity on Llama models up to 1.3B parameters surpasses all listed baselines including the SinkGD hybrid.
Optimizer-state memory drops by roughly the size of the embedding table relative to prior hybrids.
Training remains stable at the same learning rates used for AdamW without additional tuning for the embedding layer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The bounded-scale idea may transfer to other sparse-gradient settings such as recommendation models or graph embeddings.
If the same bound works at larger scale, memory savings could become decisive for training models beyond 7B parameters on single-node hardware.
Replacing sign updates with other low-memory directions while keeping the bounded scale could be tested as a direct next step.

Load-bearing premise

A single per-dimension scale bounded by 1.0 is enough to control the sparse high-variance gradients of the embedding layer without recreating the instability that forced earlier light-state methods back to AdamW.

What would settle it

Run the SAGE hybrid and the SinkGD hybrid side-by-side on a 1.3B Llama model; if the SAGE version shows equal or higher final perplexity or requires fallback to AdamW to remain stable, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2604.07663 by Hyun-Tae Kim, Wooin Lee.

**Figure 2.** Figure 2: Convergence and efficiency on the 1.3B model. (a) Test perplexity - lower the better; SAGE achieves the lowest final perplexity and faster convergence. (b) Throughput - higher the better; Effective throughput normalizes processing speed by the steps required to match the lowest (Lion-Hybrid) performance. (c) Peak memory usage breakdown by component. Group Method 270M 0.6B 1.3B PPL Mem (GB) PPL Mem (GB) PPL… view at source ↗

**Figure 3.** Figure 3: Visualization of the SAGE adaptive scale [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: PCA of the adaptive scale vector Ht. (a) Trajectories of the top 3 principal components. PC1 and PC2 show a sharp, distinct spike in activity during the early training phase (initial 1k steps), reflecting rapid adaptation to the embedding layer’s geometry. This is followed by a gradual stabilization phase where components settle into steady states. (b) The trajectory in PC1-PC2 space reveals a clear "hook"… view at source ↗

read the original abstract

The AdamW optimizer, while standard for LLM pretraining, is a critical memory bottleneck, consuming optimizer states equivalent to twice the model's size. Although light-state optimizers like SinkGD attempt to address this issue, we identify the embedding layer dilemma: these methods fail to handle the sparse, high-variance gradients inherent to embeddings, forcing a hybrid design that reverts to AdamW and partially negates the memory gains. We propose SAGE (Sign Adaptive GradiEnt), a novel optimizer that resolves this dilemma by replacing AdamW in this hybrid structure. SAGE combines a Lion-style update direction with a new, memory-efficient $O(d)$ adaptive scale. This scale acts as a "safe damper," provably bounded by 1.0, which tames high-variance dimensions more effectively than existing methods. This superior stability allows SAGE to achieve better convergence. On Llama models up to 1.3B parameters, our SAGE-based hybrid achieves new state-of-the-art perplexity, outperforming all baselines, including SinkGD hybrid, while significantly reducing optimizer state memory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SAGE, a sign-adaptive gradient optimizer that pairs a Lion-style update direction with a new O(d) adaptive scale claimed to be provably bounded by 1.0. This scale is positioned as a 'safe damper' that resolves the embedding-layer dilemma in light-state methods such as SinkGD, enabling a hybrid optimizer that fully replaces AdamW on embeddings. The central empirical claim is that the resulting hybrid attains new state-of-the-art perplexity on Llama models up to 1.3 B parameters while substantially lowering optimizer-state memory relative to all baselines, including the SinkGD hybrid.

Significance. If the bounded-scale guarantee holds under the sparsity and variance conditions of embedding gradients and the reported perplexity gains are reproducible, the work would meaningfully advance memory-efficient LLM pretraining by removing the memory penalty that has forced prior hybrids to retain AdamW on embeddings. The combination of a parameter-free-style bound with concrete SOTA results on models up to 1.3 B would be a notable contribution to the literature on scalable optimizers.

major comments (2)

[Abstract / Methods] Abstract and Methods (proof of the scale bound): the manuscript asserts that the O(d) adaptive scale is 'provably bounded by 1.0' and acts as a safe damper for high-variance embedding gradients. The derivation must be supplied in full (including any moment or update-rule assumptions) to verify that the bound remains valid under the extreme sparsity typical of embedding layers; without it the stability claim that permits complete replacement of AdamW cannot be evaluated.
[Experiments] Experiments section: the claim of new SOTA perplexity on Llama models up to 1.3 B is load-bearing for the central contribution, yet no details are provided on the precise baselines, number of runs, statistical significance, or exact memory measurements. These omissions prevent assessment of whether the SAGE hybrid truly outperforms the SinkGD hybrid without reintroducing full AdamW memory cost on embeddings.

minor comments (2)

[Methods] Notation for the adaptive scale should be introduced with an explicit equation early in the Methods section rather than only in the abstract.
[Introduction] The manuscript should include a short related-work paragraph contrasting SAGE with other sign-based or Lion-style methods to clarify the precise novelty of the bounded O(d) scale.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments identify areas where additional rigor and transparency will strengthen the manuscript. We address each point below and will incorporate the requested material in the revised version.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and Methods (proof of the scale bound): the manuscript asserts that the O(d) adaptive scale is 'provably bounded by 1.0' and acts as a safe damper for high-variance embedding gradients. The derivation must be supplied in full (including any moment or update-rule assumptions) to verify that the bound remains valid under the extreme sparsity typical of embedding layers; without it the stability claim that permits complete replacement of AdamW cannot be evaluated.

Authors: We agree that the full derivation is required for independent verification. The current manuscript contains a proof sketch in the appendix; we will expand this into a self-contained subsection of the Methods section. The expanded version will state all moment and update-rule assumptions explicitly, derive the O(d) scale bound step by step, and include a dedicated argument showing that the bound continues to hold under the sparsity and variance patterns characteristic of embedding gradients. This revision will directly support the claim that SAGE can safely replace AdamW on embeddings. revision: yes
Referee: [Experiments] Experiments section: the claim of new SOTA perplexity on Llama models up to 1.3 B is load-bearing for the central contribution, yet no details are provided on the precise baselines, number of runs, statistical significance, or exact memory measurements. These omissions prevent assessment of whether the SAGE hybrid truly outperforms the SinkGD hybrid without reintroducing full AdamW memory cost on embeddings.

Authors: We acknowledge that the experimental reporting must be more complete. In the revised manuscript we will add: an explicit list of all baselines together with their optimizer configurations; results from three independent runs reported as mean and standard deviation; statistical significance tests (paired t-tests) comparing SAGE against the SinkGD hybrid; and precise optimizer-state memory figures expressed both in absolute bytes and as a percentage of model size. These additions will allow readers to verify the claimed perplexity improvement and the memory reduction relative to the SinkGD hybrid. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces SAGE with a novel O(d) adaptive scale claimed to be provably bounded by 1.0 as a safe damper for embedding gradients. The abstract and provided text contain no equations, self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations that reduce this bound or the hybrid optimizer's stability claim to its own inputs by construction. The 'provably bounded' assertion is presented as an independent mathematical property derived from the update rules, distinct from the empirical SOTA perplexity results on Llama models. Comparisons to SinkGD and AdamW hybrids rely on external baselines rather than internal redefinition. This is a standard non-circular case where the central innovation stands on its stated derivation without reduction to tautology or self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The bounded scale is presented as part of the method rather than a new postulated entity with independent evidence.

pith-pipeline@v0.9.0 · 5482 in / 1100 out tokens · 75205 ms · 2026-05-10T18:33:19.243230+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This scale acts as a 'safe damper,' provably bounded by 1.0, which tames high-variance dimensions more effectively than existing methods. ... (Ht)j ← min((Dema t)j,(Dinst t)j,1)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Layer Normalization

Association for Computational Linguistics. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization.Preprint, arXiv:1607.06450. Abhinav Bandari, Lu Yin, Cheng-Yu Hsieh, Ajay Jaiswal, Tianlong Chen, Li Shen, Ranjay Krishna, and Shiwei Liu. 2024. Is C4 dataset optimal for prun- ing? an investigation of calibration data for LLM pruni...

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

DAtts L ⊙ 1√ h softmax′ gAtt s L√ h !# Y K L , DY K L =

Galore: Memory-efficient LLM training by gradient low-rank projection. InForty-first Interna- tional Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net. Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas Chandra, Bo Long, David Z. Pan, Zhangyang Wang, and Jinwon Lee. 2024. APOLLO: sgd-like memory, ada...

work page arXiv 2024
[3]

Opt." refers to optimizer state memory, while

Philosophical Library. A Appendix A.1 Model Configurations Table 3 provides the full configuration of each model used in the experiments, for readers to reproduce the results. Configuration 270M 0.6B 1.3B Hidden Size 1024 1536 2048 Intermediate Size 2816 4224 5632 Max Position Embeddings 8192 8192 8192 Num Attention Heads 16 24 32 Num Hidden Layers 13 16 ...

work page 2048

[1] [1]

Layer Normalization

Association for Computational Linguistics. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization.Preprint, arXiv:1607.06450. Abhinav Bandari, Lu Yin, Cheng-Yu Hsieh, Ajay Jaiswal, Tianlong Chen, Li Shen, Ranjay Krishna, and Shiwei Liu. 2024. Is C4 dataset optimal for prun- ing? an investigation of calibration data for LLM pruni...

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

DAtts L ⊙ 1√ h softmax′ gAtt s L√ h !# Y K L , DY K L =

Galore: Memory-efficient LLM training by gradient low-rank projection. InForty-first Interna- tional Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net. Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas Chandra, Bo Long, David Z. Pan, Zhangyang Wang, and Jinwon Lee. 2024. APOLLO: sgd-like memory, ada...

work page arXiv 2024

[3] [3]

Opt." refers to optimizer state memory, while

Philosophical Library. A Appendix A.1 Model Configurations Table 3 provides the full configuration of each model used in the experiments, for readers to reproduce the results. Configuration 270M 0.6B 1.3B Hidden Size 1024 1536 2048 Intermediate Size 2816 4224 5632 Max Position Embeddings 8192 8192 8192 Num Attention Heads 16 24 32 Num Hidden Layers 13 16 ...

work page 2048