Activation Compression in LLMs: Theoretical Analysis and Efficient Algorithm

Han-Bin Fang; James Kwok; Jiang-Xin Shi; Wen-Da Wei; Yang-Di Liu; Yu-Feng Li

arxiv: 2605.01255 · v1 · submitted 2026-05-02 · 💻 cs.LG

Activation Compression in LLMs: Theoretical Analysis and Efficient Algorithm

Wen-Da Wei , Han-Bin Fang , Yang-Di Liu , Jiang-Xin Shi , James Kwok , Yu-Feng Li This is my paper

Pith reviewed 2026-05-09 15:00 UTC · model grok-4.3

classification 💻 cs.LG

keywords activation compressionlarge language modelsmemory efficient trainingtheoretical guaranteesgradient compressionconvergence analysisLLM pretraining

0 comments

The pith

Activation compression is safe for linear operators in LLMs when unbiased and does not change convergence rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a theoretical framework to assess when compressing intermediate activations during backpropagation in large language models can be done safely. It demonstrates that unbiased compression works without issues for linear operations but introduces problems for nonlinear ones. The authors derive bounds showing that gradient variance stays controlled and prove convergence guarantees under standard L-smoothness, with the rate unchanged from ordinary training. They then introduce an activation-gradient co-compression approach that reuses low-rank factors already computed for activations to also compress gradients in linear layers at zero extra cost or error. Experiments on Qwen and LLaMA models across pretraining and fine-tuning tasks confirm the method maintains accuracy while reducing memory demands.

Core claim

We develop a theoretical framework showing that activation compression is safe for linear operators when activation compression is unbiased, but problematic for nonlinear ones. We further derive gradient variance bound and establish convergence guarantees for applying activation compression to all linear operators under the standard L-smoothness assumption, showing that it does not change the convergence rate. Guided by the theory, we propose an activation-gradient co-compression method that reuses low-rank activation factors to compress linear-layer gradients without extra computation or additional gradient error.

What carries the argument

The activation-gradient co-compression method, which reuses low-rank activation factors computed for linear layers to also compress the corresponding gradients without added computation or error.

If this is right

Activation compression can be applied to every linear operator in the model without altering the convergence rate.
Reusing low-rank factors for gradients adds no extra error beyond the original activation compression.
The approach reduces memory use for both activations and gradients in a single pass.
The method achieves competitive accuracy on pretraining and multiple fine-tuning benchmarks for Qwen and LLaMA models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The linear-versus-nonlinear distinction could guide selective compression strategies in architectures that mix both types of layers.
Memory savings from this method may enable larger batch sizes or longer context lengths during training of the same hardware.
The framework might extend to other memory bottlenecks, such as compressing optimizer states that depend on linear-layer outputs.

Load-bearing premise

The loss function is L-smooth and activation compression for linear operators introduces no bias.

What would settle it

A simple linear-layer training run that applies the compression yet shows either a slower convergence rate or noticeably higher final loss than the uncompressed baseline.

Figures

Figures reproduced from arXiv: 2605.01255 by Han-Bin Fang, James Kwok, Jiang-Xin Shi, Wen-Da Wei, Yang-Di Liu, Yu-Feng Li.

**Figure 1.** Figure 1: Estimated GPU memory consumption during fine-tuning of a LLaMA3-3B model on GSM8K dataset (batch size = 32). and intermediate activations (Narayanan et al., 2021; Zhao et al., 2024; Wang et al., 2025b). As a result, memory consumption has become one of the primary bottlenecks in LLM training view at source ↗

**Figure 2.** Figure 2: Comparison between the standard low-rank activation compression mechanism and our activation gradient co-compression method. (i) Are the compression-induced gradients unbiased? and (ii) Will gradient errors propagate upstream to earlier operators and accumulate across layers, which may potentially lead to large gradient variance? Prior analyses (Evans & Aamodt, 2021; Chen et al., 2021) are developed mainl… view at source ↗

**Figure 3.** Figure 3: Comparison of training loss for component-wise activation compression and our method against SFT. 0 5 10 15 20 25 Backpropagation Depth 100 200 300 400 Jacobian Frobenius Norm view at source ↗

**Figure 4.** Figure 4: Frobenius norm of the Jacobian product. against SFT at ranks 8 and 32, while the middle panel reports SiLU as a representative nonlinear operator. The figure shows that compressing activations of nonlinear components makes the loss difficult to decrease and prevents stable convergence. In contrast, compressing activations of linear operators does not impair training convergence, but only introduces minor… view at source ↗

**Figure 5.** Figure 5: Estimated GPU memory usage under different batch sizes (y-axis: batch size). as they compress the optimizer-state matrices and thereby reduce the cost of parameter updates. Ours incurs additional computation mainly due to the low-rank factorization of activations, leading to an approximate 10.8% increase in training time relative to the SFT. The results suggest that our methods achieve favorable time overh… view at source ↗

**Figure 6.** Figure 6: Training loss curves of RSVD and RP across different datasets. 11 view at source ↗

**Figure 7.** Figure 7: Pre-training perplexity on the C4 dataset for different methods 12 view at source ↗

**Figure 8.** Figure 8: Comparison of Training Loss under Different Compression Components view at source ↗

**Figure 9.** Figure 9: Frobenius norm of randomly sampled Jacobian matrices. 14 view at source ↗

read the original abstract

Training large language models (LLMs) is highly memory-intensive, as training must store not only weights and optimizer states but also intermediate activations for backpropagation. While existing memory-efficient methods largely focus on gradients and optimizer states, activation compression is less well established due to the lack of LLM-tailored theory and guarantees. In this work, we develop a theoretical framework showing that activation compression is safe for linear operators when activation compression is unbiased, but problematic for nonlinear ones. We further derive gradient variance bound and establish convergence guarantees for applying activation compression to all linear operators under the standard $L$-smoothness assumption, showing that it does not change the convergence rate. Guided by the theory, we propose an activation-gradient co-compression method that reuses low-rank activation factors to compress linear-layer gradients without extra computation or additional gradient error. We conduct extensive experiments on Qwen and LLaMA models using a pretraining benchmark and multiple fine-tuning benchmarks to validate our theory and demonstrate competitive performance of our method in both accuracy and compression efficiency. We provide our code in the supplementary material for reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper supplies a theory that unbiased activation compression leaves convergence rates unchanged for linear layers under L-smoothness, plus a co-compression method that reuses low-rank factors for gradients.

read the letter

The core claim is that activation compression stays safe on linear operators when it remains unbiased, while it causes problems on nonlinear ones. The authors derive a gradient variance bound and show that the standard L-smoothness assumption still yields the original convergence rate when compression is limited to linear layers. They then build an algorithm that reuses the low-rank factors already computed for activations to compress the corresponding gradients, claiming this adds no extra computation or gradient error. Experiments on Qwen and LLaMA models across pretraining and fine-tuning tasks report competitive accuracy at higher compression ratios, and code is released for checking the implementation.

Referee Report

1 major / 1 minor

Summary. The paper develops a theoretical framework for activation compression in LLM training, claiming that unbiased compression is safe for linear operators (but problematic for nonlinear ones), derives a gradient variance bound, and proves convergence guarantees under standard L-smoothness showing that the rate is unchanged when compressing activations in all linear operators. Guided by this, it proposes an activation-gradient co-compression algorithm that reuses low-rank factors from the activation compression to compress the corresponding gradients with no extra computation or additional gradient error. The claims are supported by experiments on Qwen and LLaMA models across pretraining and fine-tuning benchmarks, with code provided for reproducibility.

Significance. If the theoretical results and the unbiasedness claim for the co-compression operator hold, the work supplies a principled way to reduce activation memory in LLM training while preserving convergence rates, which could meaningfully improve scalability. The explicit reuse of low-rank factors for gradient compression without extra cost is a practical strength, and the provision of code supports reproducibility.

major comments (1)

[Section describing the co-compression algorithm and its theoretical justification] The convergence theorem (under L-smoothness) and gradient variance bound are derived for unbiased activation compression applied independently to linear operators. The co-compression method reuses the same low-rank factors to compress gradients and asserts “no additional gradient error,” but the manuscript does not re-derive the variance bound or prove that the joint operator remains unbiased (i.e., E[compressed gradient] equals the true gradient). This step is load-bearing for transferring the earlier guarantee to the concrete algorithm.

minor comments (1)

[Experiments] Clarify in the experimental section the precise compression ratios, rank choices, and any data-exclusion or hyperparameter rules used in the Qwen/LLaMA runs so that the reported accuracy numbers can be directly compared to the theoretical variance bound.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address the major comment below.

read point-by-point responses

Referee: [Section describing the co-compression algorithm and its theoretical justification] The convergence theorem (under L-smoothness) and gradient variance bound are derived for unbiased activation compression applied independently to linear operators. The co-compression method reuses the same low-rank factors to compress gradients and asserts “no additional gradient error,” but the manuscript does not re-derive the variance bound or prove that the joint operator remains unbiased (i.e., E[compressed gradient] equals the true gradient). This step is load-bearing for transferring the earlier guarantee to the concrete algorithm.

Authors: We agree that an explicit verification for the co-compression operator strengthens the presentation. In the revised manuscript we will add a short subsection proving that the joint operator remains unbiased: because the low-rank factors are obtained from the unbiased activation compression and reused identically for the corresponding linear-layer gradient, the expectation of the compressed gradient equals the true gradient with no additional bias term. Consequently the existing gradient-variance bound applies directly and the L-smoothness convergence rate is unchanged. We will also state explicitly that the reuse introduces no extra error beyond the already-accounted-for activation-compression variance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; theory rests on external L-smoothness assumption

full rationale

The paper's central derivation develops a framework for unbiased activation compression on linear operators, derives a gradient variance bound, and invokes the standard L-smoothness assumption to prove that convergence rate is unchanged. No quoted equations or steps reduce a claimed prediction or guarantee to a fitted quantity or self-citation by construction. The co-compression algorithm is presented as guided by the theory with an explicit claim of no additional error, but the provided text contains no self-referential reduction where the bound is applied to the joint operator without independent verification. This is a normal non-finding for a theory paper anchored in standard assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claims rest on the L-smoothness assumption from optimization theory and the unbiasedness condition for linear operators; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Standard L-smoothness assumption
Invoked to establish that compression does not change the convergence rate.

pith-pipeline@v0.9.0 · 5503 in / 1100 out tokens · 51180 ms · 2026-05-09T15:00:34.627637+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 4 canonical work pages · 2 internal anchors

[1]

Evans, R. D. and Aamodt, T. M. AC-GC: lossy activation compression with guaranteed convergence. InAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, pp. 27434–27448,

2021
[2]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

The power of scale for parameter-efficient prompt tuning

Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045–3059,

2021
[4]

Memory efficient optimizers with 4-bit states

Li, B., Chen, J., and Zhu, J. Memory efficient optimizers with 4-bit states. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023),

2023
[5]

P., Diab, M

Muhamed, A., Li, O., Woodruff, D. P., Diab, M. T., and Smith, V . GRASS: compute efficient low-memory LLM training with structured sparse gradients. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 14978–15003,

2024
[6]

CompAct: Compressed activations for memory-efficient LLM training

Shamshoum, Y ., Hodos, N., Sieradzki, Y ., and Schuster, A. CompAct: Compressed activations for memory-efficient LLM training. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, pp. 1511–1524,

2025
[7]

H., Garpebring, A., Nyholm, T., and L ¨ofstedt, T

Vu, M. H., Garpebring, A., Nyholm, T., and L ¨ofstedt, T. Compressing the activation maps in deep convolutional neural networks and its regularizing effect.Transactions on Machine Learning Research, 2024,

2024
[8]

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bow- man, S. R. Glue: A multi-task benchmark and analysis platform for natural language understanding. InProceed- ings of the 2018 EMNLP Workshop BlackboxNLP, pp. 353–355,

2018
[9]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

The experimental results indicate that operator activations are not strictly low-rank, whichsupports our theoretical explanationfor why compressing the activations of linear operators has only a minor impact on model training. In contrast, prior work typically attributes the negligible accuracy degradation to the assumption that activations have very low ...

work page arXiv 1983
[11]

Figure 7 presents the pre-training perplexity on the C4 dataset for the LLaMA3-1B model

and CoLA (Liu et al., 2025). Figure 7 presents the pre-training perplexity on the C4 dataset for the LLaMA3-1B model. Lower perplexity indicates better language modeling performance. We observe that our method outperforms CoLA during the first 3,000 training steps, but is later surpassed by CoLA. Galore performs strongly in the pre-training setting, altho...

2025
[12]

exactness

/uni00000015/uni0000004e/uni00000017/uni0000004e/uni00000019/uni0000004e/uni0000001b/uni0000004e /uni00000037/uni00000055/uni00000044/uni0000004c/uni00000051/uni0000004c/uni00000051/uni0000004a/uni00000003/uni00000036/uni00000057/uni00000048/uni00000053 /uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000018 /uni00000014/uni00000011/uni0...

work page arXiv

[1] [1]

Evans, R. D. and Aamodt, T. M. AC-GC: lossy activation compression with guaranteed convergence. InAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, pp. 27434–27448,

2021

[2] [2]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

The power of scale for parameter-efficient prompt tuning

Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045–3059,

2021

[4] [4]

Memory efficient optimizers with 4-bit states

Li, B., Chen, J., and Zhu, J. Memory efficient optimizers with 4-bit states. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023),

2023

[5] [5]

P., Diab, M

Muhamed, A., Li, O., Woodruff, D. P., Diab, M. T., and Smith, V . GRASS: compute efficient low-memory LLM training with structured sparse gradients. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 14978–15003,

2024

[6] [6]

CompAct: Compressed activations for memory-efficient LLM training

Shamshoum, Y ., Hodos, N., Sieradzki, Y ., and Schuster, A. CompAct: Compressed activations for memory-efficient LLM training. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, pp. 1511–1524,

2025

[7] [7]

H., Garpebring, A., Nyholm, T., and L ¨ofstedt, T

Vu, M. H., Garpebring, A., Nyholm, T., and L ¨ofstedt, T. Compressing the activation maps in deep convolutional neural networks and its regularizing effect.Transactions on Machine Learning Research, 2024,

2024

[8] [8]

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bow- man, S. R. Glue: A multi-task benchmark and analysis platform for natural language understanding. InProceed- ings of the 2018 EMNLP Workshop BlackboxNLP, pp. 353–355,

2018

[9] [9]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

The experimental results indicate that operator activations are not strictly low-rank, whichsupports our theoretical explanationfor why compressing the activations of linear operators has only a minor impact on model training. In contrast, prior work typically attributes the negligible accuracy degradation to the assumption that activations have very low ...

work page arXiv 1983

[11] [11]

Figure 7 presents the pre-training perplexity on the C4 dataset for the LLaMA3-1B model

and CoLA (Liu et al., 2025). Figure 7 presents the pre-training perplexity on the C4 dataset for the LLaMA3-1B model. Lower perplexity indicates better language modeling performance. We observe that our method outperforms CoLA during the first 3,000 training steps, but is later surpassed by CoLA. Galore performs strongly in the pre-training setting, altho...

2025

[12] [12]

exactness

/uni00000015/uni0000004e/uni00000017/uni0000004e/uni00000019/uni0000004e/uni0000001b/uni0000004e /uni00000037/uni00000055/uni00000044/uni0000004c/uni00000051/uni0000004c/uni00000051/uni0000004a/uni00000003/uni00000036/uni00000057/uni00000048/uni00000053 /uni00000013/uni00000011/uni00000013 /uni00000013/uni00000011/uni00000018 /uni00000014/uni00000011/uni0...

work page arXiv