ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models
Pith reviewed 2026-05-20 13:43 UTC · model grok-4.3
The pith
Transforming weight matrices to absorb activation outliers enables accurate low-rank compression of LLMs by 10-30% without any retraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By transforming the weight matrix based on the activation distribution, the outliers in the activation matrix are absorbed into the transformed weight matrix, which enhances the accuracy of the subsequent singular value decomposition. An efficient iterative calibration process then optimizes the decomposition for each layer's specific sensitivity, allowing 10%-30% network compression and 50% KV cache reduction in a training-free manner.
What carries the argument
Activation-aware Singular Value Decomposition that first scales weights by activation statistics to absorb outliers before low-rank factorization, together with layer-wise iterative calibration to respect differing sensitivities.
If this is right
- LLMs become runnable on hardware with tighter memory budgets without retraining.
- KV cache size in long-context inference drops by half at no accuracy cost.
- Post-training compression becomes practical for many existing models.
- Layer-specific tuning avoids uniform rank choices that hurt sensitive layers.
Where Pith is reading between the lines
- This approach might extend to other decomposition methods like QR or CUR if the same scaling step is applied first.
- Further savings could come from combining ASVD with quantization on the remaining factors.
- Testing on very long sequences would reveal whether the reduced KV cache preserves coherence over extended contexts.
Load-bearing premise
That scaling the weights according to activation statistics absorbs the outliers enough to make the low-rank approximation significantly more accurate, and that the iterative calibration finds good per-layer settings without needing extra validation data or introducing compounding errors.
What would settle it
Running the compressed model on a standard benchmark such as MMLU or GSM8K and observing an accuracy drop larger than a few percent relative to the original model, or failing to reach the stated compression ratios while remaining training-free.
read the original abstract
In this paper, we introduce a new post-training compression paradigm for Large Language Models (LLMs) to facilitate their wider adoption. We delve into LLM weight low-rank decomposition, and find that the challenges of this task stem from (1) the distribution variance in the LLM activations and (2) the sensitivity difference among various kinds of layers. To address these issues, we propose a training-free approach called Activation-aware Singular Value Decomposition (ASVD). Specifically, ASVD manages activation outliers by transforming the weight matrix based on the activation distribution. This transformation allows the outliers in the activation matrix to be absorbed into the transformed weight matrix, thereby enhancing decomposition accuracy. Additionally, we propose an efficient iterative calibration process to optimize layer-specific decomposition by addressing the varying sensitivity of different LLM layers. In this way, ASVD can compress a network by 10%-30%. Based on the success of the low-rank decomposition of projection matrices in the self-attention module, we further introduce ASVD to compress the KV cache. By reducing the channel dimension of KV activations, memory requirements for KV cache can be largely reduced. ASVD can further achieve 50% KV cache reductions without performance drop in a training-free manner.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Activation-aware Singular Value Decomposition (ASVD), a training-free post-training method for compressing LLMs. It identifies challenges from activation distribution variance and layer sensitivity differences, then transforms weight matrices based on activation statistics to absorb outliers before performing SVD, followed by an iterative calibration process to select layer-specific ranks. The approach is claimed to enable 10-30% network compression and, by extending the low-rank idea to self-attention projection matrices, 50% KV cache reduction without performance drop.
Significance. If the central claims are supported by rigorous experiments, the work would offer a practical, training-free compression technique that could facilitate wider deployment of LLMs on hardware with limited memory, complementing existing quantization and pruning approaches.
major comments (2)
- [§3.2] §3.2: The core transformation step (weight matrix modified according to activation distribution to absorb outliers) is presented as improving SVD accuracy, but the manuscript provides no invariance proof, error bound, or demonstration that the subsequent compensation during inference exactly preserves the original forward pass on real activations. This is load-bearing for the claim that the method reduces effective approximation error without introducing bias.
- [§3.3] §3.3: The iterative calibration process for optimizing per-layer decomposition is described at a high level as addressing varying layer sensitivity, yet no details are given on the objective function, convergence guarantees, or safeguards against systematic mismatch between the transformed space and the original activation distribution. This directly affects whether the training-free claim holds without hidden validation requirements.
minor comments (2)
- The abstract states compression ranges (10%-30%, 50% KV cache) but would be strengthened by explicit reference to the quantitative metrics (e.g., perplexity delta or zero-shot accuracy) reported in the experiments section.
- Notation for the activation-aware transformation and the calibration loop should be introduced with explicit equations and variable definitions to improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and agree that additional formal analysis and implementation details will improve clarity. We will incorporate the suggested revisions in the next version of the paper.
read point-by-point responses
-
Referee: [§3.2] §3.2: The core transformation step (weight matrix modified according to activation distribution to absorb outliers) is presented as improving SVD accuracy, but the manuscript provides no invariance proof, error bound, or demonstration that the subsequent compensation during inference exactly preserves the original forward pass on real activations. This is load-bearing for the claim that the method reduces effective approximation error without introducing bias.
Authors: We appreciate the referee highlighting the need for a rigorous justification of the transformation. The transformation is constructed so that there exists an invertible mapping T derived from activation statistics (typically a diagonal scaling matrix based on per-channel activation magnitudes or norms) such that the original matrix product is preserved exactly: if W' = W T and the corresponding activation is scaled as A' = T^{-1} A, then W A = W' A' holds identically. The SVD is performed on the transformed W', and at inference the low-rank factors are used with the inverse compensation applied to maintain equivalence before the approximation error is introduced. We acknowledge that the current manuscript presents this at a conceptual level without an explicit invariance proof, error bound derivation, or side-by-side demonstration on real activations. In the revised manuscript we will add a new subsection in §3.2 containing (i) the formal proof of exact pre-approximation equivalence, (ii) a first-order error bound relating the SVD truncation error in the transformed space to the original space, and (iii) empirical verification showing that the compensated reconstruction introduces no systematic bias on held-out activations. This revision will be made. revision: yes
-
Referee: [§3.3] §3.3: The iterative calibration process for optimizing per-layer decomposition is described at a high level as addressing varying layer sensitivity, yet no details are given on the objective function, convergence guarantees, or safeguards against systematic mismatch between the transformed space and the original activation distribution. This directly affects whether the training-free claim holds without hidden validation requirements.
Authors: We thank the referee for noting the lack of algorithmic detail. The iterative calibration allocates per-layer ranks under a global compression budget by repeatedly evaluating layer sensitivity, defined as the increase in perplexity (or downstream task loss) on a small calibration set when the rank of that layer is reduced. The objective is to minimize the sum of these sensitivity-weighted errors subject to the total parameter budget; at each iteration the layer with the lowest sensitivity-per-parameter is selected for rank reduction until the budget is met or a performance threshold is reached. Convergence is monitored empirically by tracking the change in calibration perplexity between iterations and stopping when it falls below a small epsilon. Because the transformation already aligns the activation statistics, mismatch is mitigated by re-computing activation statistics after each rank change on the same calibration samples. We agree that these elements are described only at a high level. In the revision we will expand §3.3 with the exact objective function, pseudocode of the iterative procedure, empirical convergence plots, and explicit safeguards (including re-calibration of activation statistics after each adjustment). The procedure remains strictly training-free: no gradients are computed and no model parameters are updated; only rank selection is performed using a fixed calibration set, which is standard practice for post-training compression methods. revision: yes
Circularity Check
No significant circularity; method is a self-contained heuristic with independent empirical claims
full rationale
The ASVD approach defines a weight transformation derived from activation statistics to absorb outliers, applies standard SVD to the transformed matrix, and uses an iterative calibration loop to select per-layer ranks based on sensitivity. These operations are specified procedurally without defining any output metric (such as compression ratio or performance) in terms of the inputs by construction. The reported 10-30% compression and 50% KV-cache reduction are presented as experimental outcomes rather than algebraic identities or fitted predictions that reduce to the calibration data. No self-citation chains, uniqueness theorems, or ansatz smuggling appear in the derivation; the central claims remain falsifiable against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Singular value decomposition provides an optimal low-rank approximation under the Frobenius norm.
- domain assumption Activation distributions can be estimated from a small calibration set and used to guide weight transformation.
Forward citations
Cited by 24 Pith papers
-
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs
ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
-
Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs
Dimensional misalignment slows compressed LLMs on GPUs; GAC uses knapsack optimization to achieve full alignment and up to 1.5x speedup on Llama-3-8B while preserving quality.
-
Decomposed Trust: Privacy, Adversarial Robustness, Ethics, and Fairness in Low-Rank LLMs
Low-rank compression preserves training-data privacy and improves adversarial robustness but weakens personal-information protection, reduces ethical behavior in zero-shot use, and harms fairness.
-
SAFE-SVD: Sensitivity-Aware Fidelity-Enforcing SVD for Physics Foundation Models
SAFE-SVD introduces a sensitivity-aware fidelity-enforcing SVD framework for compressing physics foundation models that maintains higher accuracy than standard methods at greater compression ratios.
-
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
OSCAR achieves near-BF16 accuracy for 2-bit KV cache quantization by using offline spectral covariance-aware rotations aligned with attention, plus a custom deployable INT2 kernel compatible with paged serving.
-
Theory-optimal Quantization Based on Flatness
The paper introduces the Flatness metric, derives a theory-optimal quantization solution, and presents BDQ that uses bidirectional diagonal transformations to reduce outlier impact, achieving under 1% drop at W4A4 on ...
-
Relative Kinetic Utility for Reasoning-Aware Structural Pruning in Large Language Models
RKU is a curvature-aware structural pruning framework that improves LLM reasoning accuracy at 40% sparsity, reaching 13.34% on GSM8K while outperforming baselines and better preserving out-of-distribution representations.
-
Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression
PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.
-
FlashSVD v1.5: Making Low-Rank Transformers Inference Actually Fast
FlashSVD v1.5 delivers up to 2.55x faster autoregressive decode and 2.39x end-to-end speedup for SVD-compressed transformers by reorganizing execution paths with dense-KV decode, packed MLP kernels, and per-layer CUDA graphs.
-
Bayesian Fine-tuning in Projected Subspaces
Bayesian fine-tuning of large models can be done efficiently by projecting uncertainties into low-dimensional subspaces, yielding improved calibration and generalization while keeping computational costs low.
-
Gated Subspace Inference for Transformer Acceleration
Gated Subspace Inference accelerates transformer linear layers 3-10x via low-rank cached subspace computation and per-token gating to skip residuals while preserving output distribution to high accuracy.
-
Importance-Guided Basis Selection for Low-Rank Decomposition of Large Language Models
BSI ranks singular-vector bases for LLM low-rank compression by estimating expected task loss increase via second-order Taylor expansion of the loss and an efficient Hessian-diagonal estimator, outperforming magnitude...
-
Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization
ARHQ isolates error-sensitive weight directions in LLMs via truncated SVD on the scaled matrix W G_x^{1/2} from activation residuals, improving SNR and preserving performance under aggressive low-bit quantization.
-
SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models
SLaB compresses LLM weights via sparse-lowrank-binary decomposition guided by activation-aware scores, achieving up to 36% lower perplexity than prior methods at 50% compression on Llama models.
-
CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning
CoSpaDi introduces a training-free sparse dictionary learning framework for post-training LLM compression that optimizes functional reconstruction error via activation-derived orthonormalization and achieves improved ...
-
IMPACT: Importance-Aware Activation Space Reconstruction
IMPACT derives a closed-form low-rank activation reconstruction from an importance-weighted covariance matrix to achieve higher compression ratios than standard methods while maintaining model accuracy.
-
Fast Tensorization of Neural Networks via Slice-wise Feature Distillation
A slice-wise feature distillation framework for independent tensorization of neural network slices to achieve scalable compression with reduced fine-tuning costs.
-
IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression
IO-SVD performs SVD-based LLM compression by constructing a KL-aware double-sided whitening space and using first-order loss estimates for heterogeneous rank allocation.
-
Minimal-Intervention KV Retention via Set-Conditioned Diversity
A minimal scoring modification to TriAttention using greedy facility-location selection with V-space redundancy penalty improves KV retention at budgets 64 and 128 on distilled reasoning models under matched-memory he...
-
Minimal-Intervention KV Retention via Set-Conditioned Diversity
A one-function modification to the TriAttention retention scorer using greedy selection under a V-space redundancy penalty outperforms seven matched mechanisms on long-form math reasoning at budgets 64 and 128.
-
DeInfer: Efficient Parallel Inferencing for Decomposed Large Language Models
DeInfer adds multiple optimizations to improve parallel inference speed for decomposed LLMs while remaining compatible with existing techniques.
-
Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities
A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.
-
NeuronMLP: Efficient LLM Inference via Singular Value Decomposition Compression and Tiling on AWS Trainium
NeuronMLP applies SVD-based compression and Trainium-specific tiling and caching to MLP layers, delivering 1.35x kernel speedup and 1.21x end-to-end inference speedup at 0.05 compression ratio versus AWS NKI baseline.
-
A Survey on Efficient Inference for Large Language Models
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
Reference graph
Works this paper leans on
-
[1]
Gkd: Generalized knowledge distillation for auto-regressive sequence models
Rishabh Agarwal, Nino Vieillard, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. Gkd: Generalized knowledge distillation for auto-regressive sequence models. arXiv preprint arXiv:2306.13649,
-
[2]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,
work page 1901
-
[3]
Palu: Compressing kv-cache with low-rank projection
Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, and Kai-Chiang Wu. Palu: Compressing kv-cache with low-rank projection. arXiv preprint arXiv:2407.21118,
-
[4]
Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282,
-
[5]
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
QLoRA: Efficient Finetuning of Quantized LLMs
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[9]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
arXiv preprint arXiv:2401.18079
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization. arXiv preprint arXiv:2401.18079,
-
[11]
Training CNNs with Low-Rank Filters for Efficient Image Classification
Yani Ioannou, Duncan Robertson, Jamie Shotton, Roberto Cipolla, and Antonio Criminisi. Training cnns with low-rank filters for efficient image classification. arXiv preprint arXiv:1511.06744,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Speeding up Convolutional Neural Networks with Low Rank Expansions
Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Squeezellm: Dense-and-sparse quantization
Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W Mahoney, and Kurt Keutzer. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629,
-
[14]
Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications
Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. Com- pression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition
Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation- aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu Kivi. A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Pointer Sentinel Mixture Models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
ACDC: A Structured Efficient Linear Layer
Marcin Moczulski, Misha Denil, Jeremy Appleyard, and Nando de Freitas. Acdc: A structured efficient linear layer. arXiv preprint arXiv:1511.05946,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Low- rank matrix factorization for deep neural network training with high-dimensional output targets
Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. Low- rank matrix factorization for deep neural network training with high-dimensional output targets. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6655–6659. IEEE,
work page 2013
-
[21]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikol...
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Svd-llm: Truncation-aware singular value decomposition for large language model compression
Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. Svd-llm: Truncation-aware singular value decomposition for large language model compression. arXiv preprint arXiv:2403.07378,
-
[23]
Mingxue Xu, Yao Lei Xu, and Danilo P Mandic. Tensorgpt: Efficient compression of the embedding layer in llms based on the tensor-train decomposition. arXiv preprint arXiv:2307.00526,
-
[24]
Rptq: Reorder-based post-training quantization for large language models
Zhihang Yuan, Lin Niu, Jiawei Liu, Wenyu Liu, Xinggang Wang, Yuzhang Shang, Guangyu Sun, Qiang Wu, Jiaxiang Wu, and Bingzhe Wu. Rptq: Reorder-based post-training quantization for large language models. arXiv preprint arXiv:2304.01089,
-
[25]
Llm inference unveiled: Survey and roofline model insights
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, et al. Llm inference unveiled: Survey and roofline model insights. arXiv preprint arXiv:2402.16363,
-
[26]
A survey on model compression for large language models
Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models. arXiv preprint arXiv:2308.07633,
-
[27]
A.5 D IFFERENCE WITH TENSOR GPT
It doesn’t need calibration dataset. A.5 D IFFERENCE WITH TENSOR GPT. In the content of LLM compression via decomposition, the most related work is the concurrent TensorGPT Xu et al. [2023], Zhu et al. [2023], in which the embedding layer of LLMs is compressed through Tensor-Train Decomposition (TTD) Oseledets
work page 2023
-
[28]
in order to store large embeddings in a low-rank tensor format, with much fewer parameters. However, there are several differences between those two methods: (1) Unlike TensorGPT which focuses solely on the token embedding matrix, ASVDaims to compress the entire weight spectrum of LLMs. This holistic approach addresses a more critical aspect of LLM compre...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.