ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

Dawei Yang; Guangyu Sun; Qiang Wu; Yan Yan; Yue Song; Yuzhang Shang; Zhihang Yuan

arxiv: 2312.05821 · v5 · pith:ISKLA3BYnew · submitted 2023-12-10 · 💻 cs.CL

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

Zhihang Yuan , Yuzhang Shang , Yue Song , Dawei Yang , Qiang Wu , Yan Yan , Guangyu Sun This is my paper

Pith reviewed 2026-05-20 13:43 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM compressionsingular value decompositionpost-training compressionlow-rank approximationKV cache compressionactivation distributionmodel efficiency

0 comments

The pith

Transforming weight matrices to absorb activation outliers enables accurate low-rank compression of LLMs by 10-30% without any retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models have activations with wide value ranges and layers that differ in how much they can tolerate approximation errors. The paper shows that scaling each weight matrix by the statistics of its input activations folds the extreme values into the weights themselves. Singular value decomposition then produces a better low-rank approximation because the remaining matrix has smaller outliers. An iterative procedure adjusts the rank or scaling for each layer according to its measured sensitivity. When this process is applied to both the projection matrices and the key-value caches in attention, the model shrinks by 10 to 30 percent and the cache memory halves while accuracy stays nearly the same.

Core claim

By transforming the weight matrix based on the activation distribution, the outliers in the activation matrix are absorbed into the transformed weight matrix, which enhances the accuracy of the subsequent singular value decomposition. An efficient iterative calibration process then optimizes the decomposition for each layer's specific sensitivity, allowing 10%-30% network compression and 50% KV cache reduction in a training-free manner.

What carries the argument

Activation-aware Singular Value Decomposition that first scales weights by activation statistics to absorb outliers before low-rank factorization, together with layer-wise iterative calibration to respect differing sensitivities.

If this is right

LLMs become runnable on hardware with tighter memory budgets without retraining.
KV cache size in long-context inference drops by half at no accuracy cost.
Post-training compression becomes practical for many existing models.
Layer-specific tuning avoids uniform rank choices that hurt sensitive layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach might extend to other decomposition methods like QR or CUR if the same scaling step is applied first.
Further savings could come from combining ASVD with quantization on the remaining factors.
Testing on very long sequences would reveal whether the reduced KV cache preserves coherence over extended contexts.

Load-bearing premise

That scaling the weights according to activation statistics absorbs the outliers enough to make the low-rank approximation significantly more accurate, and that the iterative calibration finds good per-layer settings without needing extra validation data or introducing compounding errors.

What would settle it

Running the compressed model on a standard benchmark such as MMLU or GSM8K and observing an accuracy drop larger than a few percent relative to the original model, or failing to reach the stated compression ratios while remaining training-free.

read the original abstract

In this paper, we introduce a new post-training compression paradigm for Large Language Models (LLMs) to facilitate their wider adoption. We delve into LLM weight low-rank decomposition, and find that the challenges of this task stem from (1) the distribution variance in the LLM activations and (2) the sensitivity difference among various kinds of layers. To address these issues, we propose a training-free approach called Activation-aware Singular Value Decomposition (ASVD). Specifically, ASVD manages activation outliers by transforming the weight matrix based on the activation distribution. This transformation allows the outliers in the activation matrix to be absorbed into the transformed weight matrix, thereby enhancing decomposition accuracy. Additionally, we propose an efficient iterative calibration process to optimize layer-specific decomposition by addressing the varying sensitivity of different LLM layers. In this way, ASVD can compress a network by 10%-30%. Based on the success of the low-rank decomposition of projection matrices in the self-attention module, we further introduce ASVD to compress the KV cache. By reducing the channel dimension of KV activations, memory requirements for KV cache can be largely reduced. ASVD can further achieve 50% KV cache reductions without performance drop in a training-free manner.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ASVD's activation-based weight transform plus per-layer calibration gives a practical post-training SVD tweak for LLMs and KV cache, but the error reduction claim needs tighter evidence.

read the letter

The main thing to know is that this paper describes a training-free way to improve SVD compression of LLM weights by first transforming them according to activation statistics to fold outliers into the weights, then running an iterative per-layer calibration to choose ranks, with an extension that halves KV cache size. The activation-aware step and the calibration loop are the concrete additions over plain low-rank work. The KV cache application is a direct follow-on that targets a real inference bottleneck. The paper does a clear job naming the two practical problems: activation variance that breaks standard SVD and the fact that layers differ in how much approximation they can tolerate. The transformation idea is simple and directly motivated by those issues, and the calibration sounds lightweight enough to run without much overhead. Extending the same logic to KV activations makes sense once the weight part works. The soft spot sits in the transformation itself. Turning W into W' = f(A) W, decomposing, and compensating at inference time only helps if the net effect lowers error on actual data without adding bias or requiring extra per-token fixes. If the compensation is just a mean or diagonal adjustment, the whole thing risks reducing to a reweighting heuristic whose advantage over existing activation-aware baselines is not automatic. The abstract states 10-30% compression and 50% KV reduction with no drop, yet supplies no numbers, ablations, or error bounds, so the size of the actual gain is still open. I would want to see the full tables and checks on whether calibration stays stable across datasets. This is aimed at people who need post-training size and memory reductions for LLMs without retraining. Someone running inference under tight hardware limits would find the KV cache part worth testing if the numbers hold. I would send it for peer review. The problem is real, the method is specific, and the gaps are fixable with more experimental detail rather than fundamental.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Activation-aware Singular Value Decomposition (ASVD), a training-free post-training method for compressing LLMs. It identifies challenges from activation distribution variance and layer sensitivity differences, then transforms weight matrices based on activation statistics to absorb outliers before performing SVD, followed by an iterative calibration process to select layer-specific ranks. The approach is claimed to enable 10-30% network compression and, by extending the low-rank idea to self-attention projection matrices, 50% KV cache reduction without performance drop.

Significance. If the central claims are supported by rigorous experiments, the work would offer a practical, training-free compression technique that could facilitate wider deployment of LLMs on hardware with limited memory, complementing existing quantization and pruning approaches.

major comments (2)

[§3.2] §3.2: The core transformation step (weight matrix modified according to activation distribution to absorb outliers) is presented as improving SVD accuracy, but the manuscript provides no invariance proof, error bound, or demonstration that the subsequent compensation during inference exactly preserves the original forward pass on real activations. This is load-bearing for the claim that the method reduces effective approximation error without introducing bias.
[§3.3] §3.3: The iterative calibration process for optimizing per-layer decomposition is described at a high level as addressing varying layer sensitivity, yet no details are given on the objective function, convergence guarantees, or safeguards against systematic mismatch between the transformed space and the original activation distribution. This directly affects whether the training-free claim holds without hidden validation requirements.

minor comments (2)

The abstract states compression ranges (10%-30%, 50% KV cache) but would be strengthened by explicit reference to the quantitative metrics (e.g., perplexity delta or zero-shot accuracy) reported in the experiments section.
Notation for the activation-aware transformation and the calibration loop should be introduced with explicit equations and variable definitions to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and agree that additional formal analysis and implementation details will improve clarity. We will incorporate the suggested revisions in the next version of the paper.

read point-by-point responses

Referee: [§3.2] §3.2: The core transformation step (weight matrix modified according to activation distribution to absorb outliers) is presented as improving SVD accuracy, but the manuscript provides no invariance proof, error bound, or demonstration that the subsequent compensation during inference exactly preserves the original forward pass on real activations. This is load-bearing for the claim that the method reduces effective approximation error without introducing bias.

Authors: We appreciate the referee highlighting the need for a rigorous justification of the transformation. The transformation is constructed so that there exists an invertible mapping T derived from activation statistics (typically a diagonal scaling matrix based on per-channel activation magnitudes or norms) such that the original matrix product is preserved exactly: if W' = W T and the corresponding activation is scaled as A' = T^{-1} A, then W A = W' A' holds identically. The SVD is performed on the transformed W', and at inference the low-rank factors are used with the inverse compensation applied to maintain equivalence before the approximation error is introduced. We acknowledge that the current manuscript presents this at a conceptual level without an explicit invariance proof, error bound derivation, or side-by-side demonstration on real activations. In the revised manuscript we will add a new subsection in §3.2 containing (i) the formal proof of exact pre-approximation equivalence, (ii) a first-order error bound relating the SVD truncation error in the transformed space to the original space, and (iii) empirical verification showing that the compensated reconstruction introduces no systematic bias on held-out activations. This revision will be made. revision: yes
Referee: [§3.3] §3.3: The iterative calibration process for optimizing per-layer decomposition is described at a high level as addressing varying layer sensitivity, yet no details are given on the objective function, convergence guarantees, or safeguards against systematic mismatch between the transformed space and the original activation distribution. This directly affects whether the training-free claim holds without hidden validation requirements.

Authors: We thank the referee for noting the lack of algorithmic detail. The iterative calibration allocates per-layer ranks under a global compression budget by repeatedly evaluating layer sensitivity, defined as the increase in perplexity (or downstream task loss) on a small calibration set when the rank of that layer is reduced. The objective is to minimize the sum of these sensitivity-weighted errors subject to the total parameter budget; at each iteration the layer with the lowest sensitivity-per-parameter is selected for rank reduction until the budget is met or a performance threshold is reached. Convergence is monitored empirically by tracking the change in calibration perplexity between iterations and stopping when it falls below a small epsilon. Because the transformation already aligns the activation statistics, mismatch is mitigated by re-computing activation statistics after each rank change on the same calibration samples. We agree that these elements are described only at a high level. In the revision we will expand §3.3 with the exact objective function, pseudocode of the iterative procedure, empirical convergence plots, and explicit safeguards (including re-calibration of activation statistics after each adjustment). The procedure remains strictly training-free: no gradients are computed and no model parameters are updated; only rank selection is performed using a fixed calibration set, which is standard practice for post-training compression methods. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is a self-contained heuristic with independent empirical claims

full rationale

The ASVD approach defines a weight transformation derived from activation statistics to absorb outliers, applies standard SVD to the transformed matrix, and uses an iterative calibration loop to select per-layer ranks based on sensitivity. These operations are specified procedurally without defining any output metric (such as compression ratio or performance) in terms of the inputs by construction. The reported 10-30% compression and 50% KV-cache reduction are presented as experimental outcomes rather than algebraic identities or fitted predictions that reduce to the calibration data. No self-citation chains, uniqueness theorems, or ansatz smuggling appear in the derivation; the central claims remain falsifiable against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard linear-algebra properties of SVD and the empirical premise that activation outliers can be absorbed by a linear transformation of weights; no new entities or fitted constants are introduced in the abstract.

axioms (2)

standard math Singular value decomposition provides an optimal low-rank approximation under the Frobenius norm.
Invoked implicitly when using SVD for weight compression.
domain assumption Activation distributions can be estimated from a small calibration set and used to guide weight transformation.
Central to the outlier-absorption step described in the abstract.

pith-pipeline@v0.9.0 · 5758 in / 1285 out tokens · 39921 ms · 2026-05-20T13:43:41.410430+00:00 · methodology

discussion (0)

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs
cs.AR 2026-03 unverdicted novelty 7.0

ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs
cs.DC 2026-03 unverdicted novelty 7.0

Dimensional misalignment slows compressed LLMs on GPUs; GAC uses knapsack optimization to achieve full alignment and up to 1.5x speedup on Llama-3-8B while preserving quality.
Decomposed Trust: Privacy, Adversarial Robustness, Ethics, and Fairness in Low-Rank LLMs
cs.LG 2025-11 unverdicted novelty 7.0

Low-rank compression preserves training-data privacy and improves adversarial robustness but weakens personal-information protection, reduces ethical behavior in zero-shot use, and harms fairness.
SAFE-SVD: Sensitivity-Aware Fidelity-Enforcing SVD for Physics Foundation Models
cs.LG 2026-05 unverdicted novelty 6.0

SAFE-SVD introduces a sensitivity-aware fidelity-enforcing SVD framework for compressing physics foundation models that maintains higher accuracy than standard methods at greater compression ratios.
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
cs.LG 2026-05 unverdicted novelty 6.0

OSCAR achieves near-BF16 accuracy for 2-bit KV cache quantization by using offline spectral covariance-aware rotations aligned with attention, plus a custom deployable INT2 kernel compatible with paged serving.
Theory-optimal Quantization Based on Flatness
cs.LG 2026-05 unverdicted novelty 6.0

The paper introduces the Flatness metric, derives a theory-optimal quantization solution, and presents BDQ that uses bidirectional diagonal transformations to reduce outlier impact, achieving under 1% drop at W4A4 on ...
Relative Kinetic Utility for Reasoning-Aware Structural Pruning in Large Language Models
cs.LG 2026-05 unverdicted novelty 6.0

RKU is a curvature-aware structural pruning framework that improves LLM reasoning accuracy at 40% sparsity, reaching 13.34% on GSM8K while outperforming baselines and better preserving out-of-distribution representations.
Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression
cs.LG 2026-05 unverdicted novelty 6.0

PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.
FlashSVD v1.5: Making Low-Rank Transformers Inference Actually Fast
cs.LG 2026-05 unverdicted novelty 6.0

FlashSVD v1.5 delivers up to 2.55x faster autoregressive decode and 2.39x end-to-end speedup for SVD-compressed transformers by reorganizing execution paths with dense-KV decode, packed MLP kernels, and per-layer CUDA graphs.
Bayesian Fine-tuning in Projected Subspaces
cs.LG 2026-05 unverdicted novelty 6.0

Bayesian fine-tuning of large models can be done efficiently by projecting uncertainties into low-dimensional subspaces, yielding improved calibration and generalization while keeping computational costs low.
Gated Subspace Inference for Transformer Acceleration
cs.LG 2026-05 unverdicted novelty 6.0

Gated Subspace Inference accelerates transformer linear layers 3-10x via low-rank cached subspace computation and per-token gating to skip residuals while preserving output distribution to high accuracy.
Importance-Guided Basis Selection for Low-Rank Decomposition of Large Language Models
cs.LG 2026-05 unverdicted novelty 6.0

BSI ranks singular-vector bases for LLM low-rank compression by estimating expected task loss increase via second-order Taylor expansion of the loss and an efficient Hessian-diagonal estimator, outperforming magnitude...
Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization
cs.LG 2026-04 unverdicted novelty 6.0

ARHQ isolates error-sensitive weight directions in LLMs via truncated SVD on the scaled matrix W G_x^{1/2} from activation residuals, improving SNR and preserving performance under aggressive low-bit quantization.
SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models
cs.LG 2026-04 unverdicted novelty 6.0

SLaB compresses LLM weights via sparse-lowrank-binary decomposition guided by activation-aware scores, achieving up to 36% lower perplexity than prior methods at 50% compression on Llama models.
CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning
cs.CL 2025-09 conditional novelty 6.0

CoSpaDi introduces a training-free sparse dictionary learning framework for post-training LLM compression that optimizes functional reconstruction error via activation-derived orthonormalization and achieves improved ...
IMPACT: Importance-Aware Activation Space Reconstruction
cs.LG 2025-07 unverdicted novelty 6.0

IMPACT derives a closed-form low-rank activation reconstruction from an importance-weighted covariance matrix to achieve higher compression ratios than standard methods while maintaining model accuracy.
Fast Tensorization of Neural Networks via Slice-wise Feature Distillation
cs.LG 2026-05 unverdicted novelty 5.0

A slice-wise feature distillation framework for independent tensorization of neural network slices to achieve scalable compression with reduced fine-tuning costs.
IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression
cs.LG 2026-05 unverdicted novelty 5.0

IO-SVD performs SVD-based LLM compression by constructing a KL-aware double-sided whitening space and using first-order loss estimates for heterogeneous rank allocation.
Minimal-Intervention KV Retention via Set-Conditioned Diversity
cs.LG 2026-05 conditional novelty 5.0

A minimal scoring modification to TriAttention using greedy facility-location selection with V-space redundancy penalty improves KV retention at budgets 64 and 128 on distilled reasoning models under matched-memory he...
Minimal-Intervention KV Retention via Set-Conditioned Diversity
cs.LG 2026-05 unverdicted novelty 5.0

A one-function modification to the TriAttention retention scorer using greedy selection under a V-space redundancy penalty outperforms seven matched mechanisms on long-form math reasoning at budgets 64 and 128.
DeInfer: Efficient Parallel Inferencing for Decomposed Large Language Models
cs.CL 2026-04 unverdicted novelty 4.0

DeInfer adds multiple optimizations to improve parallel inference speed for decomposed LLMs while remaining compatible with existing techniques.
Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities
cs.DC 2026-04 unverdicted novelty 3.0

A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.
NeuronMLP: Efficient LLM Inference via Singular Value Decomposition Compression and Tiling on AWS Trainium
cs.CL 2025-10 unverdicted novelty 3.0

NeuronMLP applies SVD-based compression and Trainium-specific tiling and caching to MLP layers, delivering 1.35x kernel speedup and 1.21x end-to-end inference speedup at 0.05 compression ratio versus AWS NKI baseline.
A Survey on Efficient Inference for Large Language Models
cs.CL 2024-04 accept novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 23 Pith papers · 14 internal anchors

[1]

Gkd: Generalized knowledge distillation for auto-regressive sequence models

Rishabh Agarwal, Nino Vieillard, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. Gkd: Generalized knowledge distillation for auto-regressive sequence models. arXiv preprint arXiv:2306.13649,

work page arXiv
[2]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

work page 1901
[3]

Palu: Compressing kv-cache with low-rank projection

Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, and Kai-Chiang Wu. Palu: Compressing kv-cache with low-rank projection. arXiv preprint arXiv:2407.21118,

work page arXiv
[4]

A survey of model compression and acceleration for deep neural networks.arXiv preprint arXiv:1710.09282, 2017

Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282,

work page arXiv
[5]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[9]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

arXiv preprint arXiv:2401.18079

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization. arXiv preprint arXiv:2401.18079,

work page arXiv
[11]

Training CNNs with Low-Rank Filters for Efficient Image Classification

Yani Ioannou, Duncan Robertson, Jamie Shotton, Roberto Cipolla, and Antonio Criminisi. Training cnns with low-rank filters for efficient image classification. arXiv preprint arXiv:1511.06744,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Speeding up Convolutional Neural Networks with Low Rank Expansions

Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Squeezellm: Dense-and-sparse quantization

Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W Mahoney, and Kurt Keutzer. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629,

work page arXiv
[14]

Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications

Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. Com- pression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition

Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation- aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu Kivi. A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

ACDC: A Structured Efficient Linear Layer

Marcin Moczulski, Misha Denil, Jeremy Appleyard, and Nando de Freitas. Acdc: A structured efficient linear layer. arXiv preprint arXiv:1511.05946,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Low- rank matrix factorization for deep neural network training with high-dimensional output targets

Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. Low- rank matrix factorization for deep neural network training with high-dimensional output targets. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6655–6659. IEEE,

work page 2013
[21]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikol...

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Svd-llm: Truncation-aware singular value decomposition for large language model compression

Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. Svd-llm: Truncation-aware singular value decomposition for large language model compression. arXiv preprint arXiv:2403.07378,

work page arXiv
[23]

Tensorgpt: Efficient compression of the embedding layer in llms based on the tensor-train decomposition

Mingxue Xu, Yao Lei Xu, and Danilo P Mandic. Tensorgpt: Efficient compression of the embedding layer in llms based on the tensor-train decomposition. arXiv preprint arXiv:2307.00526,

work page arXiv
[24]

Rptq: Reorder-based post-training quantization for large language models

Zhihang Yuan, Lin Niu, Jiawei Liu, Wenyu Liu, Xinggang Wang, Yuzhang Shang, Guangyu Sun, Qiang Wu, Jiaxiang Wu, and Bingzhe Wu. Rptq: Reorder-based post-training quantization for large language models. arXiv preprint arXiv:2304.01089,

work page arXiv
[25]

Llm inference unveiled: Survey and roofline model insights

Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, et al. Llm inference unveiled: Survey and roofline model insights. arXiv preprint arXiv:2402.16363,

work page arXiv
[26]

A survey on model compression for large language models

Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models. arXiv preprint arXiv:2308.07633,

work page arXiv
[27]

A.5 D IFFERENCE WITH TENSOR GPT

It doesn’t need calibration dataset. A.5 D IFFERENCE WITH TENSOR GPT. In the content of LLM compression via decomposition, the most related work is the concurrent TensorGPT Xu et al. [2023], Zhu et al. [2023], in which the embedding layer of LLMs is compressed through Tensor-Train Decomposition (TTD) Oseledets

work page 2023
[28]

in order to store large embeddings in a low-rank tensor format, with much fewer parameters. However, there are several differences between those two methods: (1) Unlike TensorGPT which focuses solely on the token embedding matrix, ASVDaims to compress the entire weight spectrum of LLMs. This holistic approach addresses a more critical aspect of LLM compre...

work page 2023

[1] [1]

Gkd: Generalized knowledge distillation for auto-regressive sequence models

Rishabh Agarwal, Nino Vieillard, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. Gkd: Generalized knowledge distillation for auto-regressive sequence models. arXiv preprint arXiv:2306.13649,

work page arXiv

[2] [2]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

work page 1901

[3] [3]

Palu: Compressing kv-cache with low-rank projection

Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, and Kai-Chiang Wu. Palu: Compressing kv-cache with low-rank projection. arXiv preprint arXiv:2407.21118,

work page arXiv

[4] [4]

A survey of model compression and acceleration for deep neural networks.arXiv preprint arXiv:1710.09282, 2017

Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282,

work page arXiv

[5] [5]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009

[9] [9]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

arXiv preprint arXiv:2401.18079

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization. arXiv preprint arXiv:2401.18079,

work page arXiv

[11] [11]

Training CNNs with Low-Rank Filters for Efficient Image Classification

Yani Ioannou, Duncan Robertson, Jamie Shotton, Roberto Cipolla, and Antonio Criminisi. Training cnns with low-rank filters for efficient image classification. arXiv preprint arXiv:1511.06744,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Speeding up Convolutional Neural Networks with Low Rank Expansions

Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Squeezellm: Dense-and-sparse quantization

Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W Mahoney, and Kurt Keutzer. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629,

work page arXiv

[14] [14]

Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications

Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. Com- pression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition

Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation- aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu Kivi. A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

ACDC: A Structured Efficient Linear Layer

Marcin Moczulski, Misha Denil, Jeremy Appleyard, and Nando de Freitas. Acdc: A structured efficient linear layer. arXiv preprint arXiv:1511.05946,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Low- rank matrix factorization for deep neural network training with high-dimensional output targets

Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. Low- rank matrix factorization for deep neural network training with high-dimensional output targets. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6655–6659. IEEE,

work page 2013

[21] [21]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikol...

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Svd-llm: Truncation-aware singular value decomposition for large language model compression

Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. Svd-llm: Truncation-aware singular value decomposition for large language model compression. arXiv preprint arXiv:2403.07378,

work page arXiv

[23] [23]

Tensorgpt: Efficient compression of the embedding layer in llms based on the tensor-train decomposition

Mingxue Xu, Yao Lei Xu, and Danilo P Mandic. Tensorgpt: Efficient compression of the embedding layer in llms based on the tensor-train decomposition. arXiv preprint arXiv:2307.00526,

work page arXiv

[24] [24]

Rptq: Reorder-based post-training quantization for large language models

Zhihang Yuan, Lin Niu, Jiawei Liu, Wenyu Liu, Xinggang Wang, Yuzhang Shang, Guangyu Sun, Qiang Wu, Jiaxiang Wu, and Bingzhe Wu. Rptq: Reorder-based post-training quantization for large language models. arXiv preprint arXiv:2304.01089,

work page arXiv

[25] [25]

Llm inference unveiled: Survey and roofline model insights

Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, et al. Llm inference unveiled: Survey and roofline model insights. arXiv preprint arXiv:2402.16363,

work page arXiv

[26] [26]

A survey on model compression for large language models

Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models. arXiv preprint arXiv:2308.07633,

work page arXiv

[27] [27]

A.5 D IFFERENCE WITH TENSOR GPT

It doesn’t need calibration dataset. A.5 D IFFERENCE WITH TENSOR GPT. In the content of LLM compression via decomposition, the most related work is the concurrent TensorGPT Xu et al. [2023], Zhu et al. [2023], in which the embedding layer of LLMs is compressed through Tensor-Train Decomposition (TTD) Oseledets

work page 2023

[28] [28]

in order to store large embeddings in a low-rank tensor format, with much fewer parameters. However, there are several differences between those two methods: (1) Unlike TensorGPT which focuses solely on the token embedding matrix, ASVDaims to compress the entire weight spectrum of LLMs. This holistic approach addresses a more critical aspect of LLM compre...

work page 2023