pith. sign in

arxiv: 2312.05821 · v5 · pith:ISKLA3BYnew · submitted 2023-12-10 · 💻 cs.CL

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

Pith reviewed 2026-05-20 13:43 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM compressionsingular value decompositionpost-training compressionlow-rank approximationKV cache compressionactivation distributionmodel efficiency
0
0 comments X

The pith

Transforming weight matrices to absorb activation outliers enables accurate low-rank compression of LLMs by 10-30% without any retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models have activations with wide value ranges and layers that differ in how much they can tolerate approximation errors. The paper shows that scaling each weight matrix by the statistics of its input activations folds the extreme values into the weights themselves. Singular value decomposition then produces a better low-rank approximation because the remaining matrix has smaller outliers. An iterative procedure adjusts the rank or scaling for each layer according to its measured sensitivity. When this process is applied to both the projection matrices and the key-value caches in attention, the model shrinks by 10 to 30 percent and the cache memory halves while accuracy stays nearly the same.

Core claim

By transforming the weight matrix based on the activation distribution, the outliers in the activation matrix are absorbed into the transformed weight matrix, which enhances the accuracy of the subsequent singular value decomposition. An efficient iterative calibration process then optimizes the decomposition for each layer's specific sensitivity, allowing 10%-30% network compression and 50% KV cache reduction in a training-free manner.

What carries the argument

Activation-aware Singular Value Decomposition that first scales weights by activation statistics to absorb outliers before low-rank factorization, together with layer-wise iterative calibration to respect differing sensitivities.

If this is right

  • LLMs become runnable on hardware with tighter memory budgets without retraining.
  • KV cache size in long-context inference drops by half at no accuracy cost.
  • Post-training compression becomes practical for many existing models.
  • Layer-specific tuning avoids uniform rank choices that hurt sensitive layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach might extend to other decomposition methods like QR or CUR if the same scaling step is applied first.
  • Further savings could come from combining ASVD with quantization on the remaining factors.
  • Testing on very long sequences would reveal whether the reduced KV cache preserves coherence over extended contexts.

Load-bearing premise

That scaling the weights according to activation statistics absorbs the outliers enough to make the low-rank approximation significantly more accurate, and that the iterative calibration finds good per-layer settings without needing extra validation data or introducing compounding errors.

What would settle it

Running the compressed model on a standard benchmark such as MMLU or GSM8K and observing an accuracy drop larger than a few percent relative to the original model, or failing to reach the stated compression ratios while remaining training-free.

read the original abstract

In this paper, we introduce a new post-training compression paradigm for Large Language Models (LLMs) to facilitate their wider adoption. We delve into LLM weight low-rank decomposition, and find that the challenges of this task stem from (1) the distribution variance in the LLM activations and (2) the sensitivity difference among various kinds of layers. To address these issues, we propose a training-free approach called Activation-aware Singular Value Decomposition (ASVD). Specifically, ASVD manages activation outliers by transforming the weight matrix based on the activation distribution. This transformation allows the outliers in the activation matrix to be absorbed into the transformed weight matrix, thereby enhancing decomposition accuracy. Additionally, we propose an efficient iterative calibration process to optimize layer-specific decomposition by addressing the varying sensitivity of different LLM layers. In this way, ASVD can compress a network by 10%-30%. Based on the success of the low-rank decomposition of projection matrices in the self-attention module, we further introduce ASVD to compress the KV cache. By reducing the channel dimension of KV activations, memory requirements for KV cache can be largely reduced. ASVD can further achieve 50% KV cache reductions without performance drop in a training-free manner.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Activation-aware Singular Value Decomposition (ASVD), a training-free post-training method for compressing LLMs. It identifies challenges from activation distribution variance and layer sensitivity differences, then transforms weight matrices based on activation statistics to absorb outliers before performing SVD, followed by an iterative calibration process to select layer-specific ranks. The approach is claimed to enable 10-30% network compression and, by extending the low-rank idea to self-attention projection matrices, 50% KV cache reduction without performance drop.

Significance. If the central claims are supported by rigorous experiments, the work would offer a practical, training-free compression technique that could facilitate wider deployment of LLMs on hardware with limited memory, complementing existing quantization and pruning approaches.

major comments (2)
  1. [§3.2] §3.2: The core transformation step (weight matrix modified according to activation distribution to absorb outliers) is presented as improving SVD accuracy, but the manuscript provides no invariance proof, error bound, or demonstration that the subsequent compensation during inference exactly preserves the original forward pass on real activations. This is load-bearing for the claim that the method reduces effective approximation error without introducing bias.
  2. [§3.3] §3.3: The iterative calibration process for optimizing per-layer decomposition is described at a high level as addressing varying layer sensitivity, yet no details are given on the objective function, convergence guarantees, or safeguards against systematic mismatch between the transformed space and the original activation distribution. This directly affects whether the training-free claim holds without hidden validation requirements.
minor comments (2)
  1. The abstract states compression ranges (10%-30%, 50% KV cache) but would be strengthened by explicit reference to the quantitative metrics (e.g., perplexity delta or zero-shot accuracy) reported in the experiments section.
  2. Notation for the activation-aware transformation and the calibration loop should be introduced with explicit equations and variable definitions to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and agree that additional formal analysis and implementation details will improve clarity. We will incorporate the suggested revisions in the next version of the paper.

read point-by-point responses
  1. Referee: [§3.2] §3.2: The core transformation step (weight matrix modified according to activation distribution to absorb outliers) is presented as improving SVD accuracy, but the manuscript provides no invariance proof, error bound, or demonstration that the subsequent compensation during inference exactly preserves the original forward pass on real activations. This is load-bearing for the claim that the method reduces effective approximation error without introducing bias.

    Authors: We appreciate the referee highlighting the need for a rigorous justification of the transformation. The transformation is constructed so that there exists an invertible mapping T derived from activation statistics (typically a diagonal scaling matrix based on per-channel activation magnitudes or norms) such that the original matrix product is preserved exactly: if W' = W T and the corresponding activation is scaled as A' = T^{-1} A, then W A = W' A' holds identically. The SVD is performed on the transformed W', and at inference the low-rank factors are used with the inverse compensation applied to maintain equivalence before the approximation error is introduced. We acknowledge that the current manuscript presents this at a conceptual level without an explicit invariance proof, error bound derivation, or side-by-side demonstration on real activations. In the revised manuscript we will add a new subsection in §3.2 containing (i) the formal proof of exact pre-approximation equivalence, (ii) a first-order error bound relating the SVD truncation error in the transformed space to the original space, and (iii) empirical verification showing that the compensated reconstruction introduces no systematic bias on held-out activations. This revision will be made. revision: yes

  2. Referee: [§3.3] §3.3: The iterative calibration process for optimizing per-layer decomposition is described at a high level as addressing varying layer sensitivity, yet no details are given on the objective function, convergence guarantees, or safeguards against systematic mismatch between the transformed space and the original activation distribution. This directly affects whether the training-free claim holds without hidden validation requirements.

    Authors: We thank the referee for noting the lack of algorithmic detail. The iterative calibration allocates per-layer ranks under a global compression budget by repeatedly evaluating layer sensitivity, defined as the increase in perplexity (or downstream task loss) on a small calibration set when the rank of that layer is reduced. The objective is to minimize the sum of these sensitivity-weighted errors subject to the total parameter budget; at each iteration the layer with the lowest sensitivity-per-parameter is selected for rank reduction until the budget is met or a performance threshold is reached. Convergence is monitored empirically by tracking the change in calibration perplexity between iterations and stopping when it falls below a small epsilon. Because the transformation already aligns the activation statistics, mismatch is mitigated by re-computing activation statistics after each rank change on the same calibration samples. We agree that these elements are described only at a high level. In the revision we will expand §3.3 with the exact objective function, pseudocode of the iterative procedure, empirical convergence plots, and explicit safeguards (including re-calibration of activation statistics after each adjustment). The procedure remains strictly training-free: no gradients are computed and no model parameters are updated; only rank selection is performed using a fixed calibration set, which is standard practice for post-training compression methods. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is a self-contained heuristic with independent empirical claims

full rationale

The ASVD approach defines a weight transformation derived from activation statistics to absorb outliers, applies standard SVD to the transformed matrix, and uses an iterative calibration loop to select per-layer ranks based on sensitivity. These operations are specified procedurally without defining any output metric (such as compression ratio or performance) in terms of the inputs by construction. The reported 10-30% compression and 50% KV-cache reduction are presented as experimental outcomes rather than algebraic identities or fitted predictions that reduce to the calibration data. No self-citation chains, uniqueness theorems, or ansatz smuggling appear in the derivation; the central claims remain falsifiable against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard linear-algebra properties of SVD and the empirical premise that activation outliers can be absorbed by a linear transformation of weights; no new entities or fitted constants are introduced in the abstract.

axioms (2)
  • standard math Singular value decomposition provides an optimal low-rank approximation under the Frobenius norm.
    Invoked implicitly when using SVD for weight compression.
  • domain assumption Activation distributions can be estimated from a small calibration set and used to guide weight transformation.
    Central to the outlier-absorption step described in the abstract.

pith-pipeline@v0.9.0 · 5758 in / 1285 out tokens · 39921 ms · 2026-05-20T13:43:41.410430+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs

    cs.AR 2026-03 unverdicted novelty 7.0

    ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.

  2. Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs

    cs.DC 2026-03 unverdicted novelty 7.0

    Dimensional misalignment slows compressed LLMs on GPUs; GAC uses knapsack optimization to achieve full alignment and up to 1.5x speedup on Llama-3-8B while preserving quality.

  3. Decomposed Trust: Privacy, Adversarial Robustness, Ethics, and Fairness in Low-Rank LLMs

    cs.LG 2025-11 unverdicted novelty 7.0

    Low-rank compression preserves training-data privacy and improves adversarial robustness but weakens personal-information protection, reduces ethical behavior in zero-shot use, and harms fairness.

  4. SAFE-SVD: Sensitivity-Aware Fidelity-Enforcing SVD for Physics Foundation Models

    cs.LG 2026-05 unverdicted novelty 6.0

    SAFE-SVD introduces a sensitivity-aware fidelity-enforcing SVD framework for compressing physics foundation models that maintains higher accuracy than standard methods at greater compression ratios.

  5. OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

    cs.LG 2026-05 unverdicted novelty 6.0

    OSCAR achieves near-BF16 accuracy for 2-bit KV cache quantization by using offline spectral covariance-aware rotations aligned with attention, plus a custom deployable INT2 kernel compatible with paged serving.

  6. Relative Kinetic Utility for Reasoning-Aware Structural Pruning in Large Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    RKU is a curvature-aware structural pruning framework that improves LLM reasoning accuracy at 40% sparsity, reaching 13.34% on GSM8K while outperforming baselines and better preserving out-of-distribution representations.

  7. Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression

    cs.LG 2026-05 unverdicted novelty 6.0

    PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.

  8. FlashSVD v1.5: Making Low-Rank Transformers Inference Actually Fast

    cs.LG 2026-05 unverdicted novelty 6.0

    FlashSVD v1.5 delivers up to 2.55x faster autoregressive decode and 2.39x end-to-end speedup for SVD-compressed transformers by reorganizing execution paths with dense-KV decode, packed MLP kernels, and per-layer CUDA graphs.

  9. Bayesian Fine-tuning in Projected Subspaces

    cs.LG 2026-05 unverdicted novelty 6.0

    Bayesian fine-tuning of large models can be done efficiently by projecting uncertainties into low-dimensional subspaces, yielding improved calibration and generalization while keeping computational costs low.

  10. Gated Subspace Inference for Transformer Acceleration

    cs.LG 2026-05 unverdicted novelty 6.0

    Gated Subspace Inference accelerates transformer linear layers 3-10x via low-rank cached subspace computation and per-token gating to skip residuals while preserving output distribution to high accuracy.

  11. Importance-Guided Basis Selection for Low-Rank Decomposition of Large Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    BSI ranks singular-vector bases for LLM low-rank compression by estimating expected task loss increase via second-order Taylor expansion of the loss and an efficient Hessian-diagonal estimator, outperforming magnitude...

  12. Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization

    cs.LG 2026-04 unverdicted novelty 6.0

    ARHQ isolates error-sensitive weight directions in LLMs via truncated SVD on the scaled matrix W G_x^{1/2} from activation residuals, improving SNR and preserving performance under aggressive low-bit quantization.

  13. SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    SLaB compresses LLM weights via sparse-lowrank-binary decomposition guided by activation-aware scores, achieving up to 36% lower perplexity than prior methods at 50% compression on Llama models.

  14. CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning

    cs.CL 2025-09 conditional novelty 6.0

    CoSpaDi introduces a training-free sparse dictionary learning framework for post-training LLM compression that optimizes functional reconstruction error via activation-derived orthonormalization and achieves improved ...

  15. IMPACT: Importance-Aware Activation Space Reconstruction

    cs.LG 2025-07 unverdicted novelty 6.0

    IMPACT derives a closed-form low-rank activation reconstruction from an importance-weighted covariance matrix to achieve higher compression ratios than standard methods while maintaining model accuracy.

  16. Fast Tensorization of Neural Networks via Slice-wise Feature Distillation

    cs.LG 2026-05 unverdicted novelty 5.0

    A slice-wise feature distillation framework for independent tensorization of neural network slices to achieve scalable compression with reduced fine-tuning costs.

  17. DeInfer: Efficient Parallel Inferencing for Decomposed Large Language Models

    cs.CL 2026-04 unverdicted novelty 4.0

    DeInfer adds multiple optimizations to improve parallel inference speed for decomposed LLMs while remaining compatible with existing techniques.

  18. Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities

    cs.DC 2026-04 unverdicted novelty 3.0

    A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.

  19. NeuronMLP: Efficient LLM Inference via Singular Value Decomposition Compression and Tiling on AWS Trainium

    cs.CL 2025-10 unverdicted novelty 3.0

    NeuronMLP applies SVD-based compression and Trainium-specific tiling and caching to MLP layers, delivering 1.35x kernel speedup and 1.21x end-to-end inference speedup at 0.05 compression ratio versus AWS NKI baseline.

  20. A Survey on Efficient Inference for Large Language Models

    cs.CL 2024-04 accept novelty 3.0

    The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 20 Pith papers · 14 internal anchors

  1. [1]

    Gkd: Generalized knowledge distillation for auto-regressive sequence models

    Rishabh Agarwal, Nino Vieillard, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. Gkd: Generalized knowledge distillation for auto-regressive sequence models. arXiv preprint arXiv:2306.13649,

  2. [2]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

  3. [3]

    Palu: Compressing kv-cache with low-rank projection

    Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, and Kai-Chiang Wu. Palu: Compressing kv-cache with low-rank projection. arXiv preprint arXiv:2407.21118,

  4. [4]

    A survey of model compression and acceleration for deep neural networks

    Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282,

  5. [5]

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339,

  6. [6]

    QLoRA: Efficient Finetuning of Quantized LLMs

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314,

  7. [7]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323,

  8. [8]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,

  9. [9]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531,

  10. [10]

    Kvquant: Towards 10 million context length llm inference with kv cache quantization

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization. arXiv preprint arXiv:2401.18079,

  11. [11]

    Training CNNs with Low-Rank Filters for Efficient Image Classification

    Yani Ioannou, Duncan Robertson, Jamie Shotton, Roberto Cipolla, and Antonio Criminisi. Training cnns with low-rank filters for efficient image classification. arXiv preprint arXiv:1511.06744,

  12. [12]

    Speeding up Convolutional Neural Networks with Low Rank Expansions

    Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866,

  13. [13]

    Squeezellm: Dense-and-sparse quantization

    Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W Mahoney, and Kurt Keutzer. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629,

  14. [14]

    Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications

    Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. Com- pression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530,

  15. [15]

    Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition

    Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553,

  16. [16]

    AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation- aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978,

  17. [17]

    KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu Kivi. A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750,

  18. [18]

    Pointer Sentinel Mixture Models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843,

  19. [19]

    ACDC: A Structured Efficient Linear Layer

    Marcin Moczulski, Misha Denil, Jeremy Appleyard, and Nando de Freitas. Acdc: A structured efficient linear layer. arXiv preprint arXiv:1511.05946,

  20. [20]

    Low- rank matrix factorization for deep neural network training with high-dimensional output targets

    Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. Low- rank matrix factorization for deep neural network training with high-dimensional output targets. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6655–6659. IEEE,

  21. [21]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikol...

  22. [22]

    Svd-llm: Truncation-aware singular value decomposition for large language model compression

    Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. Svd-llm: Truncation-aware singular value decomposition for large language model compression. arXiv preprint arXiv:2403.07378,

  23. [23]

    Tensorgpt: Efficient compression of the embedding layer in llms based on the tensor-train decomposition

    Mingxue Xu, Yao Lei Xu, and Danilo P Mandic. Tensorgpt: Efficient compression of the embedding layer in llms based on the tensor-train decomposition. arXiv preprint arXiv:2307.00526,

  24. [24]

    Rptq: Reorder-based post-training quantization for large language models

    Zhihang Yuan, Lin Niu, Jiawei Liu, Wenyu Liu, Xinggang Wang, Yuzhang Shang, Guangyu Sun, Qiang Wu, Jiaxiang Wu, and Bingzhe Wu. Rptq: Reorder-based post-training quantization for large language models. arXiv preprint arXiv:2304.01089,

  25. [25]

    Llm inference unveiled: Survey and roofline model insights

    Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, et al. Llm inference unveiled: Survey and roofline model insights. arXiv preprint arXiv:2402.16363,

  26. [26]

    A survey on model compression for large language models

    Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models. arXiv preprint arXiv:2308.07633,

  27. [27]

    A.5 D IFFERENCE WITH TENSOR GPT

    It doesn’t need calibration dataset. A.5 D IFFERENCE WITH TENSOR GPT. In the content of LLM compression via decomposition, the most related work is the concurrent TensorGPT Xu et al. [2023], Zhu et al. [2023], in which the embedding layer of LLMs is compressed through Tensor-Train Decomposition (TTD) Oseledets

  28. [28]

    in order to store large embeddings in a low-rank tensor format, with much fewer parameters. However, there are several differences between those two methods: (1) Unlike TensorGPT which focuses solely on the token embedding matrix, ASVDaims to compress the entire weight spectrum of LLMs. This holistic approach addresses a more critical aspect of LLM compre...