IMPACT: Importance-Aware Activation Space Reconstruction

Daniel Agyei Asante; Ernie Chang; Md Mokarram Chowdhury; Yang Li

arxiv: 2507.03828 · v4 · submitted 2025-07-04 · 💻 cs.LG · stat.ML

IMPACT: Importance-Aware Activation Space Reconstruction

Md Mokarram Chowdhury , Daniel Agyei Asante , Ernie Chang , Yang Li This is my paper

Pith reviewed 2026-05-19 05:28 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords LLM compressionactivation reconstructionlow-rank approximationimportance weightinggradient importancemodel efficiency

0 comments

The pith

IMPACT reconstructs LLM activations using a gradient-weighted covariance matrix to achieve low-rank compression that better preserves accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that LLM activations show a clearer low-rank structure than weights, so compression should target activation reconstruction error instead of weight error. It further claims that activation dimensions are not equal in their effect on performance, so uniform treatment wastes potential accuracy. IMPACT turns this into an optimization that folds in gradient-derived importance scores and solves for the best low-rank bases in closed form via a weighted covariance matrix. If correct, this produces compression that directly protects task accuracy rather than just minimizing reconstruction error.

Core claim

IMPACT formulates compression as an optimization problem that integrates activation structure with gradient-based importance, deriving a closed-form solution where reconstruction bases arise from an importance-weighted activation covariance matrix. This yields low-rank compression explicitly optimized for accuracy preservation.

What carries the argument

importance-weighted activation covariance matrix, from which the optimal low-rank reconstruction bases are computed in closed form

If this is right

Up to 55.4 percent greater size reduction is possible while accuracy stays comparable to or better than baselines.
The closed-form solution removes the need for iterative solvers during compression.
Compression decisions are tied directly to measured effects on model outputs via the importance weights.
The approach works across multiple models and tasks in the reported experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same weighting idea could be tried inside quantization or pruning pipelines to improve their accuracy-size trade-offs.
Calibration set design becomes critical; using only a narrow slice of data might lock in importance scores that miss rare but high-impact patterns.
The method might transfer to other sequence models where activation statistics are similarly low-rank but importance varies.
Testing whether the derived bases remain stable when the model is later fine-tuned would check long-term usefulness.

Load-bearing premise

Gradient importance scores computed on a calibration set continue to reflect each activation dimension's true contribution to performance on all future inputs and tasks.

What would settle it

Running the compressed models on held-out tasks or data distributions far from the calibration set and finding larger accuracy drops than standard low-rank baselines would show the importance weighting does not generalize.

Figures

Figures reproduced from arXiv: 2507.03828 by Daniel Agyei Asante, Ernie Chang, Md Mokarram Chowdhury, Yang Li.

**Figure 2.** Figure 2: Pass@1 accuracy and model size of Llama 2-7B compressed with various low-rank algorithms on the [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Pass@1 accuracy and model size of Llama 2-13B compressed with various low-rank algorithms on the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Pass@1 accuracy and model size of CodeLlama-7B compressed with various low-rank algorithms on the [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Pass@1 accuracy and model size of CodeLlama-13B compressed with various low-rank algorithms on the [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Pass@1 accuracy and model size of Llama 2-7B models compressed using quantization alone, as well as in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Throughput and memory consumption of com [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Pass@1 accuracy and model size of Llama 2-13B models compressed using quantization alone, as well as in [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

read the original abstract

Large language models (LLMs) achieve strong performance across diverse domains but remain difficult to deploy in resource-constrained environments due to their size. Low-rank compression is a common remedy, typically minimizing weight reconstruction error under the assumption that weights are low-rank. However, this assumption often does not hold in LLMs. In contrast, LLM activations exhibit a more pronounced low-rank structure, motivating approaches that minimize activation reconstruction error. This shift alone, however, is not sufficient: different activation dimensions contribute unequally to model performance, and treating them uniformly can lead to accuracy loss. We introduce IMPACT, an importance-aware activation reconstruction framework that links compression to its effect on model performance. IMPACT formulates compression as an optimization problem that integrates activation structure with gradient-based importance, deriving a closed-form solution where reconstruction bases arise from an importance-weighted activation covariance matrix. This yields low-rank compression explicitly optimized for accuracy preservation. Experiments across multiple models and tasks demonstrate that IMPACT achieves up to 55.4% greater model size reduction while maintaining accuracy comparable to or better than state-of-the-art baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IMPACT derives a closed-form low-rank basis from gradient-weighted activation covariance and reports stronger compression than plain activation methods, but the weights' reliability across inputs is untested.

read the letter

The one thing to know is that this paper takes standard low-rank activation compression and weights the covariance matrix by gradient importance scores to get a closed-form solution for the reconstruction bases. That produces low-rank factors explicitly tied to preserving accuracy rather than just minimizing reconstruction error in the usual sense. They show this on several LLMs and tasks, claiming up to 55% more size reduction at comparable accuracy to baselines. The math step itself is straightforward once you accept the weighting, and the empirical numbers are the part that stands out as usable right away. What they do well is recognize that not all activation dimensions matter equally and then derive the bases from that weighted matrix instead of treating everything uniformly. The experiments back the claim with concrete compression gains across models. The soft spot is the dependence on gradient importance computed from a calibration set. If those scores do not track actual downstream contribution when inputs or tasks change, the closed-form solution optimizes the wrong thing and the accuracy preservation does not follow. The abstract and setup give no details on calibration set size, diversity, or any shift tests, so that assumption carries the result. This paper is aimed at people doing practical activation compression for LLM inference under tight memory or latency limits. A reader who already works with low-rank methods or edge deployment would get direct value from the numbers and the simple implementation path. It deserves a serious referee because the idea is testable, the derivation is explicit, and the experiments are broad enough to check. I would send it to peer review with a request for more on how the importance scores hold up outside the calibration data.

Referee Report

2 major / 2 minor

Summary. The paper introduces IMPACT, a framework for low-rank compression of large language models via importance-aware activation space reconstruction. It formulates compression as an optimization problem that combines activation structure with gradient-based importance scores, deriving a closed-form solution in which the reconstruction bases are obtained from an importance-weighted activation covariance matrix. This is claimed to yield compression explicitly optimized for accuracy preservation. Experiments across multiple models and tasks report up to 55.4% greater model size reduction while maintaining accuracy comparable to or better than state-of-the-art baselines.

Significance. If the closed-form derivation is correct and the gradient-based importance weights generalize reliably beyond the calibration set, the approach would represent a meaningful advance over uniform activation or weight reconstruction methods by directly linking compression to downstream performance. The explicit optimization for accuracy preservation and the reported empirical gains in compression ratio could have practical value for efficient LLM deployment. The strength lies in the attempt to move beyond heuristic low-rank assumptions toward a performance-aware objective.

major comments (2)

[Experimental evaluation and importance score computation] The central claim that the importance-weighted covariance produces bases that explicitly preserve accuracy rests on the assumption that gradient-based importance scores computed on a calibration set reliably proxy each activation dimension's contribution to final task performance. The manuscript provides no details on calibration set size, diversity, or validation against distribution shift (e.g., in the experimental section or ablation studies), leaving open the possibility that the scores are brittle and the derived solution optimizes a mis-specified objective.
[Formulation and closed-form derivation] The derivation of the closed-form solution (integrating activation covariance with gradient importance) must be shown to avoid circularity, since the importance weights themselves derive from model gradients. Without explicit steps demonstrating that the weighting is independent of the evaluation data used for final accuracy reporting, the optimization risks reducing to self-referential fitting rather than an independent prediction of accuracy preservation.

minor comments (2)

[Method] Clarify the precise definition of the importance weighting function and how it is normalized before incorporation into the covariance matrix.
[Experiments] Include ablation studies isolating the contribution of the importance weighting versus standard activation reconstruction error minimization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Experimental evaluation and importance score computation] The central claim that the importance-weighted covariance produces bases that explicitly preserve accuracy rests on the assumption that gradient-based importance scores computed on a calibration set reliably proxy each activation dimension's contribution to final task performance. The manuscript provides no details on calibration set size, diversity, or validation against distribution shift (e.g., in the experimental section or ablation studies), leaving open the possibility that the scores are brittle and the derived solution optimizes a mis-specified objective.

Authors: We agree that the manuscript would benefit from explicit documentation of the calibration procedure. In the revised version we will add a dedicated paragraph in the experimental section specifying the calibration set size, its task and domain composition, and new ablation results that evaluate importance-score stability under distribution shifts between calibration and test data. revision: yes
Referee: [Formulation and closed-form derivation] The derivation of the closed-form solution (integrating activation covariance with gradient importance) must be shown to avoid circularity, since the importance weights themselves derive from model gradients. Without explicit steps demonstrating that the weighting is independent of the evaluation data used for final accuracy reporting, the optimization risks reducing to self-referential fitting rather than an independent prediction of accuracy preservation.

Authors: The importance weights are obtained from gradients on a calibration set that is disjoint from all evaluation sets used for final accuracy reporting. The closed-form derivation in Section 3 operates solely on this calibration-derived weighted covariance. We will expand the derivation subsection with an explicit enumeration of the data-flow steps, clearly separating the calibration phase from the held-out evaluation phase, to remove any ambiguity regarding independence. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper formulates compression as an explicit optimization problem that incorporates activation covariance structure together with separately computed gradient-based importance weights, then derives the closed-form reconstruction bases as the principal components of the resulting importance-weighted matrix. This is a direct algebraic solution to the stated objective rather than a reduction of the claimed result to its own inputs by construction. No self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation is present in the abstract or described method. The importance scores function as an independent input derived from gradients on a calibration set, and the accuracy-preservation claim rests on the optimization itself rather than tautological equivalence. The derivation remains self-contained with independent mathematical content.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the low-rank structure of activations and the validity of gradient-derived importance as a proxy for performance contribution. No explicit free parameters or invented entities are named in the abstract, but the importance computation implicitly depends on calibration data choice.

axioms (2)

domain assumption LLM activations exhibit a more pronounced low-rank structure than weights.
Stated directly in the abstract as motivation for shifting from weight to activation reconstruction.
domain assumption Gradient-based importance scores reliably indicate each activation dimension's contribution to model performance.
Used to weight the covariance matrix; this is the load-bearing link between compression and accuracy preservation.

pith-pipeline@v0.9.0 · 5722 in / 1378 out tokens · 46495 ms · 2026-05-19T05:28:10.431855+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

optimal reconstruction bases are the eigenvectors of an importance-weighted activation covariance matrix C = Cov(y) ⊙ M
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

closed-form solution where reconstruction bases arise from an importance-weighted activation covariance matrix

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 6 internal anchors

[1]

Online Embedding Compression for Text Classification Using Low Rank Matrix Factorization

Anish Acharya, Rahul Goel, Angeliki Metallinou, and Inderjit Dhillon. Online Embedding Compression for Text Classification Using Low Rank Matrix Factorization . In Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artifi...

work page 2019
[2]

Fluctuation-based Adaptive Structured Pruning for Large Language Models

Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. Fluctuation-based Adaptive Structured Pruning for Large Language Models . In AAAI Conference on Artificial Intelligence, 2024

work page 2024
[3]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program Synthesis with Large Language Models . arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating Large Language Models Trained on Code . arXiv preprint arXiv:2107.03374, 2021 a

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking

Patrick Chen, Si Si, Yang Li, Ciprian Chelba, and Cho-Jui Hsieh. GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking . In Advances in Neural Information Processing Systems (NeurIPS) , 2018

work page 2018
[6]

DRONE: Data-Aware Low-Rank Compression for Large NLP Models

Patrick Chen, Hsiang-Fu Yu, Inderjit Dhillon, and Cho-Jui Hsieh. DRONE: Data-Aware Low-Rank Compression for Large NLP Models . In Advances in Neural Information Processing Systems (NeurIPS), 2021 b

work page 2021
[7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training Verifiers to Solve Math Word Problems . arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Exploiting linear structure within convolutional networks for efficient evaluation

Emily Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation . In International Conference on Neural Information Processing Systems (NeurIPS), 2014

work page 2014
[9]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs . In Advances in neural information processing systems (NeurIPS), 2023

work page 2023
[10]

Golub and Charles F

Gene H. Golub and Charles F. Van Loan. Matrix Computations . Johns Hopkins University Press , 1983. ISBN 978-0-8018-3010-9

work page 1983
[11]

Measuring Mathematical Problem Solving with the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring Mathematical Problem Solving with the MATH Dataset . In Conference on Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[12]

Language Model Compression with Weighted Low-Rank Factorization

Yen-Chang Hsu, Ting Hua, Sungen Chang, Qian Lou, Yilin Shen, and Hongxia Jin. Language Model Compression with Weighted Low-Rank Factorization . In International Conference on Learning Representation (ICLR), 2022

work page 2022
[13]

HMC-TRAN: A Tensor-core Inspired Hierarchical Model Compression for Transformer-based DNNs on GPU

Shaoyi Huang, Shiyang Chen, Hongwu Peng, Daniel Manu, Zhenglun Kong, Geng Yuan, Lei Yang, Shusen Wang, Hang Liu, and Caiwen Ding. HMC-TRAN: A Tensor-core Inspired Hierarchical Model Compression for Transformer-based DNNs on GPU . In Great Lakes Symposium on VLSI (GLSVLSI), 2021

work page 2021
[14]

Speeding up Convolutional Neural Networks with Low Rank Expansions

Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up Convolutional Neural Networks with Low Rank Expansions . In British Machine Vision Conference (BMVC) , 2014

work page 2014
[15]

Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications

Yong - Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications . In Yoshua Bengio and Yann LeCun, editors, International Conference on Learning Representations (ICLR) , 2016

work page 2016
[16]

A Hardware-Friendly Tiled Singular-Value Decomposition-Based Matrix Multiplication for Transformer-Based Models

Hailong Li, Jaewan Choi, Yongsuk Kwon, and Jung Ho Ahn. A Hardware-Friendly Tiled Singular-Value Decomposition-Based Matrix Multiplication for Transformer-Based Models . IEEE Computer Architecture Letters (CAL), 22: 0 169--172, 2023

work page 2023
[17]

MoDe GPT : Modular Decomposition for Large Language Model Compression

Chi-Heng Lin, Shangqian Gao, James Seale Smith, Abhishek Patel, Shikhar Tuli, Yilin Shen, Hongxia Jin, and Yen-Chang Hsu. MoDe GPT : Modular Decomposition for Large Language Model Compression . In International Conference on Learning Representations (ICLR), 2025

work page 2025
[18]

Learning Compact Recurrent Neural Networks

Zhiyun Lu, Vikas Sindhwani, and Tara N Sainath. Learning Compact Recurrent Neural Networks . In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2016

work page 2016
[19]

LightFormer: Light-weight Transformer Using SVD-based Weight Transfer and Parameter Sharing

Xiuqing Lv, Peng Zhang, Sunzhu Li, Guobing Gan, and Yueheng Sun. LightFormer: Light-weight Transformer Using SVD-based Weight Transfer and Parameter Sharing . In Findings of the Association for Computational Linguistics (ACL), 2023

work page 2023
[20]

Compressing Pre-trained Language Models by Matrix Decomposition

Matan Ben Noach and Yoav Goldberg. Compressing Pre-trained Language Models by Matrix Decomposition . In 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing (AACL-IJCNLP) , 2020

work page 2020
[21]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code Llama: Open Foundation Models for Code . arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Ash, and Dipendra Misra

Pratyusha Sharma, Jordan T. Ash, and Dipendra Misra. The Truth is in there: Improving Reasoning in Language Models with Layer-Selective Rank Reduction . In International Conference on Learning Representations (ICLR), 2024

work page 2024
[23]

Cheng Tai, Tong Xiao, Yi Zhang, Xiaogang Wang, and E. Weinan. Convolutional Neural Networks With Low-rank Regularization . In International Conference on Learning Representations (ICLR) , 2016

work page 2016
[24]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open Foundation and Finetuned Chat Models . arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Pufferfish: Communication-efficient Models at No Extra Cost

Hongyi Wang, Saurabh Agarwal, and Dimitris Papailiopoulos. Pufferfish: Communication-efficient Models at No Extra Cost . In Conference on Machine Learning and Systems (MLSys) , 2021

work page 2021
[26]

Coordinating Filters for Faster Deep Neural Networks

Wei Wen, Cong Xu, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Coordinating Filters for Faster Deep Neural Networks . In IEEE International Conference on Computer Vision (ICCV) , 2017

work page 2017
[27]

Restructuring of Deep Neural Network Acoustic Models with Singular Value Decomposition

Jian Xue, Jinyu Li, and Yifan Gong. Restructuring of Deep Neural Network Acoustic Models with Singular Value Decomposition . In Annual Conference of the International Speech Communication Association (INTERSPEECH), January 2013

work page 2013
[28]

Compressing Transformers: Features Are Low-Rank, But Weights Are Not! In AAAI Conference on Artificial Intelligence, 2023

Hao Yu and Jianxin Wu. Compressing Transformers: Features Are Low-Rank, But Weights Are Not! In AAAI Conference on Artificial Intelligence, 2023

work page 2023
[29]

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

Zhihang Yuan, Yuzhang Shang, Yue Song, Qiang Wu, Yan Yan, and Guangyu Sun. ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models . arXiv preprint arXiv:2312.05821, 2023

work page internal anchor Pith review arXiv 2023
[30]

The Schur Complement and Its Applications , volume 4

Fuzhen Zhang. The Schur Complement and Its Applications , volume 4. Springer Science & Business Media, 2006

work page 2006

[1] [1]

Online Embedding Compression for Text Classification Using Low Rank Matrix Factorization

Anish Acharya, Rahul Goel, Angeliki Metallinou, and Inderjit Dhillon. Online Embedding Compression for Text Classification Using Low Rank Matrix Factorization . In Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artifi...

work page 2019

[2] [2]

Fluctuation-based Adaptive Structured Pruning for Large Language Models

Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. Fluctuation-based Adaptive Structured Pruning for Large Language Models . In AAAI Conference on Artificial Intelligence, 2024

work page 2024

[3] [3]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program Synthesis with Large Language Models . arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating Large Language Models Trained on Code . arXiv preprint arXiv:2107.03374, 2021 a

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking

Patrick Chen, Si Si, Yang Li, Ciprian Chelba, and Cho-Jui Hsieh. GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking . In Advances in Neural Information Processing Systems (NeurIPS) , 2018

work page 2018

[6] [6]

DRONE: Data-Aware Low-Rank Compression for Large NLP Models

Patrick Chen, Hsiang-Fu Yu, Inderjit Dhillon, and Cho-Jui Hsieh. DRONE: Data-Aware Low-Rank Compression for Large NLP Models . In Advances in Neural Information Processing Systems (NeurIPS), 2021 b

work page 2021

[7] [7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training Verifiers to Solve Math Word Problems . arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

Exploiting linear structure within convolutional networks for efficient evaluation

Emily Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation . In International Conference on Neural Information Processing Systems (NeurIPS), 2014

work page 2014

[9] [9]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs . In Advances in neural information processing systems (NeurIPS), 2023

work page 2023

[10] [10]

Golub and Charles F

Gene H. Golub and Charles F. Van Loan. Matrix Computations . Johns Hopkins University Press , 1983. ISBN 978-0-8018-3010-9

work page 1983

[11] [11]

Measuring Mathematical Problem Solving with the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring Mathematical Problem Solving with the MATH Dataset . In Conference on Neural Information Processing Systems (NeurIPS), 2021

work page 2021

[12] [12]

Language Model Compression with Weighted Low-Rank Factorization

Yen-Chang Hsu, Ting Hua, Sungen Chang, Qian Lou, Yilin Shen, and Hongxia Jin. Language Model Compression with Weighted Low-Rank Factorization . In International Conference on Learning Representation (ICLR), 2022

work page 2022

[13] [13]

HMC-TRAN: A Tensor-core Inspired Hierarchical Model Compression for Transformer-based DNNs on GPU

Shaoyi Huang, Shiyang Chen, Hongwu Peng, Daniel Manu, Zhenglun Kong, Geng Yuan, Lei Yang, Shusen Wang, Hang Liu, and Caiwen Ding. HMC-TRAN: A Tensor-core Inspired Hierarchical Model Compression for Transformer-based DNNs on GPU . In Great Lakes Symposium on VLSI (GLSVLSI), 2021

work page 2021

[14] [14]

Speeding up Convolutional Neural Networks with Low Rank Expansions

Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up Convolutional Neural Networks with Low Rank Expansions . In British Machine Vision Conference (BMVC) , 2014

work page 2014

[15] [15]

Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications

Yong - Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications . In Yoshua Bengio and Yann LeCun, editors, International Conference on Learning Representations (ICLR) , 2016

work page 2016

[16] [16]

A Hardware-Friendly Tiled Singular-Value Decomposition-Based Matrix Multiplication for Transformer-Based Models

Hailong Li, Jaewan Choi, Yongsuk Kwon, and Jung Ho Ahn. A Hardware-Friendly Tiled Singular-Value Decomposition-Based Matrix Multiplication for Transformer-Based Models . IEEE Computer Architecture Letters (CAL), 22: 0 169--172, 2023

work page 2023

[17] [17]

MoDe GPT : Modular Decomposition for Large Language Model Compression

Chi-Heng Lin, Shangqian Gao, James Seale Smith, Abhishek Patel, Shikhar Tuli, Yilin Shen, Hongxia Jin, and Yen-Chang Hsu. MoDe GPT : Modular Decomposition for Large Language Model Compression . In International Conference on Learning Representations (ICLR), 2025

work page 2025

[18] [18]

Learning Compact Recurrent Neural Networks

Zhiyun Lu, Vikas Sindhwani, and Tara N Sainath. Learning Compact Recurrent Neural Networks . In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2016

work page 2016

[19] [19]

LightFormer: Light-weight Transformer Using SVD-based Weight Transfer and Parameter Sharing

Xiuqing Lv, Peng Zhang, Sunzhu Li, Guobing Gan, and Yueheng Sun. LightFormer: Light-weight Transformer Using SVD-based Weight Transfer and Parameter Sharing . In Findings of the Association for Computational Linguistics (ACL), 2023

work page 2023

[20] [20]

Compressing Pre-trained Language Models by Matrix Decomposition

Matan Ben Noach and Yoav Goldberg. Compressing Pre-trained Language Models by Matrix Decomposition . In 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing (AACL-IJCNLP) , 2020

work page 2020

[21] [21]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code Llama: Open Foundation Models for Code . arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Ash, and Dipendra Misra

Pratyusha Sharma, Jordan T. Ash, and Dipendra Misra. The Truth is in there: Improving Reasoning in Language Models with Layer-Selective Rank Reduction . In International Conference on Learning Representations (ICLR), 2024

work page 2024

[23] [23]

Cheng Tai, Tong Xiao, Yi Zhang, Xiaogang Wang, and E. Weinan. Convolutional Neural Networks With Low-rank Regularization . In International Conference on Learning Representations (ICLR) , 2016

work page 2016

[24] [24]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open Foundation and Finetuned Chat Models . arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Pufferfish: Communication-efficient Models at No Extra Cost

Hongyi Wang, Saurabh Agarwal, and Dimitris Papailiopoulos. Pufferfish: Communication-efficient Models at No Extra Cost . In Conference on Machine Learning and Systems (MLSys) , 2021

work page 2021

[26] [26]

Coordinating Filters for Faster Deep Neural Networks

Wei Wen, Cong Xu, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Coordinating Filters for Faster Deep Neural Networks . In IEEE International Conference on Computer Vision (ICCV) , 2017

work page 2017

[27] [27]

Restructuring of Deep Neural Network Acoustic Models with Singular Value Decomposition

Jian Xue, Jinyu Li, and Yifan Gong. Restructuring of Deep Neural Network Acoustic Models with Singular Value Decomposition . In Annual Conference of the International Speech Communication Association (INTERSPEECH), January 2013

work page 2013

[28] [28]

Compressing Transformers: Features Are Low-Rank, But Weights Are Not! In AAAI Conference on Artificial Intelligence, 2023

Hao Yu and Jianxin Wu. Compressing Transformers: Features Are Low-Rank, But Weights Are Not! In AAAI Conference on Artificial Intelligence, 2023

work page 2023

[29] [29]

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

Zhihang Yuan, Yuzhang Shang, Yue Song, Qiang Wu, Yan Yan, and Guangyu Sun. ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models . arXiv preprint arXiv:2312.05821, 2023

work page internal anchor Pith review arXiv 2023

[30] [30]

The Schur Complement and Its Applications , volume 4

Fuzhen Zhang. The Schur Complement and Its Applications , volume 4. Springer Science & Business Media, 2006

work page 2006