CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning
Pith reviewed 2026-05-18 13:14 UTC · model grok-4.3
The pith
CoSpaDi replaces low-rank factorization with a sparse dictionary model that better preserves LLM accuracy at 20-40 percent compression.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Each weight matrix is expressed as the product of a dense dictionary and a column-sparse coefficient matrix, producing a union-of-subspaces representation. The factorization is obtained by minimizing functional reconstruction error of layer outputs on a calibration set; this data-aware objective is converted via activation-derived Gram orthonormalization into a conventional dictionary learning task. The resulting structured sparsity supports efficient sparse-dense computation and post-training quantization of the coefficients while allowing optional cross-layer dictionary sharing.
What carries the argument
Calibration-guided sparse dictionary learning that reformulates functional reconstruction error minimization into dictionary learning on Gram-orthonormalized transformed weights.
Load-bearing premise
Minimizing layer output error on a small calibration set produces a factorization whose downstream task accuracy stays close to the original model without any fine-tuning.
What would settle it
A side-by-side evaluation on Llama-7B or Qwen-7B at 30 percent compression showing equal or higher downstream accuracy and lower perplexity for an SVD baseline than for CoSpaDi would falsify the reported trade-off improvement.
Figures
read the original abstract
Post-training compression of large language models (LLMs) often relies on low-rank weight approximations that represent each column of the weight matrix in a shared low-dimensional subspace. This strategy is computationally efficient but the underlying constraint can be overly rigid for heterogeneous projection weights and may incur avoidable accuracy loss. We propose CoSpaDi (Compression via Sparse Dictionary Learning), a training-free framework that replaces low-rank factorization with a structured sparse decomposition in which each weight matrix is represented as a dense dictionary multiplied by a column-sparse coefficient matrix. This yields a union-of-subspaces model: the columns of the weight matrix are represented as linear combinations of different subsets of dictionary atoms, improving expressiveness at a fixed parameter budget. CoSpaDi is calibration-guided: using a small calibration set, we optimize the factorization to minimize functional reconstruction error of layer outputs rather than weight-space error. An activation-derived Gram orthonormalization reformulates this data-aware objective into a standard dictionary learning problem on transformed weights, and we support both per-layer compression and cross-layer dictionary sharing within groups of similar projections. Across Llama and Qwen model families, CoSpaDi consistently improves the accuracy-compression and perplexity-compression trade-offs over state-of-the-art SVD-based baselines and strong structured pruning baselines at 20-40% compression ratios. The resulting structured sparsity enables sparse--dense computation and integrates with post-training quantization of the sparse coefficients. Code is accessible at https://github.com/mts-ai/CoSpaDi
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes CoSpaDi, a training-free framework for post-training compression of LLMs. It replaces low-rank weight approximations with a structured sparse decomposition using a dense dictionary and column-sparse coefficients, optimized to minimize functional reconstruction error of layer outputs on a small calibration set via activation-derived Gram orthonormalization. The paper claims that this union-of-subspaces model improves accuracy-compression and perplexity-compression trade-offs over SVD-based and structured pruning baselines at 20-40% compression ratios on Llama and Qwen model families.
Significance. If the empirical results hold, the approach provides a more expressive parameterization for weight compression at fixed parameter budgets, potentially reducing accuracy loss compared to rigid low-rank methods. The calibration-guided objective and support for cross-layer dictionary sharing are notable technical elements. The training-free design and compatibility with quantization are practical strengths that could influence future work in efficient LLM deployment.
major comments (2)
- Abstract: The abstract states that CoSpaDi 'consistently improves' the trade-offs but provides no quantitative numbers, error bars, details on calibration set size, dictionary size selection, or statistical significance tests. This absence makes the central empirical claim difficult to evaluate and verify from the provided text.
- Central claim (calibration-guided reconstruction): The assumption that minimizing functional reconstruction error on a small calibration set will yield a factorization whose zero-shot accuracy and perplexity remain superior to SVD/pruning baselines without fine-tuning is load-bearing. If the calibration set under-samples rare patterns or task-specific activations, the union-of-subspaces model can still incur larger effective error on downstream benchmarks than weight-space methods at identical parameter budgets. The manuscript should include ablations or analysis on calibration set size, distribution, and representativeness to support this.
minor comments (1)
- Abstract: The code repository link is provided, supporting reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each of the major comments below and indicate the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: Abstract: The abstract states that CoSpaDi 'consistently improves' the trade-offs but provides no quantitative numbers, error bars, details on calibration set size, dictionary size selection, or statistical significance tests. This absence makes the central empirical claim difficult to evaluate and verify from the provided text.
Authors: We agree that incorporating specific quantitative details in the abstract would enhance the clarity and verifiability of our claims. In the revised manuscript, we will modify the abstract to include key performance metrics, such as the observed improvements in perplexity and zero-shot accuracy at various compression ratios. We will also specify the calibration set size used (128 samples from the C4 dataset), the method for selecting dictionary size (based on minimizing reconstruction error on the calibration set), and note that error bars and statistical details are provided in the experimental results section of the full paper. revision: yes
-
Referee: Central claim (calibration-guided reconstruction): The assumption that minimizing functional reconstruction error on a small calibration set will yield a factorization whose zero-shot accuracy and perplexity remain superior to SVD/pruning baselines without fine-tuning is load-bearing. If the calibration set under-samples rare patterns or task-specific activations, the union-of-subspaces model can still incur larger effective error on downstream benchmarks than weight-space methods at identical parameter budgets. The manuscript should include ablations or analysis on calibration set size, distribution, and representativeness to support this.
Authors: This is a valid concern regarding the generalizability of the calibration-guided optimization. The current manuscript uses a fixed calibration set of 128 samples and demonstrates consistent improvements across Llama and Qwen models on standard benchmarks. To further support the robustness of this approach, we will add an ablation study in the revised version analyzing the effects of varying the calibration set size and using different data distributions (e.g., C4 versus other corpora). We will also include a discussion on the limitations of finite calibration sets and how the functional reconstruction objective helps mitigate issues with rare patterns by focusing on activation statistics. We believe these additions will address the referee's point without altering the core methodology. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper derives its method by first posing a data-aware objective that minimizes layer-output reconstruction error on a calibration set, then applying an activation-derived Gram orthonormalization to recast this exactly as a standard dictionary learning problem on transformed weights. This is a mathematical equivalence that enables use of existing solvers rather than a self-definitional loop or fitted input renamed as prediction. Empirical gains over SVD and structured pruning baselines at 20-40% compression are reported via direct accuracy and perplexity measurements on Llama and Qwen families; these do not reduce to the calibration inputs by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the described chain, leaving the central claims self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- dictionary size and sparsity level
axioms (1)
- domain assumption Weight matrices admit a good approximation as dense dictionary times column-sparse coefficients
Reference graph
Works this paper leans on
-
[1]
Rohan Anil, Sebastian Borgeaud, Jiecao Chen, Aakanksha Chowdhery, Jonathan Clark, et al
doi: 10.1109/TSP.2006.881199. Rohan Anil, Sebastian Borgeaud, Jiecao Chen, Aakanksha Chowdhery, Jonathan Clark, et al. Palm 2 technical report. InarXiv preprint arXiv:2305.10403,
-
[2]
doi: 10.1109/ICASSP.1999.760624. Ky Fan. On a theorem of Weyl concerning eigenvalues of linear transformations I.Proceedings of the National Academy of Sciences of the United States of America, 35(11):652–655,
-
[3]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pretrained transformers.arXiv preprint arXiv:2210.17323,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,
work page internal anchor Pith review Pith/arXiv arXiv
- [5]
-
[6]
Tinybert: Distilling bert for natural language understanding
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. InFindings of the Association for Computational Linguistics: EMNLP 2020, pp. 4163–4174,
work page 2020
-
[7]
doi: 10.18653/v1/2020. findings-emnlp.372. Junhyuck Kim, Jongho Park, Jaewoong Cho, and Dimitris Papailiopoulos. Lexico: Extreme kv cache compression via sparse coding over universal dictionaries,
-
[8]
Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1317– 1327,
work page 2016
-
[9]
11 Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy
doi: 10.18653/v1/D16-1139. 11 Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations,
-
[10]
Orca 2: Teaching small language models how to reason.arXiv preprint arXiv:2311.11045,
Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, Ahmed Awadallah, and Sub- habrata Mukherjee. Orca 2: Teaching small language models how to reason.arXiv preprint arXiv:2311.11045,
-
[11]
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4.arXiv preprint arXiv:2306.02707,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only.arXiv preprint arXiv:2306.01116,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert: A distilled version of bert: Smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[15]
A Simple and Effective Pruning Approach for Large Language Models
Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Patient knowledge distillation for BERT model compression
Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. Patient knowledge distillation for BERT model compression. InProceedings of EMNLP-IJCNLP 2019, pp. 4323–4332,
work page 2019
-
[17]
MYTE: Morphology-driven byte encoding for better and fairer multilingual language modeling
doi: 10.18653/v1/ D19-1441. 12 Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. Mobilebert: A compact task-agnostic bert for resource-limited devices. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 2158–2170,
-
[18]
doi: 10.18653/v1/2020.acl-main.195. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Ar- mand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. InInternational Conference on...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2020.acl-main.195 2020
-
[19]
Jingcun Wang, Yu-Guang Chen, Ing-Chao Lin, Bing Li, and Grace Li Zhang. Basis sharing: Cross- layer parameter sharing for large language model compression.arXiv preprint arXiv:2410.03765,
-
[20]
Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers
Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, and Furu Wei. Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 2140–2151,
work page 2021
-
[21]
Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang
18653/v1/2021.findings-acl.188. Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. Svd-llm: Truncation-aware singular value decomposition for large language model compression. InInternational Conference on Learning Representations, 2025a. Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. Svd-llm v2: Optimizing singular value truncation for large language model...
work page 2021
-
[22]
Bert-of-theseus: Compress- ing bert by progressive module replacing
Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei, and Ming Zhou. Bert-of-theseus: Compress- ing bert by progressive module replacing. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7859–7869,
work page 2020
-
[23]
ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models
Zhihang Yuan, Yuzhang Shang, Yue Song, Qiang Wu, Yan Yan, and Guangyu Sun. Asvd: Activation-aware singular value decomposition for compressing large language models.arXiv preprint arXiv:2312.05821,
-
[24]
Magauiya Zhussip, Dmitriy Shopkhoev, Ammar Ali, and Stamatios Lefkimmiatis. Share your attention: Transformer weight sharing via matrix-based dictionary learning.arXiv preprint arXiv:2508.04581,
-
[25]
14 A APPENDIX A.1 DERIVATION OF THE OPTIMAL PAIR OF BASIS AND COEFFICIENT MATRICES We are seeking the optimal pair(B ⋆,C ⋆)that minimizes the constrained problem in Eq. (3). First, we rewrite the objectiveJin its equivalent form: J= tr WTW −2 tr WTBC + tr CTBTBC .(11) Next, we consider the basis matrixBas fixed and compute the gradient of the objective w....
work page 1949
-
[26]
(2023) which is another data-aware SVD- based method
A.6 COMPARISON WITHOTHERTRAINING-FREEMETHODS We also evaluate the performance of the proposed CoSpaDi relative to other training free methods, particularly structural pruning ones and ASVD Yuan et al. (2023) which is another data-aware SVD- based method. The results are provided in Table
work page 2023
-
[27]
The left plot in Fig. 7 shows that average accuracy stabilizes after roughly 50 K-SVD iterations, while perplexity continues to de- crease slightly before flattening out. The right plot of Fig. 7 indicates that very few power iterations are sufficient for stable convergence: performance improves sharply up to around 5 iterations, after which additional it...
-
[28]
for different weight types (Query, Key, Value, Up, Gate, Down, Out), grouped by compres- sion rate and group size (1 or 2). Weight Type Compression Rate Group size Dictionary,kSparsity,s k/sratio d 1 d2 Query 20% 2 3276 1638 2 4096 8192 Key 3276 1638 2 4096 8192 Value 3276 1638 2 4096 8192 Up 4776 2388 2 4096 22016 Gate 4776 2388 2 4096 22016 Down 1 2762 ...
work page 2089
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.