pith. the verified trust layer for science. sign in

arxiv: 2509.22075 · v5 · submitted 2025-09-26 · 💻 cs.CL · cs.AI

CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning

Pith reviewed 2026-05-18 13:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM compressionsparse dictionary learningpost-training compressionstructured sparsitycalibration guidedunion of subspaceslow-rank approximationLlama Qwen
0
0 comments X p. Extension

The pith

CoSpaDi replaces low-rank factorization with a sparse dictionary model that better preserves LLM accuracy at 20-40 percent compression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are often compressed after training by approximating each weight matrix with a low-rank factorization that forces every column into the same low-dimensional subspace. CoSpaDi instead decomposes each weight matrix as a dense dictionary multiplied by a column-sparse coefficient matrix, so that different columns can combine different subsets of dictionary atoms. The dictionary and coefficients are chosen by minimizing the difference between original and compressed layer outputs on a small calibration set rather than minimizing weight error directly. An activation-based Gram orthonormalization turns this objective into a standard dictionary learning problem that can be solved per layer or with shared dictionaries across similar layers. Experiments on Llama and Qwen families show improved accuracy-compression and perplexity-compression curves compared with strong SVD and structured pruning baselines.

Core claim

Each weight matrix is expressed as the product of a dense dictionary and a column-sparse coefficient matrix, producing a union-of-subspaces representation. The factorization is obtained by minimizing functional reconstruction error of layer outputs on a calibration set; this data-aware objective is converted via activation-derived Gram orthonormalization into a conventional dictionary learning task. The resulting structured sparsity supports efficient sparse-dense computation and post-training quantization of the coefficients while allowing optional cross-layer dictionary sharing.

What carries the argument

Calibration-guided sparse dictionary learning that reformulates functional reconstruction error minimization into dictionary learning on Gram-orthonormalized transformed weights.

Load-bearing premise

Minimizing layer output error on a small calibration set produces a factorization whose downstream task accuracy stays close to the original model without any fine-tuning.

What would settle it

A side-by-side evaluation on Llama-7B or Qwen-7B at 30 percent compression showing equal or higher downstream accuracy and lower perplexity for an SVD baseline than for CoSpaDi would falsify the reported trade-off improvement.

Figures

Figures reproduced from arXiv: 2509.22075 by Ammar Ali, Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Stamatios Lefkimmiatis.

Figure 1
Figure 1. Figure 1: Left side: weight factorization methods using low-rank decomposition. Low-rank approx￾imation decomposes a matrix into two dense matrices of lower rank. Right side: proposed CoSpaDi. A dictionary of k atoms and a column-sparse coefficient matrix are employed. No restrictions on size of k (undercomplete : k < d1, complete: k = d1 or overcomplete : k > d1 dictionaries are possible), while sparsity is defined… view at source ↗
Figure 2
Figure 2. Figure 2: Dual-axis plot showing average accuracy ( solid lines, left axis) and perplexity (- - - dashed lines, right axis, logarithmic scale with inverted direc￾tion) as functions of ρ for Llama3.2-1B under three compression levels: 0.2, 0.3 and 0.4. Perplexity de￾creases upward due to axis inversion. CR Bitwidth Avg. Acc. PPL 0.1686 bFP16 0.6198 1.94E+01 0.1843 bFP15 0.6195 1.95E+01 0.2001 bFP14 0.6176 1.97E+01 0.… view at source ↗
Figure 3
Figure 3. Figure 3: Average benchmark accuracy and WikiText perplexity for (a) LLaMA-3.2-1B and (b) [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Inference time for different projection layers of Llama3.2 1B for different compression [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Inference time for different projection layers of Llama3 8B for different compression [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Inference time for different projection layers of Qwen3 0.6B for different compression [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Average benchmark accuracy and WikiText perplexity with respect to the number of K [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
read the original abstract

Post-training compression of large language models (LLMs) often relies on low-rank weight approximations that represent each column of the weight matrix in a shared low-dimensional subspace. This strategy is computationally efficient but the underlying constraint can be overly rigid for heterogeneous projection weights and may incur avoidable accuracy loss. We propose CoSpaDi (Compression via Sparse Dictionary Learning), a training-free framework that replaces low-rank factorization with a structured sparse decomposition in which each weight matrix is represented as a dense dictionary multiplied by a column-sparse coefficient matrix. This yields a union-of-subspaces model: the columns of the weight matrix are represented as linear combinations of different subsets of dictionary atoms, improving expressiveness at a fixed parameter budget. CoSpaDi is calibration-guided: using a small calibration set, we optimize the factorization to minimize functional reconstruction error of layer outputs rather than weight-space error. An activation-derived Gram orthonormalization reformulates this data-aware objective into a standard dictionary learning problem on transformed weights, and we support both per-layer compression and cross-layer dictionary sharing within groups of similar projections. Across Llama and Qwen model families, CoSpaDi consistently improves the accuracy-compression and perplexity-compression trade-offs over state-of-the-art SVD-based baselines and strong structured pruning baselines at 20-40% compression ratios. The resulting structured sparsity enables sparse--dense computation and integrates with post-training quantization of the sparse coefficients. Code is accessible at https://github.com/mts-ai/CoSpaDi

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes CoSpaDi, a training-free framework for post-training compression of LLMs. It replaces low-rank weight approximations with a structured sparse decomposition using a dense dictionary and column-sparse coefficients, optimized to minimize functional reconstruction error of layer outputs on a small calibration set via activation-derived Gram orthonormalization. The paper claims that this union-of-subspaces model improves accuracy-compression and perplexity-compression trade-offs over SVD-based and structured pruning baselines at 20-40% compression ratios on Llama and Qwen model families.

Significance. If the empirical results hold, the approach provides a more expressive parameterization for weight compression at fixed parameter budgets, potentially reducing accuracy loss compared to rigid low-rank methods. The calibration-guided objective and support for cross-layer dictionary sharing are notable technical elements. The training-free design and compatibility with quantization are practical strengths that could influence future work in efficient LLM deployment.

major comments (2)
  1. Abstract: The abstract states that CoSpaDi 'consistently improves' the trade-offs but provides no quantitative numbers, error bars, details on calibration set size, dictionary size selection, or statistical significance tests. This absence makes the central empirical claim difficult to evaluate and verify from the provided text.
  2. Central claim (calibration-guided reconstruction): The assumption that minimizing functional reconstruction error on a small calibration set will yield a factorization whose zero-shot accuracy and perplexity remain superior to SVD/pruning baselines without fine-tuning is load-bearing. If the calibration set under-samples rare patterns or task-specific activations, the union-of-subspaces model can still incur larger effective error on downstream benchmarks than weight-space methods at identical parameter budgets. The manuscript should include ablations or analysis on calibration set size, distribution, and representativeness to support this.
minor comments (1)
  1. Abstract: The code repository link is provided, supporting reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each of the major comments below and indicate the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: Abstract: The abstract states that CoSpaDi 'consistently improves' the trade-offs but provides no quantitative numbers, error bars, details on calibration set size, dictionary size selection, or statistical significance tests. This absence makes the central empirical claim difficult to evaluate and verify from the provided text.

    Authors: We agree that incorporating specific quantitative details in the abstract would enhance the clarity and verifiability of our claims. In the revised manuscript, we will modify the abstract to include key performance metrics, such as the observed improvements in perplexity and zero-shot accuracy at various compression ratios. We will also specify the calibration set size used (128 samples from the C4 dataset), the method for selecting dictionary size (based on minimizing reconstruction error on the calibration set), and note that error bars and statistical details are provided in the experimental results section of the full paper. revision: yes

  2. Referee: Central claim (calibration-guided reconstruction): The assumption that minimizing functional reconstruction error on a small calibration set will yield a factorization whose zero-shot accuracy and perplexity remain superior to SVD/pruning baselines without fine-tuning is load-bearing. If the calibration set under-samples rare patterns or task-specific activations, the union-of-subspaces model can still incur larger effective error on downstream benchmarks than weight-space methods at identical parameter budgets. The manuscript should include ablations or analysis on calibration set size, distribution, and representativeness to support this.

    Authors: This is a valid concern regarding the generalizability of the calibration-guided optimization. The current manuscript uses a fixed calibration set of 128 samples and demonstrates consistent improvements across Llama and Qwen models on standard benchmarks. To further support the robustness of this approach, we will add an ablation study in the revised version analyzing the effects of varying the calibration set size and using different data distributions (e.g., C4 versus other corpora). We will also include a discussion on the limitations of finite calibration sets and how the functional reconstruction objective helps mitigate issues with rare patterns by focusing on activation statistics. We believe these additions will address the referee's point without altering the core methodology. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper derives its method by first posing a data-aware objective that minimizes layer-output reconstruction error on a calibration set, then applying an activation-derived Gram orthonormalization to recast this exactly as a standard dictionary learning problem on transformed weights. This is a mathematical equivalence that enables use of existing solvers rather than a self-definitional loop or fitted input renamed as prediction. Empirical gains over SVD and structured pruning baselines at 20-40% compression are reported via direct accuracy and perplexity measurements on Llama and Qwen families; these do not reduce to the calibration inputs by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the described chain, leaving the central claims self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the expressiveness of the union-of-subspaces model for weight matrices and the sufficiency of a small calibration set for guiding the factorization; no new physical entities or unproven mathematical axioms are introduced beyond standard linear algebra assumptions.

free parameters (1)
  • dictionary size and sparsity level
    Hyperparameters chosen to achieve target 20-40% compression ratios; not fitted to final task metrics in the abstract description.
axioms (1)
  • domain assumption Weight matrices admit a good approximation as dense dictionary times column-sparse coefficients
    Invoked when replacing low-rank factorization with the proposed sparse decomposition.

pith-pipeline@v0.9.0 · 5824 in / 1348 out tokens · 54230 ms · 2026-05-18T13:14:45.608935+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 8 internal anchors

  1. [1]

    Rohan Anil, Sebastian Borgeaud, Jiecao Chen, Aakanksha Chowdhery, Jonathan Clark, et al

    doi: 10.1109/TSP.2006.881199. Rohan Anil, Sebastian Borgeaud, Jiecao Chen, Aakanksha Chowdhery, Jonathan Clark, et al. Palm 2 technical report. InarXiv preprint arXiv:2305.10403,

  2. [2]

    doi: 10.1109/ICASSP.1999.760624. Ky Fan. On a theorem of Weyl concerning eigenvalues of linear transformations I.Proceedings of the National Academy of Sciences of the United States of America, 35(11):652–655,

  3. [3]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pretrained transformers.arXiv preprint arXiv:2210.17323,

  4. [4]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

  5. [5]

    Edward J

    ICLR 2022 Workshop. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InInternational Con- ference on Learning Representations,

  6. [6]

    Tinybert: Distilling bert for natural language understanding

    Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. InFindings of the Association for Computational Linguistics: EMNLP 2020, pp. 4163–4174,

  7. [7]

    doi: 10.18653/v1/2020

    doi: 10.18653/v1/2020. findings-emnlp.372. Junhyuck Kim, Jongho Park, Jaewoong Cho, and Dimitris Papailiopoulos. Lexico: Extreme kv cache compression via sparse coding over universal dictionaries,

  8. [8]

    Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1317– 1327,

  9. [9]

    11 Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy

    doi: 10.18653/v1/D16-1139. 11 Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations,

  10. [10]

    Orca 2: Teaching small language models how to reason.arXiv preprint arXiv:2311.11045,

    Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, Ahmed Awadallah, and Sub- habrata Mukherjee. Orca 2: Teaching small language models how to reason.arXiv preprint arXiv:2311.11045,

  11. [11]

    Orca: Progressive Learning from Complex Explanation Traces of GPT-4

    Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4.arXiv preprint arXiv:2306.02707,

  12. [12]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

  13. [13]

    The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

    Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only.arXiv preprint arXiv:2306.01116,

  14. [14]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert: A distilled version of bert: Smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,

  15. [15]

    A Simple and Effective Pruning Approach for Large Language Models

    Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695,

  16. [16]

    Patient knowledge distillation for BERT model compression

    Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. Patient knowledge distillation for BERT model compression. InProceedings of EMNLP-IJCNLP 2019, pp. 4323–4332,

  17. [17]

    MYTE: Morphology-driven byte encoding for better and fairer multilingual language modeling

    doi: 10.18653/v1/ D19-1441. 12 Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. Mobilebert: A compact task-agnostic bert for resource-limited devices. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 2158–2170,

  18. [18]

    doi: 10.18653/v1/2020.acl-main.195. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Ar- mand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. InInternational Conference on...

  19. [19]

    Basis sharing: Cross- layer parameter sharing for large language model compression.arXiv preprint arXiv:2410.03765,

    Jingcun Wang, Yu-Guang Chen, Ing-Chao Lin, Bing Li, and Grace Li Zhang. Basis sharing: Cross- layer parameter sharing for large language model compression.arXiv preprint arXiv:2410.03765,

  20. [20]

    Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers

    Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, and Furu Wei. Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 2140–2151,

  21. [21]

    Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang

    18653/v1/2021.findings-acl.188. Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. Svd-llm: Truncation-aware singular value decomposition for large language model compression. InInternational Conference on Learning Representations, 2025a. Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. Svd-llm v2: Optimizing singular value truncation for large language model...

  22. [22]

    Bert-of-theseus: Compress- ing bert by progressive module replacing

    Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei, and Ming Zhou. Bert-of-theseus: Compress- ing bert by progressive module replacing. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7859–7869,

  23. [23]

    ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

    Zhihang Yuan, Yuzhang Shang, Yue Song, Qiang Wu, Yan Yan, and Guangyu Sun. Asvd: Activation-aware singular value decomposition for compressing large language models.arXiv preprint arXiv:2312.05821,

  24. [24]

    Share your attention: Transformer weight sharing via matrix-based dictionary learning.arXiv preprint arXiv:2508.04581,

    Magauiya Zhussip, Dmitriy Shopkhoev, Ammar Ali, and Stamatios Lefkimmiatis. Share your attention: Transformer weight sharing via matrix-based dictionary learning.arXiv preprint arXiv:2508.04581,

  25. [25]

    14 A APPENDIX A.1 DERIVATION OF THE OPTIMAL PAIR OF BASIS AND COEFFICIENT MATRICES We are seeking the optimal pair(B ⋆,C ⋆)that minimizes the constrained problem in Eq. (3). First, we rewrite the objectiveJin its equivalent form: J= tr WTW −2 tr WTBC + tr CTBTBC .(11) Next, we consider the basis matrixBas fixed and compute the gradient of the objective w....

  26. [26]

    (2023) which is another data-aware SVD- based method

    A.6 COMPARISON WITHOTHERTRAINING-FREEMETHODS We also evaluate the performance of the proposed CoSpaDi relative to other training free methods, particularly structural pruning ones and ASVD Yuan et al. (2023) which is another data-aware SVD- based method. The results are provided in Table

  27. [27]

    7 shows that average accuracy stabilizes after roughly 50 K-SVD iterations, while perplexity continues to de- crease slightly before flattening out

    The left plot in Fig. 7 shows that average accuracy stabilizes after roughly 50 K-SVD iterations, while perplexity continues to de- crease slightly before flattening out. The right plot of Fig. 7 indicates that very few power iterations are sufficient for stable convergence: performance improves sharply up to around 5 iterations, after which additional it...

  28. [28]

    for different weight types (Query, Key, Value, Up, Gate, Down, Out), grouped by compres- sion rate and group size (1 or 2). Weight Type Compression Rate Group size Dictionary,kSparsity,s k/sratio d 1 d2 Query 20% 2 3276 1638 2 4096 8192 Key 3276 1638 2 4096 8192 Value 3276 1638 2 4096 8192 Up 4776 2388 2 4096 22016 Gate 4776 2388 2 4096 22016 Down 1 2762 ...