pith. machine review for the scientific record. sign in

arxiv: 2605.08568 · v1 · submitted 2026-05-09 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:19 UTC · model grok-4.3

classification 💻 cs.LG
keywords SVD compressionLLM compressionprompt-aware rank selectiondynamic ranklinear routerpost-training compressioninference speedup
0
0 comments X

The pith

A linear router trained on dense outputs can pick prompt-specific SVD ranks to improve accuracy and speed in compressed LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that static SVD rank truncation is limited because the optimal number of singular components varies across prompts and depends heavily on the calibration set chosen during compression. PARSE addresses this by training a linear router offline to select ranks for each prompt by matching the behavior of the full dense model on a large corpus. The router's choices exhibit patterns that repeat for similar prompts and remain stable across generation steps, so cached selections can be reused directly. When added to four existing SVD methods, the framework raises average accuracy by up to 10 percent at a 0.6 compression ratio on LLaMA-7B and delivers up to 2.5 times faster prefill and 2.4 times faster decode than static SVD execution.

Core claim

PARSE decouples rank selection from any fixed calibration set by supervising a linear router against dense-model outputs on a broad corpus. This router assigns different SVD ranks to different inputs at inference time. Because rank-selection patterns are shared across semantically similar prompts and stay consistent during decoding, the chosen rank subsets can be served from a pattern cache. Expert memory aggregation and kernel fusion then keep the added overhead low. Integrated with four representative SVD pipelines, the method improves average task accuracy by up to 10 percent at a 0.6 compression ratio on LLaMA-7B while achieving up to 2.5× prefill and 2.4× decode speedup over native SVD.

What carries the argument

A linear router that maps prompt embeddings to rank choices, trained offline by supervising against dense-model outputs rather than calibration data, which selects input-specific SVD rank subsets at inference.

If this is right

  • Existing SVD-based compression pipelines can be upgraded by adding the router without altering their core low-rank decomposition step.
  • Each prompt receives only the singular components it needs, reducing average compute and memory traffic during both prefill and decode.
  • Rank patterns that repeat across similar prompts allow caching, so router cost becomes negligible for long generations.
  • The accuracy and speedup benefits hold when the router is combined with any of the four tested SVD methods.
  • Rank selection no longer depends on the particular calibration dataset used to build the compressed model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompt-conditioned selection principle could be tested on other low-rank or sparse compression schemes that currently use static truncation.
  • If the router generalizes across domains, it could reduce the need for repeated calibration when deploying compressed models to new tasks.
  • The observed stability of rank choices across decoding steps suggests that prompt-level decisions can be made once per sequence instead of per token.

Load-bearing premise

A linear router trained offline on dense-model outputs from a large-scale corpus will reliably generalize to select suitable ranks for new, unseen prompts without being overly sensitive to the choice of training data.

What would settle it

Measure task accuracy on a held-out prompt set drawn from a different distribution than the router's training corpus; if the gains over static SVD vanish or reverse, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.08568 by Grace Li Zhang, Hengyi Zhu, Shaoyi Huang, Zhendong Mi.

Figure 1
Figure 1. Figure 1: Per-prompt window perplex￾ity on WikiText-2 for the dense and com￾pressed LLaMA-7B. Observation 1: Rank selection is sensitive to input prompts. Prior work [33, 14, 15, 32] typically evaluates compressed models by partitioning the test set into fixed￾length windows and averaging perplexity (PPL) across them. Such aggregate metrics, however, mask the per￾prompt behavior of the compressed model [PITH_FULL_I… view at source ↗
Figure 2
Figure 2. Figure 2: (a) Cross-dataset evaluation of SVD-compressed LLaMA-7B under different calibration [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed framework with example of [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Correlation between rank subset over￾lap and prompt embedding cosine similarity. We reformulate each weight matrix as a mixture of independent rank experts and introduce an offline linear router that selects a prompt-aware subset for each input, addressing the static rank truncation in Observation 1 that discards components critical for specific prompts. As established in Section 2, applying a weight matri… view at source ↗
Figure 5
Figure 5. Figure 5: Rank overlap between the subset se￾lected at prefilling and those selected at each subsequent decoding step. Rank Reuse for Decoding. At each decoding step, querying the router introduces an additional forward pass through fθ before every matrix computation, ac￾cumulating latency overhead across all layers and steps. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (a) Memory aggregation reduces scattered expert [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prefill latency (ms) and Decode latency (ms) of each token of Native SVD, Dense(Pytorch), [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prefill latency (ms) and decode latency (ms) of each token of [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-window perplexity on WikiText-2 for the dense [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
read the original abstract

Large language models (LLMs) have rapidly grown in scale, creating substantial memory and computational costs that hinder efficient deployment. Singular value decomposition (SVD) has emerged as an effective post-training compression technique, but existing SVD-based methods rely on static rank truncation, applying a fixed prefix of singular components to all inputs regardless of their diversity. We identify two limitations of this static design: the optimal rank varies across individual prompts, and the selected rank is sensitive to the choice of calibration set, leading to suboptimal performance across diverse inputs. To address these challenges, we propose $\textbf{PARSE}$, a post-training framework for $\textbf{P}$rompt-$\textbf{A}$ware $\textbf{R}$ank $\textbf{S}$election as $\textbf{E}$xperts in SVD-compressed LLMs. PARSE trains a linear router offline to perform prompt-aware rank selection, decoupling it from calibration information by supervising the router against dense-model outputs on a large-scale corpus. We further observe that rank-selection patterns are shared across semantically similar prompts and remain stable across decoding steps, allowing appropriate rank subsets to be served directly from a pattern cache at inference. Complemented by expert memory aggregation and kernel fusion for system-level efficiency, PARSE is orthogonal to existing SVD-based pipelines and consistently improves both model quality and inference efficiency. Integrated with four representative SVD-based methods, PARSE improves average task accuracy by up to 10% at a compression ratio of 0.6 on LLaMA-7B, and achieves up to 2.5 $\times$ prefill and 2.4 $\times$ decode speedup over native SVD execution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes PARSE, a post-training framework for prompt-aware dynamic rank selection in SVD-compressed LLMs. It identifies that static rank truncation is suboptimal because optimal ranks vary across prompts and are sensitive to calibration sets. PARSE trains a linear router offline, supervised on dense-model outputs from a large-scale corpus, to select per-layer rank subsets for each prompt. It exploits observed stability of rank patterns across semantically similar prompts and decoding steps via a pattern cache, plus expert memory aggregation and kernel fusion. When integrated with four SVD baselines, it reports up to 10% average task accuracy gain at 0.6 compression on LLaMA-7B and up to 2.5× prefill / 2.4× decode speedups over native SVD.

Significance. If the linear router generalizes reliably, the work offers a lightweight, calibration-decoupled way to improve existing SVD pipelines without retraining the base model or introducing non-linear overhead. The orthogonality claim and the use of offline dense supervision are positive features; reproducible speedups from caching and fusion would be practically useful for deployment.

major comments (3)
  1. [§3] §3 (router training): the headline accuracy claim (up to 10% at ratio 0.6) rests on the linear router generalizing from a fixed large-scale corpus to unseen prompts. No quantitative evidence is provided that rank-selection patterns are linearly separable (e.g., no separability metrics, no comparison to non-linear routers, no sensitivity analysis to corpus choice). If the mapping is corpus-dependent or requires non-linear decision boundaries, the reported gains become an artifact of the particular training distribution rather than a general property.
  2. [§4] §4 (experiments): the integration results with four SVD methods lack error bars, multiple random seeds, or statistical significance tests. Without these, it is impossible to determine whether the 10% average accuracy lift is robust or driven by post-hoc choices of prompts, tasks, or calibration data.
  3. [§4.3] §4.3 (generalization): the claim that rank patterns are shared across semantically similar prompts and stable across decoding steps is stated but not supported by any clustering, similarity, or stability metrics. A quantitative validation (e.g., intra-cluster variance of selected ranks or cross-prompt transfer accuracy) is needed to justify the pattern-cache design.
minor comments (2)
  1. [§3.2] Notation for the router input features and the exact supervision loss (cross-entropy on dense logits?) should be defined explicitly with equations.
  2. [Abstract, §4] The abstract and experiments should clarify the precise compression ratio definition (parameter count, FLOPs, or memory) and report the effective rank distribution chosen by the router.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the paper without altering its core claims.

read point-by-point responses
  1. Referee: [§3] the headline accuracy claim (up to 10% at ratio 0.6) rests on the linear router generalizing from a fixed large-scale corpus to unseen prompts. No quantitative evidence is provided that rank-selection patterns are linearly separable (e.g., no separability metrics, no comparison to non-linear routers, no sensitivity analysis to corpus choice).

    Authors: We agree that explicit evidence of linear separability would strengthen the justification for our router design. In the revised manuscript we will add a dedicated analysis subsection that reports (i) a linear separability metric (ratio of between-class to within-class scatter on rank-label embeddings), (ii) a direct comparison of the linear router against a small two-layer MLP on the same supervision data, and (iii) sensitivity results obtained by training on random 50 % and 25 % subsets of the corpus and evaluating on held-out prompts. These additions will demonstrate that the observed gains are not artifacts of the particular training distribution. revision: yes

  2. Referee: [§4] the integration results with four SVD methods lack error bars, multiple random seeds, or statistical significance tests.

    Authors: We acknowledge that variability measures are necessary to establish robustness. We will re-execute the main accuracy and speedup experiments across three independent random seeds, report mean and standard deviation for all task accuracies, and include paired t-tests (or Wilcoxon signed-rank tests where normality assumptions fail) for the reported improvements over the four SVD baselines. These statistics will be added to Tables 2–4 and the corresponding text in §4. revision: yes

  3. Referee: [§4.3] the claim that rank patterns are shared across semantically similar prompts and stable across decoding steps is stated but not supported by any clustering, similarity, or stability metrics.

    Authors: We recognize that quantitative validation is required to support the pattern-cache design. In the revision we will insert (i) intra-cluster variance of selected ranks when prompts are clustered by sentence-BERT embeddings, (ii) cross-prompt transfer accuracy when a router trained on one cluster is evaluated on another, and (iii) per-layer rank-change frequency and average stability score across decoding steps on long sequences. These metrics will be presented in a new paragraph in §4.3 together with the existing qualitative observations. revision: yes

Circularity Check

0 steps flagged

Minor self-citation present but not load-bearing; router supervision remains externally grounded

full rationale

The paper's central mechanism trains a linear router offline by supervising it directly against dense-model outputs on an external large-scale corpus, decoupling rank selection from any calibration set used in SVD truncation. This provides independent grounding outside the compressed model itself. No equations or steps reduce by construction to fitted inputs renamed as predictions, no uniqueness theorems are imported from the same authors, and no ansatz is smuggled via self-citation. The abstract explicitly states the supervision source and orthogonality to existing SVD pipelines, making the derivation self-contained against external benchmarks. A score of 2 accounts for the possibility of routine self-citations in the full text that do not carry the core claim.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the learnability of rank patterns by a linear router and their stability across similar prompts and decoding steps; no explicit free parameters, axioms, or invented entities are detailed beyond the router training process.

pith-pipeline@v0.9.0 · 5606 in / 1154 out tokens · 31499 ms · 2026-05-12T02:19:10.748477+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 9 internal anchors

  1. [1]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

  2. [2]

    Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model, 2025

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model, 2025

  3. [3]

    Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020

  4. [4]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

  5. [5]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

  6. [6]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  7. [7]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  8. [8]

    Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E

    Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y . Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu, 2023

  9. [9]

    Dipsvd: Dual-importance protected svd for efficient llm compression

    Xuan Ding, Rui Sun, Yunjian Zhang, Xiu Yan, Yueqi Zhou, Kaihao Huang, Suzhong Fu, Chuan- long Xie, and Yao Zhu. Dipsvd: Dual-importance protected svd for efficient llm compression. arXiv preprint arXiv:2506.20353, 2025

  10. [10]

    A survey on efficient inference for large language models, 2024

    Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, and Yu Wang. A survey on efficient inference for large language models, 2024

  11. [11]

    Model compression and efficient inference for large language models: A survey, 2024

    Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, and Xiaofei He. Model compression and efficient inference for large language models: A survey, 2024

  12. [12]

    Enhancing energy efficiency in ai: A multi-faceted analysis across time series, semantic ai and deep learning domains

    Lejla Begic Fazlic, Berkay Cetkin, Achim Guldner, Matthias Dziubany, Julian Heinen, Stefan Naumann, and Guido Dartmann. Enhancing energy efficiency in ai: A multi-faceted analysis across time series, semantic ai and deep learning domains. InEnvironmental Informatics, pages 237–256. Springer, 2024

  13. [13]

    The carbon footprint of machine learning training will plateau, then shrink.Computer, 55(7):18–28, 2022

    David Patterson, Joseph Gonzalez, Urs Hölzle, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David R So, Maud Texier, and Jeff Dean. The carbon footprint of machine learning training will plateau, then shrink.Computer, 55(7):18–28, 2022

  14. [14]

    Dobi-svd: Differentiable svd for llm compression and some new perspectives.arXiv preprint arXiv:2502.02723, 2025

    Qinsi Wang, Jinghan Ke, Masayoshi Tomizuka, Yiran Chen, Kurt Keutzer, and Chenfeng Xu. Dobi-svd: Differentiable svd for llm compression and some new perspectives.arXiv preprint arXiv:2502.02723, 2025

  15. [15]

    Basissharing:Cross-layer parameter sharing for large language model compression.arXiv preprint arXiv:2410.03765, 2024

    Jingcun Wang, Yu-Guang Chen, Ing-Chao Lin, Bing Li, and Grace Li Zhang. Basis shar- ing: Cross-layer parameter sharing for large language model compression.arXiv preprint arXiv:2410.03765, 2024

  16. [16]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

  17. [17]

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

  18. [18]

    Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291, 2024

    Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiaojuan Qi. Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291, 2024

  19. [19]

    Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023

    Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023. 11

  20. [20]

    Sparsegpt: Massive language models can be accurately pruned in one-shot

    Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. InInternational conference on machine learning, pages 10323–10337. PMLR, 2023

  21. [21]

    A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695, 2023

    Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695, 2023

  22. [22]

    Minillm: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InThe twelfth international conference on learning representations, 2024

  23. [23]

    Survey on knowledge distillation for large language models: methods, evaluation, and application.ACM Transactions on Intelligent Systems and Technology, 16(6):1– 27, 2025

    Chuanpeng Yang, Yao Zhu, Wang Lu, Yidong Wang, Qian Chen, Chenlong Gao, Bingjie Yan, and Yiqiang Chen. Survey on knowledge distillation for large language models: methods, evaluation, and application.ACM Transactions on Intelligent Systems and Technology, 16(6):1– 27, 2025

  24. [24]

    Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

  25. [25]

    Polycystic kidney disease.Nature reviews Disease primers, 4(1):50, 2018

    Carsten Bergmann, Lisa M Guay-Woodford, Peter C Harris, Shigeo Horie, Dorien JM Peters, and Vicente E Torres. Polycystic kidney disease.Nature reviews Disease primers, 4(1):50, 2018

  26. [26]

    Adabert: Task-adaptive bert compression with differentiable neural architecture search.arXiv preprint arXiv:2001.04246, 2020

    Daoyuan Chen, Yaliang Li, Minghui Qiu, Zhen Wang, Bofang Li, Bolin Ding, Hongbo Deng, Jun Huang, Wei Lin, and Jingren Zhou. Adabert: Task-adaptive bert compression with differentiable neural architecture search.arXiv preprint arXiv:2001.04246, 2020

  27. [27]

    Asvd: Activation-aware singular value decomposition for compressing large language models,

    Zhihang Yuan, Yuzhang Shang, Yue Song, Dawei Yang, Qiang Wu, Yan Yan, and Guangyu Sun. Asvd: Activation-aware singular value decomposition for compressing large language models. arXiv preprint arXiv:2312.05821, 2023

  28. [28]

    Svd-llm: Truncation- aware singular value decomposition for large language model compression,

    Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. Svd-llm: Truncation-aware singular value decomposition for large language model compression.arXiv preprint arXiv:2403.07378, 2024

  29. [29]

    Adasvd: Adaptive singular value decomposition for large language models.arXiv preprint arXiv:2502.01403, 2025

    Zhiteng Li, Mingyuan Xia, Jingyuan Zhang, Zheng Hui, Haotong Qin, Linghe Kong, Yulun Zhang, and Xiaokang Yang. Adasvd: Adaptive singular value decomposition for large language models.arXiv preprint arXiv:2502.01403, 2025

  30. [30]

    The approximation of one matrix by another of lower rank

    Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936

  31. [31]

    SIAM, 2023

    Carl D Meyer.Matrix analysis and applied linear algebra. SIAM, 2023

  32. [32]

    Saes-svd: Self-adaptive suppres- sionofaccumulatedandlocalerrorsforsvd-basedllmcompression.arXivpreprintarXiv:2602.03051, 2026

    Xing Hu, Dawei Yang, Yuan Cheng, Zhixuan Chen, and Zukang Xu. Saes-svd: Self-adaptive suppression of accumulated and local errors for svd-based llm compression.arXiv preprint arXiv:2602.03051, 2026

  33. [33]

    Svd-llm v2: Optimizing singular value truncation for large language model compression

    Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, and Mi Zhang. Svd-llm v2: Optimizing singular value truncation for large language model compression. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4287–4296, 2025

  34. [34]

    SNIP: Single-shot Network Pruning based on Connection Sensitivity

    Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruning based on connection sensitivity.arXiv preprint arXiv:1810.02340, 2018

  35. [35]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

  36. [36]

    Layer-wisedynamicrankforcompressing large language models.arXiv preprint arXiv:2509.25622, 2025

    Zhendong Mi, Bian Sun, Grace Li Zhang, and Shaoyi Huang. Layer-wise dynamic rank for compressing large language models.arXiv preprint arXiv:2509.25622, 2025. 12

  37. [37]

    Gqa: Training generalized multi-query transformer models from multi-head checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023

  38. [38]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jia- jun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

  39. [39]

    Pointer sentinel mixture models, 2016

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016

  40. [40]

    Building a large annotated corpus of english: The penn treebank.Computational linguistics, 19(2):313–330, 1993

    Mitch Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of english: The penn treebank.Computational linguistics, 19(2):313–330, 1993

  41. [41]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

  42. [42]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2381–2391, 2018

  43. [43]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

  44. [44]

    Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

  45. [45]

    Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

  46. [46]

    Piqa: Reasoning about phys- ical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

  47. [47]

    Mathqa: Towards interpretable math word problem solving with operation-based formalisms

    Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and ...

  48. [48]

    The language model evaluation harness, 07 2024

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

  49. [49]

    Language model compression with weighted low-rank factorization.arXiv preprint arXiv:2207.00112,

    Yen-Chang Hsu, Ting Hua, Sungen Chang, Qian Lou, Yilin Shen, and Hongxia Jin. Language model compression with weighted low-rank factorization.arXiv preprint arXiv:2207.00112, 2022

  50. [50]

    A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12:1556–1577, 2024

    Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12:1556–1577, 2024

  51. [51]

    A survey on model compression and acceleration for pretrained language models

    Canwen Xu and Julian McAuley. A survey on model compression and acceleration for pretrained language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 10566–10575, 2023. 13

  52. [52]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  53. [53]

    MiniLLM: On-Policy Distillation of Large Language Models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models.arXiv preprint arXiv:2306.08543, 2023

  54. [54]

    Caixia Yan, Xiaojun Chang, Minnan Luo, Huan Liu, Xiaoqin Zhang, and Qinghua Zheng

    Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, and Tianyi Zhou. A survey on knowledge distillation of large language models.arXiv preprint arXiv:2402.13116, 2024

  55. [55]

    Enabling unstructured sparse acceleration on structured sparse accelerators.Proceedings of Machine Learning and Systems, 7, 2025

    Geonhwa Jeong, Po-An Tsai, Abhimanyu R Bambhaniya, Stephen W Keckler, and Tushar Kr- ishna. Enabling unstructured sparse acceleration on structured sparse accelerators.Proceedings of Machine Learning and Systems, 7, 2025

  56. [56]

    Hardware acceleration of sparse and irregular tensor computations of ml models: A survey and insights.Proceedings of the IEEE, 109(10):1706–1752, 2021

    Shail Dave, Riyadh Baghdadi, Tony Nowatzki, Sasikanth Avancha, Aviral Shrivastava, and Baoxin Li. Hardware acceleration of sparse and irregular tensor computations of ml models: A survey and insights.Proceedings of the IEEE, 109(10):1706–1752, 2021

  57. [57]

    Slicegpt: Compress large language models by deleting rows and columns,

    Saleh Ashkboos, Maximilian L Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024, 2024

  58. [58]

    Slimllm: Accurate structured pruning for large language models.arXiv preprint arXiv:2505.22689, 2025

    Jialong Guo, Xinghao Chen, Yehui Tang, and Yunhe Wang. Slimllm: Accurate structured pruning for large language models.arXiv preprint arXiv:2505.22689, 2025

  59. [59]

    Smoothquant: Accurate and efficient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning, pages 38087–38099. PMLR, 2023

  60. [60]

    Zeroquant: Efficient and affordable post-training quantization for large-scale transformers

    Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in neural information processing systems, 35:27168–27183, 2022

  61. [61]

    Quantization meets reasoning: Exploring LLM low-bit quantization degradation for mathematical reasoning.arXiv preprint arXiv:2501.03035, 2025

    Zhen Li, Yupeng Su, Runming Yang, Congkai Xie, Zheng Wang, Zhongwei Xie, Ngai Wong, and Hongxia Yang. Quantization meets reasoning: Exploring llm low-bit quantization degradation for mathematical reasoning.arXiv preprint arXiv:2501.03035, 2025

  62. [62]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with condi- tional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2020

  63. [63]

    A survey on mixture of experts.Authorea Preprints, 2024

    Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. A survey on mixture of experts.Authorea Preprints, 2024

  64. [64]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

  65. [65]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  66. [66]

    Moefication: Transformer feed-forward layers are mixtures of experts

    Zhengyan Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. Moefication: Transformer feed-forward layers are mixtures of experts. InFindings of the Association for Computational Linguistics: ACL 2022, pages 877–890, 2022. 14 A Related Works Large Language Model Compression.Large language model compression has been widely studied to reduce...