arxiv: 2605.08568 · v1 · submitted 2026-05-09 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression

Hengyi Zhu , Zhendong Mi , Grace Li Zhang , Shaoyi Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:19 UTC · model grok-4.3

classification 💻 cs.LG

keywords SVD compressionLLM compressionprompt-aware rank selectiondynamic ranklinear routerpost-training compressioninference speedup

0 comments

The pith

A linear router trained on dense outputs can pick prompt-specific SVD ranks to improve accuracy and speed in compressed LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that static SVD rank truncation is limited because the optimal number of singular components varies across prompts and depends heavily on the calibration set chosen during compression. PARSE addresses this by training a linear router offline to select ranks for each prompt by matching the behavior of the full dense model on a large corpus. The router's choices exhibit patterns that repeat for similar prompts and remain stable across generation steps, so cached selections can be reused directly. When added to four existing SVD methods, the framework raises average accuracy by up to 10 percent at a 0.6 compression ratio on LLaMA-7B and delivers up to 2.5 times faster prefill and 2.4 times faster decode than static SVD execution.

Core claim

PARSE decouples rank selection from any fixed calibration set by supervising a linear router against dense-model outputs on a broad corpus. This router assigns different SVD ranks to different inputs at inference time. Because rank-selection patterns are shared across semantically similar prompts and stay consistent during decoding, the chosen rank subsets can be served from a pattern cache. Expert memory aggregation and kernel fusion then keep the added overhead low. Integrated with four representative SVD pipelines, the method improves average task accuracy by up to 10 percent at a 0.6 compression ratio on LLaMA-7B while achieving up to 2.5× prefill and 2.4× decode speedup over native SVD.

What carries the argument

A linear router that maps prompt embeddings to rank choices, trained offline by supervising against dense-model outputs rather than calibration data, which selects input-specific SVD rank subsets at inference.

If this is right

Existing SVD-based compression pipelines can be upgraded by adding the router without altering their core low-rank decomposition step.
Each prompt receives only the singular components it needs, reducing average compute and memory traffic during both prefill and decode.
Rank patterns that repeat across similar prompts allow caching, so router cost becomes negligible for long generations.
The accuracy and speedup benefits hold when the router is combined with any of the four tested SVD methods.
Rank selection no longer depends on the particular calibration dataset used to build the compressed model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompt-conditioned selection principle could be tested on other low-rank or sparse compression schemes that currently use static truncation.
If the router generalizes across domains, it could reduce the need for repeated calibration when deploying compressed models to new tasks.
The observed stability of rank choices across decoding steps suggests that prompt-level decisions can be made once per sequence instead of per token.

Load-bearing premise

A linear router trained offline on dense-model outputs from a large-scale corpus will reliably generalize to select suitable ranks for new, unseen prompts without being overly sensitive to the choice of training data.

What would settle it

Measure task accuracy on a held-out prompt set drawn from a different distribution than the router's training corpus; if the gains over static SVD vanish or reverse, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.08568 by Grace Li Zhang, Hengyi Zhu, Shaoyi Huang, Zhendong Mi.

**Figure 1.** Figure 1: Per-prompt window perplexity on WikiText-2 for the dense and compressed LLaMA-7B. Observation 1: Rank selection is sensitive to input prompts. Prior work [33, 14, 15, 32] typically evaluates compressed models by partitioning the test set into fixedlength windows and averaging perplexity (PPL) across them. Such aggregate metrics, however, mask the perprompt behavior of the compressed model [PITH_FULL_I… view at source ↗

**Figure 2.** Figure 2: (a) Cross-dataset evaluation of SVD-compressed LLaMA-7B under different calibration [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed framework with example of [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Correlation between rank subset overlap and prompt embedding cosine similarity. We reformulate each weight matrix as a mixture of independent rank experts and introduce an offline linear router that selects a prompt-aware subset for each input, addressing the static rank truncation in Observation 1 that discards components critical for specific prompts. As established in Section 2, applying a weight matri… view at source ↗

**Figure 5.** Figure 5: Rank overlap between the subset selected at prefilling and those selected at each subsequent decoding step. Rank Reuse for Decoding. At each decoding step, querying the router introduces an additional forward pass through fθ before every matrix computation, accumulating latency overhead across all layers and steps. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: (a) Memory aggregation reduces scattered expert [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Prefill latency (ms) and Decode latency (ms) of each token of Native SVD, Dense(Pytorch), [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Prefill latency (ms) and decode latency (ms) of each token of [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Per-window perplexity on WikiText-2 for the dense [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

read the original abstract

Large language models (LLMs) have rapidly grown in scale, creating substantial memory and computational costs that hinder efficient deployment. Singular value decomposition (SVD) has emerged as an effective post-training compression technique, but existing SVD-based methods rely on static rank truncation, applying a fixed prefix of singular components to all inputs regardless of their diversity. We identify two limitations of this static design: the optimal rank varies across individual prompts, and the selected rank is sensitive to the choice of calibration set, leading to suboptimal performance across diverse inputs. To address these challenges, we propose $\textbf{PARSE}$, a post-training framework for $\textbf{P}$rompt-$\textbf{A}$ware $\textbf{R}$ank $\textbf{S}$election as $\textbf{E}$xperts in SVD-compressed LLMs. PARSE trains a linear router offline to perform prompt-aware rank selection, decoupling it from calibration information by supervising the router against dense-model outputs on a large-scale corpus. We further observe that rank-selection patterns are shared across semantically similar prompts and remain stable across decoding steps, allowing appropriate rank subsets to be served directly from a pattern cache at inference. Complemented by expert memory aggregation and kernel fusion for system-level efficiency, PARSE is orthogonal to existing SVD-based pipelines and consistently improves both model quality and inference efficiency. Integrated with four representative SVD-based methods, PARSE improves average task accuracy by up to 10% at a compression ratio of 0.6 on LLaMA-7B, and achieves up to 2.5 $\times$ prefill and 2.4 $\times$ decode speedup over native SVD execution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PARSE adds a linear router trained on dense outputs to pick prompt-specific SVD ranks, which helps accuracy over static baselines but rests on untested generalization.

read the letter

The main takeaway is that PARSE trains a simple linear router offline to choose different SVD ranks for different prompts, supervising it directly against the full model's outputs on a large corpus instead of a calibration set. This decouples the rank choice from any fixed data and lets them cache patterns for similar prompts and stable decoding steps. They combine it with expert memory aggregation and kernel fusion for inference speed. The approach is orthogonal to four existing SVD methods and they report up to 10% higher average accuracy at 0.6 compression on LLaMA-7B plus 2.5x prefill and 2.4x decode speedups over plain SVD execution. That is the concrete advance: turning a static truncation into a prompt-dependent one with low overhead at inference. The router idea itself is straightforward and the caching observation is practical for real workloads. The soft spot is the generalization claim. The accuracy numbers require that the linear router maps new prompts to good rank subsets without being overly sensitive to the training corpus or prompt distribution. The abstract notes shared patterns across similar prompts, but the paper would need to show router accuracy on held-out data and how much the gains drop if the corpus changes. If those checks are thin, the reported lift could be partly an artifact of the particular training setup rather than a general property. This work is for people building or deploying compressed LLMs who already use SVD or low-rank approximations. A reader looking for a drop-in way to improve existing compression pipelines would get immediate value from the router plus the system-level tricks. It deserves a serious referee because the method is easy to reproduce and the orthogonality makes the experiments checkable. I would send it to review and ask specifically for router ablation results and sensitivity tests on the training corpus.

Referee Report

3 major / 2 minor

Summary. The paper proposes PARSE, a post-training framework for prompt-aware dynamic rank selection in SVD-compressed LLMs. It identifies that static rank truncation is suboptimal because optimal ranks vary across prompts and are sensitive to calibration sets. PARSE trains a linear router offline, supervised on dense-model outputs from a large-scale corpus, to select per-layer rank subsets for each prompt. It exploits observed stability of rank patterns across semantically similar prompts and decoding steps via a pattern cache, plus expert memory aggregation and kernel fusion. When integrated with four SVD baselines, it reports up to 10% average task accuracy gain at 0.6 compression on LLaMA-7B and up to 2.5× prefill / 2.4× decode speedups over native SVD.

Significance. If the linear router generalizes reliably, the work offers a lightweight, calibration-decoupled way to improve existing SVD pipelines without retraining the base model or introducing non-linear overhead. The orthogonality claim and the use of offline dense supervision are positive features; reproducible speedups from caching and fusion would be practically useful for deployment.

major comments (3)

[§3] §3 (router training): the headline accuracy claim (up to 10% at ratio 0.6) rests on the linear router generalizing from a fixed large-scale corpus to unseen prompts. No quantitative evidence is provided that rank-selection patterns are linearly separable (e.g., no separability metrics, no comparison to non-linear routers, no sensitivity analysis to corpus choice). If the mapping is corpus-dependent or requires non-linear decision boundaries, the reported gains become an artifact of the particular training distribution rather than a general property.
[§4] §4 (experiments): the integration results with four SVD methods lack error bars, multiple random seeds, or statistical significance tests. Without these, it is impossible to determine whether the 10% average accuracy lift is robust or driven by post-hoc choices of prompts, tasks, or calibration data.
[§4.3] §4.3 (generalization): the claim that rank patterns are shared across semantically similar prompts and stable across decoding steps is stated but not supported by any clustering, similarity, or stability metrics. A quantitative validation (e.g., intra-cluster variance of selected ranks or cross-prompt transfer accuracy) is needed to justify the pattern-cache design.

minor comments (2)

[§3.2] Notation for the router input features and the exact supervision loss (cross-entropy on dense logits?) should be defined explicitly with equations.
[Abstract, §4] The abstract and experiments should clarify the precise compression ratio definition (parameter count, FLOPs, or memory) and report the effective rank distribution chosen by the router.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the paper without altering its core claims.

read point-by-point responses

Referee: [§3] the headline accuracy claim (up to 10% at ratio 0.6) rests on the linear router generalizing from a fixed large-scale corpus to unseen prompts. No quantitative evidence is provided that rank-selection patterns are linearly separable (e.g., no separability metrics, no comparison to non-linear routers, no sensitivity analysis to corpus choice).

Authors: We agree that explicit evidence of linear separability would strengthen the justification for our router design. In the revised manuscript we will add a dedicated analysis subsection that reports (i) a linear separability metric (ratio of between-class to within-class scatter on rank-label embeddings), (ii) a direct comparison of the linear router against a small two-layer MLP on the same supervision data, and (iii) sensitivity results obtained by training on random 50 % and 25 % subsets of the corpus and evaluating on held-out prompts. These additions will demonstrate that the observed gains are not artifacts of the particular training distribution. revision: yes
Referee: [§4] the integration results with four SVD methods lack error bars, multiple random seeds, or statistical significance tests.

Authors: We acknowledge that variability measures are necessary to establish robustness. We will re-execute the main accuracy and speedup experiments across three independent random seeds, report mean and standard deviation for all task accuracies, and include paired t-tests (or Wilcoxon signed-rank tests where normality assumptions fail) for the reported improvements over the four SVD baselines. These statistics will be added to Tables 2–4 and the corresponding text in §4. revision: yes
Referee: [§4.3] the claim that rank patterns are shared across semantically similar prompts and stable across decoding steps is stated but not supported by any clustering, similarity, or stability metrics.

Authors: We recognize that quantitative validation is required to support the pattern-cache design. In the revision we will insert (i) intra-cluster variance of selected ranks when prompts are clustered by sentence-BERT embeddings, (ii) cross-prompt transfer accuracy when a router trained on one cluster is evaluated on another, and (iii) per-layer rank-change frequency and average stability score across decoding steps on long sequences. These metrics will be presented in a new paragraph in §4.3 together with the existing qualitative observations. revision: yes

Circularity Check

0 steps flagged

Minor self-citation present but not load-bearing; router supervision remains externally grounded

full rationale

The paper's central mechanism trains a linear router offline by supervising it directly against dense-model outputs on an external large-scale corpus, decoupling rank selection from any calibration set used in SVD truncation. This provides independent grounding outside the compressed model itself. No equations or steps reduce by construction to fitted inputs renamed as predictions, no uniqueness theorems are imported from the same authors, and no ansatz is smuggled via self-citation. The abstract explicitly states the supervision source and orthogonality to existing SVD pipelines, making the derivation self-contained against external benchmarks. A score of 2 accounts for the possibility of routine self-citations in the full text that do not carry the core claim.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the learnability of rank patterns by a linear router and their stability across similar prompts and decoding steps; no explicit free parameters, axioms, or invented entities are detailed beyond the router training process.

pith-pipeline@v0.9.0 · 5606 in / 1154 out tokens · 31499 ms · 2026-05-12T02:19:10.748477+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We reformulate each SVD-compressed weight matrix as a mixture of rank experts... linear router f_θ : R^n → R^rmax... R(x) = TopK(f_θ(x))... supervised against dense-model outputs on a large-scale corpus.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

rank-selection patterns are shared across semantically similar prompts and remain stable across decoding steps

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 9 internal anchors

[1]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page 2025
[2]

Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model, 2025

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model, 2025

work page 2025
[3]

Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020

work page 2020
[4]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page 2020
[5]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

work page 2022
[6]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page 2025
[8]

Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y . Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu, 2023

work page 2023
[9]

Dipsvd: Dual-importance protected svd for efficient llm compression

Xuan Ding, Rui Sun, Yunjian Zhang, Xiu Yan, Yueqi Zhou, Kaihao Huang, Suzhong Fu, Chuan- long Xie, and Yao Zhu. Dipsvd: Dual-importance protected svd for efficient llm compression. arXiv preprint arXiv:2506.20353, 2025

work page arXiv 2025
[10]

A survey on efficient inference for large language models, 2024

Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, and Yu Wang. A survey on efficient inference for large language models, 2024

work page 2024
[11]

Model compression and efficient inference for large language models: A survey, 2024

Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, and Xiaofei He. Model compression and efficient inference for large language models: A survey, 2024

work page 2024
[12]

Enhancing energy efficiency in ai: A multi-faceted analysis across time series, semantic ai and deep learning domains

Lejla Begic Fazlic, Berkay Cetkin, Achim Guldner, Matthias Dziubany, Julian Heinen, Stefan Naumann, and Guido Dartmann. Enhancing energy efficiency in ai: A multi-faceted analysis across time series, semantic ai and deep learning domains. InEnvironmental Informatics, pages 237–256. Springer, 2024

work page 2024
[13]

The carbon footprint of machine learning training will plateau, then shrink.Computer, 55(7):18–28, 2022

David Patterson, Joseph Gonzalez, Urs Hölzle, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David R So, Maud Texier, and Jeff Dean. The carbon footprint of machine learning training will plateau, then shrink.Computer, 55(7):18–28, 2022

work page 2022
[14]

Dobi-svd: Differentiable svd for llm compression and some new perspectives.arXiv preprint arXiv:2502.02723, 2025

Qinsi Wang, Jinghan Ke, Masayoshi Tomizuka, Yiran Chen, Kurt Keutzer, and Chenfeng Xu. Dobi-svd: Differentiable svd for llm compression and some new perspectives.arXiv preprint arXiv:2502.02723, 2025

work page arXiv 2025
[15]

Basissharing:Cross-layer parameter sharing for large language model compression.arXiv preprint arXiv:2410.03765, 2024

Jingcun Wang, Yu-Guang Chen, Ing-Chao Lin, Bing Li, and Grace Li Zhang. Basis shar- ing: Cross-layer parameter sharing for large language model compression.arXiv preprint arXiv:2410.03765, 2024

work page arXiv 2024
[16]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

work page 2024
[18]

Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291, 2024

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiaojuan Qi. Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291, 2024

work page arXiv 2024
[19]

Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023. 11

work page 2023
[20]

Sparsegpt: Massive language models can be accurately pruned in one-shot

Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. InInternational conference on machine learning, pages 10323–10337. PMLR, 2023

work page 2023
[21]

A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695, 2023

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695, 2023

work page arXiv 2023
[22]

Minillm: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InThe twelfth international conference on learning representations, 2024

work page 2024
[23]

Survey on knowledge distillation for large language models: methods, evaluation, and application.ACM Transactions on Intelligent Systems and Technology, 16(6):1– 27, 2025

Chuanpeng Yang, Yao Zhu, Wang Lu, Yidong Wang, Qian Chen, Chenlong Gao, Bingjie Yan, and Yiqiang Chen. Survey on knowledge distillation for large language models: methods, evaluation, and application.ACM Transactions on Intelligent Systems and Technology, 16(6):1– 27, 2025

work page 2025
[24]

Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

work page 2023
[25]

Polycystic kidney disease.Nature reviews Disease primers, 4(1):50, 2018

Carsten Bergmann, Lisa M Guay-Woodford, Peter C Harris, Shigeo Horie, Dorien JM Peters, and Vicente E Torres. Polycystic kidney disease.Nature reviews Disease primers, 4(1):50, 2018

work page 2018
[26]

Adabert: Task-adaptive bert compression with differentiable neural architecture search.arXiv preprint arXiv:2001.04246, 2020

Daoyuan Chen, Yaliang Li, Minghui Qiu, Zhen Wang, Bofang Li, Bolin Ding, Hongbo Deng, Jun Huang, Wei Lin, and Jingren Zhou. Adabert: Task-adaptive bert compression with differentiable neural architecture search.arXiv preprint arXiv:2001.04246, 2020

work page arXiv 2001
[27]

Asvd: Activation-aware singular value decomposition for compressing large language models,

Zhihang Yuan, Yuzhang Shang, Yue Song, Dawei Yang, Qiang Wu, Yan Yan, and Guangyu Sun. Asvd: Activation-aware singular value decomposition for compressing large language models. arXiv preprint arXiv:2312.05821, 2023

work page arXiv 2023
[28]

Svd-llm: Truncation- aware singular value decomposition for large language model compression,

Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. Svd-llm: Truncation-aware singular value decomposition for large language model compression.arXiv preprint arXiv:2403.07378, 2024

work page arXiv 2024
[29]

Adasvd: Adaptive singular value decomposition for large language models.arXiv preprint arXiv:2502.01403, 2025

Zhiteng Li, Mingyuan Xia, Jingyuan Zhang, Zheng Hui, Haotong Qin, Linghe Kong, Yulun Zhang, and Xiaokang Yang. Adasvd: Adaptive singular value decomposition for large language models.arXiv preprint arXiv:2502.01403, 2025

work page arXiv 2025
[30]

The approximation of one matrix by another of lower rank

Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936

work page 1936
[31]

SIAM, 2023

Carl D Meyer.Matrix analysis and applied linear algebra. SIAM, 2023

work page 2023
[32]

Saes-svd: Self-adaptive suppres- sionofaccumulatedandlocalerrorsforsvd-basedllmcompression.arXivpreprintarXiv:2602.03051, 2026

Xing Hu, Dawei Yang, Yuan Cheng, Zhixuan Chen, and Zukang Xu. Saes-svd: Self-adaptive suppression of accumulated and local errors for svd-based llm compression.arXiv preprint arXiv:2602.03051, 2026

work page arXiv 2026
[33]

Svd-llm v2: Optimizing singular value truncation for large language model compression

Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, and Mi Zhang. Svd-llm v2: Optimizing singular value truncation for large language model compression. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4287–4296, 2025

work page 2025
[34]

SNIP: Single-shot Network Pruning based on Connection Sensitivity

Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruning based on connection sensitivity.arXiv preprint arXiv:1810.02340, 2018

work page Pith review arXiv 2018
[35]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

work page 2022
[36]

Layer-wisedynamicrankforcompressing large language models.arXiv preprint arXiv:2509.25622, 2025

Zhendong Mi, Bian Sun, Grace Li Zhang, and Shaoyi Huang. Layer-wise dynamic rank for compressing large language models.arXiv preprint arXiv:2509.25622, 2025. 12

work page arXiv 2025
[37]

Gqa: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023

work page 2023
[38]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jia- jun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Pointer sentinel mixture models, 2016

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016

work page 2016
[40]

Building a large annotated corpus of english: The penn treebank.Computational linguistics, 19(2):313–330, 1993

Mitch Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of english: The penn treebank.Computational linguistics, 19(2):313–330, 1993

work page 1993
[41]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

work page 2020
[42]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2381–2391, 2018

work page 2018
[43]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[44]

Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

work page 2021
[45]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

work page 2019
[46]

Piqa: Reasoning about phys- ical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

work page 2020
[47]

Mathqa: Towards interpretable math word problem solving with operation-based formalisms

Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and ...

work page 2019
[48]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page 2024
[49]

Language model compression with weighted low-rank factorization.arXiv preprint arXiv:2207.00112,

Yen-Chang Hsu, Ting Hua, Sungen Chang, Qian Lou, Yilin Shen, and Hongxia Jin. Language model compression with weighted low-rank factorization.arXiv preprint arXiv:2207.00112, 2022

work page arXiv 2022
[50]

A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12:1556–1577, 2024

Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12:1556–1577, 2024

work page 2024
[51]

A survey on model compression and acceleration for pretrained language models

Canwen Xu and Julian McAuley. A survey on model compression and acceleration for pretrained language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 10566–10575, 2023. 13

work page 2023
[52]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[53]

MiniLLM: On-Policy Distillation of Large Language Models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models.arXiv preprint arXiv:2306.08543, 2023

work page internal anchor Pith review arXiv 2023
[54]

Caixia Yan, Xiaojun Chang, Minnan Luo, Huan Liu, Xiaoqin Zhang, and Qinghua Zheng

Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, and Tianyi Zhou. A survey on knowledge distillation of large language models.arXiv preprint arXiv:2402.13116, 2024

work page arXiv 2024
[55]

Enabling unstructured sparse acceleration on structured sparse accelerators.Proceedings of Machine Learning and Systems, 7, 2025

Geonhwa Jeong, Po-An Tsai, Abhimanyu R Bambhaniya, Stephen W Keckler, and Tushar Kr- ishna. Enabling unstructured sparse acceleration on structured sparse accelerators.Proceedings of Machine Learning and Systems, 7, 2025

work page 2025
[56]

Hardware acceleration of sparse and irregular tensor computations of ml models: A survey and insights.Proceedings of the IEEE, 109(10):1706–1752, 2021

Shail Dave, Riyadh Baghdadi, Tony Nowatzki, Sasikanth Avancha, Aviral Shrivastava, and Baoxin Li. Hardware acceleration of sparse and irregular tensor computations of ml models: A survey and insights.Proceedings of the IEEE, 109(10):1706–1752, 2021

work page 2021
[57]

Slicegpt: Compress large language models by deleting rows and columns,

Saleh Ashkboos, Maximilian L Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024, 2024

work page arXiv 2024
[58]

Slimllm: Accurate structured pruning for large language models.arXiv preprint arXiv:2505.22689, 2025

Jialong Guo, Xinghao Chen, Yehui Tang, and Yunhe Wang. Slimllm: Accurate structured pruning for large language models.arXiv preprint arXiv:2505.22689, 2025

work page arXiv 2025
[59]

Smoothquant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning, pages 38087–38099. PMLR, 2023

work page 2023
[60]

Zeroquant: Efficient and affordable post-training quantization for large-scale transformers

Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in neural information processing systems, 35:27168–27183, 2022

work page 2022
[61]

Quantization meets reasoning: Exploring LLM low-bit quantization degradation for mathematical reasoning.arXiv preprint arXiv:2501.03035, 2025

Zhen Li, Yupeng Su, Runming Yang, Congkai Xie, Zheng Wang, Zhongwei Xie, Ngai Wong, and Hongxia Yang. Quantization meets reasoning: Exploring llm low-bit quantization degradation for mathematical reasoning.arXiv preprint arXiv:2501.03035, 2025

work page arXiv 2025
[62]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with condi- tional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[63]

A survey on mixture of experts.Authorea Preprints, 2024

Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. A survey on mixture of experts.Authorea Preprints, 2024

work page 2024
[64]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

Moefication: Transformer feed-forward layers are mixtures of experts

Zhengyan Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. Moefication: Transformer feed-forward layers are mixtures of experts. InFindings of the Association for Computational Linguistics: ACL 2022, pages 877–890, 2022. 14 A Related Works Large Language Model Compression.Large language model compression has been widely studied to reduce...

work page 2022