IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression

Ali Abbasi; Chayne Thrash; Hamed Pirsiavash; Haoran Qin; Soheil Kolouri

arxiv: 2605.15626 · v1 · pith:2CURW7NInew · submitted 2026-05-15 · 💻 cs.LG

IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression

Ali Abbasi , Chayne Thrash , Haoran Qin , Hamed Pirsiavash , Soheil Kolouri This is my paper

Pith reviewed 2026-05-20 21:10 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM compressionSVDlow-rank approximationpost-training compressionadaptive rank allocationKL divergencemodel quantization

0 comments

The pith

IO-SVD compresses LLMs by whitening both input activations and output prediction sensitivity to limit accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models remain expensive to store and run, so post-training methods that shrink their weight matrices without retraining are valuable. IO-SVD creates a whitening space that incorporates both the statistics of typical inputs and a measure of how weight changes affect the model's final token predictions. The output measure comes from expanding the KL divergence to second order over the most probable tokens. A separate step then assigns different compression levels to different singular components by estimating their individual contribution to loss under a fixed total budget. If the approach works, models can be made substantially smaller while still producing nearly the same answers on downstream tasks and running faster at inference time.

Core claim

IO-SVD forms a KL-aware double-sided whitening space for model weights. Using a second-order expansion of the KL loss over the top-K token probabilities, it constructs an output-side metric that captures predictive sensitivity, while input whitening captures activation statistics. It further introduces an efficient heterogeneous rank-allocation strategy that scores whitened singular components using first-order calibration loss estimates and prunes the least sensitive components under a global budget. The same sensitivity estimates also guide loss-aware remapping when combining the low-rank factors with 8-bit quantization.

What carries the argument

The KL-aware double-sided whitening space that combines input activation statistics with an output metric derived from second-order KL expansion over top token probabilities.

If this is right

Models retain higher task performance at the same compression ratio compared with input-only whitening.
Inference speed increases because the resulting low-rank matrices require fewer operations during forward passes.
Hybrid low-rank plus quantization achieves better quality by using the loss estimates to decide which factors to quantize to 8 bits.
The same construction applies to both pure language models and vision-language models with only minor changes to the calibration data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The whitening construction could be tested with other divergence measures or with full-sequence losses instead of top-K tokens.
The global budget for rank allocation might be replaced by per-layer hardware constraints to target specific latency targets.
Sensitivity scores computed once could be reused to guide further compression steps such as pruning or distillation.

Load-bearing premise

The second-order expansion of the KL loss over the top-K token probabilities accurately captures predictive sensitivity for the output-side metric.

What would settle it

Apply the chosen ranks and factors to a held-out calibration set, then measure the actual change in KL divergence or downstream accuracy and check whether it matches the first-order and second-order estimates used to select the components.

Figures

Figures reproduced from arXiv: 2605.15626 by Ali Abbasi, Chayne Thrash, Hamed Pirsiavash, Haoran Qin, Soheil Kolouri.

**Figure 1.** Figure 1: Overview of IO-SVD. (a) Comparison of whitening strategies: standard SVD reconstructs the weight directly, one-sided whitening incorporates only input activation statistics, and double-sided whitening incorporates both input statistics and output-side sensitivity before SVD. (b) Heterogeneous rank allocation. For each whitened matrix B, singular components are sorted by singular-value magnitude, and the sm… view at source ↗

**Figure 2.** Figure 2: Loss-aware remapping: (a) SVD-truncate each weight to rank k; (b) score factor rows by first-order calibration-loss change under int8 quantization; (c) greedily keep low-score rows in int8 until meeting Crem; (d) assign the remaining rows to fp16. 3.2 Adaptive rank allocation The SVD solution above assumes fixed per-layer ranks. Under a global compression budget, we instead allocate ranks by estimating the… view at source ↗

**Figure 3.** Figure 3: Top-K ablation for output-side KL curvature. Normalized perplexity on wiki2, C4, PTB [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Throughput vs. peak memory on LLaMA-2- 7B (batch 64, seq 1024+1024). Peak GPU memory. The dense baseline consumes 77.6 GB, dominated by a 64.0 GB KV cache and a 12.6 GB weight tensor. IO-SVD without cache optimization shrinks the weight footprint to 5.4 GB but leaves the KV cache untouched at 64.0 GB, giving a 70.4 GB peak. Adding V-cache compression reduces the cache to 38.8 GB and yields a 50.3 GB peak,… view at source ↗

read the original abstract

Large language models deliver strong performance across language and reasoning tasks, but their storage and compute costs remain major barriers to deployment in resource-constrained and latency-sensitive settings. SVD-based post-training compression offers a hardware-agnostic way to reduce model size and improve inference efficiency through low-rank factorization. However, existing methods often rely on input-only whitening spaces, homogeneous rank allocation, or loss-agnostic allocation heuristics, limiting their ability to preserve model quality under aggressive compression. We propose Input-Output Whitened SVD (IO-SVD), a post-training compression method that forms a KL-aware double-sided whitening space for model weights. Using a second-order expansion of the KL loss over the top-K token probabilities, IO-SVD constructs an output-side metric that captures predictive sensitivity, while input whitening captures activation statistics. We further introduce an efficient heterogeneous rank-allocation strategy that scores whitened singular components using first-order calibration loss estimates and prunes the least sensitive components under a global budget. Inspired by prior work that combines SVD truncation with quantization, we improve hybrid SVD-quantization compression through loss-aware remapping, which selects low-rank factor rows for 8-bit quantization based on the predicted loss change incurred by quantizing them. Extensive experiments across diverse LLM and VLM families, and inference-time analysis shows that IO-SVD compresses LLMs with minimal performance degradation while delivering practical inference speedups. Code is available at https://github.com/mint-vu/IO-SVD.git

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IO-SVD adds output-side KL sensitivity to SVD compression and heterogeneous rank allocation, but the second-order approximation is the part that needs the most scrutiny under real compression ratios.

read the letter

IO-SVD builds a double-sided whitening space for SVD-based LLM compression: input side uses activation statistics as usual, while the output side uses a second-order Taylor expansion of KL divergence over top-K token probabilities to score predictive sensitivity. It then scores whitened components with first-order calibration loss estimates and prunes under a global rank budget, plus a loss-aware remapping step when mixing with 8-bit quantization. That combination of output metric and heterogeneous allocation is the concrete step beyond prior input-only or uniform-rank SVD work. The experiments run across several LLM and VLM families and include inference timing, which is useful for anyone who actually ships compressed models. Code release helps too. The soft spot is exactly the one the stress-test flags. The second-order KL expansion assumes the low-rank perturbation stays small enough that higher-order terms can be ignored. At the compression ratios the paper targets, singular-value truncation can shift output distributions more than that, so the sensitivity scores used for both whitening and pruning may not track actual loss as cleanly as claimed. Without seeing the full derivation, error bounds, or ablations that compare the approximation to measured loss change, it is hard to know how much this affects the final numbers. The central claim of minimal degradation therefore rests on the experiments holding up rather than on a proven bound. This paper is for people who work on practical post-training compression and want a hardware-agnostic SVD baseline that tries to stay loss-aware. A reader already following SVD-quantization hybrids would get the most out of the specific metric and allocation details. It deserves a serious referee because the idea is a direct, testable extension of existing techniques and the empirical scope is wide enough to produce useful feedback even if the approximation needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper introduces IO-SVD, a post-training SVD-based compression technique for LLMs and VLMs. It constructs a double-sided whitening space by combining input-side activation statistics with an output-side metric derived from a second-order Taylor expansion of the KL divergence over top-K token probabilities. This metric guides heterogeneous rank allocation via first-order calibration loss estimates under a global budget, and the method is extended to loss-aware remapping for hybrid SVD-quantization. Experiments across model families report minimal degradation alongside inference speedups, with code released.

Significance. If the core approximations and empirical results hold, IO-SVD would offer a hardware-agnostic, loss-aware approach to adaptive-rank compression that improves upon input-only or homogeneous baselines. The public code and cross-family evaluation strengthen reproducibility and practical utility for deployment.

major comments (2)

[IO-SVD construction] IO-SVD construction section: the second-order expansion of KL divergence over top-K probabilities is used to define the output-side whitening metric and sensitivity scores for rank allocation. No explicit bound or empirical check is provided showing that higher-order terms remain negligible under the target compression ratios, where large singular-value truncation can produce non-local output shifts. This approximation is load-bearing for the central claim of minimal degradation.
[rank allocation] Heterogeneous rank-allocation paragraph: first-order calibration loss estimates are computed in the whitened space to prune components. If these estimates reuse the same calibration data or whitening transform that defines the metric, the procedure risks circularity; an explicit statement of data separation or a reduction showing the estimates are independent of the fitted parameters is needed to support the allocation strategy.

minor comments (2)

[Abstract] Abstract: the phrase 'minimal performance degradation' is repeated without quantitative qualifiers; adding a brief range of reported perplexity or accuracy drops would improve precision.
Notation: the input and output whitening matrices are introduced without an explicit equation linking them to the final low-rank factors; a single displayed equation would clarify the double-sided construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the IO-SVD method, particularly regarding the validity of the second-order KL approximation and the independence of the rank-allocation estimates. We address each major comment below and will revise the manuscript to incorporate additional validation and clarifications.

read point-by-point responses

Referee: [IO-SVD construction] IO-SVD construction section: the second-order expansion of KL divergence over top-K probabilities is used to define the output-side whitening metric and sensitivity scores for rank allocation. No explicit bound or empirical check is provided showing that higher-order terms remain negligible under the target compression ratios, where large singular-value truncation can produce non-local output shifts. This approximation is load-bearing for the central claim of minimal degradation.

Authors: We acknowledge the importance of validating the second-order Taylor expansion of the KL divergence. In the revised manuscript, we will add an empirical section that compares the approximated output-side metric against the exact KL divergence computed on a held-out calibration set for compression ratios ranging from 2x to 4x. We will also include a brief analysis referencing approximation bounds from loss landscape literature to discuss when higher-order terms remain small, thereby supporting the claim of minimal degradation under the evaluated settings. revision: yes
Referee: [rank allocation] Heterogeneous rank-allocation paragraph: first-order calibration loss estimates are computed in the whitened space to prune components. If these estimates reuse the same calibration data or whitening transform that defines the metric, the procedure risks circularity; an explicit statement of data separation or a reduction showing the estimates are independent of the fitted parameters is needed to support the allocation strategy.

Authors: We agree that explicit separation is necessary to avoid any appearance of circularity. The input whitening transform is computed solely from activation statistics on a first calibration subset, while the first-order loss estimates for heterogeneous rank allocation are performed on a disjoint second calibration subset that does not influence the whitening matrix. In the revision, we will add an explicit statement of this data separation protocol along with a short empirical check confirming that the sensitivity scores remain stable when the whitening transform is held fixed from the first subset. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper constructs the output-side metric via an explicit second-order Taylor expansion of KL divergence over top-K probabilities (abstract and IO-SVD construction), which is an independent approximation rather than a self-definition or fitted input renamed as prediction. Heterogeneous rank allocation scores components using first-order calibration loss estimates on separate calibration data, a standard post-training technique that does not reduce by construction to the whitening space or target performance metrics. No load-bearing self-citations, uniqueness theorems imported from authors, or ansatzes smuggled via prior work are present in the core steps; the hybrid SVD-quantization remapping similarly relies on predicted loss change computed from the same expansion without circular re-use of fitted values. The derivation remains self-contained against external benchmarks and does not equate outputs to inputs by definition.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the assumption that a second-order KL expansion provides a reliable sensitivity metric and that first-order loss estimates suffice for pruning decisions under a global budget.

free parameters (1)

global rank budget
Controls total compression ratio; chosen to meet target size while minimizing predicted loss.

axioms (1)

domain assumption Second-order Taylor expansion of KL divergence over top-K token probabilities approximates output sensitivity
Invoked to construct the output-side whitening metric without full loss computation.

pith-pipeline@v0.9.0 · 5812 in / 1168 out tokens · 40592 ms · 2026-05-20T21:10:54.364427+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Using a second-order expansion of the KL loss over the top-K token probabilities, IO-SVD constructs an output-side metric... ΔJℓ ≈ ½‖Cℓ^{1/2}(Wℓ − Ŵℓ)Rℓ^{1/2}‖_F² (abstract, §3.1, Eq. 4)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_high_calibrated_iff unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

KL(pt ∥ softmax(zt + δzt)) = ½ δz_t^T H_t δz_t + O(‖δz‖³), H_t = Diag(pt) − pt pt^T (Eq. 2)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 3 internal anchors

[1]

Zero sum svd: Balancing loss sensitivity for low rank llm compression, 2026

Ali Abbasi, Chayne Thrash, Haoran Qin, Shansita Sharma, Sepehr Seifi, and Soheil Kolouri. Zero sum svd: Balancing loss sensitivity for low rank llm compression, 2026. URLhttps://arxiv.org/abs/2602.02848

work page arXiv 2026
[2]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[3]

MathQA: Towards interpretable math word problem solving with operation-based formalisms

Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Li...

work page doi:10.18653/v1/n19-1245 2019
[4]

PIQA: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432–7439, 2020. doi: 10.1609/aaai.v34i05.6239

work page doi:10.1609/aaai.v34i05.6239 2020
[5]

QuIP: 2-bit quantization of large language models with guarantees

Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher De Sa. QuIP: 2-bit quantization of large language models with guarantees. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=xrk9g5vcXR

work page 2023
[6]

Generalized fisher-weighted svd: Scalable kronecker-factored fisher approximation for compressing large language models, 2025

Viktoriia Chekalina, Daniil Moskovskiy, Daria Cherniuk, Maxim Kurkin, Andrey Kuznetsov, and Evgeny Frolov. Generalized fisher-weighted svd: Scalable kronecker-factored fisher approximation for compressing large language models, 2025. URLhttps://arxiv.org/abs/2505.17974

work page arXiv 2025
[7]

Hongrong Cheng, Miao Zhang, and Javen Qinfeng Shi. A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recommendations.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10558–10578, 2024

work page 2024
[8]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URL https: //arxiv.org/abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Llm.int8(): 8-bit matrix multiplication for transformers at scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc. ISBN 9781713871088

work page 2022
[10]

The approximation of one matrix by another of lower rank.Psychometrika, 1(3): 211–218, 1936

Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank.Psychometrika, 1(3): 211–218, 1936

work page 1936
[11]

Sparsegpt: massive language models can be accurately pruned in one-shot

Elias Frantar and Dan Alistarh. Sparsegpt: massive language models can be accurately pruned in one-shot. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

work page 2023
[12]

Optimal brain compression: a framework for accurate post- training quantization and pruning

Elias Frantar, Sidak Pal Singh, and Dan Alistarh. Optimal brain compression: a framework for accurate post- training quantization and pruning. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc. ISBN 9781713871088

work page 2022
[13]

OPTQ: Accurate quantization for generative pre-trained transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. OPTQ: Accurate quantization for generative pre-trained transformers. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=tcbBPnfwxS

work page 2023
[14]

Marlin: Mixed-precision auto- regressive parallel inference on large language models

Elias Frantar, Roberto L Castro, Jiale Chen, Torsten Hoefler, and Dan Alistarh. Marlin: Mixed-precision auto- regressive parallel inference on large language models. InProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pages 239–251, 2025

work page 2025
[15]

Minillm: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[16]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[17]

Language model compression with weighted low-rank factorization

Yen-Chang Hsu, Ting Hua, Sungen Chang, Qian Lou, Yilin Shen, and Hongxia Jin. Language model compression with weighted low-rank factorization. InInternational Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=uPv9Y3gmAI5

work page 2022
[18]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Represen- tations, 2022. URLhttps://openreview.net/forum?id=nZeVKeeFYf9. 10 IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM CompressionA PREPRINT

work page 2022
[19]

Seed-bench: Benchmarking multimodal large language models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13299–13308, 2024

work page 2024
[20]

Optimal brain decomposition for accurate llm low-rank approximation.arXiv preprint arXiv:2604.00821, 2026

Yuhang Li, Donghyun Lee, Ruokai Yin, and Priyadarshini Panda. Optimal brain decomposition for accurate llm low-rank approximation.arXiv preprint arXiv:2604.00821, 2026

work page arXiv 2026
[21]

Qserve: W4a8kv4 quantization and system co-design for efficient llm serving.arXiv preprint arXiv:2405.04532, 2024

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving.arXiv preprint arXiv:2405.04532, 2024

work page arXiv 2024
[22]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InThe 36th Conference on Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[23]

LLM-Pruner: On the structural pruning of large language models

Xinyin Ma, Gongfan Fang, and Xinchao Wang. LLM-Pruner: On the structural pruning of large language models. Advances in Neural Information Processing Systems, 36:21702–21720, 2023

work page 2023
[24]

Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz

Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of English: The Penn Treebank.Computational Linguistics, 19(2):313–330, 1993. URL https://aclanthology. org/J93-2004/

work page 1993
[25]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id= Byj72udxe

work page 2017
[26]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium, O...

work page doi:10.18653/v1/d18-1260 2018
[27]

Symmetric gauge functions and unitarily invariant norms.The Quarterly Journal of Mathematics, 11(1):50–59, 1960

Leon Mirsky. Symmetric gauge functions and unitarily invariant norms.The Quarterly Journal of Mathematics, 11(1):50–59, 1960

work page 1960
[28]

Ctpd: Cross tokenizer preference distillation.arXiv preprint arXiv:2601.11865, 2026

Truong Nguyen, Phi Van Dat, Ngan Nguyen, Linh Ngo Van, Trung Le, and Thanh Hong Nguyen. Ctpd: Cross tokenizer preference distillation.arXiv preprint arXiv:2601.11865, 2026

work page arXiv 2026
[29]

Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models.arXiv preprint arXiv:2206.09557, 2022

Gunho Park, Baeseong Park, Minsub Kim, Sungjae Lee, Jeonghoon Kim, Beomseok Kwon, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, and Dongsoo Lee. Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models.arXiv preprint arXiv:2206.09557, 2022

work page arXiv 2022
[30]

Dobi-SVD: Differentiable SVD for LLM compression and some new perspectives

Wang Qinsi, Jinghan Ke, Masayoshi Tomizuka, Kurt Keutzer, and Chenfeng Xu. Dobi-SVD: Differentiable SVD for LLM compression and some new perspectives. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=kws76i5XB8

work page 2025
[31]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res., 21(1), January 2020. ISSN 1532-4435

work page 2020
[32]

Parameter-efficient and student-friendly knowledge distillation.IEEE Transactions on Multimedia, 26:4230–4241, 2023

Jun Rao, Xv Meng, Liang Ding, Shuhan Qi, Xuebo Liu, Min Zhang, and Dacheng Tao. Parameter-efficient and student-friendly knowledge distillation.IEEE Transactions on Multimedia, 26:4230–4241, 2023

work page 2023
[33]

Winogrande: an adversarial winograd schema challenge at scale.Commun

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: an adversarial winograd schema challenge at scale.Commun. ACM, 64(9):99–106, August 2021. ISSN 0001-0782. doi: 10.1145/3474381. URLhttps://doi.org/10.1145/3474381

work page doi:10.1145/3474381 2021
[34]

Omniquant: Omnidirectionally calibrated quantization for large language models

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models. InThe Twelfth International Conference on Learning Representations

work page
[35]

Overcoming vocabulary mismatch: V ocabulary-agnostic teacher guided language modeling

Haebin Shin, Lei Ji, Xiao Liu, and Yeyun Gong. Overcoming vocabulary mismatch: V ocabulary-agnostic teacher guided language modeling. InForty-second International Conference on Machine Learning

work page
[36]

A simple and effective pruning approach for large language models

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=PxoFut3dWW

work page 2024
[37]

Quip#: even better llm quantization with hadamard incoherence and lattice codebooks

Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. Quip#: even better llm quantization with hadamard incoherence and lattice codebooks. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024. 11 IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM CompressionA PREPRINT

work page 2024
[38]

Model-preserving adaptive rounding, 2025

Albert Tseng, Zhaofeng Sun, and Christopher De Sa. Model-preserving adaptive rounding, 2025. URL https: //arxiv.org/abs/2505.22988

work page arXiv 2025
[39]

Wsvd: Weighted low-rank approximation for fast and efficient execution of low-precision vision-language models

Haiyu Wang, Yutong Wang, Jack Jiang, and Sai Qian Zhang. Wsvd: Weighted low-rank approximation for fast and efficient execution of low-precision vision-language models. InThe Fourteenth International Conference on Learning Representations,

work page
[40]

Lora-ga: Low-rank adaptation with gradient approximation.Advances in Neural Information Processing Systems, 37:54905–54931, 2024

Shaowen Wang, Linxi Yu, and Jian Li. Lora-ga: Low-rank adaptation with gradient approximation.Advances in Neural Information Processing Systems, 37:54905–54931, 2024

work page 2024
[41]

Large language models help humans verify truthfulness – except when they are convincingly wrong

Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, and Mi Zhang. SVD-LLM v2: Optimizing singular value truncation for large language model compression. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol...

work page doi:10.18653/v1/ 2025
[42]

SVD-LLM: Truncation-aware singular value decomposition for large language model compression

Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. SVD-LLM: Truncation-aware singular value decomposition for large language model compression. InThe Thirteenth International Conference on Learning Representations,

work page
[43]

URLhttps://openreview.net/forum?id=LNYIUouhdt

work page
[44]

Qsvd: Efficient low-rank approximation for unified query-key- value weight compression in low-precision vision-language models

Yutong Wang, Haiyu Wang, and Sai Qian Zhang. Qsvd: Efficient low-rank approximation for unified query-key- value weight compression in low-precision vision-language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

work page
[45]

Smoothquant: accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: accurate and efficient post-training quantization for large language models. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

work page 2023
[46]

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

Zhihang Yuan, Yuzhang Shang, Yue Song, Dawei Yang, Qiang Wu, Yan Yan, and Guangyu Sun. Asvd: Activation- aware singular value decomposition for compressing large language models, 2025. URL https://arxiv.org/ abs/2312.05821

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Linguisti...

work page doi:10.18653/v1/p19-1472 2019
[48]

Atom: Low-bit quantization for efficient and accurate llm serving.Proceedings of Machine Learning and Systems, 6:196–209, 2024

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. Atom: Low-bit quantization for efficient and accurate llm serving.Proceedings of Machine Learning and Systems, 6:196–209, 2024. 12 IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM CompressionA PREPRINT Appendix A Ad...

work page 2024

[1] [1]

Zero sum svd: Balancing loss sensitivity for low rank llm compression, 2026

Ali Abbasi, Chayne Thrash, Haoran Qin, Shansita Sharma, Sepehr Seifi, and Soheil Kolouri. Zero sum svd: Balancing loss sensitivity for low rank llm compression, 2026. URLhttps://arxiv.org/abs/2602.02848

work page arXiv 2026

[2] [2]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[3] [3]

MathQA: Towards interpretable math word problem solving with operation-based formalisms

Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Li...

work page doi:10.18653/v1/n19-1245 2019

[4] [4]

PIQA: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432–7439, 2020. doi: 10.1609/aaai.v34i05.6239

work page doi:10.1609/aaai.v34i05.6239 2020

[5] [5]

QuIP: 2-bit quantization of large language models with guarantees

Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher De Sa. QuIP: 2-bit quantization of large language models with guarantees. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=xrk9g5vcXR

work page 2023

[6] [6]

Generalized fisher-weighted svd: Scalable kronecker-factored fisher approximation for compressing large language models, 2025

Viktoriia Chekalina, Daniil Moskovskiy, Daria Cherniuk, Maxim Kurkin, Andrey Kuznetsov, and Evgeny Frolov. Generalized fisher-weighted svd: Scalable kronecker-factored fisher approximation for compressing large language models, 2025. URLhttps://arxiv.org/abs/2505.17974

work page arXiv 2025

[7] [7]

Hongrong Cheng, Miao Zhang, and Javen Qinfeng Shi. A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recommendations.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10558–10578, 2024

work page 2024

[8] [8]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URL https: //arxiv.org/abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

Llm.int8(): 8-bit matrix multiplication for transformers at scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc. ISBN 9781713871088

work page 2022

[10] [10]

The approximation of one matrix by another of lower rank.Psychometrika, 1(3): 211–218, 1936

Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank.Psychometrika, 1(3): 211–218, 1936

work page 1936

[11] [11]

Sparsegpt: massive language models can be accurately pruned in one-shot

Elias Frantar and Dan Alistarh. Sparsegpt: massive language models can be accurately pruned in one-shot. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

work page 2023

[12] [12]

Optimal brain compression: a framework for accurate post- training quantization and pruning

Elias Frantar, Sidak Pal Singh, and Dan Alistarh. Optimal brain compression: a framework for accurate post- training quantization and pruning. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc. ISBN 9781713871088

work page 2022

[13] [13]

OPTQ: Accurate quantization for generative pre-trained transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. OPTQ: Accurate quantization for generative pre-trained transformers. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=tcbBPnfwxS

work page 2023

[14] [14]

Marlin: Mixed-precision auto- regressive parallel inference on large language models

Elias Frantar, Roberto L Castro, Jiale Chen, Torsten Hoefler, and Dan Alistarh. Marlin: Mixed-precision auto- regressive parallel inference on large language models. InProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pages 239–251, 2025

work page 2025

[15] [15]

Minillm: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, 2024

work page 2024

[16] [16]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[17] [17]

Language model compression with weighted low-rank factorization

Yen-Chang Hsu, Ting Hua, Sungen Chang, Qian Lou, Yilin Shen, and Hongxia Jin. Language model compression with weighted low-rank factorization. InInternational Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=uPv9Y3gmAI5

work page 2022

[18] [18]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Represen- tations, 2022. URLhttps://openreview.net/forum?id=nZeVKeeFYf9. 10 IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM CompressionA PREPRINT

work page 2022

[19] [19]

Seed-bench: Benchmarking multimodal large language models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13299–13308, 2024

work page 2024

[20] [20]

Optimal brain decomposition for accurate llm low-rank approximation.arXiv preprint arXiv:2604.00821, 2026

Yuhang Li, Donghyun Lee, Ruokai Yin, and Priyadarshini Panda. Optimal brain decomposition for accurate llm low-rank approximation.arXiv preprint arXiv:2604.00821, 2026

work page arXiv 2026

[21] [21]

Qserve: W4a8kv4 quantization and system co-design for efficient llm serving.arXiv preprint arXiv:2405.04532, 2024

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving.arXiv preprint arXiv:2405.04532, 2024

work page arXiv 2024

[22] [22]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InThe 36th Conference on Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[23] [23]

LLM-Pruner: On the structural pruning of large language models

Xinyin Ma, Gongfan Fang, and Xinchao Wang. LLM-Pruner: On the structural pruning of large language models. Advances in Neural Information Processing Systems, 36:21702–21720, 2023

work page 2023

[24] [24]

Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz

Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of English: The Penn Treebank.Computational Linguistics, 19(2):313–330, 1993. URL https://aclanthology. org/J93-2004/

work page 1993

[25] [25]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id= Byj72udxe

work page 2017

[26] [26]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium, O...

work page doi:10.18653/v1/d18-1260 2018

[27] [27]

Symmetric gauge functions and unitarily invariant norms.The Quarterly Journal of Mathematics, 11(1):50–59, 1960

Leon Mirsky. Symmetric gauge functions and unitarily invariant norms.The Quarterly Journal of Mathematics, 11(1):50–59, 1960

work page 1960

[28] [28]

Ctpd: Cross tokenizer preference distillation.arXiv preprint arXiv:2601.11865, 2026

Truong Nguyen, Phi Van Dat, Ngan Nguyen, Linh Ngo Van, Trung Le, and Thanh Hong Nguyen. Ctpd: Cross tokenizer preference distillation.arXiv preprint arXiv:2601.11865, 2026

work page arXiv 2026

[29] [29]

Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models.arXiv preprint arXiv:2206.09557, 2022

Gunho Park, Baeseong Park, Minsub Kim, Sungjae Lee, Jeonghoon Kim, Beomseok Kwon, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, and Dongsoo Lee. Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models.arXiv preprint arXiv:2206.09557, 2022

work page arXiv 2022

[30] [30]

Dobi-SVD: Differentiable SVD for LLM compression and some new perspectives

Wang Qinsi, Jinghan Ke, Masayoshi Tomizuka, Kurt Keutzer, and Chenfeng Xu. Dobi-SVD: Differentiable SVD for LLM compression and some new perspectives. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=kws76i5XB8

work page 2025

[31] [31]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res., 21(1), January 2020. ISSN 1532-4435

work page 2020

[32] [32]

Parameter-efficient and student-friendly knowledge distillation.IEEE Transactions on Multimedia, 26:4230–4241, 2023

Jun Rao, Xv Meng, Liang Ding, Shuhan Qi, Xuebo Liu, Min Zhang, and Dacheng Tao. Parameter-efficient and student-friendly knowledge distillation.IEEE Transactions on Multimedia, 26:4230–4241, 2023

work page 2023

[33] [33]

Winogrande: an adversarial winograd schema challenge at scale.Commun

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: an adversarial winograd schema challenge at scale.Commun. ACM, 64(9):99–106, August 2021. ISSN 0001-0782. doi: 10.1145/3474381. URLhttps://doi.org/10.1145/3474381

work page doi:10.1145/3474381 2021

[34] [34]

Omniquant: Omnidirectionally calibrated quantization for large language models

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models. InThe Twelfth International Conference on Learning Representations

work page

[35] [35]

Overcoming vocabulary mismatch: V ocabulary-agnostic teacher guided language modeling

Haebin Shin, Lei Ji, Xiao Liu, and Yeyun Gong. Overcoming vocabulary mismatch: V ocabulary-agnostic teacher guided language modeling. InForty-second International Conference on Machine Learning

work page

[36] [36]

A simple and effective pruning approach for large language models

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=PxoFut3dWW

work page 2024

[37] [37]

Quip#: even better llm quantization with hadamard incoherence and lattice codebooks

Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. Quip#: even better llm quantization with hadamard incoherence and lattice codebooks. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024. 11 IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM CompressionA PREPRINT

work page 2024

[38] [38]

Model-preserving adaptive rounding, 2025

Albert Tseng, Zhaofeng Sun, and Christopher De Sa. Model-preserving adaptive rounding, 2025. URL https: //arxiv.org/abs/2505.22988

work page arXiv 2025

[39] [39]

Wsvd: Weighted low-rank approximation for fast and efficient execution of low-precision vision-language models

Haiyu Wang, Yutong Wang, Jack Jiang, and Sai Qian Zhang. Wsvd: Weighted low-rank approximation for fast and efficient execution of low-precision vision-language models. InThe Fourteenth International Conference on Learning Representations,

work page

[40] [40]

Lora-ga: Low-rank adaptation with gradient approximation.Advances in Neural Information Processing Systems, 37:54905–54931, 2024

Shaowen Wang, Linxi Yu, and Jian Li. Lora-ga: Low-rank adaptation with gradient approximation.Advances in Neural Information Processing Systems, 37:54905–54931, 2024

work page 2024

[41] [41]

Large language models help humans verify truthfulness – except when they are convincingly wrong

Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, and Mi Zhang. SVD-LLM v2: Optimizing singular value truncation for large language model compression. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol...

work page doi:10.18653/v1/ 2025

[42] [42]

SVD-LLM: Truncation-aware singular value decomposition for large language model compression

Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. SVD-LLM: Truncation-aware singular value decomposition for large language model compression. InThe Thirteenth International Conference on Learning Representations,

work page

[43] [43]

URLhttps://openreview.net/forum?id=LNYIUouhdt

work page

[44] [44]

Qsvd: Efficient low-rank approximation for unified query-key- value weight compression in low-precision vision-language models

Yutong Wang, Haiyu Wang, and Sai Qian Zhang. Qsvd: Efficient low-rank approximation for unified query-key- value weight compression in low-precision vision-language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

work page

[45] [45]

Smoothquant: accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: accurate and efficient post-training quantization for large language models. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

work page 2023

[46] [46]

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

Zhihang Yuan, Yuzhang Shang, Yue Song, Dawei Yang, Qiang Wu, Yan Yan, and Guangyu Sun. Asvd: Activation- aware singular value decomposition for compressing large language models, 2025. URL https://arxiv.org/ abs/2312.05821

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Linguisti...

work page doi:10.18653/v1/p19-1472 2019

[48] [48]

Atom: Low-bit quantization for efficient and accurate llm serving.Proceedings of Machine Learning and Systems, 6:196–209, 2024

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. Atom: Low-bit quantization for efficient and accurate llm serving.Proceedings of Machine Learning and Systems, 6:196–209, 2024. 12 IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM CompressionA PREPRINT Appendix A Ad...

work page 2024