OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
Pith reviewed 2026-05-20 12:12 UTC · model grok-4.3
The pith
OSCAR derives fixed rotations from offline covariance estimates to enable accurate 2-bit KV cache quantization for LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By estimating attention-aware covariance structures offline and deriving fixed rotations and clipping thresholds from them, OSCAR aligns 2-bit KV cache quantization with the covariance structures consumed by attention, providing both theoretical justification and a fully deployable system with an INT2 attention kernel compatible with paged KV-cache serving.
What carries the argument
The offline spectral covariance-aware rotation that computes fixed rotations based on estimated covariance to reduce outliers in alignment with attention patterns.
If this is right
- OSCAR reduces the BF16 accuracy gap to 3.78 points on Qwen3-4B-Thinking-2507 and 1.42 points on Qwen3-8B for INT2 KV cache.
- It scales to Qwen3-32B and GLM-4.7 with 358B parameters while remaining on par with BF16.
- OSCAR remains robust on long-context tasks up to 128K tokens on RULER-NIAH where naive rotation INT2 collapses.
- KV-cache memory is reduced by approximately 8x with throughput improvements up to 7x at large batch sizes.
Where Pith is reading between the lines
- This method could be extended to other low-bit quantizations if similar covariance structures exist in other model components.
- Deploying OSCAR might allow serving larger models or longer contexts under fixed hardware memory limits.
- Testing the offline estimates on a wider variety of downstream tasks would strengthen the generalizability claim.
Load-bearing premise
Covariance structures estimated offline from calibration or training data will accurately represent the attention patterns encountered during actual inference on downstream tasks and long contexts.
What would settle it
A large accuracy drop on a new long-context task or model variant not represented in the calibration data would indicate that the offline estimates do not generalize.
Figures
read the original abstract
INT2 KV-cache quantization is attractive for long-context LLM serving, but it remains difficult to make both accurate and deployable. Simple rotations such as Hadamard transforms reduce outliers, but still degrade at INT2 because they are not aligned with downstream attention. We propose OSCAR, an Ultra-low-bit KV Cache quantization method that estimates attention-aware covariance structures offline and uses them to derive fixed rotations and clipping thresholds for quantization. In this way, it aligns KV quantization with the covariance structures that attention actually consumes. More importantly, we not only provide theoretical justification but also develop a fully deployable OSCAR system with a custom INT2 attention kernel that remains compatible with paged KV-cache serving and fused kernel pipelines, enabling seamless integration into modern LLM serving frameworks such as SGLang and vLLM. We evaluate our methods on recent reasoning models with reasoning traces of up to 32k tokens across 5 tasks. On Qwen3-4B-Thinking-2507 and Qwen3-8B, OSCAR reduces the BF16 accuracy gap to 3.78 and 1.42 points, respectively, while naive rotation INT2 collapses to nearly zero. We further scale OSCAR to Qwen3-32B and GLM-4.7 (358B params), where it remains effectively on par with BF16. On long context - RULER-NIAH up to 128K, OSCAR remains robust on both Qwen3 models, while naive rotation INT2 collapses. System-wise, OSCAR reduces KV-cache memory by approximately 8x, improves throughput by up to 7x at large batch sizes under the same memory budget, and accelerates batch-size-1 decoding by up to 3x over BF16 due to reduced memory bandwidth overhead.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OSCAR, a method for 2-bit KV-cache quantization that estimates attention-aware covariance structures offline from calibration data to derive fixed rotations and clipping thresholds. This is claimed to align quantization with downstream attention patterns better than naive rotations like Hadamard transforms. The work includes a custom INT2 attention kernel compatible with paged KV-cache serving in frameworks such as SGLang and vLLM. Empirical results on Qwen3 reasoning models (4B to 32B) and GLM-4.7 (358B) report reduced accuracy gaps to BF16 (e.g., 3.78 and 1.42 points on 4B/8B models), robustness on RULER-NIAH up to 128K contexts, ~8x memory reduction, and throughput gains up to 7x.
Significance. If the results hold, OSCAR could enable practical ultra-low-bit KV caching for long-context LLM serving with near-lossless accuracy, addressing a key bottleneck in memory-constrained inference. The combination of theoretical justification, deployable kernel, and scaling to 358B models plus long-context robustness adds engineering value beyond pure quantization techniques.
major comments (2)
- [§3 and abstract] §3 (method) and abstract: The central claim that offline covariance estimation produces rotations 'aligned with the covariance structures that attention actually consumes' is load-bearing for all accuracy results (e.g., the 3.78/1.42-point gaps and 128k RULER-NIAH robustness). No distribution-shift experiment is described that tests whether eigenvectors from calibration data match those arising under paged attention on long-context downstream tasks; if they differ, the INT2 degradation reappears.
- [Experiments] Experiments section: The reported gains lack details on covariance estimation (window size, sample selection from training vs. calibration traces), error bars across runs, or explicit confirmation that calibration data has no overlap with evaluation tasks. These omissions make it impossible to assess whether the 3.78/1.42-point gaps are robust or sensitive to the free parameters listed in the axiom ledger.
minor comments (2)
- Figure captions and legends should explicitly label all three curves (BF16, naive rotation INT2, OSCAR) and report the exact context lengths used for each bar.
- Add a short paragraph clarifying the exact procedure for computing the offline covariance matrix (e.g., number of tokens, layer-wise vs. global estimation).
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address the major comments point by point below and will update the manuscript accordingly.
read point-by-point responses
-
Referee: [§3 and abstract] §3 (method) and abstract: The central claim that offline covariance estimation produces rotations 'aligned with the covariance structures that attention actually consumes' is load-bearing for all accuracy results (e.g., the 3.78/1.42-point gaps and 128k RULER-NIAH robustness). No distribution-shift experiment is described that tests whether eigenvectors from calibration data match those arising under paged attention on long-context downstream tasks; if they differ, the INT2 degradation reappears.
Authors: We acknowledge the value of directly testing eigenvector stability under distribution shift. Calibration traces were chosen to span diverse reasoning patterns and context lengths. The maintained accuracy on RULER-NIAH at 128K under paged serving already provides supporting evidence that the fixed rotations generalize. In revision we will add a short analysis subsection in §3 that (i) reports cosine similarity between calibration eigenvectors and those computed on held-out long-context downstream samples and (ii) includes a small-scale ablation showing that INT2 degradation remains limited even when modest shifts are introduced. revision: yes
-
Referee: [Experiments] Experiments section: The reported gains lack details on covariance estimation (window size, sample selection from training vs. calibration traces), error bars across runs, or explicit confirmation that calibration data has no overlap with evaluation tasks. These omissions make it impossible to assess whether the 3.78/1.42-point gaps are robust or sensitive to the free parameters listed in the axiom ledger.
Authors: We agree these details are necessary for reproducibility. The revised Experiments section will explicitly state: the window size and number of calibration sequences used for covariance estimation, that all calibration traces are drawn from held-out data with zero overlap to any evaluation benchmark or task, and error bars (standard deviation over three independent runs) for the primary accuracy metrics on Qwen3-4B and 8B. We will also clarify the hyper-parameter choices referenced in the axiom ledger. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper describes an offline estimation of attention-aware covariance structures from calibration or training data to derive fixed rotations and clipping thresholds, followed by empirical evaluation on separate downstream reasoning tasks, long-context benchmarks (RULER-NIAH up to 128K), and scaled models. This separation between calibration and held-out evaluation maintains independence. No load-bearing derivation step, equation, or self-citation in the abstract or description reduces the performance claims to a fitted input renamed as prediction or to a self-referential definition by construction. The central results rest on external benchmarks rather than internal tautology.
Axiom & Free-Parameter Ledger
free parameters (1)
- Covariance estimation window and sample selection
axioms (1)
- domain assumption Attention mechanisms in LLMs consume covariance structures that can be reliably estimated from offline data.
Reference graph
Works this paper leans on
-
[1]
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 2023
work page 2023
-
[2]
Akshat Sharma, Hangliang Ding, Jianping Li, Neel Dani, and Minjia Zhang. Minikv: Pushing the limits of llm inference via 2-bit layer-discriminative kv cache.arXiv preprint arXiv:2411.18077, 2024
-
[3]
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms.arXiv preprint arXiv:2310.01801, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. KVQuant: Towards 10 million context length LLM inference with KV cache quantization. InAdvances in Neural Information Processing Systems, 2024
work page 2024
-
[5]
WKVQuant: Quantizing weight and key/value cache for large language models gains more
Yuxuan Yue, Zhihang Yuan, Haojie Duanmu, Sifan Zhou, Jianlong Wu, and Liqiang Nie. WKVQuant: Quantizing weight and key/value cache for large language models gains more. arXiv preprint arXiv:2402.12065, 2024
-
[6]
Haojun Xia, Xiaoxia Wu, Jisen Li, Robert Wu, Junxiong Wang, Jue Wang, Chenxi Li, Aman Singhal, Alay Dilipbhai Shah, Alpay Ariyak, Donglin Zhuang, Zhongzhu Zhou, Ben Athi- waratkun, Zhen Zheng, and Shuaiwen Leon Song. Kitty: Accurate and efficient 2-bit KV cache quantization with dynamic channel-wise precision boost.arXiv preprint arXiv:2511.18643, 2025
-
[7]
Zunhai Su, Zhe Chen, Wang Shen, Hanyu Wei, Linge Li, Huangqi Yu, and Kehong Yuan. RotateKV: Accurate and robust 2-bit KV cache quantization for LLMs via outlier-aware adaptive rotations.arXiv preprint arXiv:2501.16383, 2025
-
[8]
Haoyang LI, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole HU, Wei Dong, Li Qing, and Lei Chen. A survey on large language model acceleration based on KV cache management.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URLhttps://openreview.net/forum?id=z3JZzu9EA3
work page 2025
-
[9]
Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman
Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot: Outlier-free 4-bit inference in rotated LLMs. InAdvances in Neural Information Processing Systems, volume 37, pages 100213–100240, 2024
work page 2024
-
[10]
Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.Proceedings of machine learning research, 235:48630, 2024
work page 2024
-
[11]
SpinQuant: LLM quantization with learned rotations
Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations.arXiv preprint arXiv:2405.16406, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
Jinda Jia, Jisen Li, Zhongzhu Zhou, Jung Hwan Heo, Jue Wang, Tri Dao, Shuaiwen Leon Song, Ben Athiwaratkun, Chenfeng Xu, Tianyi Zhang, et al. SAW-INT4: System-aware 4-bit KV-cache quantization for real-world LLM serving.arXiv preprint arXiv:2604.19157, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[13]
Efficient memory management for large language model serving with PagedAttention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[14]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022. 12
work page 2022
-
[15]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023. URLhttps://arxiv.org/abs/2307.08691
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024
work page 2024
-
[17]
Flashattention-4: Algorithm and kernel pipelining co-design for asymmetric hardware scaling
Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, and Tri Dao. Flashattention-4: Algorithm and kernel pipelining co-design for asymmetric hardware scaling. arXiv preprint arXiv:2603.05451, 2026
-
[18]
Gonzalez, Clark Barrett, and Ying Sheng
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: Efficient execution of structured language model programs. InAdvances in Neural Information Processing Systems, 2024
work page 2024
-
[19]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems, 2017
work page 2017
-
[20]
Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher M. De Sa. QuIP: 2-bit quantiza- tion of large language models with guarantees. InAdvances in Neural Information Processing Systems, 2023
work page 2023
-
[21]
James W. Cooley and John W. Tukey. An algorithm for the machine calculation of com- plex fourier series.Mathematics of Computation, 19(90):297–301, 1965. doi: 10.1090/ S0025-5718-1965-0178586-1
work page 1965
-
[22]
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. KIVI: A tuning-free asymmetric 2bit quantization for KV cache.arXiv preprint arXiv:2402.02750, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
SmoothQuant: Accurate and efficient post-training quantization for large language models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning, 2023
work page 2023
-
[24]
GPTQ: Accurate post-training quantization for generative pre-trained transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers. InInternational Conference on Learning Representations, 2023
work page 2023
-
[25]
Philippe Tillet, H. T. Kung, and David Cox. Triton: An intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, 2019
work page 2019
-
[26]
Flash-decoding for long-context in- ference
Tri Dao, Daniel Haziza, Francisco Massa, and Grigory Sizov. Flash-decoding for long-context in- ference. https://pytorch.org/blog/flash-decoding/, 2023. PyTorch Blog. Accessed: 2026-05-06
work page 2023
-
[27]
Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
GLM Team. ChatGLM: A family of large language models from GLM-130B to GLM-4 all tools.arXiv preprint arXiv:2406.12793, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Mathematical Association of America. AIME 2025: American invitational mathematics examination.https://maa.org/math-competitions/aime, 2025
work page 2025
-
[30]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof Q&A benchmark.Conference on Language Modeling, 2024
work page 2024
-
[31]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021. 13
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[32]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contam- ination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Measuring mathematical problem solving with the MATH dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. NeurIPS Datasets and Benchmarks Track, 2021
work page 2021
-
[34]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, et al. RULER: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
TurboQuant: 2-bit KV cache compression with 4x capacity
vibhavagarwal5. TurboQuant: 2-bit KV cache compression with 4x capacity. https:// github.com/vllm-project/vllm/pull/38479, 2026. vLLM pull request #38479
work page 2026
-
[36]
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. TurboQuant: Online vector quantization with near-optimal distortion rate.arXiv preprint arXiv:2504.19874, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Ky Fan. Maximum properties and inequalities for the eigenvalues of completely continuous operators.Proceedings of the National Academy of Sciences of the United States of America, 37 (11):760–766, 1951. doi: 10.1073/pnas.37.11.760
-
[38]
SnapKV: LLM Knows What You are Looking for Before Generation
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation.arXiv preprint arXiv:2404.14469, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
Zefan Cai et al. PyramidKV: Dynamic KV cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Efficient streaming language models with attention sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations, 2024
work page 2024
-
[41]
ZipCache: Accurate and efficient KV cache quantization with salient token identification
Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. ZipCache: Accurate and efficient KV cache quantization with salient token identification. InAdvances in Neural Information Processing Systems, 2024
work page 2024
-
[42]
Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. GEAR: An efficient KV cache compression recipe for near-lossless generative inference of LLM.arXiv preprint arXiv:2403.05527, 2024
-
[43]
Palu: Compressing kv-cache with low-rank projection
Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S. Abdelfattah, and Kai-Chiang Wu. PALU: Compressing KV-cache with low-rank projection.arXiv preprint arXiv:2407.21118, 2024
-
[44]
Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, and Mohamed S. Abdelfattah. xKV: Cross-layer SVD for KV-cache compression.arXiv preprint arXiv:2503.18893, 2025
-
[45]
MatryoshkaKV: Adaptive KV compression via trainable orthogonal projection
Bokai Lin, Zihao Zeng, Zipeng Xiao, Siqi Kou, Tianqi Hou, Xiaofeng Gao, Hao Zhang, and Zhijie Deng. MatryoshkaKV: Adaptive KV compression via trainable orthogonal projection. arXiv preprint arXiv:2410.14731, 2024
-
[46]
SKVQ: Sliding-window key and value cache quantization for large language models
Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, and Dahua Lin. SKVQ: Sliding-window key and value cache quantization for large language models. In Conference on Language Modeling, 2024
work page 2024
-
[47]
Tengxuan Liu, Shiyao Li, Jiayi Yang, Tianchen Zhao, Feng Zhou, Xiaohui Song, Guohao Dai, Shengen Yan, Huazhong Yang, and Yu Wang. PM-KVQ: Progressive mixed-precision KV cache quantization for long-CoT LLMs.arXiv preprint arXiv:2505.18610, 2025
-
[48]
Quantize What Counts: More for Keys, Less for Values
Mohsen Hariri, Alan Luo, Weicong Chen, Shaochen Zhong, Tianyi Zhang, Qifan Wang, Xia Hu, Xiaotian Han, and Vipin Chaudhary. Quantize what counts: More for keys, less for values. arXiv preprint arXiv:2502.15075, 2025. 14
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
Castro, Torsten Hoefler, and Dan Alistarh
Saleh Ashkboos, Mahdi Nikdan, Soroush Tabesh, Roberto L. Castro, Torsten Hoefler, and Dan Alistarh. HALO: Hadamard-assisted lower-precision optimization for LLMs.arXiv preprint arXiv:2501.02625, 2025
-
[50]
HOT: Hadamard-based optimized training
Seonggon Kim, Juncheol Shin, Seung-taek Woo, and Eunhyeok Park. HOT: Hadamard-based optimized training. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4787–4796, 2025
work page 2025
-
[51]
Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Noll Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Bridging the gap between promise and performance for microscaling FP4 quantization. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps...
work page 2026
-
[52]
Utkarsh Saxena and Kaushik Roy. KVLinC: KV cache quantization with hadamard rotation and linear correction.arXiv preprint arXiv:2510.05373, 2025
-
[53]
Chen, Hsiang-Fu Yu, Inderjit S
Patrick H. Chen, Hsiang-Fu Yu, Inderjit S. Dhillon, and Cho-Jui Hsieh. DRONE: Data-aware low-rank compression for large NLP models.Advances in Neural Information Processing Systems, 34:29321–29334, 2021
work page 2021
-
[54]
ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models
Zhihang Yuan, Yuzhang Shang, Yang Song, Qiang Wu, Yan Yan, and Guangyu Sun. ASVD: Activation-aware singular value decomposition for compressing large language models.arXiv preprint arXiv:2312.05821, 2023
work page internal anchor Pith review arXiv 2023
-
[55]
Svd-llm: Truncation-aware singular value decomposition for large language model compression
Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. SVD-LLM: Truncation-aware singular value decomposition for large language model compression.arXiv preprint arXiv:2403.07378, 2024
-
[56]
SVD-LLM v2: Optimizing singular value truncation for large language model compression
Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, and Mi Zhang. SVD-LLM v2: Optimizing singular value truncation for large language model compression. InProceedings of the Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025
work page 2025
-
[57]
CorDA: Context-oriented decomposition adaptation of large language models
Yibo Yang, Xiaojie Li, Zhongzhu Zhou, Shuaiwen Leon Song, Jianlong Wu, Liqiang Nie, and Bernard Ghanem. CorDA: Context-oriented decomposition adaptation of large language models. arXiv preprint arXiv:2406.05223, 2024
-
[58]
HaPPI: Efficient KV cache compression with hadamard PCA-based power iteration
Seonggon Kim, Taehyeon Kim, and Eunhyeok Park. HaPPI: Efficient KV cache compression with hadamard PCA-based power iteration. OpenReview, 2025. URL https://openreview. net/forum?id=BRDgQzdtWr. Submitted to ICLR 2026
work page 2025
-
[59]
CARE: Covariance-aware and rank- enhanced decomposition for enabling multi-head latent attention
Zhongzhu Zhou, Fengxiang Bie, Ziyan Chen, Zhenyu Zhang, Yibo Yang, Junxiong Wang, Ben Athiwaratkun, Xiaoxia Wu, and Shuaiwen Leon Song. CARE: Covariance-aware and rank- enhanced decomposition for enabling multi-head latent attention. InInternational Conference on Learning Representations, 2026
work page 2026
-
[60]
Xianglong Yan, Zhiteng Li, Tianao Zhang, Linghe Kong, Yulun Zhang, and Xiaokang Yang. RecalKV: Low-rank KV cache compression via head reordering and offline calibration.arXiv preprint arXiv:2505.24357, 2025
-
[61]
Yixuan Wang, Haoyu Qiao, Lujun Li, Qingfu Zhu, and Wanxiang Che. CommonKV: Com- pressing KV cache with cross-layer parameter sharing.arXiv preprint arXiv:2508.16134, 2025
-
[62]
AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. InProceedings of Machine Learning and Systems, 2024
work page 2024
-
[63]
Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. OmniQuant: Omnidirectionally calibrated quantiza- tion for large language models.arXiv preprint arXiv:2308.13137, 2023. 15
-
[64]
ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models
Akshat Ramachandran, Marina Neseem, Charbel Sakr, Rangharajan Venkatesan, Brucek Khailany, and Tushar Krishna. ThinKV: Thought-adaptive KV cache compression for ef- ficient reasoning models.arXiv preprint arXiv:2510.01290, 2025. URL https://arxiv.org/ abs/2510.01290. 16 A Additional Details and Theoretical Analysis A.1 Hadamard Transform The Hadamard tran...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.