pith. sign in

arxiv: 2605.17757 · v1 · pith:CTZZCS5Mnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI· cs.DC· cs.PF

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Pith reviewed 2026-05-20 12:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DCcs.PF
keywords KV cache quantizationINT2 quantizationLLM servingcovariance-aware rotationlong context inferencememory efficiencyattention mechanismsmodel quantization
0
0 comments X

The pith

OSCAR derives fixed rotations from offline covariance estimates to enable accurate 2-bit KV cache quantization for LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the problem of severe accuracy degradation when quantizing KV caches to 2 bits for efficient long-context LLM serving. It does this by estimating covariance structures from attention patterns offline and using them to compute rotations that align quantization with what the attention mechanism actually uses. This approach, combined with a custom deployable kernel, keeps performance close to full BF16 precision on reasoning tasks. Sympathetic readers would care because it promises substantial memory savings and speedups without sacrificing model capability on challenging benchmarks.

Core claim

By estimating attention-aware covariance structures offline and deriving fixed rotations and clipping thresholds from them, OSCAR aligns 2-bit KV cache quantization with the covariance structures consumed by attention, providing both theoretical justification and a fully deployable system with an INT2 attention kernel compatible with paged KV-cache serving.

What carries the argument

The offline spectral covariance-aware rotation that computes fixed rotations based on estimated covariance to reduce outliers in alignment with attention patterns.

If this is right

  • OSCAR reduces the BF16 accuracy gap to 3.78 points on Qwen3-4B-Thinking-2507 and 1.42 points on Qwen3-8B for INT2 KV cache.
  • It scales to Qwen3-32B and GLM-4.7 with 358B parameters while remaining on par with BF16.
  • OSCAR remains robust on long-context tasks up to 128K tokens on RULER-NIAH where naive rotation INT2 collapses.
  • KV-cache memory is reduced by approximately 8x with throughput improvements up to 7x at large batch sizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could be extended to other low-bit quantizations if similar covariance structures exist in other model components.
  • Deploying OSCAR might allow serving larger models or longer contexts under fixed hardware memory limits.
  • Testing the offline estimates on a wider variety of downstream tasks would strengthen the generalizability claim.

Load-bearing premise

Covariance structures estimated offline from calibration or training data will accurately represent the attention patterns encountered during actual inference on downstream tasks and long contexts.

What would settle it

A large accuracy drop on a new long-context task or model variant not represented in the calibration data would indicate that the offline estimates do not generalize.

Figures

Figures reproduced from arXiv: 2605.17757 by Ben Athiwaratkun, Donglin Zhuang, Jisen Li, Shuaiwen Leon Song, Xiaoxia Wu, Zhongzhu Zhou, Ziyan Chen.

Figure 1
Figure 1. Figure 1: OSCAR pipeline overview. Offline, OSCAR estimates attention-aware key/value covari￾ance rotation and shows how the resulting rotation makes KV activations more uniform: Hadamard mixing flattens raw peaks, while the OSCAR rotation separates directions that matter more or less to attention. Online, the serving path keeps sink and recent tokens in BF16 while applying the fixed rotate–clip–INT2 path to history… view at source ↗
Figure 2
Figure 2. Figure 2: Quantization error. OSCAR limits it at every stage (Qwen3-4B-Thinking-2507, AIME). [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: a: Qwen3-4B-Thinking-2507 4K 8K 16K 32K 64K 128K Prefill context length 0 2 4 6 8 10 12 K L(p F P 1 6 p q) (lin e a r) 0.01 0.03 0.02 0.02 0.01 0.40 7.22 4.82 12.07 12.12 8.99 11.75 0.62 1.07 2.09 2.94 2.28 3.15 Qwen3-8B OSCAR vs QuaRot vs Clip-only KL drift (teacher-forced, 32 decode steps) OSCAR (ours) QuaRot Clip only beyond native 40K (YaRN×4) Figure 3b: Qwen3-8B [PITH_FULL_IMAGE:figures/full_fig_p007… view at source ↗
Figure 4
Figure 4. Figure 4: Left: Decode throughput speedup relative to BF16 at batch size [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Long-context serving stress test with 100k-token inputs. OSCAR’s uniform INT2 KV-cache [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Effect of prefix-cache hit ratio on end-to-end serving throughput (100k ISL, 1K OSL). Each [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Layer 16 of Qwen3-8B. From left: heatmap of [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: a: K MSE 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 Layer index 10 4 10 3 10 2 10 1 10 0 10 1 V r e c o n s t r u c tio n M S E V V 2 F / N (a b s olu t e, p e r-ele m e n t) Per-layer V reconstruction error OSCAR (UKHHadPK + sink/recent/clip) HHad + sink/recent/clip HHad only (no sink/recent/clip) Clip-only (no rotation) Baseline INT2 (no rotation, no clip) Figure 8b: V MSE 0 2 4 6 8 10 12 14 16 18 … view at source ↗
Figure 9
Figure 9. Figure 9: Attention-aware rotations make history-token activations easier to quantize. Hadamard mixing flattens raw activation peaks by spreading energy across channels. OSCAR goes further: the target covariance separates directions that matter more or less to attention, and the Hadamard transform then mixes each part into a more uniform range. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Attention-aware rotations make history-token activations easier to quantize. Hadamard mixing flattens raw activation peaks by spreading energy across channels. OSCAR goes further: the target covariance separates directions that matter more or less to attention, and the Hadamard transform then mixes each part into a more uniform range. E.2 Full Table For Main Accuracy Run [PITH_FULL_IMAGE:figures/full_fig… view at source ↗
read the original abstract

INT2 KV-cache quantization is attractive for long-context LLM serving, but it remains difficult to make both accurate and deployable. Simple rotations such as Hadamard transforms reduce outliers, but still degrade at INT2 because they are not aligned with downstream attention. We propose OSCAR, an Ultra-low-bit KV Cache quantization method that estimates attention-aware covariance structures offline and uses them to derive fixed rotations and clipping thresholds for quantization. In this way, it aligns KV quantization with the covariance structures that attention actually consumes. More importantly, we not only provide theoretical justification but also develop a fully deployable OSCAR system with a custom INT2 attention kernel that remains compatible with paged KV-cache serving and fused kernel pipelines, enabling seamless integration into modern LLM serving frameworks such as SGLang and vLLM. We evaluate our methods on recent reasoning models with reasoning traces of up to 32k tokens across 5 tasks. On Qwen3-4B-Thinking-2507 and Qwen3-8B, OSCAR reduces the BF16 accuracy gap to 3.78 and 1.42 points, respectively, while naive rotation INT2 collapses to nearly zero. We further scale OSCAR to Qwen3-32B and GLM-4.7 (358B params), where it remains effectively on par with BF16. On long context - RULER-NIAH up to 128K, OSCAR remains robust on both Qwen3 models, while naive rotation INT2 collapses. System-wise, OSCAR reduces KV-cache memory by approximately 8x, improves throughput by up to 7x at large batch sizes under the same memory budget, and accelerates batch-size-1 decoding by up to 3x over BF16 due to reduced memory bandwidth overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces OSCAR, a method for 2-bit KV-cache quantization that estimates attention-aware covariance structures offline from calibration data to derive fixed rotations and clipping thresholds. This is claimed to align quantization with downstream attention patterns better than naive rotations like Hadamard transforms. The work includes a custom INT2 attention kernel compatible with paged KV-cache serving in frameworks such as SGLang and vLLM. Empirical results on Qwen3 reasoning models (4B to 32B) and GLM-4.7 (358B) report reduced accuracy gaps to BF16 (e.g., 3.78 and 1.42 points on 4B/8B models), robustness on RULER-NIAH up to 128K contexts, ~8x memory reduction, and throughput gains up to 7x.

Significance. If the results hold, OSCAR could enable practical ultra-low-bit KV caching for long-context LLM serving with near-lossless accuracy, addressing a key bottleneck in memory-constrained inference. The combination of theoretical justification, deployable kernel, and scaling to 358B models plus long-context robustness adds engineering value beyond pure quantization techniques.

major comments (2)
  1. [§3 and abstract] §3 (method) and abstract: The central claim that offline covariance estimation produces rotations 'aligned with the covariance structures that attention actually consumes' is load-bearing for all accuracy results (e.g., the 3.78/1.42-point gaps and 128k RULER-NIAH robustness). No distribution-shift experiment is described that tests whether eigenvectors from calibration data match those arising under paged attention on long-context downstream tasks; if they differ, the INT2 degradation reappears.
  2. [Experiments] Experiments section: The reported gains lack details on covariance estimation (window size, sample selection from training vs. calibration traces), error bars across runs, or explicit confirmation that calibration data has no overlap with evaluation tasks. These omissions make it impossible to assess whether the 3.78/1.42-point gaps are robust or sensitive to the free parameters listed in the axiom ledger.
minor comments (2)
  1. Figure captions and legends should explicitly label all three curves (BF16, naive rotation INT2, OSCAR) and report the exact context lengths used for each bar.
  2. Add a short paragraph clarifying the exact procedure for computing the offline covariance matrix (e.g., number of tokens, layer-wise vs. global estimation).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address the major comments point by point below and will update the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3 and abstract] §3 (method) and abstract: The central claim that offline covariance estimation produces rotations 'aligned with the covariance structures that attention actually consumes' is load-bearing for all accuracy results (e.g., the 3.78/1.42-point gaps and 128k RULER-NIAH robustness). No distribution-shift experiment is described that tests whether eigenvectors from calibration data match those arising under paged attention on long-context downstream tasks; if they differ, the INT2 degradation reappears.

    Authors: We acknowledge the value of directly testing eigenvector stability under distribution shift. Calibration traces were chosen to span diverse reasoning patterns and context lengths. The maintained accuracy on RULER-NIAH at 128K under paged serving already provides supporting evidence that the fixed rotations generalize. In revision we will add a short analysis subsection in §3 that (i) reports cosine similarity between calibration eigenvectors and those computed on held-out long-context downstream samples and (ii) includes a small-scale ablation showing that INT2 degradation remains limited even when modest shifts are introduced. revision: yes

  2. Referee: [Experiments] Experiments section: The reported gains lack details on covariance estimation (window size, sample selection from training vs. calibration traces), error bars across runs, or explicit confirmation that calibration data has no overlap with evaluation tasks. These omissions make it impossible to assess whether the 3.78/1.42-point gaps are robust or sensitive to the free parameters listed in the axiom ledger.

    Authors: We agree these details are necessary for reproducibility. The revised Experiments section will explicitly state: the window size and number of calibration sequences used for covariance estimation, that all calibration traces are drawn from held-out data with zero overlap to any evaluation benchmark or task, and error bars (standard deviation over three independent runs) for the primary accuracy metrics on Qwen3-4B and 8B. We will also clarify the hyper-parameter choices referenced in the axiom ledger. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an offline estimation of attention-aware covariance structures from calibration or training data to derive fixed rotations and clipping thresholds, followed by empirical evaluation on separate downstream reasoning tasks, long-context benchmarks (RULER-NIAH up to 128K), and scaled models. This separation between calibration and held-out evaluation maintains independence. No load-bearing derivation step, equation, or self-citation in the abstract or description reduces the performance claims to a fitted input renamed as prediction or to a self-referential definition by construction. The central results rest on external benchmarks rather than internal tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the ability to estimate stable attention covariance structures offline and on the assumption that fixed rotations derived from them remain effective at inference time.

free parameters (1)
  • Covariance estimation window and sample selection
    Offline estimation of attention-aware covariance structures requires choices of data windows and samples that are fitted or selected to produce the rotations.
axioms (1)
  • domain assumption Attention mechanisms in LLMs consume covariance structures that can be reliably estimated from offline data.
    Central to deriving the rotations and thresholds that align quantization with downstream attention.

pith-pipeline@v0.9.0 · 5896 in / 1280 out tokens · 44963 ms · 2026-05-20T12:12:28.792472+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 16 internal anchors

  1. [1]

    H2O: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 2023

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 2023

  2. [2]

    Minikv: Pushing the limits of llm inference via 2-bit layer-discriminative kv cache.arXiv preprint arXiv:2411.18077, 2024

    Akshat Sharma, Hangliang Ding, Jianping Li, Neel Dani, and Minjia Zhang. Minikv: Pushing the limits of llm inference via 2-bit layer-discriminative kv cache.arXiv preprint arXiv:2411.18077, 2024

  3. [3]

    Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

    Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms.arXiv preprint arXiv:2310.01801, 2023

  4. [4]

    Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. KVQuant: Towards 10 million context length LLM inference with KV cache quantization. InAdvances in Neural Information Processing Systems, 2024

  5. [5]

    WKVQuant: Quantizing weight and key/value cache for large language models gains more

    Yuxuan Yue, Zhihang Yuan, Haojie Duanmu, Sifan Zhou, Jianlong Wu, and Liqiang Nie. WKVQuant: Quantizing weight and key/value cache for large language models gains more. arXiv preprint arXiv:2402.12065, 2024

  6. [6]

    Kitty: Accurate and efficient 2-bit KV cache quantization with dynamic channel-wise precision boost.arXiv preprint arXiv:2511.18643, 2025

    Haojun Xia, Xiaoxia Wu, Jisen Li, Robert Wu, Junxiong Wang, Jue Wang, Chenxi Li, Aman Singhal, Alay Dilipbhai Shah, Alpay Ariyak, Donglin Zhuang, Zhongzhu Zhou, Ben Athi- waratkun, Zhen Zheng, and Shuaiwen Leon Song. Kitty: Accurate and efficient 2-bit KV cache quantization with dynamic channel-wise precision boost.arXiv preprint arXiv:2511.18643, 2025

  7. [7]

    RotateKV: Accurate and robust 2-bit KV cache quantization for LLMs via outlier-aware adaptive rotations.arXiv preprint arXiv:2501.16383, 2025

    Zunhai Su, Zhe Chen, Wang Shen, Hanyu Wei, Linge Li, Huangqi Yu, and Kehong Yuan. RotateKV: Accurate and robust 2-bit KV cache quantization for LLMs via outlier-aware adaptive rotations.arXiv preprint arXiv:2501.16383, 2025

  8. [8]

    A survey on large language model acceleration based on KV cache management.Transactions on Machine Learning Research, 2025

    Haoyang LI, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole HU, Wei Dong, Li Qing, and Lei Chen. A survey on large language model acceleration based on KV cache management.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URLhttps://openreview.net/forum?id=z3JZzu9EA3

  9. [9]

    Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman

    Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot: Outlier-free 4-bit inference in rotated LLMs. InAdvances in Neural Information Processing Systems, volume 37, pages 100213–100240, 2024

  10. [10]

    Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.Proceedings of machine learning research, 235:48630, 2024

    Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.Proceedings of machine learning research, 235:48630, 2024

  11. [11]

    SpinQuant: LLM quantization with learned rotations

    Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations.arXiv preprint arXiv:2405.16406, 2024

  12. [12]

    SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving

    Jinda Jia, Jisen Li, Zhongzhu Zhou, Jung Hwan Heo, Jue Wang, Tri Dao, Shuaiwen Leon Song, Ben Athiwaratkun, Chenfeng Xu, Tianyi Zhang, et al. SAW-INT4: System-aware 4-bit KV-cache quantization for real-world LLM serving.arXiv preprint arXiv:2604.19157, 2026

  13. [13]

    Efficient memory management for large language model serving with PagedAttention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles, 2023

  14. [14]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022. 12

  15. [15]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023. URLhttps://arxiv.org/abs/2307.08691

  16. [16]

    Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

  17. [17]

    Flashattention-4: Algorithm and kernel pipelining co-design for asymmetric hardware scaling

    Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, and Tri Dao. Flashattention-4: Algorithm and kernel pipelining co-design for asymmetric hardware scaling. arXiv preprint arXiv:2603.05451, 2026

  18. [18]

    Gonzalez, Clark Barrett, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: Efficient execution of structured language model programs. InAdvances in Neural Information Processing Systems, 2024

  19. [19]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems, 2017

  20. [20]

    Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher M. De Sa. QuIP: 2-bit quantiza- tion of large language models with guarantees. InAdvances in Neural Information Processing Systems, 2023

  21. [21]

    Cooley and John W

    James W. Cooley and John W. Tukey. An algorithm for the machine calculation of com- plex fourier series.Mathematics of Computation, 19(90):297–301, 1965. doi: 10.1090/ S0025-5718-1965-0178586-1

  22. [22]

    KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. KIVI: A tuning-free asymmetric 2bit quantization for KV cache.arXiv preprint arXiv:2402.02750, 2024

  23. [23]

    SmoothQuant: Accurate and efficient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning, 2023

  24. [24]

    GPTQ: Accurate post-training quantization for generative pre-trained transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers. InInternational Conference on Learning Representations, 2023

  25. [25]

    Philippe Tillet, H. T. Kung, and David Cox. Triton: An intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, 2019

  26. [26]

    Flash-decoding for long-context in- ference

    Tri Dao, Daniel Haziza, Francisco Massa, and Grigory Sizov. Flash-decoding for long-context in- ference. https://pytorch.org/blog/flash-decoding/, 2023. PyTorch Blog. Accessed: 2026-05-06

  27. [27]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  28. [28]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    GLM Team. ChatGLM: A family of large language models from GLM-130B to GLM-4 all tools.arXiv preprint arXiv:2406.12793, 2024

  29. [29]

    AIME 2025: American invitational mathematics examination.https://maa.org/math-competitions/aime, 2025

    Mathematical Association of America. AIME 2025: American invitational mathematics examination.https://maa.org/math-competitions/aime, 2025

  30. [30]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof Q&A benchmark.Conference on Language Modeling, 2024

  31. [31]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021. 13

  32. [32]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contam- ination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

  33. [33]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. NeurIPS Datasets and Benchmarks Track, 2021

  34. [34]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, et al. RULER: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

  35. [35]

    TurboQuant: 2-bit KV cache compression with 4x capacity

    vibhavagarwal5. TurboQuant: 2-bit KV cache compression with 4x capacity. https:// github.com/vllm-project/vllm/pull/38479, 2026. vLLM pull request #38479

  36. [36]

    TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

    Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. TurboQuant: Online vector quantization with near-optimal distortion rate.arXiv preprint arXiv:2504.19874, 2025

  37. [37]

    Ky Fan. Maximum properties and inequalities for the eigenvalues of completely continuous operators.Proceedings of the National Academy of Sciences of the United States of America, 37 (11):760–766, 1951. doi: 10.1073/pnas.37.11.760

  38. [38]

    SnapKV: LLM Knows What You are Looking for Before Generation

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation.arXiv preprint arXiv:2404.14469, 2024

  39. [39]

    PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    Zefan Cai et al. PyramidKV: Dynamic KV cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024

  40. [40]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations, 2024

  41. [41]

    ZipCache: Accurate and efficient KV cache quantization with salient token identification

    Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. ZipCache: Accurate and efficient KV cache quantization with salient token identification. InAdvances in Neural Information Processing Systems, 2024

  42. [42]

    GEAR: An efficient KV cache compression recipe for near-lossless generative inference of LLM.arXiv preprint arXiv:2403.05527, 2024

    Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. GEAR: An efficient KV cache compression recipe for near-lossless generative inference of LLM.arXiv preprint arXiv:2403.05527, 2024

  43. [43]

    Palu: Compressing kv-cache with low-rank projection

    Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S. Abdelfattah, and Kai-Chiang Wu. PALU: Compressing KV-cache with low-rank projection.arXiv preprint arXiv:2407.21118, 2024

  44. [44]

    Abdelfattah

    Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, and Mohamed S. Abdelfattah. xKV: Cross-layer SVD for KV-cache compression.arXiv preprint arXiv:2503.18893, 2025

  45. [45]

    MatryoshkaKV: Adaptive KV compression via trainable orthogonal projection

    Bokai Lin, Zihao Zeng, Zipeng Xiao, Siqi Kou, Tianqi Hou, Xiaofeng Gao, Hao Zhang, and Zhijie Deng. MatryoshkaKV: Adaptive KV compression via trainable orthogonal projection. arXiv preprint arXiv:2410.14731, 2024

  46. [46]

    SKVQ: Sliding-window key and value cache quantization for large language models

    Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, and Dahua Lin. SKVQ: Sliding-window key and value cache quantization for large language models. In Conference on Language Modeling, 2024

  47. [47]

    PM-KVQ: Progressive mixed-precision KV cache quantization for long-CoT LLMs.arXiv preprint arXiv:2505.18610, 2025

    Tengxuan Liu, Shiyao Li, Jiayi Yang, Tianchen Zhao, Feng Zhou, Xiaohui Song, Guohao Dai, Shengen Yan, Huazhong Yang, and Yu Wang. PM-KVQ: Progressive mixed-precision KV cache quantization for long-CoT LLMs.arXiv preprint arXiv:2505.18610, 2025

  48. [48]

    Quantize What Counts: More for Keys, Less for Values

    Mohsen Hariri, Alan Luo, Weicong Chen, Shaochen Zhong, Tianyi Zhang, Qifan Wang, Xia Hu, Xiaotian Han, and Vipin Chaudhary. Quantize what counts: More for keys, less for values. arXiv preprint arXiv:2502.15075, 2025. 14

  49. [49]

    Castro, Torsten Hoefler, and Dan Alistarh

    Saleh Ashkboos, Mahdi Nikdan, Soroush Tabesh, Roberto L. Castro, Torsten Hoefler, and Dan Alistarh. HALO: Hadamard-assisted lower-precision optimization for LLMs.arXiv preprint arXiv:2501.02625, 2025

  50. [50]

    HOT: Hadamard-based optimized training

    Seonggon Kim, Juncheol Shin, Seung-taek Woo, and Eunhyeok Park. HOT: Hadamard-based optimized training. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4787–4796, 2025

  51. [51]

    Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Noll Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh

    Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Noll Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Bridging the gap between promise and performance for microscaling FP4 quantization. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps...

  52. [52]

    KVLinC: KV cache quantization with hadamard rotation and linear correction.arXiv preprint arXiv:2510.05373, 2025

    Utkarsh Saxena and Kaushik Roy. KVLinC: KV cache quantization with hadamard rotation and linear correction.arXiv preprint arXiv:2510.05373, 2025

  53. [53]

    Chen, Hsiang-Fu Yu, Inderjit S

    Patrick H. Chen, Hsiang-Fu Yu, Inderjit S. Dhillon, and Cho-Jui Hsieh. DRONE: Data-aware low-rank compression for large NLP models.Advances in Neural Information Processing Systems, 34:29321–29334, 2021

  54. [54]

    ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

    Zhihang Yuan, Yuzhang Shang, Yang Song, Qiang Wu, Yan Yan, and Guangyu Sun. ASVD: Activation-aware singular value decomposition for compressing large language models.arXiv preprint arXiv:2312.05821, 2023

  55. [55]

    Svd-llm: Truncation-aware singular value decomposition for large language model compression

    Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. SVD-LLM: Truncation-aware singular value decomposition for large language model compression.arXiv preprint arXiv:2403.07378, 2024

  56. [56]

    SVD-LLM v2: Optimizing singular value truncation for large language model compression

    Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, and Mi Zhang. SVD-LLM v2: Optimizing singular value truncation for large language model compression. InProceedings of the Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025

  57. [57]

    CorDA: Context-oriented decomposition adaptation of large language models

    Yibo Yang, Xiaojie Li, Zhongzhu Zhou, Shuaiwen Leon Song, Jianlong Wu, Liqiang Nie, and Bernard Ghanem. CorDA: Context-oriented decomposition adaptation of large language models. arXiv preprint arXiv:2406.05223, 2024

  58. [58]

    HaPPI: Efficient KV cache compression with hadamard PCA-based power iteration

    Seonggon Kim, Taehyeon Kim, and Eunhyeok Park. HaPPI: Efficient KV cache compression with hadamard PCA-based power iteration. OpenReview, 2025. URL https://openreview. net/forum?id=BRDgQzdtWr. Submitted to ICLR 2026

  59. [59]

    CARE: Covariance-aware and rank- enhanced decomposition for enabling multi-head latent attention

    Zhongzhu Zhou, Fengxiang Bie, Ziyan Chen, Zhenyu Zhang, Yibo Yang, Junxiong Wang, Ben Athiwaratkun, Xiaoxia Wu, and Shuaiwen Leon Song. CARE: Covariance-aware and rank- enhanced decomposition for enabling multi-head latent attention. InInternational Conference on Learning Representations, 2026

  60. [60]

    RecalKV: Low-rank KV cache compression via head reordering and offline calibration.arXiv preprint arXiv:2505.24357, 2025

    Xianglong Yan, Zhiteng Li, Tianao Zhang, Linghe Kong, Yulun Zhang, and Xiaokang Yang. RecalKV: Low-rank KV cache compression via head reordering and offline calibration.arXiv preprint arXiv:2505.24357, 2025

  61. [61]

    CommonKV: Com- pressing KV cache with cross-layer parameter sharing.arXiv preprint arXiv:2508.16134, 2025

    Yixuan Wang, Haoyu Qiao, Lujun Li, Qingfu Zhu, and Wanxiang Che. CommonKV: Com- pressing KV cache with cross-layer parameter sharing.arXiv preprint arXiv:2508.16134, 2025

  62. [62]

    AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. InProceedings of Machine Learning and Systems, 2024

  63. [63]

    OmniQuant: Omnidirectionally calibrated quantiza- tion for large language models.arXiv preprint arXiv:2308.13137, 2023

    Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. OmniQuant: Omnidirectionally calibrated quantiza- tion for large language models.arXiv preprint arXiv:2308.13137, 2023. 15

  64. [64]

    ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

    Akshat Ramachandran, Marina Neseem, Charbel Sakr, Rangharajan Venkatesan, Brucek Khailany, and Tushar Krishna. ThinKV: Thought-adaptive KV cache compression for ef- ficient reasoning models.arXiv preprint arXiv:2510.01290, 2025. URL https://arxiv.org/ abs/2510.01290. 16 A Additional Details and Theoretical Analysis A.1 Hadamard Transform The Hadamard tran...