OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Ben Athiwaratkun; Donglin Zhuang; Jisen Li; Shuaiwen Leon Song; Xiaoxia Wu; Zhongzhu Zhou; Ziyan Chen

arxiv: 2605.17757 · v1 · pith:CTZZCS5Mnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI· cs.DC· cs.PF

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Zhongzhu Zhou , Donglin Zhuang , Jisen Li , Ziyan Chen , Shuaiwen Leon Song , Ben Athiwaratkun , Xiaoxia Wu This is my paper

Pith reviewed 2026-05-20 12:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DCcs.PF

keywords KV cache quantizationINT2 quantizationLLM servingcovariance-aware rotationlong context inferencememory efficiencyattention mechanismsmodel quantization

0 comments

The pith

OSCAR derives fixed rotations from offline covariance estimates to enable accurate 2-bit KV cache quantization for LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the problem of severe accuracy degradation when quantizing KV caches to 2 bits for efficient long-context LLM serving. It does this by estimating covariance structures from attention patterns offline and using them to compute rotations that align quantization with what the attention mechanism actually uses. This approach, combined with a custom deployable kernel, keeps performance close to full BF16 precision on reasoning tasks. Sympathetic readers would care because it promises substantial memory savings and speedups without sacrificing model capability on challenging benchmarks.

Core claim

By estimating attention-aware covariance structures offline and deriving fixed rotations and clipping thresholds from them, OSCAR aligns 2-bit KV cache quantization with the covariance structures consumed by attention, providing both theoretical justification and a fully deployable system with an INT2 attention kernel compatible with paged KV-cache serving.

What carries the argument

The offline spectral covariance-aware rotation that computes fixed rotations based on estimated covariance to reduce outliers in alignment with attention patterns.

If this is right

OSCAR reduces the BF16 accuracy gap to 3.78 points on Qwen3-4B-Thinking-2507 and 1.42 points on Qwen3-8B for INT2 KV cache.
It scales to Qwen3-32B and GLM-4.7 with 358B parameters while remaining on par with BF16.
OSCAR remains robust on long-context tasks up to 128K tokens on RULER-NIAH where naive rotation INT2 collapses.
KV-cache memory is reduced by approximately 8x with throughput improvements up to 7x at large batch sizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could be extended to other low-bit quantizations if similar covariance structures exist in other model components.
Deploying OSCAR might allow serving larger models or longer contexts under fixed hardware memory limits.
Testing the offline estimates on a wider variety of downstream tasks would strengthen the generalizability claim.

Load-bearing premise

Covariance structures estimated offline from calibration or training data will accurately represent the attention patterns encountered during actual inference on downstream tasks and long contexts.

What would settle it

A large accuracy drop on a new long-context task or model variant not represented in the calibration data would indicate that the offline estimates do not generalize.

Figures

Figures reproduced from arXiv: 2605.17757 by Ben Athiwaratkun, Donglin Zhuang, Jisen Li, Shuaiwen Leon Song, Xiaoxia Wu, Zhongzhu Zhou, Ziyan Chen.

**Figure 1.** Figure 1: OSCAR pipeline overview. Offline, OSCAR estimates attention-aware key/value covariance rotation and shows how the resulting rotation makes KV activations more uniform: Hadamard mixing flattens raw peaks, while the OSCAR rotation separates directions that matter more or less to attention. Online, the serving path keeps sink and recent tokens in BF16 while applying the fixed rotate–clip–INT2 path to history… view at source ↗

**Figure 2.** Figure 2: Quantization error. OSCAR limits it at every stage (Qwen3-4B-Thinking-2507, AIME). [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: a: Qwen3-4B-Thinking-2507 4K 8K 16K 32K 64K 128K Prefill context length 0 2 4 6 8 10 12 K L(p F P 1 6 p q) (lin e a r) 0.01 0.03 0.02 0.02 0.01 0.40 7.22 4.82 12.07 12.12 8.99 11.75 0.62 1.07 2.09 2.94 2.28 3.15 Qwen3-8B OSCAR vs QuaRot vs Clip-only KL drift (teacher-forced, 32 decode steps) OSCAR (ours) QuaRot Clip only beyond native 40K (YaRN×4) Figure 3b: Qwen3-8B [PITH_FULL_IMAGE:figures/full_fig_p007… view at source ↗

**Figure 4.** Figure 4: Left: Decode throughput speedup relative to BF16 at batch size [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Long-context serving stress test with 100k-token inputs. OSCAR’s uniform INT2 KV-cache [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Effect of prefix-cache hit ratio on end-to-end serving throughput (100k ISL, 1K OSL). Each [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Layer 16 of Qwen3-8B. From left: heatmap of [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: a: K MSE 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 Layer index 10 4 10 3 10 2 10 1 10 0 10 1 V r e c o n s t r u c tio n M S E V V 2 F / N (a b s olu t e, p e r-ele m e n t) Per-layer V reconstruction error OSCAR (UKHHadPK + sink/recent/clip) HHad + sink/recent/clip HHad only (no sink/recent/clip) Clip-only (no rotation) Baseline INT2 (no rotation, no clip) Figure 8b: V MSE 0 2 4 6 8 10 12 14 16 18 … view at source ↗

**Figure 9.** Figure 9: Attention-aware rotations make history-token activations easier to quantize. Hadamard mixing flattens raw activation peaks by spreading energy across channels. OSCAR goes further: the target covariance separates directions that matter more or less to attention, and the Hadamard transform then mixes each part into a more uniform range. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗

**Figure 10.** Figure 10: Attention-aware rotations make history-token activations easier to quantize. Hadamard mixing flattens raw activation peaks by spreading energy across channels. OSCAR goes further: the target covariance separates directions that matter more or less to attention, and the Hadamard transform then mixes each part into a more uniform range. E.2 Full Table For Main Accuracy Run [PITH_FULL_IMAGE:figures/full_fig… view at source ↗

read the original abstract

INT2 KV-cache quantization is attractive for long-context LLM serving, but it remains difficult to make both accurate and deployable. Simple rotations such as Hadamard transforms reduce outliers, but still degrade at INT2 because they are not aligned with downstream attention. We propose OSCAR, an Ultra-low-bit KV Cache quantization method that estimates attention-aware covariance structures offline and uses them to derive fixed rotations and clipping thresholds for quantization. In this way, it aligns KV quantization with the covariance structures that attention actually consumes. More importantly, we not only provide theoretical justification but also develop a fully deployable OSCAR system with a custom INT2 attention kernel that remains compatible with paged KV-cache serving and fused kernel pipelines, enabling seamless integration into modern LLM serving frameworks such as SGLang and vLLM. We evaluate our methods on recent reasoning models with reasoning traces of up to 32k tokens across 5 tasks. On Qwen3-4B-Thinking-2507 and Qwen3-8B, OSCAR reduces the BF16 accuracy gap to 3.78 and 1.42 points, respectively, while naive rotation INT2 collapses to nearly zero. We further scale OSCAR to Qwen3-32B and GLM-4.7 (358B params), where it remains effectively on par with BF16. On long context - RULER-NIAH up to 128K, OSCAR remains robust on both Qwen3 models, while naive rotation INT2 collapses. System-wise, OSCAR reduces KV-cache memory by approximately 8x, improves throughput by up to 7x at large batch sizes under the same memory budget, and accelerates batch-size-1 decoding by up to 3x over BF16 due to reduced memory bandwidth overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OSCAR shows that offline attention covariance can produce rotations keeping INT2 KV cache within a few points of BF16 on Qwen reasoning models and scaling to 358B, but the calibration-to-inference match remains untested.

read the letter

The key takeaway is that this paper gets 2-bit KV cache to stay usable on recent reasoning models by deriving fixed rotations from offline covariance estimates rather than generic Hadamard transforms. On Qwen3-4B and 8B it cuts the accuracy drop to 3.78 and 1.42 points while naive INT2 falls apart, and it holds parity on 32B and 358B models plus 128k RULER-NIAH. They also ship a paged-compatible INT2 kernel that plugs into vLLM and SGLang, which turns the idea into something you could actually run at scale with 8x memory savings and throughput gains up to 7x at large batches.

Referee Report

2 major / 2 minor

Summary. The paper introduces OSCAR, a method for 2-bit KV-cache quantization that estimates attention-aware covariance structures offline from calibration data to derive fixed rotations and clipping thresholds. This is claimed to align quantization with downstream attention patterns better than naive rotations like Hadamard transforms. The work includes a custom INT2 attention kernel compatible with paged KV-cache serving in frameworks such as SGLang and vLLM. Empirical results on Qwen3 reasoning models (4B to 32B) and GLM-4.7 (358B) report reduced accuracy gaps to BF16 (e.g., 3.78 and 1.42 points on 4B/8B models), robustness on RULER-NIAH up to 128K contexts, ~8x memory reduction, and throughput gains up to 7x.

Significance. If the results hold, OSCAR could enable practical ultra-low-bit KV caching for long-context LLM serving with near-lossless accuracy, addressing a key bottleneck in memory-constrained inference. The combination of theoretical justification, deployable kernel, and scaling to 358B models plus long-context robustness adds engineering value beyond pure quantization techniques.

major comments (2)

[§3 and abstract] §3 (method) and abstract: The central claim that offline covariance estimation produces rotations 'aligned with the covariance structures that attention actually consumes' is load-bearing for all accuracy results (e.g., the 3.78/1.42-point gaps and 128k RULER-NIAH robustness). No distribution-shift experiment is described that tests whether eigenvectors from calibration data match those arising under paged attention on long-context downstream tasks; if they differ, the INT2 degradation reappears.
[Experiments] Experiments section: The reported gains lack details on covariance estimation (window size, sample selection from training vs. calibration traces), error bars across runs, or explicit confirmation that calibration data has no overlap with evaluation tasks. These omissions make it impossible to assess whether the 3.78/1.42-point gaps are robust or sensitive to the free parameters listed in the axiom ledger.

minor comments (2)

Figure captions and legends should explicitly label all three curves (BF16, naive rotation INT2, OSCAR) and report the exact context lengths used for each bar.
Add a short paragraph clarifying the exact procedure for computing the offline covariance matrix (e.g., number of tokens, layer-wise vs. global estimation).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address the major comments point by point below and will update the manuscript accordingly.

read point-by-point responses

Referee: [§3 and abstract] §3 (method) and abstract: The central claim that offline covariance estimation produces rotations 'aligned with the covariance structures that attention actually consumes' is load-bearing for all accuracy results (e.g., the 3.78/1.42-point gaps and 128k RULER-NIAH robustness). No distribution-shift experiment is described that tests whether eigenvectors from calibration data match those arising under paged attention on long-context downstream tasks; if they differ, the INT2 degradation reappears.

Authors: We acknowledge the value of directly testing eigenvector stability under distribution shift. Calibration traces were chosen to span diverse reasoning patterns and context lengths. The maintained accuracy on RULER-NIAH at 128K under paged serving already provides supporting evidence that the fixed rotations generalize. In revision we will add a short analysis subsection in §3 that (i) reports cosine similarity between calibration eigenvectors and those computed on held-out long-context downstream samples and (ii) includes a small-scale ablation showing that INT2 degradation remains limited even when modest shifts are introduced. revision: yes
Referee: [Experiments] Experiments section: The reported gains lack details on covariance estimation (window size, sample selection from training vs. calibration traces), error bars across runs, or explicit confirmation that calibration data has no overlap with evaluation tasks. These omissions make it impossible to assess whether the 3.78/1.42-point gaps are robust or sensitive to the free parameters listed in the axiom ledger.

Authors: We agree these details are necessary for reproducibility. The revised Experiments section will explicitly state: the window size and number of calibration sequences used for covariance estimation, that all calibration traces are drawn from held-out data with zero overlap to any evaluation benchmark or task, and error bars (standard deviation over three independent runs) for the primary accuracy metrics on Qwen3-4B and 8B. We will also clarify the hyper-parameter choices referenced in the axiom ledger. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an offline estimation of attention-aware covariance structures from calibration or training data to derive fixed rotations and clipping thresholds, followed by empirical evaluation on separate downstream reasoning tasks, long-context benchmarks (RULER-NIAH up to 128K), and scaled models. This separation between calibration and held-out evaluation maintains independence. No load-bearing derivation step, equation, or self-citation in the abstract or description reduces the performance claims to a fitted input renamed as prediction or to a self-referential definition by construction. The central results rest on external benchmarks rather than internal tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the ability to estimate stable attention covariance structures offline and on the assumption that fixed rotations derived from them remain effective at inference time.

free parameters (1)

Covariance estimation window and sample selection
Offline estimation of attention-aware covariance structures requires choices of data windows and samples that are fitted or selected to produce the rotations.

axioms (1)

domain assumption Attention mechanisms in LLMs consume covariance structures that can be reliably estimated from offline data.
Central to deriving the rotations and thresholds that align quantization with downstream attention.

pith-pipeline@v0.9.0 · 5896 in / 1280 out tokens · 44963 ms · 2026-05-20T12:12:28.792472+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 16 internal anchors

[1]

H2O: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 2023

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 2023

work page 2023
[2]

Minikv: Pushing the limits of llm inference via 2-bit layer-discriminative kv cache.arXiv preprint arXiv:2411.18077, 2024

Akshat Sharma, Hangliang Ding, Jianping Li, Neel Dani, and Minjia Zhang. Minikv: Pushing the limits of llm inference via 2-bit layer-discriminative kv cache.arXiv preprint arXiv:2411.18077, 2024

work page arXiv 2024
[3]

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms.arXiv preprint arXiv:2310.01801, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. KVQuant: Towards 10 million context length LLM inference with KV cache quantization. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[5]

WKVQuant: Quantizing weight and key/value cache for large language models gains more

Yuxuan Yue, Zhihang Yuan, Haojie Duanmu, Sifan Zhou, Jianlong Wu, and Liqiang Nie. WKVQuant: Quantizing weight and key/value cache for large language models gains more. arXiv preprint arXiv:2402.12065, 2024

work page arXiv 2024
[6]

Kitty: Accurate and efficient 2-bit KV cache quantization with dynamic channel-wise precision boost.arXiv preprint arXiv:2511.18643, 2025

Haojun Xia, Xiaoxia Wu, Jisen Li, Robert Wu, Junxiong Wang, Jue Wang, Chenxi Li, Aman Singhal, Alay Dilipbhai Shah, Alpay Ariyak, Donglin Zhuang, Zhongzhu Zhou, Ben Athi- waratkun, Zhen Zheng, and Shuaiwen Leon Song. Kitty: Accurate and efficient 2-bit KV cache quantization with dynamic channel-wise precision boost.arXiv preprint arXiv:2511.18643, 2025

work page arXiv 2025
[7]

RotateKV: Accurate and robust 2-bit KV cache quantization for LLMs via outlier-aware adaptive rotations.arXiv preprint arXiv:2501.16383, 2025

Zunhai Su, Zhe Chen, Wang Shen, Hanyu Wei, Linge Li, Huangqi Yu, and Kehong Yuan. RotateKV: Accurate and robust 2-bit KV cache quantization for LLMs via outlier-aware adaptive rotations.arXiv preprint arXiv:2501.16383, 2025

work page arXiv 2025
[8]

A survey on large language model acceleration based on KV cache management.Transactions on Machine Learning Research, 2025

Haoyang LI, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole HU, Wei Dong, Li Qing, and Lei Chen. A survey on large language model acceleration based on KV cache management.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URLhttps://openreview.net/forum?id=z3JZzu9EA3

work page 2025
[9]

Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot: Outlier-free 4-bit inference in rotated LLMs. InAdvances in Neural Information Processing Systems, volume 37, pages 100213–100240, 2024

work page 2024
[10]

Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.Proceedings of machine learning research, 235:48630, 2024

Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.Proceedings of machine learning research, 235:48630, 2024

work page 2024
[11]

SpinQuant: LLM quantization with learned rotations

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations.arXiv preprint arXiv:2405.16406, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving

Jinda Jia, Jisen Li, Zhongzhu Zhou, Jung Hwan Heo, Jue Wang, Tri Dao, Shuaiwen Leon Song, Ben Athiwaratkun, Chenfeng Xu, Tianyi Zhang, et al. SAW-INT4: System-aware 4-bit KV-cache quantization for real-world LLM serving.arXiv preprint arXiv:2604.19157, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Efficient memory management for large language model serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles, 2023

work page 2023
[14]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022. 12

work page 2022
[15]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023. URLhttps://arxiv.org/abs/2307.08691

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

work page 2024
[17]

Flashattention-4: Algorithm and kernel pipelining co-design for asymmetric hardware scaling

Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, and Tri Dao. Flashattention-4: Algorithm and kernel pipelining co-design for asymmetric hardware scaling. arXiv preprint arXiv:2603.05451, 2026

work page arXiv 2026
[18]

Gonzalez, Clark Barrett, and Ying Sheng

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: Efficient execution of structured language model programs. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[19]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems, 2017

work page 2017
[20]

Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher M. De Sa. QuIP: 2-bit quantiza- tion of large language models with guarantees. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[21]

Cooley and John W

James W. Cooley and John W. Tukey. An algorithm for the machine calculation of com- plex fourier series.Mathematics of Computation, 19(90):297–301, 1965. doi: 10.1090/ S0025-5718-1965-0178586-1

work page 1965
[22]

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. KIVI: A tuning-free asymmetric 2bit quantization for KV cache.arXiv preprint arXiv:2402.02750, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

SmoothQuant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning, 2023

work page 2023
[24]

GPTQ: Accurate post-training quantization for generative pre-trained transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers. InInternational Conference on Learning Representations, 2023

work page 2023
[25]

Philippe Tillet, H. T. Kung, and David Cox. Triton: An intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, 2019

work page 2019
[26]

Flash-decoding for long-context in- ference

Tri Dao, Daniel Haziza, Francisco Massa, and Grigory Sizov. Flash-decoding for long-context in- ference. https://pytorch.org/blog/flash-decoding/, 2023. PyTorch Blog. Accessed: 2026-05-06

work page 2023
[27]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

GLM Team. ChatGLM: A family of large language models from GLM-130B to GLM-4 all tools.arXiv preprint arXiv:2406.12793, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

AIME 2025: American invitational mathematics examination.https://maa.org/math-competitions/aime, 2025

Mathematical Association of America. AIME 2025: American invitational mathematics examination.https://maa.org/math-competitions/aime, 2025

work page 2025
[30]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof Q&A benchmark.Conference on Language Modeling, 2024

work page 2024
[31]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021. 13

work page internal anchor Pith review Pith/arXiv arXiv 2021
[32]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contam- ination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. NeurIPS Datasets and Benchmarks Track, 2021

work page 2021
[34]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, et al. RULER: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

TurboQuant: 2-bit KV cache compression with 4x capacity

vibhavagarwal5. TurboQuant: 2-bit KV cache compression with 4x capacity. https:// github.com/vllm-project/vllm/pull/38479, 2026. vLLM pull request #38479

work page 2026
[36]

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. TurboQuant: Online vector quantization with near-optimal distortion rate.arXiv preprint arXiv:2504.19874, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Ky Fan. Maximum properties and inequalities for the eigenvalues of completely continuous operators.Proceedings of the National Academy of Sciences of the United States of America, 37 (11):760–766, 1951. doi: 10.1073/pnas.37.11.760

work page doi:10.1073/pnas.37.11.760 1951
[38]

SnapKV: LLM Knows What You are Looking for Before Generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation.arXiv preprint arXiv:2404.14469, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai et al. PyramidKV: Dynamic KV cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations, 2024

work page 2024
[41]

ZipCache: Accurate and efficient KV cache quantization with salient token identification

Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. ZipCache: Accurate and efficient KV cache quantization with salient token identification. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[42]

GEAR: An efficient KV cache compression recipe for near-lossless generative inference of LLM.arXiv preprint arXiv:2403.05527, 2024

Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. GEAR: An efficient KV cache compression recipe for near-lossless generative inference of LLM.arXiv preprint arXiv:2403.05527, 2024

work page arXiv 2024
[43]

Palu: Compressing kv-cache with low-rank projection

Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S. Abdelfattah, and Kai-Chiang Wu. PALU: Compressing KV-cache with low-rank projection.arXiv preprint arXiv:2407.21118, 2024

work page arXiv 2024
[44]

Abdelfattah

Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, and Mohamed S. Abdelfattah. xKV: Cross-layer SVD for KV-cache compression.arXiv preprint arXiv:2503.18893, 2025

work page arXiv 2025
[45]

MatryoshkaKV: Adaptive KV compression via trainable orthogonal projection

Bokai Lin, Zihao Zeng, Zipeng Xiao, Siqi Kou, Tianqi Hou, Xiaofeng Gao, Hao Zhang, and Zhijie Deng. MatryoshkaKV: Adaptive KV compression via trainable orthogonal projection. arXiv preprint arXiv:2410.14731, 2024

work page arXiv 2024
[46]

SKVQ: Sliding-window key and value cache quantization for large language models

Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, and Dahua Lin. SKVQ: Sliding-window key and value cache quantization for large language models. In Conference on Language Modeling, 2024

work page 2024
[47]

PM-KVQ: Progressive mixed-precision KV cache quantization for long-CoT LLMs.arXiv preprint arXiv:2505.18610, 2025

Tengxuan Liu, Shiyao Li, Jiayi Yang, Tianchen Zhao, Feng Zhou, Xiaohui Song, Guohao Dai, Shengen Yan, Huazhong Yang, and Yu Wang. PM-KVQ: Progressive mixed-precision KV cache quantization for long-CoT LLMs.arXiv preprint arXiv:2505.18610, 2025

work page arXiv 2025
[48]

Quantize What Counts: More for Keys, Less for Values

Mohsen Hariri, Alan Luo, Weicong Chen, Shaochen Zhong, Tianyi Zhang, Qifan Wang, Xia Hu, Xiaotian Han, and Vipin Chaudhary. Quantize what counts: More for keys, less for values. arXiv preprint arXiv:2502.15075, 2025. 14

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Castro, Torsten Hoefler, and Dan Alistarh

Saleh Ashkboos, Mahdi Nikdan, Soroush Tabesh, Roberto L. Castro, Torsten Hoefler, and Dan Alistarh. HALO: Hadamard-assisted lower-precision optimization for LLMs.arXiv preprint arXiv:2501.02625, 2025

work page arXiv 2025
[50]

HOT: Hadamard-based optimized training

Seonggon Kim, Juncheol Shin, Seung-taek Woo, and Eunhyeok Park. HOT: Hadamard-based optimized training. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4787–4796, 2025

work page 2025
[51]

Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Noll Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh

Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Noll Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Bridging the gap between promise and performance for microscaling FP4 quantization. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps...

work page 2026
[52]

KVLinC: KV cache quantization with hadamard rotation and linear correction.arXiv preprint arXiv:2510.05373, 2025

Utkarsh Saxena and Kaushik Roy. KVLinC: KV cache quantization with hadamard rotation and linear correction.arXiv preprint arXiv:2510.05373, 2025

work page arXiv 2025
[53]

Chen, Hsiang-Fu Yu, Inderjit S

Patrick H. Chen, Hsiang-Fu Yu, Inderjit S. Dhillon, and Cho-Jui Hsieh. DRONE: Data-aware low-rank compression for large NLP models.Advances in Neural Information Processing Systems, 34:29321–29334, 2021

work page 2021
[54]

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

Zhihang Yuan, Yuzhang Shang, Yang Song, Qiang Wu, Yan Yan, and Guangyu Sun. ASVD: Activation-aware singular value decomposition for compressing large language models.arXiv preprint arXiv:2312.05821, 2023

work page internal anchor Pith review arXiv 2023
[55]

Svd-llm: Truncation-aware singular value decomposition for large language model compression

Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. SVD-LLM: Truncation-aware singular value decomposition for large language model compression.arXiv preprint arXiv:2403.07378, 2024

work page arXiv 2024
[56]

SVD-LLM v2: Optimizing singular value truncation for large language model compression

Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, and Mi Zhang. SVD-LLM v2: Optimizing singular value truncation for large language model compression. InProceedings of the Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025

work page 2025
[57]

CorDA: Context-oriented decomposition adaptation of large language models

Yibo Yang, Xiaojie Li, Zhongzhu Zhou, Shuaiwen Leon Song, Jianlong Wu, Liqiang Nie, and Bernard Ghanem. CorDA: Context-oriented decomposition adaptation of large language models. arXiv preprint arXiv:2406.05223, 2024

work page arXiv 2024
[58]

HaPPI: Efficient KV cache compression with hadamard PCA-based power iteration

Seonggon Kim, Taehyeon Kim, and Eunhyeok Park. HaPPI: Efficient KV cache compression with hadamard PCA-based power iteration. OpenReview, 2025. URL https://openreview. net/forum?id=BRDgQzdtWr. Submitted to ICLR 2026

work page 2025
[59]

CARE: Covariance-aware and rank- enhanced decomposition for enabling multi-head latent attention

Zhongzhu Zhou, Fengxiang Bie, Ziyan Chen, Zhenyu Zhang, Yibo Yang, Junxiong Wang, Ben Athiwaratkun, Xiaoxia Wu, and Shuaiwen Leon Song. CARE: Covariance-aware and rank- enhanced decomposition for enabling multi-head latent attention. InInternational Conference on Learning Representations, 2026

work page 2026
[60]

RecalKV: Low-rank KV cache compression via head reordering and offline calibration.arXiv preprint arXiv:2505.24357, 2025

Xianglong Yan, Zhiteng Li, Tianao Zhang, Linghe Kong, Yulun Zhang, and Xiaokang Yang. RecalKV: Low-rank KV cache compression via head reordering and offline calibration.arXiv preprint arXiv:2505.24357, 2025

work page arXiv 2025
[61]

CommonKV: Com- pressing KV cache with cross-layer parameter sharing.arXiv preprint arXiv:2508.16134, 2025

Yixuan Wang, Haoyu Qiao, Lujun Li, Qingfu Zhu, and Wanxiang Che. CommonKV: Com- pressing KV cache with cross-layer parameter sharing.arXiv preprint arXiv:2508.16134, 2025

work page arXiv 2025
[62]

AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. InProceedings of Machine Learning and Systems, 2024

work page 2024
[63]

OmniQuant: Omnidirectionally calibrated quantiza- tion for large language models.arXiv preprint arXiv:2308.13137, 2023

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. OmniQuant: Omnidirectionally calibrated quantiza- tion for large language models.arXiv preprint arXiv:2308.13137, 2023. 15

work page arXiv 2023
[64]

ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

Akshat Ramachandran, Marina Neseem, Charbel Sakr, Rangharajan Venkatesan, Brucek Khailany, and Tushar Krishna. ThinKV: Thought-adaptive KV cache compression for ef- ficient reasoning models.arXiv preprint arXiv:2510.01290, 2025. URL https://arxiv.org/ abs/2510.01290. 16 A Additional Details and Theoretical Analysis A.1 Hadamard Transform The Hadamard tran...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

H2O: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 2023

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 2023

work page 2023

[2] [2]

Minikv: Pushing the limits of llm inference via 2-bit layer-discriminative kv cache.arXiv preprint arXiv:2411.18077, 2024

Akshat Sharma, Hangliang Ding, Jianping Li, Neel Dani, and Minjia Zhang. Minikv: Pushing the limits of llm inference via 2-bit layer-discriminative kv cache.arXiv preprint arXiv:2411.18077, 2024

work page arXiv 2024

[3] [3]

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms.arXiv preprint arXiv:2310.01801, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. KVQuant: Towards 10 million context length LLM inference with KV cache quantization. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[5] [5]

WKVQuant: Quantizing weight and key/value cache for large language models gains more

Yuxuan Yue, Zhihang Yuan, Haojie Duanmu, Sifan Zhou, Jianlong Wu, and Liqiang Nie. WKVQuant: Quantizing weight and key/value cache for large language models gains more. arXiv preprint arXiv:2402.12065, 2024

work page arXiv 2024

[6] [6]

Kitty: Accurate and efficient 2-bit KV cache quantization with dynamic channel-wise precision boost.arXiv preprint arXiv:2511.18643, 2025

Haojun Xia, Xiaoxia Wu, Jisen Li, Robert Wu, Junxiong Wang, Jue Wang, Chenxi Li, Aman Singhal, Alay Dilipbhai Shah, Alpay Ariyak, Donglin Zhuang, Zhongzhu Zhou, Ben Athi- waratkun, Zhen Zheng, and Shuaiwen Leon Song. Kitty: Accurate and efficient 2-bit KV cache quantization with dynamic channel-wise precision boost.arXiv preprint arXiv:2511.18643, 2025

work page arXiv 2025

[7] [7]

RotateKV: Accurate and robust 2-bit KV cache quantization for LLMs via outlier-aware adaptive rotations.arXiv preprint arXiv:2501.16383, 2025

Zunhai Su, Zhe Chen, Wang Shen, Hanyu Wei, Linge Li, Huangqi Yu, and Kehong Yuan. RotateKV: Accurate and robust 2-bit KV cache quantization for LLMs via outlier-aware adaptive rotations.arXiv preprint arXiv:2501.16383, 2025

work page arXiv 2025

[8] [8]

A survey on large language model acceleration based on KV cache management.Transactions on Machine Learning Research, 2025

Haoyang LI, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole HU, Wei Dong, Li Qing, and Lei Chen. A survey on large language model acceleration based on KV cache management.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URLhttps://openreview.net/forum?id=z3JZzu9EA3

work page 2025

[9] [9]

Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot: Outlier-free 4-bit inference in rotated LLMs. InAdvances in Neural Information Processing Systems, volume 37, pages 100213–100240, 2024

work page 2024

[10] [10]

Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.Proceedings of machine learning research, 235:48630, 2024

Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.Proceedings of machine learning research, 235:48630, 2024

work page 2024

[11] [11]

SpinQuant: LLM quantization with learned rotations

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations.arXiv preprint arXiv:2405.16406, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving

Jinda Jia, Jisen Li, Zhongzhu Zhou, Jung Hwan Heo, Jue Wang, Tri Dao, Shuaiwen Leon Song, Ben Athiwaratkun, Chenfeng Xu, Tianyi Zhang, et al. SAW-INT4: System-aware 4-bit KV-cache quantization for real-world LLM serving.arXiv preprint arXiv:2604.19157, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

Efficient memory management for large language model serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles, 2023

work page 2023

[14] [14]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022. 12

work page 2022

[15] [15]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023. URLhttps://arxiv.org/abs/2307.08691

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

work page 2024

[17] [17]

Flashattention-4: Algorithm and kernel pipelining co-design for asymmetric hardware scaling

Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, and Tri Dao. Flashattention-4: Algorithm and kernel pipelining co-design for asymmetric hardware scaling. arXiv preprint arXiv:2603.05451, 2026

work page arXiv 2026

[18] [18]

Gonzalez, Clark Barrett, and Ying Sheng

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: Efficient execution of structured language model programs. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[19] [19]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems, 2017

work page 2017

[20] [20]

Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher M. De Sa. QuIP: 2-bit quantiza- tion of large language models with guarantees. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[21] [21]

Cooley and John W

James W. Cooley and John W. Tukey. An algorithm for the machine calculation of com- plex fourier series.Mathematics of Computation, 19(90):297–301, 1965. doi: 10.1090/ S0025-5718-1965-0178586-1

work page 1965

[22] [22]

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. KIVI: A tuning-free asymmetric 2bit quantization for KV cache.arXiv preprint arXiv:2402.02750, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

SmoothQuant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning, 2023

work page 2023

[24] [24]

GPTQ: Accurate post-training quantization for generative pre-trained transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers. InInternational Conference on Learning Representations, 2023

work page 2023

[25] [25]

Philippe Tillet, H. T. Kung, and David Cox. Triton: An intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, 2019

work page 2019

[26] [26]

Flash-decoding for long-context in- ference

Tri Dao, Daniel Haziza, Francisco Massa, and Grigory Sizov. Flash-decoding for long-context in- ference. https://pytorch.org/blog/flash-decoding/, 2023. PyTorch Blog. Accessed: 2026-05-06

work page 2023

[27] [27]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

GLM Team. ChatGLM: A family of large language models from GLM-130B to GLM-4 all tools.arXiv preprint arXiv:2406.12793, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

AIME 2025: American invitational mathematics examination.https://maa.org/math-competitions/aime, 2025

Mathematical Association of America. AIME 2025: American invitational mathematics examination.https://maa.org/math-competitions/aime, 2025

work page 2025

[30] [30]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof Q&A benchmark.Conference on Language Modeling, 2024

work page 2024

[31] [31]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021. 13

work page internal anchor Pith review Pith/arXiv arXiv 2021

[32] [32]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contam- ination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. NeurIPS Datasets and Benchmarks Track, 2021

work page 2021

[34] [34]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, et al. RULER: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

TurboQuant: 2-bit KV cache compression with 4x capacity

vibhavagarwal5. TurboQuant: 2-bit KV cache compression with 4x capacity. https:// github.com/vllm-project/vllm/pull/38479, 2026. vLLM pull request #38479

work page 2026

[36] [36]

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. TurboQuant: Online vector quantization with near-optimal distortion rate.arXiv preprint arXiv:2504.19874, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Ky Fan. Maximum properties and inequalities for the eigenvalues of completely continuous operators.Proceedings of the National Academy of Sciences of the United States of America, 37 (11):760–766, 1951. doi: 10.1073/pnas.37.11.760

work page doi:10.1073/pnas.37.11.760 1951

[38] [38]

SnapKV: LLM Knows What You are Looking for Before Generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation.arXiv preprint arXiv:2404.14469, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai et al. PyramidKV: Dynamic KV cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations, 2024

work page 2024

[41] [41]

ZipCache: Accurate and efficient KV cache quantization with salient token identification

Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. ZipCache: Accurate and efficient KV cache quantization with salient token identification. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[42] [42]

GEAR: An efficient KV cache compression recipe for near-lossless generative inference of LLM.arXiv preprint arXiv:2403.05527, 2024

Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. GEAR: An efficient KV cache compression recipe for near-lossless generative inference of LLM.arXiv preprint arXiv:2403.05527, 2024

work page arXiv 2024

[43] [43]

Palu: Compressing kv-cache with low-rank projection

Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S. Abdelfattah, and Kai-Chiang Wu. PALU: Compressing KV-cache with low-rank projection.arXiv preprint arXiv:2407.21118, 2024

work page arXiv 2024

[44] [44]

Abdelfattah

Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, and Mohamed S. Abdelfattah. xKV: Cross-layer SVD for KV-cache compression.arXiv preprint arXiv:2503.18893, 2025

work page arXiv 2025

[45] [45]

MatryoshkaKV: Adaptive KV compression via trainable orthogonal projection

Bokai Lin, Zihao Zeng, Zipeng Xiao, Siqi Kou, Tianqi Hou, Xiaofeng Gao, Hao Zhang, and Zhijie Deng. MatryoshkaKV: Adaptive KV compression via trainable orthogonal projection. arXiv preprint arXiv:2410.14731, 2024

work page arXiv 2024

[46] [46]

SKVQ: Sliding-window key and value cache quantization for large language models

Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, and Dahua Lin. SKVQ: Sliding-window key and value cache quantization for large language models. In Conference on Language Modeling, 2024

work page 2024

[47] [47]

PM-KVQ: Progressive mixed-precision KV cache quantization for long-CoT LLMs.arXiv preprint arXiv:2505.18610, 2025

Tengxuan Liu, Shiyao Li, Jiayi Yang, Tianchen Zhao, Feng Zhou, Xiaohui Song, Guohao Dai, Shengen Yan, Huazhong Yang, and Yu Wang. PM-KVQ: Progressive mixed-precision KV cache quantization for long-CoT LLMs.arXiv preprint arXiv:2505.18610, 2025

work page arXiv 2025

[48] [48]

Quantize What Counts: More for Keys, Less for Values

Mohsen Hariri, Alan Luo, Weicong Chen, Shaochen Zhong, Tianyi Zhang, Qifan Wang, Xia Hu, Xiaotian Han, and Vipin Chaudhary. Quantize what counts: More for keys, less for values. arXiv preprint arXiv:2502.15075, 2025. 14

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

Castro, Torsten Hoefler, and Dan Alistarh

Saleh Ashkboos, Mahdi Nikdan, Soroush Tabesh, Roberto L. Castro, Torsten Hoefler, and Dan Alistarh. HALO: Hadamard-assisted lower-precision optimization for LLMs.arXiv preprint arXiv:2501.02625, 2025

work page arXiv 2025

[50] [50]

HOT: Hadamard-based optimized training

Seonggon Kim, Juncheol Shin, Seung-taek Woo, and Eunhyeok Park. HOT: Hadamard-based optimized training. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4787–4796, 2025

work page 2025

[51] [51]

Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Noll Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh

Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Noll Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Bridging the gap between promise and performance for microscaling FP4 quantization. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps...

work page 2026

[52] [52]

KVLinC: KV cache quantization with hadamard rotation and linear correction.arXiv preprint arXiv:2510.05373, 2025

Utkarsh Saxena and Kaushik Roy. KVLinC: KV cache quantization with hadamard rotation and linear correction.arXiv preprint arXiv:2510.05373, 2025

work page arXiv 2025

[53] [53]

Chen, Hsiang-Fu Yu, Inderjit S

Patrick H. Chen, Hsiang-Fu Yu, Inderjit S. Dhillon, and Cho-Jui Hsieh. DRONE: Data-aware low-rank compression for large NLP models.Advances in Neural Information Processing Systems, 34:29321–29334, 2021

work page 2021

[54] [54]

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

Zhihang Yuan, Yuzhang Shang, Yang Song, Qiang Wu, Yan Yan, and Guangyu Sun. ASVD: Activation-aware singular value decomposition for compressing large language models.arXiv preprint arXiv:2312.05821, 2023

work page internal anchor Pith review arXiv 2023

[55] [55]

Svd-llm: Truncation-aware singular value decomposition for large language model compression

Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. SVD-LLM: Truncation-aware singular value decomposition for large language model compression.arXiv preprint arXiv:2403.07378, 2024

work page arXiv 2024

[56] [56]

SVD-LLM v2: Optimizing singular value truncation for large language model compression

Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, and Mi Zhang. SVD-LLM v2: Optimizing singular value truncation for large language model compression. InProceedings of the Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025

work page 2025

[57] [57]

CorDA: Context-oriented decomposition adaptation of large language models

Yibo Yang, Xiaojie Li, Zhongzhu Zhou, Shuaiwen Leon Song, Jianlong Wu, Liqiang Nie, and Bernard Ghanem. CorDA: Context-oriented decomposition adaptation of large language models. arXiv preprint arXiv:2406.05223, 2024

work page arXiv 2024

[58] [58]

HaPPI: Efficient KV cache compression with hadamard PCA-based power iteration

Seonggon Kim, Taehyeon Kim, and Eunhyeok Park. HaPPI: Efficient KV cache compression with hadamard PCA-based power iteration. OpenReview, 2025. URL https://openreview. net/forum?id=BRDgQzdtWr. Submitted to ICLR 2026

work page 2025

[59] [59]

CARE: Covariance-aware and rank- enhanced decomposition for enabling multi-head latent attention

Zhongzhu Zhou, Fengxiang Bie, Ziyan Chen, Zhenyu Zhang, Yibo Yang, Junxiong Wang, Ben Athiwaratkun, Xiaoxia Wu, and Shuaiwen Leon Song. CARE: Covariance-aware and rank- enhanced decomposition for enabling multi-head latent attention. InInternational Conference on Learning Representations, 2026

work page 2026

[60] [60]

RecalKV: Low-rank KV cache compression via head reordering and offline calibration.arXiv preprint arXiv:2505.24357, 2025

Xianglong Yan, Zhiteng Li, Tianao Zhang, Linghe Kong, Yulun Zhang, and Xiaokang Yang. RecalKV: Low-rank KV cache compression via head reordering and offline calibration.arXiv preprint arXiv:2505.24357, 2025

work page arXiv 2025

[61] [61]

CommonKV: Com- pressing KV cache with cross-layer parameter sharing.arXiv preprint arXiv:2508.16134, 2025

Yixuan Wang, Haoyu Qiao, Lujun Li, Qingfu Zhu, and Wanxiang Che. CommonKV: Com- pressing KV cache with cross-layer parameter sharing.arXiv preprint arXiv:2508.16134, 2025

work page arXiv 2025

[62] [62]

AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. InProceedings of Machine Learning and Systems, 2024

work page 2024

[63] [63]

OmniQuant: Omnidirectionally calibrated quantiza- tion for large language models.arXiv preprint arXiv:2308.13137, 2023

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. OmniQuant: Omnidirectionally calibrated quantiza- tion for large language models.arXiv preprint arXiv:2308.13137, 2023. 15

work page arXiv 2023

[64] [64]

ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

Akshat Ramachandran, Marina Neseem, Charbel Sakr, Rangharajan Venkatesan, Brucek Khailany, and Tushar Krishna. ThinKV: Thought-adaptive KV cache compression for ef- ficient reasoning models.arXiv preprint arXiv:2510.01290, 2025. URL https://arxiv.org/ abs/2510.01290. 16 A Additional Details and Theoretical Analysis A.1 Hadamard Transform The Hadamard tran...

work page internal anchor Pith review Pith/arXiv arXiv 2025