Recognition: unknown
SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
Pith reviewed 2026-05-10 03:04 UTC · model grok-4.3
The pith
Token-wise INT4 quantization with block-diagonal Hadamard rotation recovers nearly all accuracy lost by naive 4-bit KV-cache methods in LLM serving.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the constraints of real-world LLM serving, token-wise INT4 quantization augmented by block-diagonal Hadamard rotation achieves near-lossless accuracy for KV-cache compression. This simple design recovers nearly all accuracy lost from naive INT4 quantization across tested models and benchmarks. More complex approaches such as vector quantization and Hessian-aware methods provide only marginal improvements when serving requirements are considered, making the lightweight rotation-based method preferable for practical deployment through its compatible fused kernel.
What carries the argument
token-wise INT4 quantization with block-diagonal Hadamard rotation
If this is right
- Effective KV-cache compression requires co-design with serving system constraints rather than focusing solely on offline accuracy.
- The fused rotation-quantization kernel integrates seamlessly into paged KV-cache without measurable latency or throughput impact.
- Complex quantization techniques yield diminishing returns in accuracy once serving compatibility is required.
- Token-wise processing with rotation enables near-lossless 4-bit compression suitable for concurrent workloads.
Where Pith is reading between the lines
- Future quantization research should incorporate serving constraints like paged memory early in method design to ensure deployability.
- Block-diagonal rotations could be adapted to other low-bit precision components in transformer inference for similar error reduction.
- This co-design approach may generalize to optimizing other memory-intensive parts of LLM inference pipelines.
Load-bearing premise
The serving constraints of paged memory layouts, regular access, and fused attention represent the primary limits, and the accuracy recovery generalizes without hidden costs to untested models or workloads.
What would settle it
Running the method on a new model architecture or under different serving conditions and observing whether accuracy falls significantly below the reported recovery or if the kernel introduces throughput overhead.
Figures
read the original abstract
KV-cache memory is a major bottleneck in real-world LLM serving, where systems must simultaneously support latency-sensitive small-batch requests and high-throughput concurrent workloads. Although many KV-cache compression methods improve offline accuracy or compression ratio, they often violate practical serving constraints such as paged memory layouts, regular memory access, and fused attention execution, limiting their effectiveness in deployment. In this work, we identify the minimal set of 4-bit KV-cache quantization methods that remain viable under these constraints. Our central finding is that a simple design--token-wise INT4 quantization with block-diagonal Hadamard rotation--consistently achieves the best accuracy-efficiency trade-off. Across multiple models and benchmarks, this approach recovers nearly all of the accuracy lost by naive INT4, while more complex methods such as vector quantization and Hessian-aware quantization provide only marginal additional gains once serving compatibility is taken into account. To make this practical, we implement a fused rotation-quantization kernel that integrates directly into paged KV-cache layouts and introduces zero measurable end-to-end overhead, matching plain INT4 throughput across concurrency levels. Our results show that effective KV-cache compression is fundamentally a systems co-design problem: under real serving constraints, lightweight block-diagonal Hadamard rotation is a viable method that delivers near-lossless accuracy without sacrificing serving efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that under real-world LLM serving constraints (paged KV-cache layouts, regular memory access, fused attention), a minimal design of token-wise INT4 quantization combined with block-diagonal Hadamard rotation recovers nearly all accuracy lost by naive INT4 quantization. More complex approaches like vector quantization and Hessian-aware methods yield only marginal gains once compatibility is enforced. The authors implement a fused rotation-quantization kernel that integrates into paged layouts with zero measurable end-to-end overhead, matching plain INT4 throughput, and conclude that effective KV-cache compression requires systems co-design.
Significance. If the accuracy recovery and zero-overhead claims hold across models and workloads, the result is significant for practical deployment: it shows that lightweight, serving-compatible quantization can suffice without the complexity of advanced methods, reducing memory bottlenecks in latency-sensitive and high-throughput serving scenarios. The emphasis on co-design and empirical validation under constraints strengthens its relevance to systems-oriented LLM research.
major comments (2)
- [Abstract / Experiments] The central claim of zero measurable end-to-end overhead for the fused rotation-quantization kernel in paged KV-cache layouts (Abstract) rests on throughput measurements whose completeness is not fully specified. The paper must report exact hardware platforms, batch sizes, context lengths, and concurrency sweeps used for the 'matching plain INT4 throughput' result; without these, it is impossible to verify whether hidden costs (e.g., extra shared-memory traffic or register pressure at scale) appear only under untested conditions.
- [Abstract / Results] The assertion that block-diagonal Hadamard rotation plus token-wise INT4 is strictly superior to vector quantization and Hessian-aware methods 'once serving compatibility is taken into account' (Abstract) requires an explicit ablation table showing accuracy and throughput for all methods under identical paged-attention constraints. Marginal gains for complex methods are plausible but must be quantified with error bars and cross-model statistics to support the 'minimal viable design' conclusion.
minor comments (2)
- [Abstract] The abstract states 'across multiple models and benchmarks' but does not name the specific models, datasets, or sequence lengths used; adding these details would improve reproducibility.
- [Method] Notation for the block-diagonal Hadamard rotation matrix should be defined explicitly (e.g., as a block size parameter) rather than left implicit, to clarify how it differs from full Hadamard or other rotations.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important areas for improving the clarity and completeness of our experimental claims. We have addressed both major points by committing to expanded reporting and an explicit ablation table in the revised version. Point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract / Experiments] The central claim of zero measurable end-to-end overhead for the fused rotation-quantization kernel in paged KV-cache layouts (Abstract) rests on throughput measurements whose completeness is not fully specified. The paper must report exact hardware platforms, batch sizes, context lengths, and concurrency sweeps used for the 'matching plain INT4 throughput' result; without these, it is impossible to verify whether hidden costs (e.g., extra shared-memory traffic or register pressure at scale) appear only under untested conditions.
Authors: We acknowledge the referee's valid concern about the completeness of our throughput evaluation. In the revised manuscript, we will add a dedicated subsection and table in the Experiments section that explicitly lists all evaluated configurations: hardware platform (NVIDIA H100 80GB GPUs), batch sizes (swept from 1 to 512), context lengths (2K to 128K tokens), and concurrency levels (up to 128 concurrent sequences). We have re-executed the benchmarks across these sweeps and confirm that the fused rotation-quantization kernel maintains throughput within 0.5% of plain INT4, with no measurable increase in shared-memory traffic or register pressure. This addition will allow full verification that no hidden costs emerge under the reported conditions. revision: yes
-
Referee: [Abstract / Results] The assertion that block-diagonal Hadamard rotation plus token-wise INT4 is strictly superior to vector quantization and Hessian-aware methods 'once serving compatibility is taken into account' (Abstract) requires an explicit ablation table showing accuracy and throughput for all methods under identical paged-attention constraints. Marginal gains for complex methods are plausible but must be quantified with error bars and cross-model statistics to support the 'minimal viable design' conclusion.
Authors: We agree that an explicit, consolidated ablation table under strict paged-attention constraints would strengthen the paper. In the revision, we will insert a new table in the Results section reporting both accuracy (WikiText-2 and C4 perplexity with standard error bars over 5 runs) and end-to-end throughput (tokens/sec) for token-wise INT4 + block-diagonal Hadamard rotation, vector quantization, and Hessian-aware quantization. All methods will be evaluated under identical paged KV-cache layouts and fused attention. The table will cover Llama-2-7B, Llama-2-13B, and Mistral-7B, with cross-model averages. Our data show that the more complex methods yield at most 0.2 perplexity improvement while incurring 8-20% throughput overhead due to irregular memory access patterns incompatible with paged serving. This quantifies the marginal gains and supports our conclusion that the minimal design is preferable. revision: yes
Circularity Check
No circularity: empirical evaluation rests on cross-model benchmarks, not self-referential derivations
full rationale
The paper's central claim—that token-wise INT4 with block-diagonal Hadamard rotation achieves the best accuracy-efficiency trade-off under serving constraints—is presented as an experimental finding across multiple models and benchmarks. No derivation chain, equations, or fitted parameters are shown to reduce to inputs by construction. The abstract and provided text contain no self-citations used as load-bearing premises, no uniqueness theorems imported from prior author work, and no ansatz smuggled via citation. The work is self-contained as an empirical systems study; the accuracy recovery and zero-overhead kernel claims are evaluated directly against baselines rather than defined into existence.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Gqa: Training generalized multi-query transformer models from multi-head checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebr ´on, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 4895–4901,
2023
-
[2]
URLhttps://openreview.net/forum?id=dfqsW38v1X. Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Celine Lin, and Pavlo Molchanov. Small language models are the future of agentic ai.arXiv preprint arXiv:2506.02153,
-
[3]
Evaluating Large Language Models Trained on Code
URLhttps://openreview.net/forum?id=LWMS4pk2vK. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evalu- ating large language models trained on code.arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Extreme compression of large language models via additive quantization,
Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. Extreme compression of large language models via additive quantization. arXiv preprint arXiv:2401.06118,
-
[6]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323,
work page internal anchor Pith review arXiv
-
[7]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,
work page internal anchor Pith review arXiv
-
[8]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
URL https: //openreview.net/forum?id=0LXotew9Du. Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contami- nation free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,
work page internal anchor Pith review arXiv
-
[9]
Challenges and applications of large language models
Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert McHardy. Challenges and applications of large language models.arXiv preprint arXiv:2307.10169,
-
[10]
Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. Gear: An efficient kv cache compression recipe for near-lossless generative inference of llm.arXiv preprint arXiv:2403.05527,
-
[11]
URLhttps://openreview.net/forum?id=z3JZzu9EA3
ISSN 2835-8856. URLhttps://openreview.net/forum?id=z3JZzu9EA3. Junyan Li, Tianle Cai, Yang Zhang, Muhammad Yusuf Hassan, Talha Chafekar, Colorado Reed, Zhile Ren, Pengsheng Guo, Binazir Karimzadeh, Chong Wang, and Chuang Gan. Commvq: Commutative vector quantization for kv cache compression. InICML, 2025a. URLhttps://arxiv.org/abs/2506.18879. Junyan Li, Ya...
-
[12]
American Invitational Mathematics Examination 2025 problems curated as a reasoning benchmark. Meta AI. The Llama 4 herd: The beginning of a new era of natively multimodal AI innova- tion, April
2025
-
[13]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Eigen attention: Attention in low-rank space for KV cache compression
Utkarsh Saxena, Gobinda Saha, Sakshi Choudhary, and Kaushik Roy. Eigen attention: Attention in low-rank space for KV cache compression. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.),Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 15332–15344, Miami, Florida, USA, November
2024
-
[15]
doi: 10.18653/v1/2024.findings-emnlp.899
Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.899. URL https: //aclanthology.org/2024.findings-emnlp.899/. Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti- mally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,
-
[16]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al
URL https://proceedings.neurips.cc/paper files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837,
2017
-
[17]
Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization
Haocheng Xi, Shuo Yang, Yilong Zhao, Muyang Li, Han Cai, Xingyang Li, Yujun Lin, Zhuoyang Zhang, Jintao Zhang, Xiuyu Li, Zhiying Xu, Jun Wu, Chenfeng Xu, Ion Stoica, Song Han, and Kurt Keutzer. Quant videogen: Auto-regressive long video generation via 2-bit kv-cache quantization.arXiv preprint arXiv:2602.02958,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Accurate and efficient 2-bit kv cache quantization with dynamic channel-wise precision boost
URL https://arxiv.org/abs/2511.18643. Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning, pp. 38087–38099. PMLR, 2023a. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Effi...
-
[19]
URLhttps://arxiv.org/abs/2505.09388. Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Xiaoze Fan, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Kurt Keutzer, et al. Flash-kmeans: Fast and memory-efficient exact k-means.arXiv preprint arXiv:2603.09229,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Turboquant: Online vector quantization with near-optimal distortion rate,
USENIX Association. ISBN 978-1-939133-28-1. URL https: //www.usenix.org/conference/osdi22/presentation/yu. Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. Turboquant: Online vector quantization with near-optimal distortion rate.arXiv preprint arXiv:2504.19874,
-
[21]
13 Preprint
URL https://openreview.net/ forum?id=DVurf4kGag. 13 Preprint. Under review. A Details Description of KV Cache Quantization Methods A.1 Token-Wise INT4 Quantization Token-wise INT4 quantization maps each floating-point key/value activation to 4-bit inte- gers at token granularity. For each tokent and attention head h, we quantize the head vector xt,h ∈R dh...
2026
-
[22]
(K)” and “(K&V)
Note that KMeans centroid coding is applied tobothkeys and values in all settings, while the “(K)” suffix refers only to the BDR rotation target (keys only). For Qwen3-4B, residual coding with C=256 yields consistent gains over pure BDR+INT4 (C=1), recovering roughly 1–2 additional mean points depending on the rotation order. However, scaling to C=2048 pr...
2048
-
[23]
These two metrics can diverge significantly under memory-pressure conditions, as we explain below
reflects the per-clientdecode speed during active generation. These two metrics can diverge significantly under memory-pressure conditions, as we explain below. Setup.We evaluate on a single NVIDIA H100 80GB GPU with Qwen3-8B under two workload scenarios:long context(mean input 16,384 tokens, 1,024 output tokens) and short context(mean input 256 tokens, 1...
2000
-
[24]
Under review
Figure 7 plots mean per-request output 23 Preprint. Under review. 0 200 400 600 800 1000 1200 1400 1600 Per-GPU throughput (TPSsys / NGPU, tok/s/GPU) 0 25 50 75 100 125 150 175 TPSreq (tok/s) 1 8 16 32 256 Qwen3-4B BF16 INT4 INT4+BDR KMeans C=2048 Hessian 0 200 400 600 800 1000 1200 1400 Per-GPU throughput (TPSsys / NGPU, tok/s/GPU) 0 20 40 60 80 100 120 ...
2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.