Recognition: no theorem link
RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory
Pith reviewed 2026-05-11 01:04 UTC · model grok-4.3
The pith
RateQuant fits each quantizer's own distortion curve from a small calibration set then uses reverse waterfilling to allocate bits across attention heads in the KV cache.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper shows that mixed-precision KV cache quantization fails under distortion model mismatch because each quantizer exhibits a distinct decay rate beta between 3.6 and 5.3 in the curve D(b) = alpha * beta^{-b}. RateQuant fits alpha and beta for every quantizer from a small calibration set, then applies reverse waterfilling to obtain the optimal bit allocation that minimizes total distortion subject to an average-rate constraint. The resulting allocations improve perplexity substantially over both uniform quantization and mismatched mixed-precision baselines.
What carries the argument
Per-quantizer exponential distortion model D(b) = alpha * beta^{-b} fitted from calibration data, combined with reverse waterfilling to solve the bit-allocation optimization in closed form.
If this is right
- At 2.5 average bits on Qwen3-8B, RateQuant reduces KIVI perplexity from 49.3 to 14.9.
- The same setting improves QuaRot perplexity by 6.6 points.
- Calibration completes in 1.6 seconds on one GPU and adds zero overhead at inference time.
- Correct per-quantizer models prevent the reversal of bit-allocation order that occurs under mismatch.
Where Pith is reading between the lines
- The same calibration-plus-reverse-waterfilling pattern could apply to quantizing weights or activations where sensitivity also varies across components.
- The observed range of beta values indicates that quantizer selection itself shapes the optimal allocation beyond the choice of bit widths.
- Testing the fitted models on longer sequences or out-of-distribution prompts would check whether the calibration remains stable over extended generation.
Load-bearing premise
The exponential distortion parameters fitted on the calibration set accurately describe how each quantizer behaves on the data distribution encountered during actual inference.
What would settle it
Running RateQuant's bit allocation on a new model or dataset and finding higher perplexity than a uniform bit-width baseline of the same average rate would show the fitted models do not generalize.
Figures
read the original abstract
Large language models cache all previously computed key-value (KV) pairs during generation, and this KV cache grows linearly with sequence length, making it a primary memory bottleneck for serving. Quantizing the KV cache to fewer bits reduces this cost, yet all current quantizers assign the same bit-width to every attention head, ignoring the large variation in head importance. A natural idea is to allocate more bits to important heads and fewer to the rest. We show, however, that such mixed-precision allocation has a hidden pitfall: each quantizer follows a different distortion curve D(b)=alpha*beta^{-b}, and the decay rate beta varies from 3.6 to 5.3 across quantizer designs. Applying one quantizer's distortion model to another inverts the allocation order and makes performance worse than uniform quantization. We call this failure mode distortion model mismatch and propose RateQuant to resolve it. RateQuant fits a per-quantizer distortion model from a small calibration set, then solves the resulting bit-allocation problem in closed form via reverse waterfilling from rate-distortion theory. On Qwen3-8B at 2.5 average bits, calibrated RateQuant reduces KIVI's perplexity from 49.3 to 14.9 (70% reduction) and improves QuaRot by 6.6 PPL. The entire calibration takes 1.6 s on a single GPU and adds zero overhead at inference time.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RateQuant for mixed-precision KV cache quantization in LLMs. It fits an exponential distortion model D(b)=α β^{-b} per quantizer from a small calibration set, then derives a closed-form bit allocation via reverse waterfilling from rate-distortion theory to minimize total distortion at a target average bit rate. Experiments report large perplexity gains on Qwen3-8B (e.g., KIVI from 49.3 to 14.9 PPL at 2.5 bits) while highlighting the risk of distortion-model mismatch when β varies across quantizers.
Significance. If the fitted per-quantizer models remain accurate on inference KV distributions, the method supplies a principled, low-overhead way to allocate bits according to head-specific importance. The closed-form solution, 1.6 s calibration, and explicit treatment of the mismatch failure mode are concrete strengths that could influence practical KV quantization pipelines.
major comments (1)
- [RateQuant bit-allocation procedure] The optimality guarantee via reverse waterfilling holds only under the assumption that the fitted exponential distortion curves accurately predict behavior on the inference distribution. The manuscript demonstrates that modest β mismatch (3.6 vs 5.3) inverts the allocation and can degrade performance below uniform quantization, yet provides no quantitative check (e.g., held-out prediction error or long-sequence validation) that the calibration-set parameters remain valid for the target generation statistics.
minor comments (1)
- [Abstract] The abstract states concrete perplexity numbers; adding the number of calibration sequences and the exact calibration prompt distribution would help readers assess how representative the fit is.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. The concern regarding validation of the fitted distortion models on inference distributions is well-taken. We address it directly below and will strengthen the manuscript accordingly.
read point-by-point responses
-
Referee: The optimality guarantee via reverse waterfilling holds only under the assumption that the fitted exponential distortion curves accurately predict behavior on the inference distribution. The manuscript demonstrates that modest β mismatch (3.6 vs 5.3) inverts the allocation and can degrade performance below uniform quantization, yet provides no quantitative check (e.g., held-out prediction error or long-sequence validation) that the calibration-set parameters remain valid for the target generation statistics.
Authors: We agree that the closed-form optimality of reverse waterfilling is conditional on the accuracy of the per-quantizer exponential models D(b) = α β^{-b} when applied to the actual inference KV distributions. The manuscript already identifies distortion-model mismatch as a concrete failure mode and shows that fitting separate parameters per quantizer (from a 1.6 s calibration set) avoids the inversion that occurs when a single model is used. The reported 70 % perplexity reduction versus KIVI on Qwen3-8B at 2.5 average bits supplies indirect evidence that the fitted curves generalize sufficiently for the evaluated tasks. Nevertheless, we concur that an explicit quantitative check would be valuable. In the revised manuscript we will add a new subsection that reports (i) mean-squared prediction error of the fitted D(b) curves on a held-out portion of the calibration data and (ii) the same error measured on KV statistics collected during actual long-sequence generation. These results will quantify the residual mismatch risk and confirm that the calibration procedure remains reliable for the target generation statistics. revision: yes
Circularity Check
No significant circularity; bit allocation follows from empirical per-quantizer fits plus standard reverse waterfilling.
full rationale
The derivation fits parameters alpha and beta of the exponential distortion model D(b) = alpha * beta^{-b} on a calibration set, then applies the closed-form reverse-waterfilling solution from classical rate-distortion theory to obtain the bit vector. This is a standard optimization step once the model is given; the final allocation is data-dependent rather than tautological or self-definitional. No self-citation chain, uniqueness theorem, or ansatz smuggling is invoked in the provided text to justify the core construction, and the exponential form is presented as an observed empirical pattern rather than derived from the target result. The approach therefore remains self-contained against external rate-distortion benchmarks once the calibration assumption is granted.
Axiom & Free-Parameter Ledger
free parameters (1)
- alpha and beta per quantizer
axioms (2)
- domain assumption Quantizer distortion follows the exponential form D(b)=alpha*beta^{-b}
- standard math Reverse waterfilling yields the optimal bit allocation under the fitted distortion model
Reference graph
Works this paper leans on
-
[1]
Quarot: Outlier-free 4-bit inference in rotated llms
Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms. In ICML, 2024
2024
-
[2]
Kvmix: Gradient-based layer importance-aware mixed-precision quantization for kv cache
Jiahao Chen, Fangcheng Wei, Zhuowei Liu, and Zhongqiu Peng. Kvmix: Gradient-based layer importance-aware mixed-precision quantization for kv cache. arXiv preprint arXiv:2506.08018, 2025 a
-
[3]
Progressive mixed-precision kv cache quantization for long-cot llms
Yilong Chen et al. Progressive mixed-precision kv cache quantization for long-cot llms. arXiv preprint arXiv:2505.18610, 2025 b
-
[4]
Elements of Information Theory
Thomas M Cover and Joy A Thomas. Elements of Information Theory. John Wiley & Sons, 2nd edition, 2006
2006
-
[5]
Gpt3.int8(): 8-bit matrix multiplication for transformers at scale
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3.int8(): 8-bit matrix multiplication for transformers at scale. NeurIPS, 2022
2022
-
[6]
Hawq: Hessian aware quantization of neural networks with mixed-precision
Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Hawq: Hessian aware quantization of neural networks with mixed-precision. ICCV, 2019
2019
-
[7]
Hawq-v2: Hessian aware trace-weighted quantization of neural networks
Zhen Dong, Zhewei Yao, Daiyaan Arfeen, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Hawq-v2: Hessian aware trace-weighted quantization of neural networks. NeurIPS, 2020
2020
-
[8]
Saeed Ghadimi and Guanghui Lan
Leo Gao, Jonathan Tow, et al. A framework for few-shot language model evaluation, 2024. URL https://zenodo.org/records/10256836
-
[9]
Model tells you what to discard: Adaptive kv cache compression for llms
Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. ICLR, 2024
2024
-
[10]
Kvquant: Towards 10 million context length llm inference with kv cache quantization
Coleman Hooper, Sehoon Kim, Hiva Mohammadi, Michael W Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization. NeurIPS, 2024
2024
-
[11]
Quantize What Counts: More for Keys, Less for Values
Dongjin Kim et al. Quantize what counts: More for keys, less for values. arXiv preprint arXiv:2502.15075, 2025
work page internal anchor Pith review arXiv 2025
-
[12]
Cokv: Optimizing kv cache allocation via cooperative game
Yichi Li et al. Cokv: Optimizing kv cache allocation via cooperative game. arXiv preprint arXiv:2502.17501, 2025 a
-
[13]
Kvtuner: Sensitivity-aware layer-wise mixed-precision kv cache quantization
Yifei Li, Zhehao Wu, and Cheng Zhou. Kvtuner: Sensitivity-aware layer-wise mixed-precision kv cache quantization. 2025 b
2025
-
[14]
Channel-aware mixed-precision quantization for efficient long-context inference
Chengxi Liao and Zeyi Wen. Channel-aware mixed-precision quantization for efficient long-context inference. In ICLR, 2026
2026
-
[15]
Awq: Activation-aware weight quantization for llm compression and acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration. MLSys, 2024
2024
-
[16]
Kivi: A tuning-free asymmetric 2bit quantization for kv cache
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. ICML, 2024
2024
-
[17]
Pointer sentinel mixture models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. ICLR, 2017
2017
-
[18]
Dynamic memory compression: Retrofitting llms for accelerated inference
Piotr Nawrot, Adrian ancucki, Marcin Chochowski, David Tarjan, and Edoardo Maria Ponti. Dynamic memory compression: Retrofitting llms for accelerated inference. ICML, 2024
2024
-
[19]
Matroid Theory
James G Oxley. Matroid Theory. Oxford University Press, 2011
2011
-
[20]
Efficiently scaling transformer inference
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 2023
2023
-
[21]
Radio: Rate-distortion optimization for large language model compression
Albert Tseng et al. Radio: Rate-distortion optimization for large language model compression. 2025
2025
-
[22]
Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned
Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. ACL, 2019
2019
-
[23]
Accurate and efficient 2-bit kv cache quantization with dynamic channel-wise precision boost
Xingyu Wang et al. Accurate and efficient 2-bit kv cache quantization with dynamic channel-wise precision boost. arXiv preprint arXiv:2511.18643, 2025
-
[24]
Smoothquant: Accurate and efficient post-training quantization for large language models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. ICML, 2023
2023
-
[25]
Wkvquant: Quantizing weight and key/value cache for large language models
Peng Yue et al. Wkvquant: Quantizing weight and key/value cache for large language models. arXiv preprint, 2024
2024
-
[26]
Turboquant: Online vector quantization with near-optimal distortion rate
Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. Turboquant: Online vector quantization with near-optimal distortion rate. In ICLR, 2026
2026
-
[27]
Query-aware mixed-precision kv cache quantization for long-context reasoning
Wei Zhang et al. Query-aware mixed-precision kv cache quantization for long-context reasoning. arXiv preprint arXiv:2512.19206, 2025
-
[28]
H2o: Heavy-hitter oracle for efficient generative inference of large language models
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R \'e , Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. NeurIPS, 2024
2024
-
[29]
Baq: Efficient bit allocation quantization for large language models
Qi Zheng et al. Baq: Efficient bit allocation quantization for large language models. arXiv preprint arXiv:2506.05664, 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.