arxiv: 2605.06675 · v1 · submitted 2026-04-22 · 💻 cs.LG · cs.CL· cs.IT· math.IT

Recognition: no theorem link

RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory

Fei Zuo , Zikang Zhou , Hao Cong , Xiaoyan Xi , Ho Fai Leung

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:04 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.ITmath.IT

keywords KV cache quantizationmixed-precision allocationrate-distortion theorylarge language modelsattention headsbit allocationperplexity reductionreverse waterfilling

0 comments

The pith

RateQuant fits each quantizer's own distortion curve from a small calibration set then uses reverse waterfilling to allocate bits across attention heads in the KV cache.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models store key-value pairs in a cache whose size grows linearly with sequence length, creating a primary memory bottleneck during generation. Uniform quantization across all attention heads wastes bits because heads vary in importance, yet naive mixed-precision schemes often perform worse than uniform ones. The failure stems from applying one quantizer's distortion behavior to another when their exponential decay rates differ. RateQuant calibrates the parameters of D(b) = alpha * beta^{-b} separately for each quantizer on a small data set. It then solves the bit-allocation problem in closed form with reverse waterfilling from rate-distortion theory, yielding lower perplexity at a fixed average bit rate with no added inference cost.

Core claim

The paper shows that mixed-precision KV cache quantization fails under distortion model mismatch because each quantizer exhibits a distinct decay rate beta between 3.6 and 5.3 in the curve D(b) = alpha * beta^{-b}. RateQuant fits alpha and beta for every quantizer from a small calibration set, then applies reverse waterfilling to obtain the optimal bit allocation that minimizes total distortion subject to an average-rate constraint. The resulting allocations improve perplexity substantially over both uniform quantization and mismatched mixed-precision baselines.

What carries the argument

Per-quantizer exponential distortion model D(b) = alpha * beta^{-b} fitted from calibration data, combined with reverse waterfilling to solve the bit-allocation optimization in closed form.

If this is right

At 2.5 average bits on Qwen3-8B, RateQuant reduces KIVI perplexity from 49.3 to 14.9.
The same setting improves QuaRot perplexity by 6.6 points.
Calibration completes in 1.6 seconds on one GPU and adds zero overhead at inference time.
Correct per-quantizer models prevent the reversal of bit-allocation order that occurs under mismatch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same calibration-plus-reverse-waterfilling pattern could apply to quantizing weights or activations where sensitivity also varies across components.
The observed range of beta values indicates that quantizer selection itself shapes the optimal allocation beyond the choice of bit widths.
Testing the fitted models on longer sequences or out-of-distribution prompts would check whether the calibration remains stable over extended generation.

Load-bearing premise

The exponential distortion parameters fitted on the calibration set accurately describe how each quantizer behaves on the data distribution encountered during actual inference.

What would settle it

Running RateQuant's bit allocation on a new model or dataset and finding higher perplexity than a uniform bit-width baseline of the same average rate would show the fitted models do not generalize.

Figures

Figures reproduced from arXiv: 2605.06675 by Fei Zuo, Hao Cong, Ho Fai Leung, Xiaoyan Xi, Zikang Zhou.

**Figure 1.** Figure 1: Distortion model mismatch (Qwen3-8B, 2.5 avg bits). Left: Distortion curves D(b)=αβ−b diverge across quantizers (β varies 1.5×). Right: Naïve mixed-precision with mismatched β worsens PPL (KIVI: 49.3→87); calibrated RATEQUANT with K/V separation reaches 14.9 (70% ↓). mismatched β inverts the marginal gain ordering across heads. Resolving this mismatch is the key to building a truly general allocation frame… view at source ↗

**Figure 2.** Figure 2: RATEQUANT pipeline. Phases 1–3 are one-time offline costs (<2 s for 8B); Phase 4 adds zero runtime overhead. operates at per-head granularity with closed-form allocation and supports arbitrary base quantizers through calibration; [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Marginal gain wi · ∆Di(b) for the top-8 heads. (a) Correct β=3.6: gains well-separated. (b) Correct β=5.1: faster decay compresses gains. (c) Mismatch (β=3.6 applied to β=5.1 data): head ranking inverted. 3.3 Integer Allocation For integer bit-widths, we solve (1) via greedy marginal gain: Proposition 6 (Greedy optimality). When D(b) is convex in b (which holds under Theorem 1), Algorithm 1 produces the op… view at source ↗

**Figure 5.** Figure 5: PPL vs. average bits (Qwen3-8B). Solid: RATEQUANT; dashed: uniform. amplification, not loss impact; it over-allocates to late layers whose large ∥Q∥ · ∥V ∥ products inflate the proxy without corresponding PPL sensitivity [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: KIVI per-layer MSE at different bit-widths. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Per-head bit allocation for Qwen3-8B at ¯b=4.0 (bmin=3, bmax=6). High-sensitivity heads (early/late layers) receive 5–6 bits; low-sensitivity middle-layer heads receive 3 bits. I Calibration Overhead [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

read the original abstract

Large language models cache all previously computed key-value (KV) pairs during generation, and this KV cache grows linearly with sequence length, making it a primary memory bottleneck for serving. Quantizing the KV cache to fewer bits reduces this cost, yet all current quantizers assign the same bit-width to every attention head, ignoring the large variation in head importance. A natural idea is to allocate more bits to important heads and fewer to the rest. We show, however, that such mixed-precision allocation has a hidden pitfall: each quantizer follows a different distortion curve D(b)=alpha*beta^{-b}, and the decay rate beta varies from 3.6 to 5.3 across quantizer designs. Applying one quantizer's distortion model to another inverts the allocation order and makes performance worse than uniform quantization. We call this failure mode distortion model mismatch and propose RateQuant to resolve it. RateQuant fits a per-quantizer distortion model from a small calibration set, then solves the resulting bit-allocation problem in closed form via reverse waterfilling from rate-distortion theory. On Qwen3-8B at 2.5 average bits, calibrated RateQuant reduces KIVI's perplexity from 49.3 to 14.9 (70% reduction) and improves QuaRot by 6.6 PPL. The entire calibration takes 1.6 s on a single GPU and adds zero overhead at inference time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RateQuant fits per-quantizer exponential distortion curves on calibration data then applies closed-form reverse waterfilling for KV-cache bit allocation, producing the reported perplexity drops, but the optimality guarantee rests on the fit generalizing to inference distributions.

read the letter

The paper's main contribution is showing that uniform or naive mixed-precision KV quantization fails because different quantizers have different distortion decay rates, and that fitting a simple exponential model per quantizer then solving the bit-allocation problem with reverse waterfilling fixes it. On Qwen3-8B they cut KIVI's perplexity from 49.3 to 14.9 at 2.5 average bits and beat QuaRot by 6.6 points, with calibration taking 1.6 seconds and no runtime cost. That is concrete and useful for anyone serving long-context models on limited memory.

Referee Report

1 major / 1 minor

Summary. The paper introduces RateQuant for mixed-precision KV cache quantization in LLMs. It fits an exponential distortion model D(b)=α β^{-b} per quantizer from a small calibration set, then derives a closed-form bit allocation via reverse waterfilling from rate-distortion theory to minimize total distortion at a target average bit rate. Experiments report large perplexity gains on Qwen3-8B (e.g., KIVI from 49.3 to 14.9 PPL at 2.5 bits) while highlighting the risk of distortion-model mismatch when β varies across quantizers.

Significance. If the fitted per-quantizer models remain accurate on inference KV distributions, the method supplies a principled, low-overhead way to allocate bits according to head-specific importance. The closed-form solution, 1.6 s calibration, and explicit treatment of the mismatch failure mode are concrete strengths that could influence practical KV quantization pipelines.

major comments (1)

[RateQuant bit-allocation procedure] The optimality guarantee via reverse waterfilling holds only under the assumption that the fitted exponential distortion curves accurately predict behavior on the inference distribution. The manuscript demonstrates that modest β mismatch (3.6 vs 5.3) inverts the allocation and can degrade performance below uniform quantization, yet provides no quantitative check (e.g., held-out prediction error or long-sequence validation) that the calibration-set parameters remain valid for the target generation statistics.

minor comments (1)

[Abstract] The abstract states concrete perplexity numbers; adding the number of calibration sequences and the exact calibration prompt distribution would help readers assess how representative the fit is.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The concern regarding validation of the fitted distortion models on inference distributions is well-taken. We address it directly below and will strengthen the manuscript accordingly.

read point-by-point responses

Referee: The optimality guarantee via reverse waterfilling holds only under the assumption that the fitted exponential distortion curves accurately predict behavior on the inference distribution. The manuscript demonstrates that modest β mismatch (3.6 vs 5.3) inverts the allocation and can degrade performance below uniform quantization, yet provides no quantitative check (e.g., held-out prediction error or long-sequence validation) that the calibration-set parameters remain valid for the target generation statistics.

Authors: We agree that the closed-form optimality of reverse waterfilling is conditional on the accuracy of the per-quantizer exponential models D(b) = α β^{-b} when applied to the actual inference KV distributions. The manuscript already identifies distortion-model mismatch as a concrete failure mode and shows that fitting separate parameters per quantizer (from a 1.6 s calibration set) avoids the inversion that occurs when a single model is used. The reported 70 % perplexity reduction versus KIVI on Qwen3-8B at 2.5 average bits supplies indirect evidence that the fitted curves generalize sufficiently for the evaluated tasks. Nevertheless, we concur that an explicit quantitative check would be valuable. In the revised manuscript we will add a new subsection that reports (i) mean-squared prediction error of the fitted D(b) curves on a held-out portion of the calibration data and (ii) the same error measured on KV statistics collected during actual long-sequence generation. These results will quantify the residual mismatch risk and confirm that the calibration procedure remains reliable for the target generation statistics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; bit allocation follows from empirical per-quantizer fits plus standard reverse waterfilling.

full rationale

The derivation fits parameters alpha and beta of the exponential distortion model D(b) = alpha * beta^{-b} on a calibration set, then applies the closed-form reverse-waterfilling solution from classical rate-distortion theory to obtain the bit vector. This is a standard optimization step once the model is given; the final allocation is data-dependent rather than tautological or self-definitional. No self-citation chain, uniqueness theorem, or ansatz smuggling is invoked in the provided text to justify the core construction, and the exponential form is presented as an observed empirical pattern rather than derived from the target result. The approach therefore remains self-contained against external rate-distortion benchmarks once the calibration assumption is granted.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on an assumed parametric form for the distortion curve and on the applicability of reverse waterfilling once that form is fitted; no new physical entities are introduced.

free parameters (1)

alpha and beta per quantizer
Parameters of the exponential distortion model D(b)=alpha*beta^{-b} are fitted to a small calibration set for each quantizer design.

axioms (2)

domain assumption Quantizer distortion follows the exponential form D(b)=alpha*beta^{-b}
Invoked to enable closed-form reverse waterfilling; stated in the abstract as the observed behavior across quantizer designs.
standard math Reverse waterfilling yields the optimal bit allocation under the fitted distortion model
Standard result from rate-distortion theory applied after the per-quantizer fit.

pith-pipeline@v0.9.0 · 5579 in / 1590 out tokens · 40562 ms · 2026-05-11T01:04:30.239563+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 8 canonical work pages · 1 internal anchor

[1]

Quarot: Outlier-free 4-bit inference in rotated llms

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms. In ICML, 2024

2024
[2]

Kvmix: Gradient-based layer importance-aware mixed-precision quantization for kv cache

Jiahao Chen, Fangcheng Wei, Zhuowei Liu, and Zhongqiu Peng. Kvmix: Gradient-based layer importance-aware mixed-precision quantization for kv cache. arXiv preprint arXiv:2506.08018, 2025 a

work page arXiv 2025
[3]

Progressive mixed-precision kv cache quantization for long-cot llms

Yilong Chen et al. Progressive mixed-precision kv cache quantization for long-cot llms. arXiv preprint arXiv:2505.18610, 2025 b

work page arXiv 2025
[4]

Elements of Information Theory

Thomas M Cover and Joy A Thomas. Elements of Information Theory. John Wiley & Sons, 2nd edition, 2006

2006
[5]

Gpt3.int8(): 8-bit matrix multiplication for transformers at scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3.int8(): 8-bit matrix multiplication for transformers at scale. NeurIPS, 2022

2022
[6]

Hawq: Hessian aware quantization of neural networks with mixed-precision

Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Hawq: Hessian aware quantization of neural networks with mixed-precision. ICCV, 2019

2019
[7]

Hawq-v2: Hessian aware trace-weighted quantization of neural networks

Zhen Dong, Zhewei Yao, Daiyaan Arfeen, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Hawq-v2: Hessian aware trace-weighted quantization of neural networks. NeurIPS, 2020

2020
[8]

Saeed Ghadimi and Guanghui Lan

Leo Gao, Jonathan Tow, et al. A framework for few-shot language model evaluation, 2024. URL https://zenodo.org/records/10256836

work page arXiv 2024
[9]

Model tells you what to discard: Adaptive kv cache compression for llms

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. ICLR, 2024

2024
[10]

Kvquant: Towards 10 million context length llm inference with kv cache quantization

Coleman Hooper, Sehoon Kim, Hiva Mohammadi, Michael W Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization. NeurIPS, 2024

2024
[11]

Quantize What Counts: More for Keys, Less for Values

Dongjin Kim et al. Quantize what counts: More for keys, less for values. arXiv preprint arXiv:2502.15075, 2025

work page internal anchor Pith review arXiv 2025
[12]

Cokv: Optimizing kv cache allocation via cooperative game

Yichi Li et al. Cokv: Optimizing kv cache allocation via cooperative game. arXiv preprint arXiv:2502.17501, 2025 a

work page arXiv 2025
[13]

Kvtuner: Sensitivity-aware layer-wise mixed-precision kv cache quantization

Yifei Li, Zhehao Wu, and Cheng Zhou. Kvtuner: Sensitivity-aware layer-wise mixed-precision kv cache quantization. 2025 b

2025
[14]

Channel-aware mixed-precision quantization for efficient long-context inference

Chengxi Liao and Zeyi Wen. Channel-aware mixed-precision quantization for efficient long-context inference. In ICLR, 2026

2026
[15]

Awq: Activation-aware weight quantization for llm compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration. MLSys, 2024

2024
[16]

Kivi: A tuning-free asymmetric 2bit quantization for kv cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. ICML, 2024

2024
[17]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. ICLR, 2017

2017
[18]

Dynamic memory compression: Retrofitting llms for accelerated inference

Piotr Nawrot, Adrian ancucki, Marcin Chochowski, David Tarjan, and Edoardo Maria Ponti. Dynamic memory compression: Retrofitting llms for accelerated inference. ICML, 2024

2024
[19]

Matroid Theory

James G Oxley. Matroid Theory. Oxford University Press, 2011

2011
[20]

Efficiently scaling transformer inference

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 2023

2023
[21]

Radio: Rate-distortion optimization for large language model compression

Albert Tseng et al. Radio: Rate-distortion optimization for large language model compression. 2025

2025
[22]

Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned

Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. ACL, 2019

2019
[23]

Accurate and efficient 2-bit kv cache quantization with dynamic channel-wise precision boost

Xingyu Wang et al. Accurate and efficient 2-bit kv cache quantization with dynamic channel-wise precision boost. arXiv preprint arXiv:2511.18643, 2025

work page arXiv 2025
[24]

Smoothquant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. ICML, 2023

2023
[25]

Wkvquant: Quantizing weight and key/value cache for large language models

Peng Yue et al. Wkvquant: Quantizing weight and key/value cache for large language models. arXiv preprint, 2024

2024
[26]

Turboquant: Online vector quantization with near-optimal distortion rate

Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. Turboquant: Online vector quantization with near-optimal distortion rate. In ICLR, 2026

2026
[27]

Query-aware mixed-precision kv cache quantization for long-context reasoning

Wei Zhang et al. Query-aware mixed-precision kv cache quantization for long-context reasoning. arXiv preprint arXiv:2512.19206, 2025

work page arXiv 2025
[28]

H2o: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R \'e , Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. NeurIPS, 2024

2024
[29]

Baq: Efficient bit allocation quantization for large language models

Qi Zheng et al. Baq: Efficient bit allocation quantization for large language models. arXiv preprint arXiv:2506.05664, 2025

work page arXiv 2025