KVTuner: Sensitivity-aware layer-wise mixed-precision KV cache quantization for efficient and nearly lossless LLM inference

Xing Li, Zeyu Xing, Yiming Li, Linping Qu, Hui-Ling Zhen, Wulong Liu, Yiwu Yao, Sinno Jialin Pan, Mingxuan Yuan · 2025 · arXiv 2502.04420

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

VeriCache: Turning Lossy KV Cache into Lossless LLM Inference

cs.AR · 2026-05-17 · unverdicted · novelty 6.0

VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.

RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

RDKV derives per-token and per-channel weights from attention distortion, then uses reverse water-filling to assign bit-widths from full precision to zero after prefilling, recovering 97.81% accuracy with 2.48% cache retention on LongBench.

HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

cs.LG · 2026-05-05 · unverdicted · novelty 5.0 · 2 refs

HeadQ applies score-space logit corrections for keys and attention-weighted surrogates for values to KV-cache quantization, removing 84-94% of excess perplexity in 2-bit key experiments across six models.

A Simple Plug-in for Improving Eviction-Based KV Cache Compression

cs.LG · 2026-05-22 · unverdicted · novelty 4.0

VECTOR augments eviction-based KV cache compression with three-way token routing that combines importance scoring and offline regression-based reconstructability estimation to improve quality at high compression ratios.

citing papers explorer

Showing 4 of 4 citing papers.

VeriCache: Turning Lossy KV Cache into Lossless LLM Inference cs.AR · 2026-05-17 · unverdicted · none · ref 27
VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.
RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache cs.LG · 2026-05-08 · unverdicted · none · ref 19
RDKV derives per-token and per-channel weights from attention distortion, then uses reverse water-filling to assign bit-widths from full precision to zero after prefilling, recovering 97.81% accuracy with 2.48% cache retention on LongBench.
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization cs.LG · 2026-05-05 · unverdicted · none · ref 17 · 2 links
HeadQ applies score-space logit corrections for keys and attention-weighted surrogates for values to KV-cache quantization, removing 84-94% of excess perplexity in 2-bit key experiments across six models.
A Simple Plug-in for Improving Eviction-Based KV Cache Compression cs.LG · 2026-05-22 · unverdicted · none · ref 19
VECTOR augments eviction-based KV cache compression with three-way token routing that combines importance scoring and offline regression-based reconstructability estimation to improve quality at high compression ratios.

KVTuner: Sensitivity-aware layer-wise mixed-precision KV cache quantization for efficient and nearly lossless LLM inference

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer