pith. machine review for the scientific record. sign in

arxiv: 2605.08317 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords KV cache compressionrate-distortion optimizationLLM inferencequantizationtoken evictionattention distortionbit allocation
0
0 comments X

The pith

RDKV unifies KV cache eviction and quantization as a single rate-distortion bit-allocation problem driven by attention distortion weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models with long contexts are limited by the growing size of the key-value cache that must be reloaded from high-bandwidth memory at every decoding step. Existing work reduces this cache either by evicting tokens or by quantizing values, but treats the two operations separately. The paper models both as endpoints of the same bit-allocation continuum and derives a weight for each token or channel from the distortion its compression would cause inside the attention computation. These weights guide a reverse water-filling procedure that assigns every token or channel an integer bit width from full precision down to zero bits. The allocation is performed once after the prefilling stage and then held fixed, producing higher task accuracy than separate eviction or quantization baselines while using far less memory.

Core claim

RDKV casts KV cache compression as a rate-distortion problem under which eviction and quantization become the two extremes of one unified bit-allocation scheme. It computes a weight for each token or channel equal to the distortion that compressing that token or channel would induce on the attention computation, then uses reverse water-filling on those weights to assign bit widths ranging from the original precision down to complete removal. The resulting allocation is applied once after prefilling and remains constant during autoregressive decoding.

What carries the argument

Distortion-weighted reverse water-filling bit allocation, in which per-token or per-channel weights are derived from their individual contribution to attention distortion and then used to decide full-precision retention, reduced-precision quantization, or outright eviction.

If this is right

  • On LongBench the method recovers 97.81 percent of full-cache accuracy while retaining only 2.48 percent of the cache.
  • Across LongBench, RULER, and InfiniteBench it outperforms the strongest evaluated baseline by 9.1 percent on average.
  • At 128 K context length it delivers 4.5 times faster decoding and 1.9 times lower peak memory than full-cache FlashAttention-2 while keeping comparable accuracy.
  • Jointly optimizing eviction and quantization within one rate-distortion framework is strictly better than treating the two operations in isolation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distortion-to-bit mapping could be tested on other memory-bound transformer structures such as activation caches or MLP weights.
  • If attention-distortion weights remain stable across model families, the allocation rule might transfer to new architectures without retraining.
  • Periodic reallocation during very long generations could be a direct extension if static allocation begins to degrade.

Load-bearing premise

The amount of distortion that compressing a token or channel produces inside the attention computation is a sufficient proxy for how much that token or channel matters to the model's final output quality, and one allocation computed after prefilling stays close to optimal for the rest of the generation.

What would settle it

Compare end-task accuracy when the bit allocation is frozen after the initial prefilling stage versus when it is recomputed every 8 K tokens during a 128 K generation; a large sustained gap would indicate that the single static allocation is not near-optimal.

Figures

Figures reproduced from arXiv: 2605.08317 by Hang Guo, Junkai Zhang, Luca Benini, Yawei Li.

Figure 1
Figure 1. Figure 1: Left: Per-sequence weighted distortion ∆D = P u wu εu(bu) ( Eq. (6)) versus average bit-width ¯b. Lines: median across sequences; shaded: IQR. Lower bound: continuous relaxation (Prop. A.3). Right: LongBench score by task category at a per-layer cache budget of 128 FP16- equivalent tokens (Btotal = 128L), normalized by FullKV; per-task scores in Tab. 1. Eviction refers to Ada-SnapKV [26] in the right panel… view at source ↗
Figure 2
Figure 2. Figure 2: RDKV per-head bit-allocation pipeline (illustrated for 8 tokens and 8 channels). [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: TriZone cache layout for one (ℓ, h) pair. Each zone admits a uniform dequantization path; cell labels show the stored format. The pipeline of Sec. 3.2 produces a mixed-bit cache per (ℓ, h) pair. A mixed-bit allocation alone does not reduce decode cost: if the quan￾tized entries are unpacked to FP16 before the attention kernel reads them, peak HBM us￾age stays unchanged and the extra dequantiza￾tion pass ad… view at source ↗
Figure 4
Figure 4. Figure 4: Needle-in-a-Haystack [48] on LLaMA-3.1-8B-Instruct at Btotal = 64L. RDKV preserves a near-uniform retrieval pattern; SnapKV and AdaKV drop accuracy in mid-depth bands [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Decode latency, peak memory, and latency-accuracy trade-off for LLaMA-3.1-8B-Instruct [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Needle-in-a-Haystack on LLaMA-3.1-8B-Instruct at [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-token V-cache bit allocation: layer 15, head 0. [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-channel K-cache bit allocation: layer 15, head 0. [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-token V-cache bit allocation: layer 15, head 1. [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Per-channel K-cache bit allocation: layer 15, head 1. [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Per-token V-cache bit allocation: layer 31, head 0. [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Per-channel K-cache bit allocation: layer 31, head 0. [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Per-token V-cache bit allocation: layer 31, head 1. [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Per-channel K-cache bit allocation: layer 31, head 1. [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗
read the original abstract

Large language models (LLMs) have shown strong performance across diverse tasks, but their inference with long input contexts is bottlenecked by memory size and bandwidth. The Key-Value (KV) cache size grows linearly with sequence length and needs to be re-read from off-chip high-bandwidth memory (HBM) to on-chip memory at every decoding step, resulting in memory-bound inference. Existing methods reduce the cache by either eviction or quantization, but typically treat the two in isolation. In this paper, we cast KV cache compression as a rate-distortion problem, under which eviction and quantization are two end-points of the same bit allocation scheme. This exposes the need to optimize them jointly, motivating our method, RDKV (Rate-Distortion KV cache compression). RDKV derives the weight of each token or channel from the distortion that compression induces on the attention computation. Based on these weights, it assigns each token or channel a bit-width ranging from full precision down to zero bits guided by reverse water-filling, applied once after the prefilling stage. Experiments on LongBench, RULER, and InfiniteBench show that RDKV outperforms the best evaluated baseline by 9.1% on average. On LongBench it recovers 97.81% of full-cache accuracy with only 2.48% cache retention. Compared with full-cache FlashAttention-2 decoding, it achieves 4.5x decode speedup and 1.9x peak memory reduction with 128K context length, while maintaining comparable performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes RDKV, which frames KV cache compression for LLMs as a rate-distortion optimization problem to jointly optimize eviction (zero-bit allocation) and quantization. Per-token and per-channel weights are derived from the distortion each induces on the attention computation; these weights then guide a single reverse water-filling bit allocation (full precision down to zero bits) performed once after the prefilling stage and held fixed for the remainder of autoregressive decoding. Experiments on LongBench, RULER, and InfiniteBench report that RDKV outperforms the strongest baseline by 9.1% on average, recovers 97.81% of full-cache accuracy at 2.48% cache retention, and delivers 4.5× decode speedup with 1.9× memory reduction at 128K context length.

Significance. If the empirical results prove robust under the static-allocation regime, the work supplies a principled unification of two previously separate compression axes and demonstrates concrete gains in long-context inference efficiency. The use of attention-distortion weights and reverse water-filling supplies an explicit, reproducible allocation rule that avoids post-hoc threshold tuning, which would be a methodological strength.

major comments (2)
  1. [Method (bit allocation)] Method section (bit-allocation paragraph): the allocation is computed once after prefilling and then frozen. Because query vectors evolve with each newly generated token and the KV cache grows, the relative distortion contribution of any cached entry changes. The manuscript supplies neither a proof that the initial allocation remains near-optimal nor an ablation that recomputes the allocation at regular intervals; the headline accuracy-recovery and speedup numbers rest on this untested premise.
  2. [Experiments] Experiments section: the abstract and results tables report precise figures (9.1% average gain, 97.81% recovery at 2.48% retention) without error bars, number of random seeds, or an explicit statement of how the distortion weights are normalized and whether any fitted constants enter the water-filling threshold. This prevents verification that the gains are attributable to the rate-distortion formulation rather than implementation choices.
minor comments (2)
  1. [Method] Notation: the manuscript should explicitly define the distortion metric (e.g., whether it is the squared error in the attention output or a scaled version) and state whether the weights are recomputed only on the prompt or also on newly generated tokens.
  2. [Figures] Figure clarity: the rate-distortion curves and per-layer bit-allocation visualizations would benefit from an additional panel showing how the allocation evolves (or does not evolve) across decoding steps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help improve the clarity and rigor of our work. We address each major point below and will revise the manuscript to incorporate additional experiments and clarifications.

read point-by-point responses
  1. Referee: Method section (bit-allocation paragraph): the allocation is computed once after prefilling and then frozen. Because query vectors evolve with each newly generated token and the KV cache grows, the relative distortion contribution of any cached entry changes. The manuscript supplies neither a proof that the initial allocation remains near-optimal nor an ablation that recomputes the allocation at regular intervals; the headline accuracy-recovery and speedup numbers rest on this untested premise.

    Authors: We acknowledge that the static allocation after prefilling is a deliberate design choice to avoid the computational overhead of repeated rate-distortion optimization during decoding. While a formal proof of near-optimality under evolving queries is not provided (and would be challenging given the non-convex nature of the problem), the empirical results on LongBench, RULER, and InfiniteBench support its effectiveness. To directly address the concern, we will add an ablation study in the revised manuscript that recomputes the bit allocation at fixed intervals (e.g., every 2K tokens) and reports the resulting accuracy and latency trade-offs compared to the static version. This will empirically test the stability of the initial allocation. revision: yes

  2. Referee: Experiments section: the abstract and results tables report precise figures (9.1% average gain, 97.81% recovery at 2.48% retention) without error bars, number of random seeds, or an explicit statement of how the distortion weights are normalized and whether any fitted constants enter the water-filling threshold. This prevents verification that the gains are attributable to the rate-distortion formulation rather than implementation choices.

    Authors: We agree that reproducibility details are essential. In the revised manuscript, we will report results with error bars computed over 3 random seeds, explicitly state the seed count, and add a dedicated paragraph in the method section clarifying the weight computation. The per-token and per-channel weights are derived directly from the L2 distortion each induces on the attention output (normalized by the sum of all weights to form a probability distribution for water-filling); no additional fitted constants are used, and the water-filling threshold is set solely by the target retention ratio. These details will also be added to the experiments section and appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain.

full rationale

The paper derives per-token and per-channel weights directly from the distortion each element induces on the attention computation (an external, model-grounded quantity computed from the forward pass). These weights then feed a standard reverse water-filling procedure drawn from rate-distortion theory to produce bit allocations. No equations or steps reduce the claimed result to a fitted parameter, self-definition, or load-bearing self-citation; the allocation rule is applied once post-prefill as an explicit modeling choice rather than a tautology. The static-allocation assumption affects long-term optimality but does not create circularity in the derivation itself. The method remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method implicitly relies on the unstated assumption that attention distortion is a faithful importance metric and that one-time allocation suffices, but none are enumerated or justified in the provided text.

pith-pipeline@v0.9.0 · 5589 in / 1309 out tokens · 47633 ms · 2026-05-12T01:12:45.573348+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 10 internal anchors

  1. [1]

    Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901, 2020

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. DeepSeek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  4. [4]

    Retrieval-augmented generation for knowledge-intensive NLP tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

  5. [5]

    Efficient memory management for large language model serving with PagedAttention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

  6. [6]

    FlashAttention: Fast and memory-efficient exact attention with IO-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022

  7. [7]

    Efficiently scaling transformer inference

    Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5:606–624, 2023

  8. [8]

    arXiv preprint arXiv:2412.19442 , year =

    Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, and Lei Chen. A survey on large language model acceleration based on KV cache management.arXiv preprint arXiv:2412.19442, 2024

  9. [9]

    KV cache compression for inference efficiency in LLMs: A review

    Yanyu Liu, Jingying Fu, Sixiang Liu, Yitian Zou, Shouhua Zhang, and Jiehan Zhou. KV cache compression for inference efficiency in LLMs: A review. InProceedings of the 4th International Conference on Artificial Intelligence and Intelligent Information Processing, pages 207–212, 2025

  10. [10]

    H2O: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2O: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

  11. [11]

    SnapKV: LLM knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

  12. [12]

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuan- dong Tian, Christopher Ré, Clark Barrett, and 1 oth- ers

    Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou, Amrita Saha, Caiming Xiong, and Doyen Sahoo. ThinK: Thinner key cache by query-driven pruning.arXiv preprint arXiv:2407.21018, 2024

  13. [13]

    KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. KIVI: A tuning-free asymmetric 2bit quantization for KV cache.arXiv preprint arXiv:2402.02750, 2024. 10

  14. [14]

    ZipCache: Accurate and efficient KV cache quantization with salient token identification.Advances in Neural Information Processing Systems, 37:68287–68307, 2024

    Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. ZipCache: Accurate and efficient KV cache quantization with salient token identification.Advances in Neural Information Processing Systems, 37:68287–68307, 2024

  15. [15]

    More tokens, lower precision: Towards the optimal token- precision trade-off in KV cache compression.arXiv preprint arXiv:2412.12706, 2024

    Jiebin Zhang, Dawei Zhu, Yifan Song, Wenhao Wu, Chuqiao Kuang, Xiaoguang Li, Lifeng Shang, Qun Liu, and Sujian Li. More tokens, lower precision: Towards the optimal token- precision trade-off in KV cache compression.arXiv preprint arXiv:2412.12706, 2024

  16. [16]

    ARKV: Adaptive and resource-efficient KV cache man- agement under limited memory budget for long-context inference in LLMs.arXiv preprint arXiv:2603.08727, 2026

    Jianlong Lei and Shashikant Ilager. ARKV: Adaptive and resource-efficient KV cache man- agement under limited memory budget for long-context inference in LLMs.arXiv preprint arXiv:2603.08727, 2026

  17. [17]

    HqeKV: Towards hybrid quantization and eviction for KV cache in long-context LLM inference

    Anonymous. HqeKV: Towards hybrid quantization and eviction for KV cache in long-context LLM inference. InSubmitted to ACL Rolling Review - January 2026, 2026. under review

  18. [18]

    KVQuant: Towards 10 million context length LLM inference with KV cache quantization.Advances in Neural Information Processing Systems, 37:1270– 1303, 2024

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. KVQuant: Towards 10 million context length LLM inference with KV cache quantization.Advances in Neural Information Processing Systems, 37:1270– 1303, 2024

  19. [19]

    KVTuner: Sensitivity-aware layer-wise mixed-precision KV cache quantization for efficient and nearly lossless LLM inference.arXiv preprint arXiv:2502.04420, 2025

    Xing Li, Zeyu Xing, Yiming Li, Linping Qu, Hui-Ling Zhen, Wulong Liu, Yiwu Yao, Sinno Jialin Pan, and Mingxuan Yuan. KVTuner: Sensitivity-aware layer-wise mixed-precision KV cache quantization for efficient and nearly lossless LLM inference.arXiv preprint arXiv:2502.04420, 2025

  20. [20]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others

    Shichen Dong, Wen Cheng, Jiayu Qin, and Wei Wang. QAQ: Quality adaptive quantization for LLM KV cache.arXiv preprint arXiv:2403.04643, 2024

  21. [21]

    Y., Kim, B., Bae, J., Kwon, B., Park, G., Yang, E., Kwon, S

    June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, and Dongsoo Lee. No token left behind: Reliable kv cache compression via importance-aware mixed precision quantization.arXiv preprint arXiv:2402.18096, 2024

  22. [22]

    Shutova, V

    Alina Shutova, Vladimir Malinovskii, Vage Egiazarian, Denis Kuznedelev, Denis Mazur, Nikita Surkov, Ivan Ermakov, and Dan Alistarh. Cache me if you must: Adaptive key-value quantization for large language models.arXiv preprint arXiv:2501.19392, 2025

  23. [23]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

  24. [24]

    Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of impor- tance hypothesis for llm kv cache compression at test time.Advances in Neural Information Processing Systems, 36:52342–52364, 2023

  25. [25]

    Nacl: A general and effective kv cache eviction framework for llm at inference time

    Yilong Chen, Guoxia Wang, Junyuan Shang, Shiyao Cui, Zhenyu Zhang, Tingwen Liu, Shuo- huan Wang, Yu Sun, Dianhai Yu, and Hua Wu. Nacl: A general and effective kv cache eviction framework for llm at inference time. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7913–7926, 2024

  26. [26]

    arXiv preprint arXiv:2407.11550 , year =

    Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Ada-KV: Optimizing KV cache eviction by adaptive budget allocation for efficient LLM inference.arXiv preprint arXiv:2407.11550, 2024

  27. [27]

    PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. PyramidKV: Dynamic KV cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024

  28. [28]

    Cake: Cascading and adaptive kv cache eviction with layer preferences.arXiv preprint arXiv:2503.12491, 2025

    Ziran Qin, Yuchen Cao, Mingbao Lin, Wen Hu, Shixuan Fan, Ke Cheng, Weiyao Lin, and Jianguo Li. Cake: Cascading and adaptive kv cache eviction with layer preferences.arXiv preprint arXiv:2503.12491, 2025

  29. [29]

    Expected attention: KV cache compres- sion by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636, 2025

    Alessio Devoto, Maximilian Jeblick, and Simon Jégou. Expected attention: KV cache compres- sion by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636, 2025. 11

  30. [30]

    Identify critical KV cache in LLM inference from an output perturbation perspective.arXiv preprint arXiv:2502.03805, 2025

    Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Identify critical KV cache in LLM inference from an output perturbation perspective.arXiv preprint arXiv:2502.03805, 2025

  31. [31]

    OBCache: Optimal brain KV cache pruning for efficient long-context LLM inference.arXiv preprint arXiv:2510.07651, 2025

    Yuzhe Gu, Xiyu Liang, Jiaojiao Zhao, and Enmao Diao. OBCache: Optimal brain KV cache pruning for efficient long-context LLM inference.arXiv preprint arXiv:2510.07651, 2025

  32. [32]

    Optimal brain damage.Advances in Neural Information Processing Systems, 2, 1989

    Yann LeCun, John Denker, and Sara Solla. Optimal brain damage.Advances in Neural Information Processing Systems, 2, 1989

  33. [33]

    Optimal brain restoration for joint quantization and sparsification of llms.arXiv preprint arXiv:2509.11177, 2025

    Hang Guo, Yawei Li, and Luca Benini. Optimal brain restoration for joint quantization and sparsification of llms.arXiv preprint arXiv:2509.11177, 2025

  34. [34]

    Caote: Kv cache selection for LLMs via attention output error-based token eviction.arXiv preprint arXiv:2504.14051, 2025

    Raghavv Goel, Junyoung Park, Mukul Gagrani, Dalton Jones, Matthew Morse, Harper Langston, Mingu Lee, and Chris Lott. Caote: Kv cache selection for LLMs via attention output error-based token eviction.arXiv preprint arXiv:2504.14051, 2025

  35. [35]

    Accurate kv cache eviction via anchor direction projection for efficient llm inference

    Zijie Geng, Jie Wang, Ziqi Liu, Feng Ju, Yiming Li, Xing Li, Mingxuan Yuan, Jianye Hao, Defu Lian, Enhong Chen, et al. Accurate kv cache eviction via anchor direction projection for efficient llm inference. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  36. [36]

    ReST-KV: Robust KV cache eviction with layer-wise output reconstruction and spatial- temporal smoothing

    Yongqi An, Chang Lu, Kuan Zhu, Tao Yu, Chaoyang Zhao, Hong Wu, Ming Tang, and Jinqiao Wang. ReST-KV: Robust KV cache eviction with layer-wise output reconstruction and spatial- temporal smoothing. InThe Fourteenth International Conference on Learning Representations, 2026

  37. [37]

    Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

  38. [38]

    GQA: Training generalized multi-query transformer models from multi-head checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023

  39. [39]

    Spectra of quantized signals.The Bell System Technical Journal, 27(3): 446–472, 1948

    William Ralph Bennett. Spectra of quantized signals.The Bell System Technical Journal, 27(3): 446–472, 1948

  40. [40]

    Coding theorems for a discrete source with a fidelity criterion.IRE Nat

    Claude E Shannon et al. Coding theorems for a discrete source with a fidelity criterion.IRE Nat. Conv. Rec, 4(142-163):1, 1959

  41. [41]

    Cover and Joy A

    Thomas M. Cover and Joy A. Thomas.Elements of Information Theory 2nd Edition (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, July 2006. ISBN 0471241954

  42. [42]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  43. [43]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7B, 2023

  44. [44]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  45. [45]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 12

  46. [46]

    Qwen2.5 technical report, 2025

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  47. [47]

    LongBench: A bilingual, multitask benchmark for long context understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. LongBench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3119–3137, 2024

  48. [48]

    Needle in a haystack - pressure testing LLMs, 2023

    Gregory Kamradt. Needle in a haystack - pressure testing LLMs, 2023

  49. [49]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

  50. [50]

    ∞ bench: Extending long context evaluation beyond 100K tokens

    Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Hao, Xu Han, Zhen Thai, Shuo Wang, Zhiyuan Liu, et al. ∞ bench: Extending long context evaluation beyond 100K tokens. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15262–15277, 2024

  51. [51]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023. 13 A Proofs This appendix collects the proofs for the results in Sec. 3.1. All quantities below are local to a single layer and KV head(ℓ, h), with indices suppressed. Token weight in V cache. Proof of Prop. 3.1. By definition, ...