pith. machine review for the scientific record. sign in

arxiv: 2604.25975 · v1 · submitted 2026-04-28 · 💻 cs.LG · cs.AI· cs.IT· math.IT

Recognition: unknown

Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ITmath.IT
keywords KV cache evictioninformation bottleneckmutual informationlinear-Gaussian surrogateleverage scoreslong-context inferencecapacity maximizationCapKV
0
0 comments X

The pith

KV cache eviction can be reframed as selecting the subset that maximizes retained predictive information under a linear-Gaussian attention model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that KV cache eviction in large language models amounts to keeping the past token representations that carry the most mutual information about the model's future outputs. By modeling attention with a linear-Gaussian surrogate and applying the Information Bottleneck principle, the authors obtain a closed-form objective that directly measures the information capacity of any chosen cache subset. The formulation shows that many existing heuristic eviction rules are in practice trying to approximate this same capacity-maximizing goal. The authors then propose CapKV, which replaces heuristics with a leverage-score-based log-determinant method that explicitly targets the highest-capacity tokens.

Core claim

Under a linear-Gaussian surrogate of attention, the authors derive a closed-form mutual information objective that characterizes the effective information capacity of a retained KV cache subset. This formulation reveals that a wide range of existing eviction strategies can be interpreted as different approximations of the same capacity-maximization principle. Guided by this insight, they introduce CapKV, a capacity-aware eviction method that directly targets information preservation via a log-determinant approximation using statistical leverage scores.

What carries the argument

The closed-form mutual information objective derived from the linear-Gaussian attention surrogate, which serves as a unified measure of the predictive information capacity retained by any KV cache subset.

If this is right

  • Existing heuristic eviction policies succeed to the extent they approximate the same information-capacity objective.
  • Directly optimizing the capacity objective via leverage scores yields a selection rule that preserves more predictive signal than prior heuristics.
  • The resulting method improves the memory-quality trade-off on long-context generation tasks across multiple models and benchmarks.
  • Future cache-management designs can start from the capacity-maximization principle rather than trial-and-error rules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the surrogate holds, then any new eviction heuristic can be evaluated by how closely its ranking matches the mutual-information ordering.
  • The same capacity view might be used to decide when to recompute rather than evict entries during very long generations.
  • The leverage-score approximation could be replaced by other tractable estimators if the Gaussian assumption is relaxed in future work.

Load-bearing premise

The linear-Gaussian surrogate provides a valid model for the information-theoretic properties of attention inside actual large language models.

What would settle it

Computing the closed-form objective on real attention activations from a transformer layer and checking whether the subset it ranks highest actually produces higher next-token prediction accuracy than a lower-ranked subset on held-out continuations.

Figures

Figures reproduced from arXiv: 2604.25975 by Chenwei Tang, Jiaming Yang, Jiancheng Lv, Liangli Zhen.

Figure 1
Figure 1. Figure 1: A unified information-theoretic view of KV cache eviction. The horizontal axis reflects structural diversity of re￾tained KV entries, while the vertical axis reflects query relevance, measuring alignment with likely future queries. KV cache eviction can be viewed as maximizing an information-capacity objective, with existing methods corresponding to different approximations. KV cache with context length cr… view at source ↗
Figure 2
Figure 2. Figure 2: Information-Theoretic Perspective of KV Cache Eviction. We model cached key–value pairs as communication channels mapping future queries to attention outputs. Query as Source: Future queries are characterized as stochastic signals sampled from q ∼ N (µQ, ΛQ). KV Cache Channels: Each cached entities induces an effective channel, the vector direction signifies its contribution to the output space, while colo… view at source ↗
Figure 3
Figure 3. Figure 3: Capacity preservation and capacity–performance correlation on Qasper.Top: Retained capacity under increasing cache compression ratios for different KV cache eviction methods, shown for three simplified capacity proxies.Bottom: Scatter plots of retained capacity versus task performance across eviction methods and compression ratios, with Spearman rank correlation coefficients and corresponding p-values anno… view at source ↗
Figure 4
Figure 4. Figure 4: Nemotron7B Reasoning Model Experiments on Aime25. nc stands for no compression view at source ↗
Figure 5
Figure 5. Figure 5: Runtime comparison of KV cache eviction methods under varying context lengths and compression ratios on Qwen3-8B. C.2. Capacity–Performance Correlation In the main paper, we analyze the relationship between retained KV cache capacity and downstream performance on QASPER. In this section, we extend the same analysis to three additional long-context benchmarks: 2WIKIMQA, MULTIFIELDQA, and PASSAGE-RETRIEVAL. … view at source ↗
Figure 6
Figure 6. Figure 6: Capacity–Performance Correlation analysis on 2WIKIMQA (top two rows), MULTIFIELDQA (middle two rows), and PASSAGE￾RETRIEVAL (bottom two rows). For each dataset, the top row shows the retained KV cache capacity under increasing compression ratios for different eviction methods, evaluated using three simplified capacity proxies (K, U, and KU-Capacity). The bottom row plots downstream task performance against… view at source ↗
read the original abstract

Key-value (KV) caching is essential for large language model inference, yet its memory overhead poses a critical bottleneck for long-context generation. Existing eviction policies predominantly rely on empirical heuristics, lacking a rigorous theoretical foundation. This work rethinks KV cache eviction through the lens of the Information Bottleneck principle. Under a linear-Gaussian surrogate of attention, we derive a closed-form mutual information objective that characterizes the effective information capacity of a retained KV cache subset. This formulation reveals that a wide range of existing eviction strategies can be interpreted as different approximations of the same capacity-maximization principle. Guided by this insight, we introduce CapKV, a capacity-aware eviction method that directly targets information preservation via a log-determinant approximation using statistical leverage scores. This approach replaces heuristic selection with a theoretically grounded mechanism that preserves the maximum predictive signal. Extensive experiments across multiple models and long-context benchmarks show that CapKV consistently outperforms prior methods, achieving a better trade-off between memory efficiency and generational fidelity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper rethinks KV cache eviction for long-context LLM inference through the Information Bottleneck principle. Under a linear-Gaussian surrogate model of attention, it derives a closed-form mutual information objective characterizing the information capacity of retained KV subsets. This unifies a range of existing eviction heuristics as approximations to capacity maximization and motivates CapKV, a new method using log-determinant leverage scores for direct information preservation. Experiments across models and benchmarks report that CapKV improves the memory-fidelity trade-off over prior approaches.

Significance. If the linear-Gaussian surrogate is shown to preserve the relevant predictive mutual information quantities from real softmax attention, the work supplies a principled theoretical foundation for KV eviction. The unification of heuristics and the capacity-aware CapKV could replace ad-hoc policies with a reproducible, information-theoretic mechanism, with direct implications for efficient long-context generation.

major comments (2)
  1. [Theoretical derivation (linear-Gaussian surrogate and MI objective)] The linear-Gaussian surrogate (derivation in the main theoretical section) is load-bearing for both the closed-form MI objective and the claim that existing policies approximate capacity maximization. Real attention is query-dependent, uses softmax normalization, and exhibits heavy-tailed behavior absent from the surrogate; without explicit validation that the surrogate induces similar token-importance orderings or capacity estimates as deployed transformers, the unification and CapKV superiority rest on an untested modeling assumption rather than properties of actual LLMs.
  2. [Experiments and results] Table of experimental results: while CapKV is reported to outperform baselines, the paper does not report whether the surrogate parameters are estimated from data or fixed, nor does it include an ablation measuring how closely the surrogate's leverage-score ordering matches empirical attention scores on the same inputs. This gap directly affects whether the observed gains can be attributed to the information-theoretic principle.
minor comments (2)
  1. [Notation and equations] Notation for the mutual information objective and log-determinant approximation could be cross-referenced more explicitly to the relevant equations to improve readability.
  2. [Abstract] The abstract states 'extensive experiments across multiple models and long-context benchmarks' but does not name the specific models or benchmarks; adding these details would help readers assess scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important aspects of the surrogate model's validity and experimental reporting that we will address in the revision. Below we respond point by point to the major comments.

read point-by-point responses
  1. Referee: [Theoretical derivation (linear-Gaussian surrogate and MI objective)] The linear-Gaussian surrogate (derivation in the main theoretical section) is load-bearing for both the closed-form MI objective and the claim that existing policies approximate capacity maximization. Real attention is query-dependent, uses softmax normalization, and exhibits heavy-tailed behavior absent from the surrogate; without explicit validation that the surrogate induces similar token-importance orderings or capacity estimates as deployed transformers, the unification and CapKV superiority rest on an untested modeling assumption rather than properties of actual LLMs.

    Authors: We agree that the linear-Gaussian surrogate is a simplifying assumption whose fidelity to real softmax attention merits explicit checking. The unification of eviction policies and the derivation of the capacity objective are formally valid under this surrogate, as stated in the theoretical section; they do not claim to be exact for deployed transformers. To strengthen the link to practice, we will add a new subsection that compares token-importance rankings produced by the surrogate's leverage scores against empirical attention scores on representative long-context inputs from the evaluation benchmarks. This analysis will quantify the alignment (or divergence) in ordering and will be used to qualify the scope of the unification claim. revision: yes

  2. Referee: [Experiments and results] Table of experimental results: while CapKV is reported to outperform baselines, the paper does not report whether the surrogate parameters are estimated from data or fixed, nor does it include an ablation measuring how closely the surrogate's leverage-score ordering matches empirical attention scores on the same inputs. This gap directly affects whether the observed gains can be attributed to the information-theoretic principle.

    Authors: We will clarify in the revised experimental section that the surrogate parameters (means and covariances of the linear-Gaussian model) are fixed according to the theoretical derivation rather than estimated from data. We will also add the requested ablation: for a subset of evaluation sequences we will report the Spearman rank correlation between the surrogate leverage scores and the actual attention weights, together with the resulting performance difference when eviction is performed using the empirical attention scores instead of the surrogate scores. These additions will make the attribution of gains to the information-theoretic objective more transparent. revision: yes

Circularity Check

0 steps flagged

Derivation from IB principle under linear-Gaussian surrogate is independent

full rationale

The paper derives a closed-form mutual information objective directly from the Information Bottleneck principle applied to a linear-Gaussian surrogate model of attention. This is a first-principles theoretical step with no fitting to data or renaming of empirical patterns. Existing eviction policies are interpreted as approximations only after the derivation is complete, and CapKV is proposed from the resulting log-determinant leverage scores. No load-bearing equation reduces by construction to its inputs, and the central claim does not rely on self-citations or fitted parameters presented as predictions. Empirical results function solely as validation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's claims rest on the Information Bottleneck principle and the linear-Gaussian surrogate assumption for attention, which allow the derivation but require validation in real models.

axioms (1)
  • domain assumption Attention mechanism can be accurately approximated by a linear-Gaussian model for the purpose of information capacity calculation
    This surrogate enables the closed-form derivation of the mutual information objective.

pith-pipeline@v0.9.0 · 5478 in / 1319 out tokens · 89933 ms · 2026-05-07T16:37:20.561692+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 19 canonical work pages · 5 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Opencodereasoning: Advancing data distillation for competitive coding

    Ahmad, W. U., Narenthiran, S., Majumdar, S., Ficek, A., Jain, S., Huang, J., Noroozi, V., and Ginsburg, B. Opencodereasoning: Advancing data distillation for competitive coding. CoRR, abs/2504.01943, 2025. doi:10.48550/ARXIV.2504.01943. URL https://doi.org/10.48550/arXiv.2504.01943

  3. [3]

    Longbench: A bilingual, multitask benchmark for long context understanding, 2023

    Bai, Y., Lv, X., Zhang, J., Lyu, H., Tang, J., Huang, Z., Du, Z., Liu, X., Zeng, A., Hou, L., Dong, Y., Tang, J., and Li, J. Longbench: A bilingual, multitask benchmark for long context understanding, 2023

  4. [4]

    Information bottleneck for gaussian variables

    Chechik, G., Globerson, A., Tishby, N., and Weiss, Y. Information bottleneck for gaussian variables. Advances in Neural Information Processing Systems, 16, 2003

  5. [5]

    Exploring deepseek: A survey on advances, applications, challenges and future directions

    Deng, Z., Ma, W., Han, Q.-L., Zhou, W., Zhu, X., Wen, S., and Xiang, Y. Exploring deepseek: A survey on advances, applications, challenges and future directions. IEEE/CAA Journal of Automatica Sinica, 12 0 (5): 0 872--893, 2025

  6. [6]

    A simple and effective l\_2 norm-based strategy for kv cache compression

    Devoto, A., Zhao, Y., Scardapane, S., and Minervini, P. A simple and effective l\_2 norm-based strategy for kv cache compression. arXiv preprint arXiv:2406.11430, 2024

  7. [7]

    Expected attention: KV cache compres- sion by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636, 2025

    Devoto, A., Jeblick, M., and J \'e gou, S. Expected attention: Kv cache compression by estimating attention from future queries distribution. arXiv preprint arXiv:2510.00636, 2025

  8. [8]

    Feng, Y., Lv, J., Cao, Y., Xie, X., and Zhou, S. K. Identify critical kv cache in llm inference from an output perturbation perspective. arXiv preprint arXiv:2502.03805, 2025

  9. [9]

    Forney, G. D. and Ungerboeck, G. Modulation and coding for linear gaussian channels. IEEE Transactions on Information Theory, 44 0 (6): 0 2384--2415, 2002

  10. [10]

    W., Shao, Y

    Hooper, C., Kim, S., Mohammadzadeh, H., Mahoney, M. W., Shao, Y. S., Keutzer, K., and Gholami, A. Kvquant: Towards 10 million context length llm inference with kv cache quantization. Advances in Neural Information Processing Systems, 37: 0 1270--1303, 2024

  11. [11]

    Mistral 7B

    Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de Las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mistral 7b. CoRR, abs/2310.06825, 2023. doi:10.48550/ARXIV.2310.06825. URL https://doi.org/10.48550/arXiv.2310.06825

  12. [12]

    Optimization of an alkaline peroxide interstage treatment of jack pine (pinus banksiana lamb.) using a d-optimal design

    Lanouette, R., Valade, J., and Thibault, J. Optimization of an alkaline peroxide interstage treatment of jack pine (pinus banksiana lamb.) using a d-optimal design. The Canadian Journal of Chemical Engineering, 75 0 (1): 0 70--78, 1997

  13. [13]

    arXiv preprint arXiv:2412.19442 , year =

    Li, H., Li, Y., Tian, A., Tang, T., Xu, Z., Chen, X., Hu, N., Dong, W., Li, Q., and Chen, L. A survey on large language model acceleration based on kv cache management. arXiv preprint arXiv:2412.19442, 2024 a

  14. [14]

    Snapkv: Llm knows what you are looking for before generation

    Li, Y., Huang, Y., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., and Chen, D. Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems, 37: 0 22947--22970, 2024 b

  15. [15]

    arXiv preprint arXiv:2412.10319 , year =

    Li, Y., Jiang, H., Wu, Q., Luo, X., Ahn, S., Zhang, C., Abdi, A. H., Li, D., Gao, J., Yang, Y., and Qiu, L. Scbench: A KV cache-centric analysis of long-context methods. CoRR, abs/2412.10319, 2024 c . doi:10.48550/ARXIV.2412.10319. URL https://doi.org/10.48550/arXiv.2412.10319

  16. [16]

    Minicache: Kv cache compression in depth dimension for large language models

    Liu, A., Liu, J., Pan, Z., He, Y., Haffari, G., and Zhuang, B. Minicache: Kv cache compression in depth dimension for large language models. Advances in Neural Information Processing Systems, 37: 0 139997--140031, 2024 a

  17. [17]

    Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc.arXiv preprint arXiv:2502.14282, 2025

    Liu, H., Zhang, X., Xu, H., Wanyan, Y., Wang, J., Yan, M., Zhang, J., Yuan, C., Xu, C., Hu, W., and Huang, F. Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc. arXiv preprint arXiv:2502.14282, 2025 a

  18. [18]

    Aligning cyber space with physical world: A comprehensive survey on embodied ai

    Liu, Y., Chen, W., Bai, Y., Liang, X., Li, G., Gao, W., and Lin, L. Aligning cyber space with physical world: A comprehensive survey on embodied ai. IEEE/ASME Transactions on Mechatronics, 2025 b

  19. [19]

    KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V., Chen, B., and Hu, X. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750, 2024 b

  20. [20]

    d-optimal

    Mitchell, T. J. Computer construction of “d-optimal” first-order designs. Technometrics, 16 0 (2): 0 211--220, 1974

  21. [21]

    J., Goel, R., Lee, M., and Lott, C

    Park, J., Jones, D., Morse, M. J., Goel, R., Lee, M., and Lott, C. KeyDiff : Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments , May 2025

  22. [22]

    Efficiently scaling transformer inference

    Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Levskaya, A., Heek, J., Xiao, K., Agrawal, S., and Dean, J. Efficiently scaling transformer inference, 2022. URL https://arxiv.org/abs/2211.05102

  23. [23]

    G., Agullo, F., Zhu, Y., Wang, C., Lee, E

    Recasens, P. G., Agullo, F., Zhu, Y., Wang, C., Lee, E. K., Tardieu, O., Torres, J., and Berral, J. L. Mind the memory gap: Unveiling gpu bottlenecks in large-batch llm inference. arXiv preprint arXiv:2503.08311, 2025

  24. [24]

    Sparq attention: Bandwidth-efficient LLM inference

    Ribar, L., Chelombiev, I., Hudlass - Galley, L., Blake, C., Luschi, C., and Orr, D. Sparq attention: Bandwidth-efficient LLM inference. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=OS5dqxmmtl

  25. [25]

    R., and Stein, S

    Schwartz, M., Bennett, W. R., and Stein, S. Communication systems and techniques. John Wiley & Sons, 1995

  26. [26]

    arXiv preprint arXiv:2407.18003

    Shi, L., Zhang, H., Yao, Y., Li, Z., and Zhao, H. Keep the cost down: A review on methods to optimize llm' s kv-cache consumption. CoRR, abs/2407.18003, 2024. doi:10.48550/ARXIV.2407.18003. URL https://doi.org/10.48550/arXiv.2407.18003

  27. [27]

    Siddiky, M. N. A., Badhon, R. H., Rahman, M. E., Salman, M., and Sohag, M. S. H. The information bottleneck method in deep learning: Principles, applications and challenges. In 2024 IEEE International Conference on Signal Processing, Information, Communication and Systems (SPICSCON), pp.\ 01--06. IEEE, 2024

  28. [28]

    The Llama 3 Herd of Models

    Team, L. The llama 3 herd of models. CoRR, abs/2407.21783, 2024. doi:10.48550/ARXIV.2407.21783. URL https://doi.org/10.48550/arXiv.2407.21783

  29. [29]

    Qwen3 Technical Report

    Team, Q. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388

  30. [30]

    Keepkv: Eliminating output perturbation in kv cache compression for efficient llms inference.arXiv preprint arXiv:2504.09936, 2025

    Tian, Y., Wang, Z., Peng, Y., Yuan, A., Wang, Z., Yi, B., Liu, X., Cui, Y., and Yang, T. Keepkv: Eliminating output perturbation in kv cache compression for efficient llms inference. arXiv preprint arXiv:2504.09936, 2025

  31. [31]

    and Zaslavsky, N

    Tishby, N. and Zaslavsky, N. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop, ITW 2015, Jerusalem, Israel, April 26 - May 1, 2015 , pp.\ 1--5. IEEE , 2015. doi:10.1109/ITW.2015.7133169. URL https://doi.org/10.1109/ITW.2015.7133169

  32. [32]

    The information bottleneck method

    Tishby, N., Pereira, F. C., and Bialek, W. The information bottleneck method. arXiv preprint physics/0004057, 2000

  33. [33]

    and Roth, V

    Wieczorek, A. and Roth, V. On the difference between the information bottleneck and the deep information bottleneck. Entropy, 22 0 (2): 0 131, 2020

  34. [34]

    Efficient Streaming Language Models with Attention Sinks

    Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023

  35. [35]

    On-device language models: A comprehensive review.arXiv preprint arXiv:2409.00088, 2024

    Xu, J., Li, Z., Chen, W., Wang, Q., Gao, X., Cai, Q., and Ling, Z. On-device language models: A comprehensive review. arXiv preprint arXiv:2409.00088, 2024

  36. [36]

    and Math-AI, T

    Zhang, Y. and Math-AI, T. American invitational mathematics examination (aime) 2025, 2025

  37. [37]

    Cam: Cache merging for memory-efficient llms inference

    Zhang, Y., Du, Y., Luo, G., Zhong, Y., Zhang, Z., Liu, S., and Ji, R. Cam: Cache merging for memory-efficient llms inference. In Forty-first international conference on machine learning, 2024

  38. [38]

    H2o: Heavy-hitter oracle for efficient generative inference of large language models

    Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y., R \'e , C., Barrett, C., et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36: 0 34661--34710, 2023