pith. machine review for the scientific record. sign in

arxiv: 2605.08234 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Recognition: no theorem link

When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords KV cache compressionvalue-aware evictionlong-context LLM inferencefixed-contract diagnosticcache eviction probeLongBench evaluationnon-monotone cache compression
0
0 comments X

The pith

Value-aware KV eviction improves cache compression only when it recovers decode-side evidence first, then ranks output value, and preserves coupled evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a fixed-contract diagnostic that keeps a KV selector's setup unchanged while altering one decision slot at a time to isolate why eviction succeeds or fails. The diagnostic uses a probe that adds a block's attention mass to the estimated change in final output when that block is removed. Across LongBench tasks with three models and two cache budgets, this probe aligns with positive performance margins in 72.6 percent of helpful cases and only 32.4 percent of non-helpful cases. The resulting ordering requires selectors to recover evidence needed for future decoding, rank its effect on the output, and avoid breaking related evidence when fitting the cache into a small budget. Task accuracy alone cannot reveal these separate failure modes, so the diagnostic explains when value-aware methods actually reduce memory cost without hurting accuracy.

Core claim

A selector can fail by missing needed evidence, scoring tokens that do not change the output, or breaking related evidence when compressing the cache. The fixed-contract probe, which combines attention mass with the estimated output change from block removal, is positive on 72.6 percent of positive-margin cells and 32.4 percent of nonpositive-margin cells on LongBench. NeedleBench M-RT at 32k and RULER 8k confirm the probe works under branched retrieval. A 264-cell sign evaluation separates support recovery and output-value ranking from boundary leverage effects. The resulting order is to recover decode-side evidence, rank its output value, and preserve coupled evidence during projection.

What carries the argument

The fixed-contract diagnostic, which holds the selector setup fixed and changes one decision slot at a time, together with a value probe that merges block attention mass and estimated output change from removal.

If this is right

  • The probe aligns with positive margins on 72.6 percent of helpful cells and 32.4 percent of non-helpful cells across three models and two budgets.
  • The probe maintains support under branched retrieval on NeedleBench M-RT at 32k and RULER 8k.
  • A 264-cell sign evaluation isolates support recovery, output-value ranking, and leverage effects near cache boundaries.
  • Selectors must follow the sequence of evidence recovery, then value ranking, then coupled-evidence preservation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same isolation approach could be used to diagnose failures in other KV compression methods such as quantization or merging.
  • If the output-change estimate remains reliable at scale, it could support dynamic cache policies that adapt eviction mid-generation.
  • Hybrid selectors might be built by composing separate modules for each step in the identified order rather than learning a single scoring function.

Load-bearing premise

The estimated output change from removing a block accurately captures its true value to future decoding without confounding the fixed-contract isolation.

What would settle it

A case on LongBench where the probe's estimated output change from block removal does not match the actual output difference observed when that block is evicted during real decoding.

Figures

Figures reproduced from arXiv: 2605.08234 by Da Chang, Fanqi Kong, Haozhe Liang, Huaxiao Yin, Li Hu, Ruijie Zhang, Yu Li.

Figure 1
Figure 1. Figure 1: Compression is not monotone. A compressed cache can underperform, match, or exceed [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Stage II value-channel re-ranking shifts retained mass across prefill blocks (Qwen3-8B, HotpotQA, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Stage II prediction and value-channel control under a fixed observation-window contract. Panels [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Stage III separability under block projection. (a) Strict-block lattice slack across the 96-cell grid, [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: separates the residual mean from the variance after count debiasing. The mean correction bends mass toward the prompt tail, while the variance penalty appears mainly in the same tail region. 0.1 0.2 0.5 0.9 position i=T 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 mean ratio cui=cuhead head ref Llama-3.1-8B Qwen3-8B Mistral-7B-v0.3 (a) Post-debias mean. 0.1 0.2 0.5 0.9 position i=T 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Ni Va… view at source ↗
Figure 6
Figure 6. Figure 6: Effective head participation on the 8k NIAH retrieval check. [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Layer-scope sensitivity of the Module I access-support measurement. NDCG@5% measures rank [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Stage I access gaps predict count-debias repair. [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Boundary-heavy budget trajectory heatmap. [PITH_FULL_IMAGE:figures/full_fig_p035_9.png] view at source ↗
read the original abstract

Long-context LLM inference is bottlenecked by the memory and bandwidth cost of reading large KV caches during decoding. KV compression reduces this cost by keeping only part of the cache, but task accuracy alone does not identify why a selector succeeds or fails. A selector can fail at three steps: it may miss the evidence future decoding needs, give high scores to tokens that do not affect the output, or break related evidence when fitting scores into a small cache. We introduce a fixed-contract diagnostic that holds the selector's setup fixed and changes one decision slot at a time. For value ranking, the probe combines a block's attention mass with the estimated output change from removing it. On LongBench across three models and two budgets, the probe is positive on 72.6% of positive-margin cells and 32.4% of nonpositive-margin cells. NeedleBench M-RT at 32k and a RULER 8k check probe support closure under branched retrieval, and a 264-cell sign evaluation separates support recovery and output-value ranking from leverage effects near the boundary. The resulting order is to recover decode-side evidence, rank its output value, and preserve coupled evidence during projection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a fixed-contract diagnostic for evaluating KV cache compression selectors in long-context LLMs. It identifies three potential failure points in selectors: missing necessary evidence, assigning high scores to low-impact tokens, and disrupting coupled evidence when compressing. The diagnostic maintains the selector's contract fixed while varying one slot. For value ranking, the probe integrates attention mass with the estimated change in output from removing a block. Empirical evaluation on LongBench across three models and two budgets shows the probe positive on 72.6% of positive-margin cells and 32.4% of nonpositive-margin cells. Additional checks on NeedleBench and RULER support the findings, leading to the ordering: recover decode-side evidence, rank its output value, and preserve coupled evidence during projection.

Significance. If the diagnostic's isolation procedure holds, this work offers a valuable tool for dissecting why certain KV eviction methods succeed or fail, moving beyond aggregate accuracy metrics. It could inform the design of more robust cache compression strategies for efficient LLM inference, particularly in identifying when value-aware approaches provide benefits. The fixed-contract approach and multi-benchmark validation are strengths.

major comments (2)
  1. [Value-ranking probe and LongBench evaluation] The value-ranking probe (abstract and experimental results) combines attention mass with estimated output change from block removal under fixed-contract isolation. The removal simulation may introduce confounding interactions such as altered attention patterns or new token dependencies not present in the original cache, potentially misclassifying the true marginal value. This directly affects the reliability of the reported 72.6% vs 32.4% separation on LongBench and the derived ordering.
  2. [LongBench results] LongBench results (across three models and two budgets): the percentages 72.6% and 32.4% are presented without error bars, confidence intervals, details on data exclusion rules, or full experimental protocol. This makes it difficult to assess statistical significance and robustness of the central empirical claim.
minor comments (2)
  1. [Abstract] The abstract references 'a 264-cell sign evaluation' without elaboration; adding a brief description or pointer to the relevant section would improve clarity.
  2. [Experimental sections] Consider including exact model names, context lengths, and budget sizes (e.g., in a table) when summarizing the LongBench, NeedleBench, and RULER results for improved reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the fixed-contract diagnostic and its empirical validation. We address each major comment below, clarifying the design choices in the value-ranking probe and committing to improved statistical reporting for the LongBench results.

read point-by-point responses
  1. Referee: [Value-ranking probe and LongBench evaluation] The value-ranking probe (abstract and experimental results) combines attention mass with estimated output change from block removal under fixed-contract isolation. The removal simulation may introduce confounding interactions such as altered attention patterns or new token dependencies not present in the original cache, potentially misclassifying the true marginal value. This directly affects the reliability of the reported 72.6% vs 32.4% separation on LongBench and the derived ordering.

    Authors: The fixed-contract diagnostic deliberately isolates one decision slot while holding the selector's overall contract (cache size, eviction policy for remaining tokens) fixed, which is intended to reduce the scope of secondary interactions compared to full re-encoding. The probe further combines attention mass with a direct estimate of output logit change under this isolation to approximate marginal value. We agree that residual confounds from attention redistribution cannot be entirely eliminated in simulation. In revision we will add an explicit limitations paragraph discussing this point, together with the supporting evidence from the NeedleBench M-RT and RULER checks that the ordering remains consistent under branched retrieval. We do not claim the probe is an oracle, only that it yields a useful diagnostic separation (72.6 % positive-margin vs 32.4 % non-positive-margin cells) that is corroborated across benchmarks. revision: partial

  2. Referee: [LongBench results] LongBench results (across three models and two budgets): the percentages 72.6% and 32.4% are presented without error bars, confidence intervals, details on data exclusion rules, or full experimental protocol. This makes it difficult to assess statistical significance and robustness of the central empirical claim.

    Authors: We accept this criticism. The revised manuscript will report bootstrap confidence intervals for both percentages, state the exact cell-inclusion criteria (minimum 50 cells per model-budget pair), and move the complete experimental protocol—including model versions, random seeds, and token-removal simulation details—into a new appendix section. These additions will allow direct assessment of the robustness of the reported separation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims are direct empirical measurements

full rationale

The paper presents a fixed-contract diagnostic consisting of controlled experiments that hold the selector setup fixed while altering one decision slot at a time. The reported probe results (positive on 72.6% of positive-margin cells and 32.4% of nonpositive-margin cells on LongBench, plus NeedleBench and RULER checks) are direct observations from these held-fixed runs across models and budgets. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or description. The central ordering (recover evidence, rank value, preserve coupled evidence) follows from the sign evaluations rather than reducing to inputs by construction. This is self-contained empirical work with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions of transformer KV caching and attention-based importance; no new physical entities or ad-hoc constants are introduced beyond the diagnostic construction itself.

free parameters (2)
  • cache budget sizes
    Two specific budgets chosen for the LongBench experiments; values not derived from first principles.
  • model selection
    Three unnamed models used; choice affects generalizability.
axioms (1)
  • domain assumption Holding the selector setup fixed while varying one decision slot isolates the contribution of that slot.
    Core premise of the fixed-contract diagnostic stated in the abstract.

pith-pipeline@v0.9.0 · 5532 in / 1368 out tokens · 54738 ms · 2026-05-12T01:42:50.793562+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 9 internal anchors

  1. [1]

    LongBench: A bilingual, multitask benchmark for long context understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. LongBench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3119–3137, 2024

  2. [2]

    Z. Cai, Y. Zhang, B. Gao, Y. Liu, T. Liu, K. Lu, W. Xiong, Y. Dong, B. Chang, J. Hu, and W. Xiao. PyramidKV: Dynamic KV cache compression based on pyramidal information funneling. InProceedings of the Second Conference on Language Modeling (COLM), 2025

  3. [3]

    Z. Cai, W. Xiao, H. Sun, C. Luo, Y. Zhang, K. Wan, Y. Li, Y. Zhou, L.-W. Chang, J. Gu, Z. Dong, A. Anandkumar, A. Asi, and J. Hu. R-KV: Redundancy-aware KV cache compression for reasoning models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=2jwAjomEDB

  4. [4]

    Y. Gu, Z. Jiang, J. Jin, K. Guo, Z. Zhang, and X. Xu. AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models.arXiv preprint arXiv:2506.03762, 2025

  5. [5]

    A. Chen, R. Geh, A. Grover, G. Van den Broeck, and D. Israel. The Pitfalls of KV Cache Compression. arXiv preprint arXiv:2510.00231, 2025

  6. [6]

    H. Chen, X. Liu, Y. Liu, J. Jiang, B. He, and X. Liu. FleetOpt: Analytical fleet provisioning for LLM inference with compress-and-route as implementation mechanism.arXiv preprint arXiv:2603.16514, 2026

  7. [7]

    Clark, U

    K. Clark, U. Khandelwal, O. Levy, and C. D. Manning. What does BERT look at? an analysis of BERT’s attention. InProceedings of the 2019 ACL workshop BlackboxNLP: analyzing and interpreting neural networks for NLP, pages 276–286, 2019

  8. [8]

    Y. An, C. Lu, K. Zhu, T. Yu, C. Zhao, H. Wu, M. Tang, and J. Wang. ReST-KV: Robust KV cache eviction with layer-wise output reconstruction and spatial-temporal smoothing. InThe Fourteenth International Conference on Learning Representations, 2026

  9. [9]

    Y. Bai, Q. Dong, T. Jiang, X. Lv, Z. Du, A. Zeng, J. Tang, and J. Li. IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse.arXiv preprint arXiv:2603.12201, 2026

  10. [10]

    KV Cache Offloading for Context-Intensive Tasks

    A. Bocharnikov, I. Ermakov, D. Kuznedelev, V. Zhdanovskiy, and Y. Yershov. KV Cache Offloading for Context-Intensive Tasks.arXiv preprint arXiv:2604.08426, 2026

  11. [11]

    K. Team, G. Chen, Y. Zhang, J. Su, W. Xu, S. Pan, Y. Wang, Y. Wang, G. Chen, B. Yin, Y. Chen, J. Yan, M. Wei, et al. Attention residuals.arXiv preprint arXiv:2603.15031, 2026

  12. [12]

    J. Yuan, H. Gao, D. Dai, J. Luo, L. Zhao, Z. Zhang, Z. Xie, Y. Wei, L. Wang, Z. Xiao, Y. Wang, C. Ruan, M. Zhang, W. Liang, and W. Zeng. Native sparse attention: Hardware-aligned and natively trainable sparse attention. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23078–23097, 2025

  13. [13]

    DeepSeek-V4: Towards highly efficient million-token context intelligence

    DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence. Techni- cal report, 2026. URLhttps://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/resolve/ main/DeepSeek_V4.pdf

  14. [14]

    Expected attention: KV cache compres- sion by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636, 2025

    A. Devoto, M. Jeblick, and S. Jégou. Expected attention: Kv cache compression by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636, 2025

  15. [15]

    Z. Dong, P. Liu, J. Li, Z. Chen, H. Peng, S. Wang, and W. X. Zhao. ForesightKV: Optimizing KV cache eviction for reasoning models by learning long-term contribution.arXiv preprint arXiv:2602.03203, 2026. 10

  16. [16]

    R. Goel, J. Park, M. Gagrani, D. Jones, M. Morse, H. Langston, M. Lee, and C. Lott. CAOTE: KV cache selection for LLMs via attention output error-based token eviction.arXiv preprint arXiv:2504.14051, 2025

  17. [17]

    Y. Feng, J. Lv, Y. Cao, X. Xie, and S. K. Zhou. Ada-KV: Optimizing KV cache eviction by adaptive budget allocation for efficient LLM inference. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=tcisuhGsQZ

  18. [18]

    Y. Feng, J. Lv, Y. Cao, X. Xie, and S. K. Zhou. Identify critical KV cache in LLM inference from an output perturbation perspective.arXiv preprint arXiv:2502.03805, 2025

  19. [19]

    S. Ge, Y. Zhang, L. Liu, M. Zhang, J. Han, and J. Gao. Model tells you what to discard: Adaptive KV cache compression for LLMs. InInternational Conference on Learning Representations, 2024

  20. [20]

    Gu and T

    A. Gu and T. Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. InFirst Con- ference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=tEYskw1VY2

  21. [21]

    X. Li, X. Jin, and L. Zhang. GraphKV: Breaking the static selection paradigm with graph-based KV cache eviction. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 21899–21909, Suzhou, China, 2025. Association for Computational Linguistics

  22. [22]

    Coleman Richard Charles Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, S. Shao, K. Keutzer, and A. Gholami. KVQuant: Towards 10 million context length LLM inference with KV cache quantization. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.https: //openreview.net/forum?id=0LXotew9Du

  23. [23]

    Huang, H

    K. Huang, H. Meng, J. Wu, J. Lu, C. Ma, Z. Chen, X. Wang, B. Ding, J. Wu, X. Wang, X. He, G. Wang, J. Zhou. Beyond Magnitude: Leveraging Direction of RLVR Updates for LLM Reasoning. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview. net/forum?id=r6Pw3RiMYL

  24. [24]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024. COLM 2024. URLhttps://arxiv.org/abs/2404.06654

  25. [25]

    NeedleBench: Evaluat- ing LLM retrieval and reasoning across varying information densities.Transactions on Machine Learning Research, 2025

    Mo Li, Songyang Zhang, Taolin Zhang, Haodong Duan, Yunxin Liu, and Kai Chen. NeedleBench: Evaluat- ing LLM retrieval and reasoning across varying information densities.Transactions on Machine Learning Research, 2025. URLhttps://mlanthology.org/tmlr/2025/li2025tmlr-needlebench/

  26. [26]

    Y. Li, Y. Huang, B. Yang, B. Venkitesh, A. Locatelli, H. Ye, T. Cai, P. Lewis, and D. Chen. SnapKV: LLM knows what you are looking for before generation. InAdvances in Neural Information Processing Systems, volume 37, pages 22947–22970, 2024. Curran Associates, Inc

  27. [27]

    Z. Liu, J. Yuan, H. Jin, S. Zhong, Z. Xu, V. Braverman, B. Chen, and X. Hu. KIVI: A tuning-free asymmetric 2bit quantization for KV cache. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 32332–32344. PMLR, 2024

  28. [28]

    P. Liu, J. Liu, X. Qiu, and X. Huang. Beyond Attention Magnitude: Leveraging Inter-layer Rank Consistency for Efficient Vision-Language-Action Models.arXiv preprint arXiv:2603.24941, 2026

  29. [29]

    L. Lu, K. Qiu, J. Zhou, J. Kai, H. Zhang, H. Wang, J. Leng, Z. He, and Z. Lin. One size does not fit all: Token-wise adaptive compression for KV cache.arXiv preprint arXiv:2603.04411, 2026

  30. [30]

    Nawrot, A

    P. Nawrot, A. Łańcucki, M. Chochowski, D. Tarjan, and E. M. Ponti. Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference. InForty-first International Conference on Machine Learning,

  31. [31]

    URLhttps://openreview.net/forum?id=tDRYrAkOB7. 11

  32. [32]

    M. Oren, M. Hassid, N. Yarden, Y. Adi, and R. Schwartz. Transformers are multi-state RNNs. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18724–18741. Association for Computational Linguistics, 2024

  33. [33]

    H. Tang, Y. Lin, J. Lin, Q. Han, D. Ke, S. Hong, Y. Yao, and G. Wang. RazorAttention: Efficient KV cache compression through retrieval heads. InThe Thirteenth International Conference on Learning Representations, 2025

  34. [34]

    Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference

    R. Taniguchi, Y. Dong, M. Onizuka, and C. Xiao. Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference.arXiv preprint arXiv:2601.07667, 2026

  35. [35]

    Chang, W.-C

    C.-C. Chang, W.-C. Lin, C.-Y. Lin, C.-Y. Chen, Y.-F. Hu, P.-S. Wang, N.-C. Huang, L. Ceze, M. S. Abdelfattah, and K.-C. Wu. Palu: KV-cache compression with low-rank projection. InThe Thirteenth International Conference on Learning Representations, 2025

  36. [36]

    J. Y. Yang, B. Kim, J. Bae, B. Kwon, G. Park, E. Yang, S. J. Kwon, and D. Lee. No token left behind: Reliable KV cache compression via importance-aware mixed precision quantization.arXiv preprint arXiv:2402.18096, 2024

  37. [37]

    J.-H. Kim, J. Kim, S. Kwon, J. W. Lee, S. Yun, and H. O. Song. KVzip: Query-agnostic KV cache compression with context reconstruction. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  38. [38]

    Z. Qin, Y. Cao, M. Lin, W. Hu, S. Fan, K. Cheng, W. Lin, and J. Li. CAKE: Cascading and adaptive KV cache eviction with layer preferences. InThe Thirteenth International Conference on Learning Representations, 2025

  39. [39]

    ARKV: Adaptive and resource-efficient KV cache man- agement under limited memory budget for long-context inference in LLMs.arXiv preprint arXiv:2603.08727, 2026

    J. Lei and S. Ilager. ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs.arXiv preprint arXiv:2603.08727, 2026

  40. [40]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  41. [41]

    A. Sood, T. Sharma, and V. Agrawal. More Than a Quick Glance: Overcoming the Greedy Bias in KV-Cache Compression.arXiv preprint arXiv:2602.02199, 2026

  42. [42]

    Z. Su, H. Zhang, W. Wu, Y. Zhang, Y. Liu, H. Xiao, Q. Yang, Y. Sun, R. Yang, C. Zhang, K. Fan, W. Ye, J. Xiong, H. Shen, C. Tao, T. Wu, Z. Wan, Y. Qian, Y. Xie, and N. Wong. Attention sink in transformers: A survey on utilization, interpretation, and mitigation.arXiv preprint arXiv:2604.10098, 2026

  43. [43]

    Do LLMs Encode Functional Importance of Reasoning Tokens?

    J. Singh and D. Hakkani-Tür. Do LLMs Encode Functional Importance of Reasoning Tokens?arXiv preprint arXiv:2601.03066, 2026

  44. [44]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  45. [45]

    A. Q. Jiang et al. Mistral 7B.arXiv preprint arXiv:2310.06825, 2023

  46. [46]

    Z. Tian, Y. Su, J. Li, and M. Zhang. Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries.arXiv preprint arXiv:2603.11564, 2026

  47. [47]

    G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis. Efficient Streaming Language Models with Attention Sinks. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=NG7sS51zVF

  48. [48]

    Z. Guo, H. Kamigaito, and T. Watanabe. Attention score is not all you need for token importance indicator in KV cache reduction: Value also matters. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21158–21166, 2024. 12

  49. [49]

    K. Zhao, W. Yuan, Y. Lin, L. Ruan, X. Lu, D.-P. Fan, M.-M. Cheng, and D. Zeng. Attention Debiasing for Token Pruning in Vision Language Models.arXiv preprint arXiv:2508.17807, 2025

  50. [50]

    Y. Feng, H. Guo, J. Lv, S Kevin Zhou, and X. Xie. DefensiveKV: Taming the Fragility of KV Cache Eviction in LLM Inference. InThe Fourteenth International Conference on Learning Representations,

  51. [51]

    URLhttps://openreview.net/forum?id=nJgS06sX3O

  52. [52]

    J. Ahn, I. Seong, A. Kedia, J. Kim, H. Jang, K. Lee, and Y. Jeon. LookaheadKV: Fast and accurate KV cacheevictionbyglimpsingintothefuturewithoutgeneration. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=RVLMGPXt2i

  53. [53]

    Y. Wang, S. Ji, Y. Liu, Y. Xu, Y. Xu, Q. Zhu, and W. Che. Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 34146–34162, 2025

  54. [54]

    Zhang, Y

    Z. Zhang, Y. Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y. Tian, C. Re, C. Barrett, Z. Wang, and B. Chen. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps: //openreview.net/forum?id=RkRrPp7GKO. 13 Appendix •Appendix A: notatio...

  55. [55]

    The largest adjacent transitions occur in Qwen3 TREC, Qwen3 passage retrieval, and Mistral TREC, with ranges from6.250to13.281pp

    Budget0.10is the strongest fixed default, with+1.066pp, 22 wins, 10 losses, and 4 ties, but it captures only55 .5%of the per-trajectory oracle-best mean and is exact-best or tied in only 16 of 36 trajectories. The largest adjacent transitions occur in Qwen3 TREC, Qwen3 passage retrieval, and Mistral TREC, with ranges from6.250to13.281pp. 34 0.02 0.05 0.10...