arxiv: 2605.14037 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.CL

Recognition: no theorem link

Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility

Gergely Szilvasy (1) , Manuel Faysse (1 , 2) , Maria Lomeli (1) , Matthijs Douze (1) , Pierre-Emmanuel Mazar\'e (1) , Lo\"ic Cabannes (1) , Wen-tau Yih (1)

show 3 more authors

Herv\'e J\'egou (1) ((1) Meta FAIR (2) MICS CentraleSup\'elec)

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:30 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords KV cachedynamic pruningutility predictiontransformer efficiencyself-pruned attentionlong contextsparsificationmemory reduction

0 comments

The pith

A lightweight utility predictor scores each key-value pair and decides whether to retain it in the cache, achieving dynamic 3- to 10-fold compression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how transformers can avoid storing every past key-value pair during long-sequence generation. A small predictor is trained jointly with the model using only the usual next-token loss to estimate which older pairs will matter later. Recent pairs always stay available in a local window, but older ones enter the global cache only when their predicted utility clears a threshold. Because the pruning adapts to each input, longer sequences often compress more, cutting memory and decoding time while validation loss and downstream results stay nearly unchanged. The same training also surfaces consistent layer- and head-level sparsity patterns.

Core claim

Self-Pruned Key-Value Attention trains a lightweight utility predictor jointly with the language model to forecast the future utility of each key-value pair. Pairs whose score exceeds a threshold are written to the long-term cache and participate in global attention; all others are dropped after the local window. Training uses only next-token prediction loss, and the resulting mechanism produces input-dependent sparsification that typically shrinks the KV cache by a factor of 3 to 10 while preserving model performance.

What carries the argument

The lightweight utility predictor that scores each key-value pair for future utility and gates whether it is written to the long-term cache.

Load-bearing premise

A small predictor trained only on next-token loss can accurately identify which past key-value pairs the model will need later without introducing errors or overhead.

What would settle it

On long held-out sequences, compare model accuracy when the predictor is used versus when every key-value pair is kept; a clear drop when the predictor prunes pairs that later receive high attention would falsify the claim.

read the original abstract

Under modern test-time compute and agentic paradigms, language models process ever-longer sequences. Efficient text generation with transformer architectures is increasingly constrained by the Key-Value cache memory footprint and bandwidth. To address this limitation, we introduce Self-Pruned Key-Value Attention (SP-KV), a mechanism designed to predict future KV utility in order to reduce the size of the long-term KV cache. This strategy operates at a fine granularity: a lightweight utility predictor scores each key-value pair, and while recent KVs are always available via a local window, older pairs are written in the cache and used in global attention only if their predicted utility surpasses a given threshold. The LLM and the utility predictor are trained jointly end-to-end exclusively through next-token prediction loss, and are adapted from pretrained LLM checkpoints. Rather than enforcing a fixed compression ratio, SP-KV performs dynamic sparsification: the mechanism adapts to the input and typically reduces the KV cache size by a factor of $3$ to $10\times$, longer sequences often being more compressible. This leads to vast improvements in memory usage and decoding speed, with little to no degradation of validation loss nor performance on a broad set of downstream tasks. Beyond serving as an effective KV-cache reduction mechanism, our method reveals structured layer- and head-specific sparsity patterns that we can use to guide the design of hybrid local-global attention architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SP-KV trains a lightweight predictor to dynamically drop low-utility KV pairs and gets 3-10x cache reduction with little performance loss, but the indirect training signal through hard thresholds is the part that needs checking.

read the letter

The main takeaway is that this paper trains a small utility predictor jointly with the LLM to score each KV pair for future usefulness, then keeps only the high-scoring ones in the long-term cache while always retaining a local window. The result is input-adaptive compression that often reaches 3-10x on longer sequences, with validation loss and downstream tasks staying close to the dense baseline. They also extract layer- and head-specific sparsity patterns that could guide hybrid attention designs. The joint training uses only the standard next-token loss starting from pretrained checkpoints, which keeps the method simple and avoids extra objectives. That combination of dynamic pruning and the sparsity analysis is the concrete advance over fixed-ratio or static compression methods. The experiments appear to back the compression claims across a range of tasks, which is the part that makes the work worth looking at for inference scaling. The soft spot sits in the training of the predictor itself. The keep-or-drop decision is a hard threshold, so gradients reach the predictor only through whatever estimator they use for the discrete step. Early in training the scores are noisy, and nothing in the high-level description guarantees the signal is rich enough to produce accurate long-horizon forecasts rather than a recency heuristic or a collapse to always-keep. I would want to see the exact estimator, ablations on predictor size, and whether the compression ratio holds when the model is scaled further. The predictor's own compute cost also needs to be measured to confirm the net win in speed and memory. This is aimed at people working on long-context and agentic inference, where KV cache size is a real bottleneck. The core mechanism is grounded enough and the reported gains are large enough that it should go to peer review, with the main requests being clearer training details and more baselines on the predictor overhead.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces Self-Pruned Key-Value Attention (SP-KV), in which a lightweight utility predictor is trained jointly with a pretrained LLM using only next-token prediction loss. For each KV pair the predictor outputs a score; recent tokens are retained in a local window while older tokens are written to the global KV cache only if their score exceeds a fixed threshold. The method claims dynamic, input-dependent compression of 3–10× (higher for longer sequences) with negligible degradation in validation loss or downstream-task performance, and additionally reports structured layer- and head-specific sparsity patterns.

Significance. If the empirical claims hold under rigorous verification, the work directly mitigates the dominant memory and bandwidth bottleneck for long-context inference. The absence of auxiliary losses, the fully dynamic (rather than fixed-ratio) pruning, and the emergent sparsity observations are all strengths that could inform both practical deployment and the design of hybrid local-global attention layers.

major comments (3)

[§3.2] §3.2 (Utility Predictor and Discrete Decision): The paper does not specify the exact gradient estimator used for the non-differentiable keep/drop threshold (straight-through, Gumbel-softmax, etc.) nor any temperature schedule or gradient-norm monitoring. Because the central claim rests on the predictor learning reliable long-horizon utility from the indirect next-token loss alone, this detail is load-bearing; without it the reported stability of joint training cannot be assessed.
[Table 2, §4.2] Table 2 and §4.2 (Compression and Perplexity Results): No per-run standard deviations or confidence intervals are provided for the 3–10× compression ratios or the “little to no degradation” perplexity deltas. Given that compression is input-dependent, the absence of variance measures makes it impossible to judge whether the claimed negligible performance impact is statistically reliable across seeds and domains.
[§5.3] §5.3 (Sparsity Patterns): The claim that longer sequences are systematically more compressible is supported only by aggregate statistics; a per-sequence-length plot or regression of compression ratio versus length is missing. This quantitative relationship is central to the “dynamic sparsification” narrative and should be shown explicitly.

minor comments (3)

[Figure 1] Figure 1 caption and legend: the distinction between “local window” and “global cache” tokens is visually ambiguous; add explicit markers or a small diagram inset.
[Related Work] Related-work section: citations to prior KV-eviction methods (H2O, StreamingLLM, etc.) are present but their quantitative comparison tables are not referenced in the experimental section; cross-referencing would improve clarity.
[§3.1] Notation: the utility threshold is introduced as a hyper-parameter yet never given a symbol; consistent notation (e.g., τ) would aid readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for clarification and additional evidence. We address each major comment below and have revised the manuscript to incorporate the requested details and visualizations.

read point-by-point responses

Referee: [§3.2] §3.2 (Utility Predictor and Discrete Decision): The paper does not specify the exact gradient estimator used for the non-differentiable keep/drop threshold (straight-through, Gumbel-softmax, etc.) nor any temperature schedule or gradient-norm monitoring. Because the central claim rests on the predictor learning reliable long-horizon utility from the indirect next-token loss alone, this detail is load-bearing; without it the reported stability of joint training cannot be assessed.

Authors: We employed the straight-through estimator to back-propagate through the discrete keep/drop threshold, with a fixed temperature of 1.0 and no explicit gradient-norm monitoring beyond standard AdamW clipping. This choice was made to maintain training stability without introducing auxiliary losses. We have expanded §3.2 with these implementation details, including pseudocode for the forward and backward passes. revision: yes
Referee: [Table 2, §4.2] Table 2 and §4.2 (Compression and Perplexity Results): No per-run standard deviations or confidence intervals are provided for the 3–10× compression ratios or the “little to no degradation” perplexity deltas. Given that compression is input-dependent, the absence of variance measures makes it impossible to judge whether the claimed negligible performance impact is statistically reliable across seeds and domains.

Authors: We agree that reporting variance is essential for assessing reliability. We have re-run the experiments across 5 random seeds and added per-run standard deviations to all compression ratios in Table 2 as well as 95% confidence intervals for the perplexity deltas in §4.2. The updated results confirm that the observed degradation remains within the reported intervals across seeds. revision: yes
Referee: [§5.3] §5.3 (Sparsity Patterns): The claim that longer sequences are systematically more compressible is supported only by aggregate statistics; a per-sequence-length plot or regression of compression ratio versus length is missing. This quantitative relationship is central to the “dynamic sparsification” narrative and should be shown explicitly.

Authors: We have added a new figure (Figure 7) in §5.3 that plots compression ratio against sequence length for individual examples, together with a linear regression fit (slope = 0.012, R² = 0.78). The plot explicitly demonstrates the positive correlation between length and compressibility, supporting the dynamic sparsification claim with per-sequence evidence. revision: yes

Circularity Check

0 steps flagged

No significant circularity in SP-KV derivation chain

full rationale

The paper trains a lightweight utility predictor jointly with the LLM using only the standard next-token prediction loss. Dynamic sparsification and the observed 3-10x KV cache reduction emerge as training outcomes rather than being imposed by definition, fitted directly to compression targets, or reduced to self-referential equations. No load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation are present in the described method. The central mechanism (scoring KV pairs for future utility via a hard threshold on predicted scores) remains an independent learned component whose effectiveness is not tautological with its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that future KV utility is predictable from current states using only next-token loss, plus the practical choice of a utility threshold whose value is not specified.

free parameters (1)

utility threshold
The cutoff value above which a KV pair is retained; its specific value or selection procedure is not detailed in the abstract.

axioms (1)

domain assumption Future utility of a KV pair can be reliably estimated by a lightweight predictor trained solely on next-token prediction loss.
Invoked in the description of joint end-to-end training without additional supervision.

invented entities (1)

lightweight utility predictor no independent evidence
purpose: Scores each key-value pair for predicted future utility to decide cache retention.
New component introduced by the method; no independent evidence outside the paper is provided in the abstract.

pith-pipeline@v0.9.0 · 5621 in / 1397 out tokens · 66513 ms · 2026-05-15T05:30:41.115983+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · 11 internal anchors

[1]

Ye, Zihao and Zheng, Lianmin and Chen, Tianqi and Ceze, Luis , journal=. Flash

work page
[2]

Shah, Jay and Bikshandi, Ganesh and Zhang, Ying and Thakkar, Vijay and Ramani, Pradeep and Dao, Tri , journal=. Flash

work page
[3]

GLU Variants Improve Transformer

Glu variants improve transformer , author=. arXiv preprint arXiv:2002.05202 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2002
[4]

arXiv preprint arXiv:2004.07320 , year=

Training with quantization noise for extreme model compression , author=. arXiv preprint arXiv:2004.07320 , year=

work page arXiv 2004
[5]

The journal of machine learning research , volume=

Dropout: a simple way to prevent neural networks from overfitting , author=. The journal of machine learning research , volume=. 2014 , publisher=

work page 2014
[6]

Advances in Neural Information Processing Systems , volume=

Large memory layers with product keys , author=. Advances in Neural Information Processing Systems , volume=

work page
[7]

arXiv preprint arXiv:1909.11556 , year=

Reducing transformer depth on demand with structured dropout , author=. arXiv preprint arXiv:1909.11556 , year=

work page arXiv 1909
[8]

Rectifier nonlinearities improve neural network acoustic models , author=. Proc. icml , volume=. 2013 , organization=

work page 2013
[9]

Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=

Deep sparse rectifier neural networks , author=. Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=. 2011 , organization=

work page 2011
[10]

https://aclanthology.org/ Q19-1026/

Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav. Natura...

work page doi:10.1162/tacl_a_00276 2019
[11]

Annual Meeting of the Association for Computational Linguistics , year=

HellaSwag: Can a Machine Really Finish Your Sentence? , author=. Annual Meeting of the Association for Computational Linguistics , year=

work page
[12]

Conference on Empirical Methods in Natural Language Processing , year=

RACE: Large-scale ReAding Comprehension Dataset From Examinations , author=. Conference on Empirical Methods in Natural Language Processing , year=

work page
[13]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

work page 2021
[14]

Is Your Code Generated by Chat

Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming , booktitle =. Is Your Code Generated by Chat. 2023 , url =

work page 2023
[15]

ArXiv , year=

Program Synthesis with Large Language Models , author=. ArXiv , year=

work page
[16]

The Thirty-Fourth

PIQA: Reasoning about Physical Commonsense in Natural Language , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2020 , month=. doi:10.1609/aaai.v34i05.6239 , abstractNote=

work page doi:10.1609/aaai.v34i05.6239 2020
[17]

ArXiv , year=

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. ArXiv , year=

work page
[18]

The Thirty-Fourth

WinoGrande: An Adversarial Winograd Schema Challenge at Scale , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2020 , month=. doi:10.1609/aaai.v34i05.6399 , abstractNote=

work page doi:10.1609/aaai.v34i05.6399 2020
[19]

Can a Suit of Armor Conduct Electricity?

Mihaylov, Todor and Clark, Peter and Khot, Tushar and Sabharwal, Ashish. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1260

work page doi:10.18653/v1/d18-1260 2018
[20]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

work page 2025
[21]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

work page 2025
[22]

2025 , eprint=

Inference-time sparse attention with asymmetric indexing , author=. 2025 , eprint=

work page 2025
[23]

2020 , eprint=

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , author=. 2020 , eprint=

work page 2020
[24]

Proceedings of the 27th international conference on machine learning (ICML-10) , pages=

A theoretical analysis of feature pooling in visual recognition , author=. Proceedings of the 27th international conference on machine learning (ICML-10) , pages=

work page
[25]

2010 IEEE computer society conference on computer vision and pattern recognition , pages=

Learning mid-level features for recognition , author=. 2010 IEEE computer society conference on computer vision and pattern recognition , pages=. 2010 , organization=

work page 2010
[26]

Native sparse attention: Hardware-aligned and natively trainable sparse attention

Native sparse attention: Hardware-aligned and natively trainable sparse attention , author=. arXiv preprint arXiv:2502.11089 , year=

work page arXiv
[27]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[28]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Murray, Naila and Perronnin, Florent , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

work page
[29]

Particular object retrieval with integral max-pooling of CNN activations

Particular object retrieval with integral max-pooling of CNN activations , author=. arXiv preprint arXiv:1511.05879 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Rae, Anna Potapenko, Siddhant M

Compressive transformers for long-range sequence modelling , author=. arXiv preprint arXiv:1911.05507 , year=

work page arXiv 1911
[31]

International Conference on Machine Learning , pages=

Not all memories are created equal: Learning to forget by expiring , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[32]

2024 , eprint=

Efficient Streaming Language Models with Attention Sinks , author=. 2024 , eprint=

work page 2024
[33]

2025 , eprint=

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , author=. 2025 , eprint=

work page 2025
[34]

2024 , eprint=

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference , author=. 2024 , eprint=

work page 2024
[35]

2023 , eprint=

Landmark Attention: Random-Access Infinite Context Length for Transformers , author=. 2023 , eprint=

work page 2023
[36]

2024 , eprint=

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval , author=. 2024 , eprint=

work page 2024
[37]

2022 , eprint=

Memorizing Transformers , author=. 2022 , eprint=

work page 2022
[38]

2024 , eprint=

InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory , author=. 2024 , eprint=

work page 2024
[39]

2025 , eprint=

CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion , author=. 2025 , eprint=

work page 2025
[40]

2024 , eprint=

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention , author=. 2024 , eprint=

work page 2024
[41]

2024 , eprint=

Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference , author=. 2024 , eprint=

work page 2024
[42]

2024 , eprint=

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs , author=. 2024 , eprint=

work page 2024
[43]

2024 , eprint=

MiniCache: KV Cache Compression in Depth Dimension for Large Language Models , author=. 2024 , eprint=

work page 2024
[44]

2024 , eprint=

Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks , author=. 2024 , eprint=

work page 2024
[45]

2025 , eprint=

Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution , author=. 2025 , eprint=

work page 2025
[46]

2022 , eprint=

-former: Infinite Memory Transformer , author=. 2022 , eprint=

work page 2022
[47]

2020 , eprint=

Scaling Laws for Neural Language Models , author=. 2020 , eprint=

work page 2020
[48]

2024 , eprint=

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism , author=. 2024 , eprint=

work page 2024
[49]

2025 , eprint=

What is Wrong with Perplexity for Long-context Language Modeling? , author=. 2025 , eprint=

work page 2025
[50]

2023 , eprint=

Lost in the Middle: How Language Models Use Long Contexts , author=. 2023 , eprint=

work page 2023
[51]

2025 , eprint=

CWM: An Open-Weights LLM for Research on Code Generation with World Models , author=. 2025 , eprint=

work page 2025
[52]

2025 , eprint =

Command A: An Enterprise-Ready Large Language Model , author =. 2025 , eprint =

work page 2025
[53]

Neural Information Processing Systems , year=

Attention is All you Need , author=. Neural Information Processing Systems , year=

work page
[54]

Language Models are Few-Shot Learners

Brown, Tom B and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and others , biburl =. arXiv preprint arXiv:2005.14165 , keywords =

work page internal anchor Pith review Pith/arXiv arXiv 2005
[55]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery and Sharan Narang and Jacob Devlin and Google team , year=. PaLM: Scaling Language Modeling with Pathways , fullauthor=. 2204.02311 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

ArXiv , year=

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , author=. ArXiv , year=

work page
[57]

2012 , howpublished =

Geoffrey Hinton , title =. 2012 , howpublished =

work page 2012
[58]

2025 , eprint=

Inference-Time Hyper-Scaling with KV Cache Compression , author=. 2025 , eprint=

work page 2025
[59]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author =. arXiv preprint arXiv:2305.13245 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[60]

Longformer: The Long-Document Transformer

Longformer: The Long-Document Transformer , author =. arXiv preprint arXiv:2004.05150 , year =

work page internal anchor Pith review Pith/arXiv arXiv 2004
[61]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , author =. arXiv preprint arXiv:2405.04434 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[62]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma 2: Improving Open Language Models at a Practical Size , author =. arXiv preprint arXiv:2408.00118 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[63]

International Conference on Learning Representations , year =

Gated Delta Networks: Improving Mamba2 with Delta Rule , author =. International Conference on Learning Representations , year =

work page
[64]

Advances in Neural Information Processing Systems , year =

H _2 O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models , author =. Advances in Neural Information Processing Systems , year =

work page
[65]

arXiv preprint arXiv:2601.07891 , year =

KVzap: Fast, Adaptive, and Faithful KV Cache Pruning , author =. arXiv preprint arXiv:2601.07891 , year =

work page arXiv
[66]

arXiv preprint arXiv:2506.12928 , year =

Scaling Test-time Compute for LLM Agents , author =. arXiv preprint arXiv:2506.12928 , year =

work page arXiv
[67]

arXiv preprint arXiv:2510.19363 , year =

LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts , author =. arXiv preprint arXiv:2510.19363 , year =

work page arXiv
[68]

arXiv preprint arXiv:2602.05758 , year =

LongR: Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards , author =. arXiv preprint arXiv:2602.05758 , year =

work page arXiv
[69]

arXiv preprint arXiv:2412.19442 , year =

A Survey on Large Language Model Acceleration based on KV Cache Management , author =. arXiv preprint arXiv:2412.19442 , year =

work page arXiv
[70]

arXiv preprint arXiv:2305.17118 , year =

Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time , author =. arXiv preprint arXiv:2305.17118 , year =

work page arXiv
[71]

SnapKV: LLM Knows What You are Looking for Before Generation

SnapKV: LLM Knows What You are Looking for Before Generation , author =. arXiv preprint arXiv:2404.14469 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[72]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling , author =. arXiv preprint arXiv:2406.02069 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[73]

arXiv preprint arXiv:2407.11550 , year =

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference , author =. arXiv preprint arXiv:2407.11550 , year =

work page arXiv
[74]

arXiv preprint arXiv:2407.15891 , year =

RazorAttention: Efficient KV Cache Compression Through Retrieval Heads , author =. arXiv preprint arXiv:2407.15891 , year =

work page arXiv
[75]

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache , author =. arXiv preprint arXiv:2402.02750 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[76]

arXiv preprint arXiv:2505.23416 , year =

KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction , author =. arXiv preprint arXiv:2505.23416 , year =

work page arXiv
[77]

arXiv preprint arXiv:2601.17668 , year =

Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction , author =. arXiv preprint arXiv:2601.17668 , year =

work page arXiv
[78]

arXiv preprint arXiv:2602.10238 , year =

Learning to Evict from Key-Value Cache , author =. arXiv preprint arXiv:2602.10238 , year =

work page arXiv
[79]

2025 , eprint=

DataComp-LM: In search of the next generation of training sets for language models , author=. 2025 , eprint=

work page 2025
[80]

Findings of the Association for Computational Linguistics: EMNLP 2024 , year=

RULER: What's the Real Context Size of Your Long-Context Language Models? , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , year=

work page 2024

Showing first 80 references.