{"paper":{"title":"Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A lightweight utility predictor scores each key-value pair and decides whether to retain it in the cache, achieving dynamic 3- to 10-fold compression.","cross_cats":["cs.CL"],"primary_cat":"cs.LG","authors_text":"2), (2) MICS, CentraleSup\\'elec), Gergely Szilvasy (1), Herv\\'e J\\'egou (1) ((1) Meta FAIR, Lo\\\"ic Cabannes (1), Manuel Faysse (1, Maria Lomeli (1), Matthijs Douze (1), Pierre-Emmanuel Mazar\\'e (1), Wen-tau Yih (1)","submitted_at":"2026-05-13T18:58:16Z","abstract_excerpt":"Under modern test-time compute and agentic paradigms, language models process ever-longer sequences. Efficient text generation with transformer architectures is increasingly constrained by the Key-Value cache memory footprint and bandwidth. To address this limitation, we introduce Self-Pruned Key-Value Attention (SP-KV), a mechanism designed to predict future KV utility in order to reduce the size of the long-term KV cache. This strategy operates at a fine granularity: a lightweight utility predictor scores each key-value pair, and while recent KVs are always available via a local window, olde"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"SP-KV performs dynamic sparsification: the mechanism adapts to the input and typically reduces the KV cache size by a factor of 3 to 10×, longer sequences often being more compressible. This leads to vast improvements in memory usage and decoding speed, with little to no degradation of validation loss nor performance on a broad set of downstream tasks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"A lightweight utility predictor trained jointly with the LLM using only next-token prediction loss can accurately forecast which KV pairs will be needed in the future without introducing meaningful errors or extra overhead.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A lightweight utility predictor scores each key-value pair and decides whether to retain it in the cache, achieving dynamic 3- to 10-fold compression.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"5895a554d93dd7bf39b8d210740754960c0a04a10cc3f84556cccd2eb438b39c"},"source":{"id":"2605.14037","kind":"arxiv","version":1},"verdict":{"id":"5262c577-e767-40ec-b20a-999eaf5d1f80","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T05:30:41.115983Z","strongest_claim":"SP-KV performs dynamic sparsification: the mechanism adapts to the input and typically reduces the KV cache size by a factor of 3 to 10×, longer sequences often being more compressible. This leads to vast improvements in memory usage and decoding speed, with little to no degradation of validation loss nor performance on a broad set of downstream tasks.","one_line_summary":"SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"A lightweight utility predictor trained jointly with the LLM using only next-token prediction loss can accurately forecast which KV pairs will be needed in the future without introducing meaningful errors or extra overhead.","pith_extraction_headline":"A lightweight utility predictor scores each key-value pair and decides whether to retain it in the cache, achieving dynamic 3- to 10-fold compression."},"references":{"count":82,"sample":[{"doi":"","year":null,"title":"Ye, Zihao and Zheng, Lianmin and Chen, Tianqi and Ceze, Luis , journal=. Flash","work_id":"dcd54ce3-994e-4006-a2dd-b0bb17b4e97f","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Shah, Jay and Bikshandi, Ganesh and Zhang, Ying and Thakkar, Vijay and Ramani, Pradeep and Dao, Tri , journal=. Flash","work_id":"4fefbff5-13c2-47ba-b816-692314bdaf59","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2002,"title":"GLU Variants Improve Transformer","work_id":"17d0763c-1016-41ab-a478-478e890765eb","ref_index":3,"cited_arxiv_id":"2002.05202","is_internal_anchor":true},{"doi":"","year":2004,"title":"Training with quantization noise for extreme ﬁxed-point compression","work_id":"1169a968-440c-4209-8438-fffcfb77faf4","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2014,"title":"The journal of machine learning research , volume=","work_id":"b25e43d4-a73c-404a-b1ac-ebff0cbe4930","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":82,"snapshot_sha256":"f281cf959d3987eaf2f86b9331c8cddfb687b5727297dfb1fc8d6cc383cd3b51","internal_anchors":13},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}