{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2026:BRGDBQMQYHH2KATHMSVTRIFOYH","short_pith_number":"pith:BRGDBQMQ","schema_version":"1.0","canonical_sha256":"0c4c30c190c1cfa5026764ab38a0aec1e0fa12bc255c071aec4bc633005d0e53","source":{"kind":"arxiv","id":"2605.14037","version":1},"attestation_state":"computed","paper":{"title":"Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A lightweight utility predictor scores each key-value pair and decides whether to retain it in the cache, achieving dynamic 3- to 10-fold compression.","cross_cats":["cs.CL"],"primary_cat":"cs.LG","authors_text":"2), (2) MICS, CentraleSup\\'elec), Gergely Szilvasy (1), Herv\\'e J\\'egou (1) ((1) Meta FAIR, Lo\\\"ic Cabannes (1), Manuel Faysse (1, Maria Lomeli (1), Matthijs Douze (1), Pierre-Emmanuel Mazar\\'e (1), Wen-tau Yih (1)","submitted_at":"2026-05-13T18:58:16Z","abstract_excerpt":"Under modern test-time compute and agentic paradigms, language models process ever-longer sequences. Efficient text generation with transformer architectures is increasingly constrained by the Key-Value cache memory footprint and bandwidth. To address this limitation, we introduce Self-Pruned Key-Value Attention (SP-KV), a mechanism designed to predict future KV utility in order to reduce the size of the long-term KV cache. This strategy operates at a fine granularity: a lightweight utility predictor scores each key-value pair, and while recent KVs are always available via a local window, olde"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":false},"canonical_record":{"source":{"id":"2605.14037","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.LG","submitted_at":"2026-05-13T18:58:16Z","cross_cats_sorted":["cs.CL"],"title_canon_sha256":"b63cf5e4b12fd45755a0039792d5d4283a0d78127de450b0c7ff9c7ae68cb99e","abstract_canon_sha256":"f08ff89cbdbe68f8636c04cdccf0794d76aeda45bd585a9030864a16073293ac"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:39:12.781670Z","signature_b64":"eRR0THsEeFluc1d9wGr/dXUL7ZcmTRKKrrSBtTdstzYZ6gp205xWDehu29BMkfNeFIIuRLoDQmuaW76ygf0YCA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"0c4c30c190c1cfa5026764ab38a0aec1e0fa12bc255c071aec4bc633005d0e53","last_reissued_at":"2026-05-17T23:39:12.781106Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:39:12.781106Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A lightweight utility predictor scores each key-value pair and decides whether to retain it in the cache, achieving dynamic 3- to 10-fold compression.","cross_cats":["cs.CL"],"primary_cat":"cs.LG","authors_text":"2), (2) MICS, CentraleSup\\'elec), Gergely Szilvasy (1), Herv\\'e J\\'egou (1) ((1) Meta FAIR, Lo\\\"ic Cabannes (1), Manuel Faysse (1, Maria Lomeli (1), Matthijs Douze (1), Pierre-Emmanuel Mazar\\'e (1), Wen-tau Yih (1)","submitted_at":"2026-05-13T18:58:16Z","abstract_excerpt":"Under modern test-time compute and agentic paradigms, language models process ever-longer sequences. Efficient text generation with transformer architectures is increasingly constrained by the Key-Value cache memory footprint and bandwidth. To address this limitation, we introduce Self-Pruned Key-Value Attention (SP-KV), a mechanism designed to predict future KV utility in order to reduce the size of the long-term KV cache. This strategy operates at a fine granularity: a lightweight utility predictor scores each key-value pair, and while recent KVs are always available via a local window, olde"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"SP-KV performs dynamic sparsification: the mechanism adapts to the input and typically reduces the KV cache size by a factor of 3 to 10×, longer sequences often being more compressible. This leads to vast improvements in memory usage and decoding speed, with little to no degradation of validation loss nor performance on a broad set of downstream tasks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"A lightweight utility predictor trained jointly with the LLM using only next-token prediction loss can accurately forecast which KV pairs will be needed in the future without introducing meaningful errors or extra overhead.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A lightweight utility predictor scores each key-value pair and decides whether to retain it in the cache, achieving dynamic 3- to 10-fold compression.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"5895a554d93dd7bf39b8d210740754960c0a04a10cc3f84556cccd2eb438b39c"},"source":{"id":"2605.14037","kind":"arxiv","version":1},"verdict":{"id":"5262c577-e767-40ec-b20a-999eaf5d1f80","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T05:30:41.115983Z","strongest_claim":"SP-KV performs dynamic sparsification: the mechanism adapts to the input and typically reduces the KV cache size by a factor of 3 to 10×, longer sequences often being more compressible. This leads to vast improvements in memory usage and decoding speed, with little to no degradation of validation loss nor performance on a broad set of downstream tasks.","one_line_summary":"SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"A lightweight utility predictor trained jointly with the LLM using only next-token prediction loss can accurately forecast which KV pairs will be needed in the future without introducing meaningful errors or extra overhead.","pith_extraction_headline":"A lightweight utility predictor scores each key-value pair and decides whether to retain it in the cache, achieving dynamic 3- to 10-fold compression."},"references":{"count":82,"sample":[{"doi":"","year":null,"title":"Ye, Zihao and Zheng, Lianmin and Chen, Tianqi and Ceze, Luis , journal=. Flash","work_id":"dcd54ce3-994e-4006-a2dd-b0bb17b4e97f","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Shah, Jay and Bikshandi, Ganesh and Zhang, Ying and Thakkar, Vijay and Ramani, Pradeep and Dao, Tri , journal=. Flash","work_id":"4fefbff5-13c2-47ba-b816-692314bdaf59","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2002,"title":"GLU Variants Improve Transformer","work_id":"17d0763c-1016-41ab-a478-478e890765eb","ref_index":3,"cited_arxiv_id":"2002.05202","is_internal_anchor":true},{"doi":"","year":2004,"title":"Training with quantization noise for extreme ﬁxed-point compression","work_id":"1169a968-440c-4209-8438-fffcfb77faf4","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2014,"title":"The journal of machine learning research , volume=","work_id":"b25e43d4-a73c-404a-b1ac-ebff0cbe4930","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":82,"snapshot_sha256":"f281cf959d3987eaf2f86b9331c8cddfb687b5727297dfb1fc8d6cc383cd3b51","internal_anchors":13},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2605.14037","created_at":"2026-05-17T23:39:12.781215+00:00"},{"alias_kind":"arxiv_version","alias_value":"2605.14037v1","created_at":"2026-05-17T23:39:12.781215+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2605.14037","created_at":"2026-05-17T23:39:12.781215+00:00"},{"alias_kind":"pith_short_12","alias_value":"BRGDBQMQYHH2","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"BRGDBQMQYHH2KATH","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"BRGDBQMQ","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":0,"internal_anchor_count":0,"sample":[]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/BRGDBQMQYHH2KATHMSVTRIFOYH","json":"https://pith.science/pith/BRGDBQMQYHH2KATHMSVTRIFOYH.json","graph_json":"https://pith.science/api/pith-number/BRGDBQMQYHH2KATHMSVTRIFOYH/graph.json","events_json":"https://pith.science/api/pith-number/BRGDBQMQYHH2KATHMSVTRIFOYH/events.json","paper":"https://pith.science/paper/BRGDBQMQ"},"agent_actions":{"view_html":"https://pith.science/pith/BRGDBQMQYHH2KATHMSVTRIFOYH","download_json":"https://pith.science/pith/BRGDBQMQYHH2KATHMSVTRIFOYH.json","view_paper":"https://pith.science/paper/BRGDBQMQ","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2605.14037&json=true","fetch_graph":"https://pith.science/api/pith-number/BRGDBQMQYHH2KATHMSVTRIFOYH/graph.json","fetch_events":"https://pith.science/api/pith-number/BRGDBQMQYHH2KATHMSVTRIFOYH/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/BRGDBQMQYHH2KATHMSVTRIFOYH/action/timestamp_anchor","attest_storage":"https://pith.science/pith/BRGDBQMQYHH2KATHMSVTRIFOYH/action/storage_attestation","attest_author":"https://pith.science/pith/BRGDBQMQYHH2KATHMSVTRIFOYH/action/author_attestation","sign_citation":"https://pith.science/pith/BRGDBQMQYHH2KATHMSVTRIFOYH/action/citation_signature","submit_replication":"https://pith.science/pith/BRGDBQMQYHH2KATHMSVTRIFOYH/action/replication_record"}},"created_at":"2026-05-17T23:39:12.781215+00:00","updated_at":"2026-05-17T23:39:12.781215+00:00"}