{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:EM2A7KL7DVBG3VHQBSRAVTUMG5","short_pith_number":"pith:EM2A7KL7","schema_version":"1.0","canonical_sha256":"23340fa97f1d426dd4f00ca20ace8c37532352596ea1ea91a591c8ed76947c51","source":{"kind":"arxiv","id":"2310.01801","version":4},"attestation_state":"computed","paper":{"title":"Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs","license":"http://creativecommons.org/licenses/by/4.0/","headline":"LLMs can cut KV cache memory by profiling attention heads once and evicting tokens selectively per head type.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Jianfeng Gao, Jiawei Han, Liyuan Liu, Minjia Zhang, Suyu Ge, Yunan Zhang","submitted_at":"2023-10-03T05:17:08Z","abstract_excerpt":"In this study, we introduce adaptive KV cache compression, a plug-and-play method that reduces the memory footprint of generative inference for Large Language Models (LLMs). Different from the conventional KV cache that retains key and value vectors for all context tokens, we conduct targeted profiling to discern the intrinsic structure of attention modules. Based on the recognized structure, we then construct the KV cache in an adaptive manner: evicting long-range contexts on attention heads emphasizing local contexts, discarding non-special tokens on attention heads centered on special token"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2310.01801","kind":"arxiv","version":4},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CL","submitted_at":"2023-10-03T05:17:08Z","cross_cats_sorted":[],"title_canon_sha256":"67dcb3a627ccfd1e7e1ed9f026f773a287848ff13092ed123185a24c13fd96f3","abstract_canon_sha256":"0e193d84667f6bfe1d3dd00a5dcff1a21a7121bcb148bb9dae7ceb420b7365e4"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:14.272943Z","signature_b64":"1bxxoaXwe1Mwqf+CPlzVeM2HKq3bowiy42jP8jddzjH1XhDjcpDRbRoOw2kepLZfcSa+df6SbReEnb1IsrQMBg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"23340fa97f1d426dd4f00ca20ace8c37532352596ea1ea91a591c8ed76947c51","last_reissued_at":"2026-05-17T23:38:14.272361Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:14.272361Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs","license":"http://creativecommons.org/licenses/by/4.0/","headline":"LLMs can cut KV cache memory by profiling attention heads once and evicting tokens selectively per head type.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Jianfeng Gao, Jiawei Han, Liyuan Liu, Minjia Zhang, Suyu Ge, Yunan Zhang","submitted_at":"2023-10-03T05:17:08Z","abstract_excerpt":"In this study, we introduce adaptive KV cache compression, a plug-and-play method that reduces the memory footprint of generative inference for Large Language Models (LLMs). Different from the conventional KV cache that retains key and value vectors for all context tokens, we conduct targeted profiling to discern the intrinsic structure of attention modules. Based on the recognized structure, we then construct the KV cache in an adaptive manner: evicting long-range contexts on attention heads emphasizing local contexts, discarding non-special tokens on attention heads centered on special token"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"we conduct targeted profiling to discern the intrinsic structure of attention modules. Based on the recognized structure, we then construct the KV cache in an adaptive manner: evicting long-range contexts on attention heads emphasizing local contexts, discarding non-special tokens on attention heads centered on special tokens, and only employing the standard KV cache for attention heads that broadly attend to all tokens. Moreover, with the lightweight attention profiling used to guide the construction of the adaptive KV cache, FastGen can be deployed without resource-intensive fine-tuning or re-training. In our experiments across various tasks, FastGen demonstrates substantial reduction on GPU memory consumption with negligible generation quality loss.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the attention-head structures identified by a single lightweight profiling pass remain stable and sufficient to guide token eviction across diverse generation tasks and contexts without materially degrading output quality or requiring any model updates.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"FastGen adaptively compresses LLM KV caches via lightweight attention profiling: evicting long-range contexts on local heads, non-special tokens on special-token heads, and retaining full caches on broad-attention heads, yielding substantial memory savings with negligible quality loss.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"LLMs can cut KV cache memory by profiling attention heads once and evicting tokens selectively per head type.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"6ae7930f67e9daa039255d1e86d06154a593ea28dd20a97521548c155b260ab8"},"source":{"id":"2310.01801","kind":"arxiv","version":4},"verdict":{"id":"354e0abe-3f5d-4c5a-9813-4e1b8d71d6e4","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T11:06:14.885468Z","strongest_claim":"we conduct targeted profiling to discern the intrinsic structure of attention modules. Based on the recognized structure, we then construct the KV cache in an adaptive manner: evicting long-range contexts on attention heads emphasizing local contexts, discarding non-special tokens on attention heads centered on special tokens, and only employing the standard KV cache for attention heads that broadly attend to all tokens. Moreover, with the lightweight attention profiling used to guide the construction of the adaptive KV cache, FastGen can be deployed without resource-intensive fine-tuning or re-training. In our experiments across various tasks, FastGen demonstrates substantial reduction on GPU memory consumption with negligible generation quality loss.","one_line_summary":"FastGen adaptively compresses LLM KV caches via lightweight attention profiling: evicting long-range contexts on local heads, non-special tokens on special-token heads, and retaining full caches on broad-attention heads, yielding substantial memory savings with negligible quality loss.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the attention-head structures identified by a single lightweight profiling pass remain stable and sufficient to guide token eviction across diverse generation tasks and contexts without materially degrading output quality or requiring any model updates.","pith_extraction_headline":"LLMs can cut KV cache memory by profiling attention heads once and evicting tokens selectively per head type."},"references":{"count":85,"sample":[{"doi":"","year":2017,"title":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=","work_id":"b4b7109a-a8e3-49f5-8c2a-d4245838468b","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"2019 , journal =","work_id":"41c879fa-f0bb-4f50-b5be-c2b0d1b21ffa","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwa","work_id":"d2f65ea3-f1ff-4ac1-9647-f11876755e31","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"SC22: International Conference for High Performance Computing, Networking, Storage and Analysis , year=","work_id":"d8d392d0-72d7-4c83-b302-9d2e830cf8d8","ref_index":8,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"International Conference on Machine Learning , year=","work_id":"dad3ccf2-8849-4fbf-ab94-d710a8f18169","ref_index":9,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":85,"snapshot_sha256":"0ffa89237fb670f8bc441ecec5ef75412a025a175638a720026a498e4716b10f","internal_anchors":18},"formal_canon":{"evidence_count":2,"snapshot_sha256":"b24d1e411a837688e7fe815d189cbc6c833d073526ad00ae35b19b3dba4d8236"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2310.01801","created_at":"2026-05-17T23:38:14.272476+00:00"},{"alias_kind":"arxiv_version","alias_value":"2310.01801v4","created_at":"2026-05-17T23:38:14.272476+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2310.01801","created_at":"2026-05-17T23:38:14.272476+00:00"},{"alias_kind":"pith_short_12","alias_value":"EM2A7KL7DVBG","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"EM2A7KL7DVBG3VHQ","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"EM2A7KL7","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":32,"internal_anchor_count":32,"sample":[{"citing_arxiv_id":"2502.01941","citing_title":"Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2505.05772","citing_title":"Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21603","citing_title":"DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22106","citing_title":"ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22337","citing_title":"Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20600","citing_title":"Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16360","citing_title":"ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17757","citing_title":"OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18753","citing_title":"DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19660","citing_title":"OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2506.17310","citing_title":"PaceLLM: Brain-Inspired Large Language Models for Long-Context Understanding","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2507.21433","citing_title":"ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2508.16703","citing_title":"ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2510.09883","citing_title":"DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2407.11550","citing_title":"Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2504.15965","citing_title":"From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs","ref_index":117,"is_internal_anchor":true},{"citing_arxiv_id":"2502.11089","citing_title":"Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention","ref_index":76,"is_internal_anchor":true},{"citing_arxiv_id":"2410.10781","citing_title":"When Attention Sink Emerges in Language Models: An Empirical View","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2602.01203","citing_title":"Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2502.13189","citing_title":"MoBA: Mixture of Block Attention for Long-Context LLMs","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2410.17247","citing_title":"PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2604.22782","citing_title":"Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03143","citing_title":"TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2404.14469","citing_title":"SnapKV: LLM Knows What You are Looking for Before Generation","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2406.02069","citing_title":"PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling","ref_index":9,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/EM2A7KL7DVBG3VHQBSRAVTUMG5","json":"https://pith.science/pith/EM2A7KL7DVBG3VHQBSRAVTUMG5.json","graph_json":"https://pith.science/api/pith-number/EM2A7KL7DVBG3VHQBSRAVTUMG5/graph.json","events_json":"https://pith.science/api/pith-number/EM2A7KL7DVBG3VHQBSRAVTUMG5/events.json","paper":"https://pith.science/paper/EM2A7KL7"},"agent_actions":{"view_html":"https://pith.science/pith/EM2A7KL7DVBG3VHQBSRAVTUMG5","download_json":"https://pith.science/pith/EM2A7KL7DVBG3VHQBSRAVTUMG5.json","view_paper":"https://pith.science/paper/EM2A7KL7","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2310.01801&json=true","fetch_graph":"https://pith.science/api/pith-number/EM2A7KL7DVBG3VHQBSRAVTUMG5/graph.json","fetch_events":"https://pith.science/api/pith-number/EM2A7KL7DVBG3VHQBSRAVTUMG5/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/EM2A7KL7DVBG3VHQBSRAVTUMG5/action/timestamp_anchor","attest_storage":"https://pith.science/pith/EM2A7KL7DVBG3VHQBSRAVTUMG5/action/storage_attestation","attest_author":"https://pith.science/pith/EM2A7KL7DVBG3VHQBSRAVTUMG5/action/author_attestation","sign_citation":"https://pith.science/pith/EM2A7KL7DVBG3VHQBSRAVTUMG5/action/citation_signature","submit_replication":"https://pith.science/pith/EM2A7KL7DVBG3VHQBSRAVTUMG5/action/replication_record"}},"created_at":"2026-05-17T23:38:14.272476+00:00","updated_at":"2026-05-17T23:38:14.272476+00:00"}