{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:MH2XVHQOPBSABRG55I2UUI33LR","short_pith_number":"pith:MH2XVHQO","schema_version":"1.0","canonical_sha256":"61f57a9e0e786400c4ddea354a237b5c66af12f0fdd177f5b2363818b7189051","source":{"kind":"arxiv","id":"2306.14048","version":3},"attestation_state":"computed","paper":{"title":"H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Heavy-hitter tokens that dominate attention let LLMs run with a much smaller KV cache and up to 29 times higher throughput.","cross_cats":[],"primary_cat":"cs.LG","authors_text":"Beidi Chen, Christopher R\\'e, Clark Barrett, Lianmin Zheng, Ruisi Cai, Tianlong Chen, Tianyi Zhou, Ying Sheng, Yuandong Tian, Zhangyang Wang, Zhao Song, Zhenyu Zhang","submitted_at":"2023-06-24T20:11:14Z","abstract_excerpt":"Large Language Models (LLMs), despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the KV cache, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. In this paper, we introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint. Our approach is based on the noteworthy observati"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2306.14048","kind":"arxiv","version":3},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.LG","submitted_at":"2023-06-24T20:11:14Z","cross_cats_sorted":[],"title_canon_sha256":"a04537268fce239cf30f5f553c8505a3533aa36fd8781487ff50496f711acc78","abstract_canon_sha256":"1cf0b946cfa0d0e8f3a51b87e5d5d9de557ac26c2712fe9dc6ab49bee852cdfe"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:13.469236Z","signature_b64":"EY/p6rbcEE8y6sP3cx2UTMhrsf1eEJjRlmSzfk+Vgd41PGyeerTu9rUZ61YHPCbc4oM2ra5nS3N5fJen9rvdCw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"61f57a9e0e786400c4ddea354a237b5c66af12f0fdd177f5b2363818b7189051","last_reissued_at":"2026-05-17T23:38:13.468578Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:13.468578Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Heavy-hitter tokens that dominate attention let LLMs run with a much smaller KV cache and up to 29 times higher throughput.","cross_cats":[],"primary_cat":"cs.LG","authors_text":"Beidi Chen, Christopher R\\'e, Clark Barrett, Lianmin Zheng, Ruisi Cai, Tianlong Chen, Tianyi Zhou, Ying Sheng, Yuandong Tian, Zhangyang Wang, Zhao Song, Zhenyu Zhang","submitted_at":"2023-06-24T20:11:14Z","abstract_excerpt":"Large Language Models (LLMs), despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the KV cache, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. In this paper, we introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint. Our approach is based on the noteworthy observati"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our implementation of H₂O with 20% heavy hitters improves the throughput over three leading inference systems DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen by up to 29×, 29×, and 3× on OPT-6.7B and OPT-30B.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The emergence of heavy hitters is natural and strongly correlates with frequent co-occurrence of tokens, and removing them results in significant performance degradation (abstract observation that must hold for the eviction policy to remain accurate).","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Heavy-hitter tokens that dominate attention let LLMs run with a much smaller KV cache and up to 29 times higher throughput.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"9f7e6c31f0628991785558050bba43e690746c3410b4bd6badf234d945d2be38"},"source":{"id":"2306.14048","kind":"arxiv","version":3},"verdict":{"id":"c43764df-7317-4df7-94b0-b88601c38156","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T17:53:42.848761Z","strongest_claim":"Our implementation of H₂O with 20% heavy hitters improves the throughput over three leading inference systems DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen by up to 29×, 29×, and 3× on OPT-6.7B and OPT-30B.","one_line_summary":"H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The emergence of heavy hitters is natural and strongly correlates with frequent co-occurrence of tokens, and removing them results in significant performance degradation (abstract observation that must hold for the eviction policy to remain accurate).","pith_extraction_headline":"Heavy-hitter tokens that dominate attention let LLMs run with a much smaller KV cache and up to 29 times higher throughput."},"references":{"count":145,"sample":[{"doi":"","year":2022,"title":"LaMDA: Language Models for Dialog Applications","work_id":"1b66d0a5-f6ae-4332-8025-c662dc64b238","ref_index":1,"cited_arxiv_id":"2201.08239","is_internal_anchor":true},{"doi":"","year":2022,"title":"Wordcraft: story writing with large language models","work_id":"bb152c56-5310-43f0-ba5b-6ef51b9ed164","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Emergent Abilities of Large Language Models","work_id":"6ea3375b-837c-4640-a175-be7525aa3c6d","ref_index":3,"cited_arxiv_id":"2206.07682","is_internal_anchor":true},{"doi":"","year":2023,"title":"Benchmarking Large Language Models for News Summarization","work_id":"1fd145fc-38ee-4d6a-b71b-fadec1a7b54b","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Efficiently scaling transformer inference","work_id":"89acfce1-19be-43c5-b74d-cdfe66fa10d8","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":145,"snapshot_sha256":"35ad91d0f9eeabcf04c80d9849c3c50da9591e7779c052db808df401e685a93c","internal_anchors":32},"formal_canon":{"evidence_count":3,"snapshot_sha256":"4104fc4c6342cb5620ec681137622a31e7c9a714aefc7e2bb8e189329378a380"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2306.14048","created_at":"2026-05-17T23:38:13.468675+00:00"},{"alias_kind":"arxiv_version","alias_value":"2306.14048v3","created_at":"2026-05-17T23:38:13.468675+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2306.14048","created_at":"2026-05-17T23:38:13.468675+00:00"},{"alias_kind":"pith_short_12","alias_value":"MH2XVHQOPBSA","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"MH2XVHQOPBSABRG5","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"MH2XVHQO","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":18,"internal_anchor_count":18,"sample":[{"citing_arxiv_id":"2511.03092","citing_title":"SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2510.09608","citing_title":"StreamingVLM: Real-Time Understanding for Infinite Video Streams","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2310.01801","citing_title":"Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs","ref_index":101,"is_internal_anchor":true},{"citing_arxiv_id":"2401.05459","citing_title":"Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security","ref_index":267,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08913","citing_title":"Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2401.10774","citing_title":"Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads","ref_index":79,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11733","citing_title":"Position: LLM Inference Should Be Evaluated as Energy-to-Token Production","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2402.02750","citing_title":"KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09735","citing_title":"KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08913","citing_title":"Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06554","citing_title":"Long Context Pre-Training with Lighthouse Attention","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07363","citing_title":"MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06763","citing_title":"Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17935","citing_title":"How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05219","citing_title":"Sparse Prefix Caching for Hybrid and Recurrent LLM Serving","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2604.16864","citing_title":"HieraSparse: Hierarchical Semi-Structured Sparse KV Attention","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21335","citing_title":"Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2605.02568","citing_title":"StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k","ref_index":36,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":3,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/MH2XVHQOPBSABRG55I2UUI33LR","json":"https://pith.science/pith/MH2XVHQOPBSABRG55I2UUI33LR.json","graph_json":"https://pith.science/api/pith-number/MH2XVHQOPBSABRG55I2UUI33LR/graph.json","events_json":"https://pith.science/api/pith-number/MH2XVHQOPBSABRG55I2UUI33LR/events.json","paper":"https://pith.science/paper/MH2XVHQO"},"agent_actions":{"view_html":"https://pith.science/pith/MH2XVHQOPBSABRG55I2UUI33LR","download_json":"https://pith.science/pith/MH2XVHQOPBSABRG55I2UUI33LR.json","view_paper":"https://pith.science/paper/MH2XVHQO","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2306.14048&json=true","fetch_graph":"https://pith.science/api/pith-number/MH2XVHQOPBSABRG55I2UUI33LR/graph.json","fetch_events":"https://pith.science/api/pith-number/MH2XVHQOPBSABRG55I2UUI33LR/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/MH2XVHQOPBSABRG55I2UUI33LR/action/timestamp_anchor","attest_storage":"https://pith.science/pith/MH2XVHQOPBSABRG55I2UUI33LR/action/storage_attestation","attest_author":"https://pith.science/pith/MH2XVHQOPBSABRG55I2UUI33LR/action/author_attestation","sign_citation":"https://pith.science/pith/MH2XVHQOPBSABRG55I2UUI33LR/action/citation_signature","submit_replication":"https://pith.science/pith/MH2XVHQOPBSABRG55I2UUI33LR/action/replication_record"}},"created_at":"2026-05-17T23:38:13.468675+00:00","updated_at":"2026-05-17T23:38:13.468675+00:00"}