{"paper":{"title":"H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Heavy-hitter tokens that dominate attention let LLMs run with a much smaller KV cache and up to 29 times higher throughput.","cross_cats":[],"primary_cat":"cs.LG","authors_text":"Beidi Chen, Christopher R\\'e, Clark Barrett, Lianmin Zheng, Ruisi Cai, Tianlong Chen, Tianyi Zhou, Ying Sheng, Yuandong Tian, Zhangyang Wang, Zhao Song, Zhenyu Zhang","submitted_at":"2023-06-24T20:11:14Z","abstract_excerpt":"Large Language Models (LLMs), despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the KV cache, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. In this paper, we introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint. Our approach is based on the noteworthy observati"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our implementation of H₂O with 20% heavy hitters improves the throughput over three leading inference systems DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen by up to 29×, 29×, and 3× on OPT-6.7B and OPT-30B.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The emergence of heavy hitters is natural and strongly correlates with frequent co-occurrence of tokens, and removing them results in significant performance degradation (abstract observation that must hold for the eviction policy to remain accurate).","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Heavy-hitter tokens that dominate attention let LLMs run with a much smaller KV cache and up to 29 times higher throughput.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"9f7e6c31f0628991785558050bba43e690746c3410b4bd6badf234d945d2be38"},"source":{"id":"2306.14048","kind":"arxiv","version":3},"verdict":{"id":"c43764df-7317-4df7-94b0-b88601c38156","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T17:53:42.848761Z","strongest_claim":"Our implementation of H₂O with 20% heavy hitters improves the throughput over three leading inference systems DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen by up to 29×, 29×, and 3× on OPT-6.7B and OPT-30B.","one_line_summary":"H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The emergence of heavy hitters is natural and strongly correlates with frequent co-occurrence of tokens, and removing them results in significant performance degradation (abstract observation that must hold for the eviction policy to remain accurate).","pith_extraction_headline":"Heavy-hitter tokens that dominate attention let LLMs run with a much smaller KV cache and up to 29 times higher throughput."},"references":{"count":145,"sample":[{"doi":"","year":2022,"title":"LaMDA: Language Models for Dialog Applications","work_id":"1b66d0a5-f6ae-4332-8025-c662dc64b238","ref_index":1,"cited_arxiv_id":"2201.08239","is_internal_anchor":true},{"doi":"","year":2022,"title":"Wordcraft: story writing with large language models","work_id":"bb152c56-5310-43f0-ba5b-6ef51b9ed164","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Emergent Abilities of Large Language Models","work_id":"6ea3375b-837c-4640-a175-be7525aa3c6d","ref_index":3,"cited_arxiv_id":"2206.07682","is_internal_anchor":true},{"doi":"","year":2023,"title":"Benchmarking Large Language Models for News Summarization","work_id":"1fd145fc-38ee-4d6a-b71b-fadec1a7b54b","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Efficiently scaling transformer inference","work_id":"89acfce1-19be-43c5-b74d-cdfe66fa10d8","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":145,"snapshot_sha256":"35ad91d0f9eeabcf04c80d9849c3c50da9591e7779c052db808df401e685a93c","internal_anchors":32},"formal_canon":{"evidence_count":3,"snapshot_sha256":"4104fc4c6342cb5620ec681137622a31e7c9a714aefc7e2bb8e189329378a380"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}