{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2026:Z4MBO6YKC3ON2JYN5WVCJJK75X","short_pith_number":"pith:Z4MBO6YK","schema_version":"1.0","canonical_sha256":"cf18177b0a16dcdd270dedaa24a55fede162bbe452ca8530d2a411ac612a2100","source":{"kind":"arxiv","id":"2604.08426","version":4},"attestation_state":"computed","paper":{"title":"KV Cache Offloading for Context-Intensive Tasks","license":"http://creativecommons.org/licenses/by/4.0/","headline":"KV-cache offloading causes major accuracy losses on tasks that require pulling lots of details from long inputs, but a simpler alternative recovers performance across models.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.LG","authors_text":"Andrey Bocharnikov, Denis Kuznedelev, Ivan Ermakov, Vyacheslav Zhdanovskiy, Yegor Yershov","submitted_at":"2026-04-09T16:30:44Z","abstract_excerpt":"With the growing demand for long-context LLMs across a wide range of applications, the key-value (KV) cache has become a critical bottleneck for both latency and memory usage. Recently, KV-cache offloading has emerged as a promising approach to reduce memory footprint and inference latency while preserving accuracy. Prior evaluations have largely focused on tasks that do not require extracting large amounts of information from the context. In this work, we study KV-cache offloading on context-intensive tasks: problems where the solution requires looking up a lot of information from the input p"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":false},"canonical_record":{"source":{"id":"2604.08426","kind":"arxiv","version":4},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.LG","submitted_at":"2026-04-09T16:30:44Z","cross_cats_sorted":["cs.AI","cs.CL"],"title_canon_sha256":"a95638f90930d2a3d264e375f1f001c33755d2ca4274281faf5cafcd6bac51d3","abstract_canon_sha256":"87d9201feef55e531e4c7771d4acb9404df4e293b23a2de9748b705a17c2adf4"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-20T00:01:41.124377Z","signature_b64":"NxkDoz2DLPdcnTvyYkumGIT2ckQYX5lif3cGnqZrgOPPExA+cjQdp5pyBHZ2R0FOS50p9fWXIGSlSdun1PzgCQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"cf18177b0a16dcdd270dedaa24a55fede162bbe452ca8530d2a411ac612a2100","last_reissued_at":"2026-05-20T00:01:41.123824Z","signature_status":"signed_v1","first_computed_at":"2026-05-20T00:01:41.123824Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"KV Cache Offloading for Context-Intensive Tasks","license":"http://creativecommons.org/licenses/by/4.0/","headline":"KV-cache offloading causes major accuracy losses on tasks that require pulling lots of details from long inputs, but a simpler alternative recovers performance across models.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.LG","authors_text":"Andrey Bocharnikov, Denis Kuznedelev, Ivan Ermakov, Vyacheslav Zhdanovskiy, Yegor Yershov","submitted_at":"2026-04-09T16:30:44Z","abstract_excerpt":"With the growing demand for long-context LLMs across a wide range of applications, the key-value (KV) cache has become a critical bottleneck for both latency and memory usage. Recently, KV-cache offloading has emerged as a promising approach to reduce memory footprint and inference latency while preserving accuracy. Prior evaluations have largely focused on tasks that do not require extracting large amounts of information from the context. In this work, we study KV-cache offloading on context-intensive tasks: problems where the solution requires looking up a lot of information from the input p"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Existing KV-cache offloading techniques produce significant performance degradation on context-intensive tasks; a simpler alternative strategy significantly improves accuracy across multiple LLM families and benchmarks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that the observed accuracy drops are caused primarily by low-rank key projections and unreliable landmarks rather than by other implementation details of the offloading systems or by the specific choice of evaluation prompts and metrics.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"KV offloading degrades accuracy on context-intensive tasks due to low-rank key projections and unreliable landmarks; a simpler alternative improves results across models and benchmarks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"KV-cache offloading causes major accuracy losses on tasks that require pulling lots of details from long inputs, but a simpler alternative recovers performance across models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"edcacc817497d64001d4095e82392c415c565593ae70e886bdec8e0aadb57d6b"},"source":{"id":"2604.08426","kind":"arxiv","version":4},"verdict":{"id":"aa39596d-ce9d-401c-a220-d791b219afa3","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-19T16:37:28.609807Z","strongest_claim":"Existing KV-cache offloading techniques produce significant performance degradation on context-intensive tasks; a simpler alternative strategy significantly improves accuracy across multiple LLM families and benchmarks.","one_line_summary":"KV offloading degrades accuracy on context-intensive tasks due to low-rank key projections and unreliable landmarks; a simpler alternative improves results across models and benchmarks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that the observed accuracy drops are caused primarily by low-rank key projections and unreliable landmarks rather than by other implementation details of the offloading systems or by the specific choice of evaluation prompts and metrics.","pith_extraction_headline":"KV-cache offloading causes major accuracy losses on tasks that require pulling lots of details from long inputs, but a simpler alternative recovers performance across models."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2604.08426/integrity.json","findings":[],"available":true,"detectors_run":[],"snapshot_sha256":"c28c3603d3b5d939e8dc4c7e95fa8dfce3d595e45f758748cecf8e644a296938"},"references":{"count":66,"sample":[{"doi":"","year":2022,"title":"R. Y . Aminabadi, S. Rajbhandari, M. Zhang, A. A. Awan, C. Li, D. Li, E. Zheng, J. Rasley, S. Smith, O. Ruwase, and Y . He. Deepspeed inference: Enabling efficient inference of trans- former models at","work_id":"aa03a3df-a912-44fb-8afd-0a62ea51e7ed","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2026,"title":"S. Ananthanarayanan and A. Sengupta. Understanding the physics of key-value cache compres- sion for LLMs through attention dynamics.arXiv preprint arXiv:2603.01426, 2026","work_id":"ed7782c0-1d6f-4734-acd4-3005c48cee88","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2026,"title":"S. Ananthanarayanan, A. Sengupta, and T. Chakraborty. Understanding the physics of key-value cache compression for llms through attention dynamics, 2026","work_id":"97a1505e-e443-48ce-81a6-ea11a111a1ec","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman. Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Proce","work_id":"5b998afa-9885-4090-b2de-067d0c30016f","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, Y . Dong, J. Tang, and J. Li. Longbench: A bilingual, multitask benchmark for long context understanding. InProceed","work_id":"b9bb4ee1-7d33-4626-af7b-c290cdbd7c2c","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":66,"snapshot_sha256":"f45e28349a4e76d39e5f11285f218764b22c90c06ec7528f1272f25f6a158770","internal_anchors":5},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2604.08426","created_at":"2026-05-20T00:01:41.123911+00:00"},{"alias_kind":"arxiv_version","alias_value":"2604.08426v4","created_at":"2026-05-20T00:01:41.123911+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2604.08426","created_at":"2026-05-20T00:01:41.123911+00:00"},{"alias_kind":"pith_short_12","alias_value":"Z4MBO6YKC3ON","created_at":"2026-05-20T00:01:41.123911+00:00"},{"alias_kind":"pith_short_16","alias_value":"Z4MBO6YKC3ON2JYN","created_at":"2026-05-20T00:01:41.123911+00:00"},{"alias_kind":"pith_short_8","alias_value":"Z4MBO6YK","created_at":"2026-05-20T00:01:41.123911+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":1,"internal_anchor_count":1,"sample":[{"citing_arxiv_id":"2605.08234","citing_title":"When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression","ref_index":10,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/Z4MBO6YKC3ON2JYN5WVCJJK75X","json":"https://pith.science/pith/Z4MBO6YKC3ON2JYN5WVCJJK75X.json","graph_json":"https://pith.science/api/pith-number/Z4MBO6YKC3ON2JYN5WVCJJK75X/graph.json","events_json":"https://pith.science/api/pith-number/Z4MBO6YKC3ON2JYN5WVCJJK75X/events.json","paper":"https://pith.science/paper/Z4MBO6YK"},"agent_actions":{"view_html":"https://pith.science/pith/Z4MBO6YKC3ON2JYN5WVCJJK75X","download_json":"https://pith.science/pith/Z4MBO6YKC3ON2JYN5WVCJJK75X.json","view_paper":"https://pith.science/paper/Z4MBO6YK","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2604.08426&json=true","fetch_graph":"https://pith.science/api/pith-number/Z4MBO6YKC3ON2JYN5WVCJJK75X/graph.json","fetch_events":"https://pith.science/api/pith-number/Z4MBO6YKC3ON2JYN5WVCJJK75X/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/Z4MBO6YKC3ON2JYN5WVCJJK75X/action/timestamp_anchor","attest_storage":"https://pith.science/pith/Z4MBO6YKC3ON2JYN5WVCJJK75X/action/storage_attestation","attest_author":"https://pith.science/pith/Z4MBO6YKC3ON2JYN5WVCJJK75X/action/author_attestation","sign_citation":"https://pith.science/pith/Z4MBO6YKC3ON2JYN5WVCJJK75X/action/citation_signature","submit_replication":"https://pith.science/pith/Z4MBO6YKC3ON2JYN5WVCJJK75X/action/replication_record"}},"created_at":"2026-05-20T00:01:41.123911+00:00","updated_at":"2026-05-20T00:01:41.123911+00:00"}