{"paper":{"title":"CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Treating KV-cache compression as coreset selection improves streaming video understanding under fixed memory budgets.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Ailar Mahdizadeh, Leonid Sigal, Muchen Li, Puria Azadi, Xiangteng He","submitted_at":"2026-05-14T03:22:30Z","abstract_excerpt":"Streaming video understanding with large vision-language models (VLMs) requires a compact memory that can support future reasoning over an ever-growing visual history. A common solution is to compress the key-value (KV) cache, but existing streaming methods typically rely on local token-wise heuristics, such as recency, temporal redundancy, or saliency, which do not explicitly optimize whether the retained cache is representative of the accumulated history. We propose to view KV-cache compression as a coreset selection problem: rather than scoring tokens independently for retention, we select "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Across four open-source VLMs and five long-video and streaming-video benchmarks, our method improves over heuristic streaming compression baselines under a fixed cache budget.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That a small coreset chosen by joint KV coverage and orthogonality will preserve the information needed for arbitrary future reasoning queries over the full history.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Treating KV-cache compression as coreset selection improves streaming video understanding under fixed memory budgets.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"d3c1de20f15168de3249dfe5176e72b4d1eda4bd635eb4d9c36fff85a1725ba3"},"source":{"id":"2605.14310","kind":"arxiv","version":1},"verdict":{"id":"4a664be5-ccb1-4be9-a9c7-1ea9c582dcb5","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T02:24:37.272302Z","strongest_claim":"Across four open-source VLMs and five long-video and streaming-video benchmarks, our method improves over heuristic streaming compression baselines under a fixed cache budget.","one_line_summary":"CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That a small coreset chosen by joint KV coverage and orthogonality will preserve the information needed for arbitrary future reasoning queries over the full history.","pith_extraction_headline":"Treating KV-cache compression as coreset selection improves streaming video understanding under fixed memory budgets."},"references":{"count":32,"sample":[{"doi":"","year":2024,"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","ref_index":2,"cited_arxiv_id":"2409.12191","is_internal_anchor":true},{"doi":"","year":2025,"title":"Long context transfer from language to vision.Transactions on Machine Learning Research, 2025","work_id":"62705864-363c-4f27-84b3-0f90e70f58f2","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Infinipot-v: Memory-constrained kv cache compression for streaming video understanding","work_id":"fd86124e-903d-4c8e-a169-f9b97c80b4b6","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Streaming long video understanding with large language models","work_id":"6a0c4f9b-d588-49db-8fe1-5d2652cc34f5","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, and Vikas Chandra","work_id":"e47963a0-7038-4696-99ee-74c58cd4e5de","ref_index":6,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":32,"snapshot_sha256":"e47305bb4df4f97304fbc7b8522295edd67bd1fc08b7cc0c177e4d4cab054d3f","internal_anchors":7},"formal_canon":{"evidence_count":2,"snapshot_sha256":"b1cc6b3b748e09ee0735fa762b4d42b27c47d6e4fc29ccb0e729e336af5b5524"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}