{"paper":{"title":"AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding","license":"http://creativecommons.org/licenses/by/4.0/","headline":"AdaFocus improves long-video accuracy while cutting visual tokens by about 33 times through adaptive preview sampling and on-demand disk retrieval.","cross_cats":["cs.AI"],"primary_cat":"cs.CV","authors_text":"Haoxuan Yu, Ning Qin, Xiao Yang, Yingzhe Ma, Zixin Li","submitted_at":"2026-05-13T03:40:21Z","abstract_excerpt":"Long video understanding is heavily bottlenecked by a rigid one-shot paradigm: existing methods either densely encode videos at prohibitive memory and latency costs, or aggressively compress them into sparse frame sets that irreversibly discard fine-grained evidence needed for downstream reasoning. Consequently, current models struggle to simultaneously balance temporal coverage, visual details, and computational efficiency.\n  We propose AdaFocus, an efficient framework that rethinks long-video understanding as progressive evidence acquisition rather than one-pass encoding. AdaFocus relies on "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"AdaFocus delivers a substantially better efficiency-accuracy trade-off than strong baselines. Compared with conventional dense encoding, AdaFocus achieves improved task performance (e.g., +2.59 accuracy on VideoMME, +8.39 mIoU on Charades-STA over single-pass inference) while reducing visual token consumption by ~33x and eliminating the need for in-memory frame pre-caching through its zero-cache disk retrieval design.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The uncertainty-triggered refinement mechanism can reliably identify when and which high-resolution evidence is needed from the initial low-cost preview, without missing critical details that would require exhaustive preloading.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"AdaFocus achieves better accuracy on long-video benchmarks with roughly 33 times fewer visual tokens by combining query-aware adaptive sampling and zero-cache disk-based refinement.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"AdaFocus improves long-video accuracy while cutting visual tokens by about 33 times through adaptive preview sampling and on-demand disk retrieval.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"149ae7405348385853575b2a38e602ea91834d89db54625b86404a9a30957ce3"},"source":{"id":"2605.12954","kind":"arxiv","version":1},"verdict":{"id":"2d444176-e987-4148-9205-cafd94888b09","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T19:39:53.060195Z","strongest_claim":"AdaFocus delivers a substantially better efficiency-accuracy trade-off than strong baselines. Compared with conventional dense encoding, AdaFocus achieves improved task performance (e.g., +2.59 accuracy on VideoMME, +8.39 mIoU on Charades-STA over single-pass inference) while reducing visual token consumption by ~33x and eliminating the need for in-memory frame pre-caching through its zero-cache disk retrieval design.","one_line_summary":"AdaFocus achieves better accuracy on long-video benchmarks with roughly 33 times fewer visual tokens by combining query-aware adaptive sampling and zero-cache disk-based refinement.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The uncertainty-triggered refinement mechanism can reliably identify when and which high-resolution evidence is needed from the initial low-cost preview, without missing critical details that would require exhaustive preloading.","pith_extraction_headline":"AdaFocus improves long-video accuracy while cutting visual tokens by about 33 times through adaptive preview sampling and on-demand disk retrieval."},"references":{"count":33,"sample":[{"doi":"","year":2017,"title":"Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE international confe","work_id":"a050ad1f-bc2f-410c-a773-37094bb7af2b","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","ref_index":2,"cited_arxiv_id":"2511.21631","is_internal_anchor":true},{"doi":"","year":2024,"title":"VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs","work_id":"ccfc3f89-c510-45f1-8a35-ed1a56c0ae5c","ref_index":3,"cited_arxiv_id":"2406.07476","is_internal_anchor":true},{"doi":"","year":2022,"title":"Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashat- tention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems35 ","work_id":"15271a55-5c79-4cdd-b2cb-9ad910540658","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Video-R1: Reinforcing Video Reasoning in MLLMs","work_id":"0ce88332-564c-4361-8e2a-3850eb1ace9c","ref_index":5,"cited_arxiv_id":"2503.21776","is_internal_anchor":true}],"resolved_work":33,"snapshot_sha256":"55ce5452aad1530c14d7dea2c266a8ec2d187431d7a16fce547b8554ca71bd6e","internal_anchors":9},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}