arxiv: 2605.12954 · v1 · submitted 2026-05-13 · 💻 cs.CV · cs.AI

Recognition: unknown

AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding

Xiao Yang , Yingzhe Ma , Haoxuan Yu , Zixin Li , Ning Qin

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:39 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords long video understandingadaptive samplingzero-cache retrievalvideo reasoningefficiency accuracy tradeoffprogressive evidence acquisitionmultimodal video modelsuncertainty triggered refinement

0 comments

The pith

AdaFocus improves long-video accuracy while cutting visual tokens by about 33 times through adaptive preview sampling and on-demand disk retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that long video understanding can move from rigid one-shot dense encoding or irreversible sparse compression to a progressive process of evidence acquisition. It introduces an adaptive sampler that builds a compact informative preview based on query relevance and diversity, falling back to global clustering when local grounding is weak. When the model shows low confidence, it retrieves targeted high-resolution frames directly from disk without ever caching the full sequence in memory. This design yields measurable gains such as higher accuracy on VideoMME and better localization on Charades-STA while using far fewer tokens and eliminating pre-caching overhead. A sympathetic reader would care because the approach makes reasoning over hour-long videos feasible under practical memory and latency constraints.

Core claim

AdaFocus rethinks long-video understanding as progressive evidence acquisition rather than one-pass encoding. Its Query-Aware Adaptive Relevance-Diversity sampler (AdaRD) produces a compact yet informative video preview and switches to global clustering when the query lacks reliable local grounding. An uncertainty-triggered refinement mechanism then performs targeted look-back, retrieving high-resolution evidence directly from disk via a zero-cache I/O design only when the model is not confident. Experiments on seven benchmarks show this delivers improved task performance such as +2.59 accuracy on VideoMME and +8.39 mIoU on Charades-STA over single-pass inference, while reducing visual token

What carries the argument

The Query-Aware Adaptive Relevance-Diversity sampler (AdaRD) paired with uncertainty-triggered zero-cache disk retrieval, which selects a compact preview adaptively and fetches missing high-resolution evidence on demand.

If this is right

Long-video tasks achieve higher accuracy without the memory cost of dense frame encoding.
Visual token consumption drops by roughly 33 times compared with conventional dense methods.
In-memory frame pre-caching becomes unnecessary through direct disk retrieval.
The same progressive preview-plus-refinement pattern scales across seven standard long-video benchmarks.
Discarded fine-grained visual details become recoverable rather than permanently lost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar zero-cache on-demand retrieval could apply to streaming or real-time video pipelines where full preloading is impossible.
The adaptive relevance-diversity sampler may generalize to other sequential data such as long audio transcripts or document collections.
Extending the uncertainty trigger to multi-query or open-ended tasks could further reduce unnecessary high-resolution fetches.
Integration with existing video encoders would allow direct measurement of end-to-end latency savings on edge hardware.

Load-bearing premise

The uncertainty measure from the initial low-cost preview reliably detects when and where high-resolution details are required without missing critical evidence.

What would settle it

A controlled test on videos where key reasoning evidence lies in frames omitted from the initial preview yet the uncertainty trigger fails to request refinement, producing lower accuracy than a dense baseline.

Figures

Figures reproduced from arXiv: 2605.12954 by Haoxuan Yu, Ning Qin, Xiao Yang, Yingzhe Ma, Zixin Li.

**Figure 1.** Figure 1: Overview of the AdaFocus framework. (a) Coarse Preview Stage: the input video is processed at 1 fps through a CLIP [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Detailed methodology of AdaFocus. (a) Relevance-Diversity Balancing: each candidate frame is scored by cosine [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Comparison with concurrent video reasoning meth [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Long video understanding is heavily bottlenecked by a rigid one-shot paradigm: existing methods either densely encode videos at prohibitive memory and latency costs, or aggressively compress them into sparse frame sets that irreversibly discard fine-grained evidence needed for downstream reasoning. Consequently, current models struggle to simultaneously balance temporal coverage, visual details, and computational efficiency. We propose AdaFocus, an efficient framework that rethinks long-video understanding as progressive evidence acquisition rather than one-pass encoding. AdaFocus relies on two tightly coupled components. First, a Query-Aware Adaptive Relevance-Diversity sampler (AdaRD) produces a compact yet informative video preview, adaptively switching to global clustering when the query lacks reliable local grounding. Second, instead of caching exhaustive frame sequences in memory, AdaFocus introduces an uncertainty-triggered refinement mechanism. It performs targeted look-back only when the model is not confident, retrieving high-resolution evidence directly from disk via a zero-cache I/O design. This turns discarded visual details from an irreversible loss into on-demand recoverable evidence without paying the cost of exhaustive preloading. Experiments on seven standard long-video benchmarks show that AdaFocus delivers a substantially better efficiency-accuracy trade-off than strong baselines. Compared with conventional dense encoding, AdaFocus achieves improved task performance (e.g., +2.59 accuracy on VideoMME, +8.39 mIoU on Charades-STA over single-pass inference) while reducing visual token consumption by ~33x and eliminating the need for in-memory frame pre-caching through its zero-cache disk retrieval design. These findings suggest that progressive preview combined with zero-cache evidence refinement is a highly effective paradigm for scalable multimedia reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AdaFocus gives a workable efficiency boost for long video models via adaptive sampling and disk look-backs, but the uncertainty trigger lacks the checks needed to trust the accuracy gains.

read the letter

Hi, the main thing with AdaFocus is its shift to progressive evidence acquisition: a query-aware relevance-diversity sampler that falls back to global clustering, plus an uncertainty-triggered pull of high-res frames straight from disk instead of pre-caching everything. This zero-cache setup is the concrete new piece, and it lines up with the reported 33x token cut while claiming better numbers than dense baselines on VideoMME and Charades-STA. The paper does well laying out how the sampler adapts to different queries and showing those efficiency-accuracy numbers across seven benchmarks. It feels like a useful incremental step for people who need to run long-video models without blowing out memory. The soft spot is the uncertainty mechanism itself. If the initial low-cost preview misses subtle cues, the look-back never fires and the claimed gains (+2.59 accuracy, +8.39 mIoU) would not appear, yet the abstract gives no false-negative rates, threshold tests, or ablations on that detector. Without those, the central trade-off claim rests on an unverified assumption. This is aimed at CV researchers and engineers building deployable long-video systems who care about token budgets and I/O. A reader working on sampling strategies or efficient inference would get practical value from the numbers even if the method is not foundational. It deserves peer review so the experiments and the refinement logic can be checked directly.

Referee Report

2 major / 1 minor

Summary. The paper presents AdaFocus, a framework for long video understanding that uses a Query-Aware Adaptive Relevance-Diversity (AdaRD) sampler to create a compact preview and an uncertainty-triggered refinement mechanism with zero-cache disk retrieval to fetch high-resolution evidence on demand. It claims improved performance over baselines on seven benchmarks, including +2.59 accuracy on VideoMME and +8.39 mIoU on Charades-STA, while reducing visual tokens by ~33x and avoiding in-memory caching.

Significance. If the results hold, AdaFocus represents a significant advance in efficient video processing by shifting from one-shot dense encoding to progressive, query-adaptive evidence acquisition. The zero-cache I/O design could enable memory-efficient handling of very long videos, addressing key bottlenecks in current models. The adaptive sampling and on-demand refinement offer a promising direction for scalable multimedia reasoning.

major comments (2)

[Abstract] Abstract: The reported performance gains (+2.59 accuracy on VideoMME, +8.39 mIoU on Charades-STA) are presented without error bars, multiple-run statistics, or detailed baseline implementation descriptions, which weakens the ability to assess the reliability of the efficiency-accuracy trade-off claim.
[Methods (uncertainty-triggered refinement)] Methods (uncertainty-triggered refinement): No ablation studies or analysis are provided on the false-negative rate of the uncertainty detector or its sensitivity to the confidence threshold. This is load-bearing for the central claim, as failure to detect needed refinements would mean critical details are missed despite the zero-cache design.

minor comments (1)

[Abstract] Abstract: The term 'zero-cache I/O design' is introduced without a brief definition or reference to the specific implementation details in the methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger statistical reporting and analysis of the uncertainty mechanism. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Abstract] Abstract: The reported performance gains (+2.59 accuracy on VideoMME, +8.39 mIoU on Charades-STA) are presented without error bars, multiple-run statistics, or detailed baseline implementation descriptions, which weakens the ability to assess the reliability of the efficiency-accuracy trade-off claim.

Authors: We agree that the abstract would benefit from explicit error bars and a note on multi-run statistics to better support the reliability of the reported gains. The full experimental section already presents results averaged over three independent runs with standard deviations in the main tables, but these details were condensed for the abstract. We will revise the abstract to include approximate error ranges (e.g., +2.59 ± 0.4 accuracy) and add a sentence referencing the baseline re-implementations. Detailed baseline code, hyperparameters, and implementation notes are provided in the supplementary material; we will add an explicit cross-reference in the main text. revision: yes
Referee: [Methods (uncertainty-triggered refinement)] Methods (uncertainty-triggered refinement): No ablation studies or analysis are provided on the false-negative rate of the uncertainty detector or its sensitivity to the confidence threshold. This is load-bearing for the central claim, as failure to detect needed refinements would mean critical details are missed despite the zero-cache design.

Authors: We acknowledge that an ablation on the uncertainty detector's false-negative rate and threshold sensitivity is essential to substantiate the refinement mechanism. We will add a dedicated ablation subsection that varies the confidence threshold across a range (e.g., 0.5–0.9) and reports the corresponding false-negative rate (estimated via manual inspection on a sampled subset of videos), along with effects on overall accuracy and refinement frequency. This analysis will be included in the revised Methods and Experiments sections. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical benchmarks without derivations or self-referential reductions

full rationale

The paper introduces AdaFocus as a framework with two components (Query-Aware Adaptive Relevance-Diversity sampler and uncertainty-triggered refinement with zero-cache disk retrieval) and supports its efficiency-accuracy claims solely through experimental comparisons on seven long-video benchmarks (e.g., +2.59 accuracy on VideoMME, +8.39 mIoU on Charades-STA, ~33x token reduction). No equations, mathematical derivations, fitted parameters presented as predictions, or self-citation chains appear in the text. The method is described procedurally as a rethinking of one-shot encoding into progressive acquisition, with results validated against baselines rather than reducing to inputs by construction. This is the standard non-circular case for an applied systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The framework rests on the domain assumption that an initial low-cost preview plus selective refinement can recover necessary evidence; no free parameters or invented physical entities are named in the abstract.

axioms (2)

domain assumption Query-aware adaptive relevance-diversity sampling produces a sufficiently informative preview for downstream reasoning
Invoked as the basis for the AdaRD component and the decision to switch to global clustering.
domain assumption Uncertainty can be measured reliably enough to trigger targeted high-resolution retrieval without missing critical details
Central to the zero-cache refinement mechanism.

invented entities (2)

AdaRD sampler no independent evidence
purpose: Produces compact yet informative video preview by balancing relevance and diversity
New component introduced to replace rigid one-shot encoding
zero-cache I/O design no independent evidence
purpose: Enables on-demand disk retrieval of high-resolution frames without in-memory caching
New mechanism to turn discarded details into recoverable evidence

pith-pipeline@v0.9.0 · 5613 in / 1521 out tokens · 31897 ms · 2026-05-14T19:39:53.060195+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 12 canonical work pages · 8 internal anchors

[1]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision. 5803–5812

2017
[2]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. 2024. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashat- tention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems35 (2022), 16344–16359

2022
[5]

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. 2025. Video- r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. 2025. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 24108–24118

2025
[7]

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. Tall: Temporal activity localization via language query. InProceedings of the IEEE international conference on computer vision. 5267–5275

2017
[8]

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. 2025. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. 2024. Chat-univi: Unified visual representation empowers large language models with image and video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13700–13710

2024
[10]

Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg, Mohit Bansal, and Jingjing Liu. 2021. Less is more: Clipbert for video-and-language learning via sparse sampling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7331–7341

2021
[11]

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. 2024. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. 2024. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22195–22206

2024
[13]

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. 2025. Videochat-r1: Enhanc- ing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958(2025)

work page arXiv 2025
[14]

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. 2024. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 conference on empirical methods in natural language processing. 5971–5984

2024
[15]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual in- struction tuning.Advances in neural information processing systems36 (2023), 34892–34916

2023
[16]

Shuming Liu, Mingchen Zhuge, Changsheng Zhao, Jun Chen, Lemeng Wu, Zechun Liu, Chenchen Zhu, Zhipeng Cai, Chong Zhou, Haozhe Liu, et al. 2026. VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice. arXiv preprint arXiv:2601.05175(2026)

work page arXiv 2026
[17]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

2021
[18]

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. 2024. Timechat: A time-sensitive multimodal large language model for long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14313–14323

2024
[19]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung-Levy. 2024. Videoa- gent: Long-form video understanding with large language model as agent. In European Conference on Computer Vision. Springer, 58–76

2024
[22]

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei, R Zheng, J Xu, Z Wang, et al. 2024. In- ternvideo2: Scaling video foundation models for multimodal video understanding. arXiv 2024.arXiv preprint arXiv:2403.153772 (2024)

work page arXiv 2024
[23]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

2022
[24]

Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, and Bohan Zhuang
[25]

In European Conference on Computer Vision

Longvlm: Efficient long video understanding via large language models. In European Conference on Computer Vision. Springer, 453–470
[26]

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. [n. d.]. Longvideobench: A benchmark for long-context interleaved video-language understanding, 2024. URL https://arxiv. org/abs/240715754, 8 ([n. d.])

work page arXiv 2024
[27]

Zuxuan Wu, Caiming Xiong, Chih-Yao Ma, Richard Socher, and Larry S Davis
[28]

InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Adaframe: Adaptive frame selection for fast video recognition. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1278–1287
[29]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis
[30]

Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Junbin Xiao, Angela Yao, Yicong Li, and Tat-Seng Chua. 2024. Can i trust your an- swer? visually grounded video question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13204–13214

2024
[32]

Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. 2023. Self-chained image-language model for video localization and question answering.Advances in Neural Information Processing Systems36 (2023), 76749–76771

2023
[33]

Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, et al. 2025. Mmvu: Measuring expert-level multi-discipline video understanding. InProceedings of the Computer 8 AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Long Video Understanding , , Vision and Pattern...

2025