Adaptive Greedy Frame Selection for Long Video Understanding
Pith reviewed 2026-05-15 08:17 UTC · model grok-4.3
The pith
A submodular greedy selector that balances question relevance and video coverage picks frames that raise accuracy on long-video question answering.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that a 1-FPS candidate pool capped at 1000 frames, embedded once with SigLIP for question relevance and once with DINOv2 for semantic similarity, can be filtered by a greedy algorithm that maximizes a normalized monotone submodular function consisting of a modular relevance term plus a facility-location coverage term, and that routing queries to one of four preset weightings via a text-only classifier yields higher answer accuracy on the MLVU benchmark than uniform sampling or prior relevance-driven baselines, with the largest margins appearing under small frame budgets.
What carries the argument
The normalized monotone submodular objective that adds a modular relevance score from SigLIP embeddings to a facility-location coverage score from DINOv2 embeddings, allowing the standard greedy algorithm and its (1-1/e) approximation guarantee.
If this is right
- Accuracy improves consistently over uniform sampling and a strong recent baseline on MLVU across multiple frame budgets.
- Gains are largest when the allowed number of frames is small.
- A text-only classifier can route each query to an appropriate preset weighting without additional training.
- The submodular property supplies a (1-1/e) guarantee that the greedy choice is near-optimal for the chosen objective.
Where Pith is reading between the lines
- The same pool-and-greedy structure could be applied to streaming video by maintaining a rolling candidate window.
- Audio or motion embeddings could be added to the coverage term to handle questions that depend on sound or action rather than appearance.
- The method might reduce overall token usage in deployed VLM systems without retraining the underlying model.
Load-bearing premise
The SigLIP and DINOv2 embeddings must reliably encode question relevance and semantic representativeness across the range of question types that appear in long videos, and the four preset strategies must be sufficient to handle the needed relevance-coverage trade-offs without per-query retuning.
What would settle it
On a new long-video QA dataset whose questions fall outside the four preset categories or whose visual content is poorly separated by the chosen embeddings, the greedy selections produce no accuracy gain or a loss relative to uniform sampling at the same frame budget.
Figures
read the original abstract
Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frames and resulting visual tokens. Naive sparse sampling can miss decisive moments, while purely relevance-driven selection frequently collapses onto near-duplicate frames and sacrifices coverage of temporally distant evidence. We propose a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic representativeness under a fixed frame budget. Our approach constructs a 1~FPS candidate pool (capped at 1000) with exact timestamp alignment, embeds candidates in two complementary spaces (SigLIP for question relevance and DINOv2 for semantic similarity), and selects frames by greedily maximizing a weighted sum of a modular relevance term and a facility-location coverage term. This objective is normalized, monotone, and submodular, yielding a standard (1-1/e) greedy approximation guarantee. To account for question-dependent trade-offs between relevance and coverage, we introduce four preset strategies and a lightweight text-only question-type classifier that routes each query to its best-performing preset. Experiments on MLVU show consistent accuracy gains over uniform sampling and a strong recent baseline across frame budgets, with the largest improvements under tight budgets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an adaptive greedy frame selection method for long-video VLM question answering. It constructs a 1 FPS candidate pool (capped at 1000), embeds frames via SigLIP (for modular relevance) and DINOv2 (for facility-location coverage), and greedily maximizes a normalized weighted sum objective that is monotone and submodular, yielding the standard (1-1/e) approximation guarantee. A lightweight text-only classifier routes each query to one of four preset relevance-coverage weightings. Experiments on MLVU report consistent accuracy gains over uniform sampling and a recent baseline, largest under tight frame budgets.
Significance. If the gains are robust, the work offers a practical, theoretically grounded solution to the frame-budget bottleneck in long-video VLMs by combining complementary embeddings with submodular optimization and lightweight adaptation. The explicit use of the greedy guarantee and the separation of the classifier from the visual embeddings are notable strengths that could influence efficient inference pipelines.
major comments (3)
- [Abstract / Method] Abstract and Method section: the claim that the objective is 'normalized, monotone, and submodular' is asserted without an explicit equation for the weighted sum or a short derivation/reference showing why the facility-location term preserves submodularity under the chosen normalization; this is load-bearing for the (1-1/e) guarantee.
- [Experiments] Experiments section: MLVU accuracy improvements are summarized at high level only, with no error bars, statistical tests, or per-budget ablation tables; without these it is impossible to determine whether the reported gains are statistically reliable or driven primarily by the adaptive classifier versus the base greedy selection.
- [Method / Experiments] Method / Experiments: the central assumption that SigLIP cosine similarity and DINOv2 features produce scores aligned with answer utility across diverse question types (fine-grained actions, temporal ordering, rare objects) is not supported by qualitative frame-selection examples or failure-case analysis; if this alignment fails, the greedy selections remain suboptimal despite the theoretical guarantee.
minor comments (2)
- [Abstract] Abstract: the phrase 'exact timestamp alignment' for the candidate pool should be expanded with a brief description of how frame indices map to video time in the full method.
- [Experiments] Overall: a small table showing accuracy for each of the four preset strategies (with and without the classifier) would strengthen the justification for the adaptive routing component.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve rigor and clarity.
read point-by-point responses
-
Referee: [Abstract / Method] Abstract and Method section: the claim that the objective is 'normalized, monotone, and submodular' is asserted without an explicit equation for the weighted sum or a short derivation/reference showing why the facility-location term preserves submodularity under the chosen normalization; this is load-bearing for the (1-1/e) guarantee.
Authors: We agree that an explicit formulation is essential. In the revised manuscript we will add the precise objective equation in the Method section (a normalized weighted sum of the modular SigLIP relevance term and the DINOv2 facility-location coverage term) together with a short derivation or standard reference establishing that the coverage term is monotone submodular and that the chosen normalization preserves these properties for the combined objective. This will directly support the (1-1/e) guarantee. revision: yes
-
Referee: [Experiments] Experiments section: MLVU accuracy improvements are summarized at high level only, with no error bars, statistical tests, or per-budget ablation tables; without these it is impossible to determine whether the reported gains are statistically reliable or driven primarily by the adaptive classifier versus the base greedy selection.
Authors: We acknowledge the current lack of detailed statistics. The revision will include error bars from multiple random seeds, paired statistical significance tests across methods, and expanded per-budget ablation tables that separately report the base greedy selector and the full adaptive-classifier version. These additions will allow readers to assess both reliability and the relative contribution of each component. revision: yes
-
Referee: [Method / Experiments] Method / Experiments: the central assumption that SigLIP cosine similarity and DINOv2 features produce scores aligned with answer utility across diverse question types (fine-grained actions, temporal ordering, rare objects) is not supported by qualitative frame-selection examples or failure-case analysis; if this alignment fails, the greedy selections remain suboptimal despite the theoretical guarantee.
Authors: We agree that qualitative evidence would strengthen the empirical grounding. The revised version will add representative frame-selection visualizations for different MLVU question categories and a short failure-case analysis section that discusses instances where the selected frames align with or deviate from answer utility. This will provide concrete support for the practical utility of the embeddings. revision: yes
Circularity Check
No circularity: standard greedy algorithm on explicitly submodular objective with independent pre-trained components
full rationale
The derivation chain consists of constructing a candidate pool, embedding in fixed pre-trained spaces (SigLIP, DINOv2), defining a normalized monotone submodular objective, and applying the known greedy algorithm whose (1-1/e) guarantee is external. The four presets and text-only classifier are trained separately on question text and do not reduce to fitted parameters from the reported video results. No self-definitional equations, no predictions that are inputs by construction, and no load-bearing self-citations appear in the provided derivation. The central claim remains independent of the experimental outcomes.
Axiom & Free-Parameter Ledger
free parameters (1)
- relevance-coverage weight
axioms (2)
- standard math The combined relevance-plus-coverage objective is monotone and submodular
- domain assumption SigLIP and DINOv2 embeddings provide faithful measures of question relevance and semantic diversity
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
This objective is normalized, monotone, and submodular, yielding a standard (1-1/e) greedy approximation guarantee.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic representativeness
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Buch, S., Nagrani, A., Arnab, A., Schmid, C.: Flexible frame selection for efficient videoreasoning.In:ProceedingsoftheIEEE/CVFConferenceonComputerVision and Pattern Recognition (CVPR). pp. 29071–29082 (June 2025)
work page 2025
-
[2]
The use of mmr, diversity-based reranking for reordering documents and producing summaries
Carbonell, J., Goldstein, J.: The use of MMR, diversity-based reranking for re- ordering documents and producing summaries. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Informa- tion Retrieval (SIGIR). pp. 335–336. ACM (1998).https://doi.org/10.1145/ 290941.291025,https://doi.org/10.1145/290941.291025
-
[3]
Carbonell, J., Goldstein, J.: The use of mmr, diversity-based reranking for reorder- ing documents and producing summaries. In: Proceedings of the 21st Annual In- ternational ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). pp. 335–336. ACM (1998)
work page 1998
-
[4]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., Chen, P., Li, Y., Lin, S., Zhao, S., Li, K., Xu, T., Zheng, X., Chen, E., Shan, C., He, R., Sun, X.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision...
work page 2025
-
[5]
Cambridge University Press (Feb 2014)
Krause,A.,Golovin,D.:Submodularfunctionmaximization.In:Tractability:Prac- tical Approaches to Hard Problems. Cambridge University Press (Feb 2014)
work page 2014
-
[6]
In: Bordeaux, L., Hamadi, Y., Kohli, P
Krause, A., Golovin, D.: Submodular function maximization. In: Bordeaux, L., Hamadi, Y., Kohli, P. (eds.) Tractability: Practical Approaches to Hard Problems, pp. 71–104. Cambridge University Press (2014) 12 F. Author et al
work page 2014
-
[7]
Kulesza, A., Taskar, B.: Determinantal point processes for machine learning. Foundations and Trends in Machine Learning5(2-3), 123–286 (2012).https: //doi.org/10.1561/2200000044,https://doi.org/10.1561/2200000044
-
[8]
In: Lin, D., Matsumoto, Y., Mihalcea, R
Lin, H., Bilmes, J.: A class of submodular functions for document summariza- tion. In: Lin, D., Matsumoto, Y., Mihalcea, R. (eds.) Proceedings of the 49th An- nual Meeting of the Association for Computational Linguistics: Human Language Technologies. pp. 510–520. Association for Computational Linguistics, Portland, Oregon, USA (Jun 2011),https://aclanthol...
work page 2011
-
[9]
In: Proceedings of the Com- puter Vision and Pattern Recognition Conference (CVPR)
Liu, S., Zhao, C., Xu, T., Ghanem, B.: Bolt: Boost large vision-language model without training for long-form video understanding. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference (CVPR). pp. 3318–3327 (June 2025)
work page 2025
-
[10]
In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR)
Ma,Z.,Gou,C.,Shi,H.,Sun,B.,Li,S.,Rezatofighi,H.,Cai,J.:Drvideo:Document retrieval based long video understanding. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). pp. 18936–18946 (June 2025)
work page 2025
- [11]
-
[12]
Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin
Nemhauser, G.L., Wolsey, L.A., Fisher, M.L.: An analysis of approximations for maximizing submodular set functions - I. Mathematical Programming14(1), 265– 294 (1978).https://doi.org/10.1007/BF01588971,https://doi.org/10.1007/ BF01588971
-
[13]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Song, E., Chai, W., Wang, G., Zhang, Y., Zhou, H., Wu, F., Chi, H., Guo, X., Ye, T., Zhang, Y., Lu, Y., Hwang, J.N., Wang, G.: Moviechat: From dense token to sparse memory for long video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18221– 18232 (June 2024)
work page 2024
-
[14]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Tang, X., Qiu, J., Xie, L., Tian, Y., Jiao, J., Ye, Q.: Adaptive keyframe sampling for long video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 29118–29128 (June 2025)
work page 2025
-
[15]
In: Proceedings of the Computer Vision and Pattern Recognition Confer- ence (CVPR)
Wang, Z., Yu, S., Stengel-Eskin, E., Yoon, J., Cheng, F., Bertasius, G., Bansal, M.: Videotree: Adaptive tree-based video representation for llm reasoning on long videos. In: Proceedings of the Computer Vision and Pattern Recognition Confer- ence (CVPR). pp. 3272–3283 (June 2025)
work page 2025
-
[16]
Wu, H., Li, D., Chen, B., Li, J.: Longvideobench: A benchmark for long-context interleaved video-language understanding (2024),https://arxiv.org/abs/2407. 15754
work page 2024
-
[17]
Yao, L., Wu, H., Ouyang, K., Zhang, Y., Xiong, C., Chen, B., Sun, X., Li, J.: Generative frame sampler for long video understanding. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Findings of the Association for Computational Linguistics: ACL 2025. pp. 17900–17917. Association for Computational Linguis- tics, Vienna, Austria (Jul 2025).https...
-
[18]
Zhang, X., Wu, Z., Li, Z., Xu, H., Gong, L., Boussaid, F., Werghi, N., Bennamoun, M.: Adard-key: Adaptive relevance-diversity keyframe sampling for long-form video understanding (Oct 2025).https://doi.org/10.48550/arXiv.2510.02778, https://arxiv.org/abs/2510.02778
- [19]
-
[20]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Zhou, J., Shu, Y., Zhao, B., Wu, B., Liang, Z., Xiao, S., Qin, M., Yang, X., Xiong, Y., Zhang, B., Huang, T., Liu, Z.: Mlvu: Benchmarking multi-task long video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13691–13701 (June 2025)
work page 2025
-
[22]
Zou, J., Huang, Z., Zhang, S., Zhang, L., Shen, W.: Videobrain: Learning adaptive frame sampling for long video understanding (2026).https://doi.org/10.48550/ arXiv.2602.04094,https://arxiv.org/abs/2602.04094
work page internal anchor Pith review arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.