pith. sign in

arxiv: 2603.20180 · v2 · submitted 2026-03-20 · 💻 cs.CV · cs.AI· cs.CL

Adaptive Greedy Frame Selection for Long Video Understanding

Pith reviewed 2026-05-15 08:17 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords frame selectionlong video understandinggreedy algorithmsubmodular optimizationvision-language modelsvideo question answeringMLVU benchmark
0
0 comments X

The pith

A submodular greedy selector that balances question relevance and video coverage picks frames that raise accuracy on long-video question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models for long-video question answering face strict limits on the number of input frames they can process. Uniform sampling often skips key moments while relevance-only selection tends to pick near-duplicates and lose broad coverage. The paper demonstrates that building a one-frame-per-second candidate pool, embedding frames in two spaces, and greedily maximizing a weighted combination of relevance and facility-location coverage produces better selections. The objective is proven monotone and submodular, so the greedy step comes with a standard approximation guarantee. A lightweight classifier routes each query to one of four preset weightings, and experiments on MLVU show gains over baselines that grow larger when the frame budget is tight.

Core claim

The paper establishes that a 1-FPS candidate pool capped at 1000 frames, embedded once with SigLIP for question relevance and once with DINOv2 for semantic similarity, can be filtered by a greedy algorithm that maximizes a normalized monotone submodular function consisting of a modular relevance term plus a facility-location coverage term, and that routing queries to one of four preset weightings via a text-only classifier yields higher answer accuracy on the MLVU benchmark than uniform sampling or prior relevance-driven baselines, with the largest margins appearing under small frame budgets.

What carries the argument

The normalized monotone submodular objective that adds a modular relevance score from SigLIP embeddings to a facility-location coverage score from DINOv2 embeddings, allowing the standard greedy algorithm and its (1-1/e) approximation guarantee.

If this is right

  • Accuracy improves consistently over uniform sampling and a strong recent baseline on MLVU across multiple frame budgets.
  • Gains are largest when the allowed number of frames is small.
  • A text-only classifier can route each query to an appropriate preset weighting without additional training.
  • The submodular property supplies a (1-1/e) guarantee that the greedy choice is near-optimal for the chosen objective.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pool-and-greedy structure could be applied to streaming video by maintaining a rolling candidate window.
  • Audio or motion embeddings could be added to the coverage term to handle questions that depend on sound or action rather than appearance.
  • The method might reduce overall token usage in deployed VLM systems without retraining the underlying model.

Load-bearing premise

The SigLIP and DINOv2 embeddings must reliably encode question relevance and semantic representativeness across the range of question types that appear in long videos, and the four preset strategies must be sufficient to handle the needed relevance-coverage trade-offs without per-query retuning.

What would settle it

On a new long-video QA dataset whose questions fall outside the four preset categories or whose visual content is poorly separated by the chosen embeddings, the greedy selections produce no accuracy gain or a loss relative to uniform sampling at the same frame budget.

Figures

Figures reproduced from arXiv: 2603.20180 by Fengqing Zhu, Joseph Huang, Xiaoyu Ji, Yichi Zhang, Yuning Huang.

Figure 1
Figure 1. Figure 1: Question-type classifier used for adaptive strategy routing. [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Main result: average accuracy vs. selected frame count. [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
read the original abstract

Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frames and resulting visual tokens. Naive sparse sampling can miss decisive moments, while purely relevance-driven selection frequently collapses onto near-duplicate frames and sacrifices coverage of temporally distant evidence. We propose a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic representativeness under a fixed frame budget. Our approach constructs a 1~FPS candidate pool (capped at 1000) with exact timestamp alignment, embeds candidates in two complementary spaces (SigLIP for question relevance and DINOv2 for semantic similarity), and selects frames by greedily maximizing a weighted sum of a modular relevance term and a facility-location coverage term. This objective is normalized, monotone, and submodular, yielding a standard (1-1/e) greedy approximation guarantee. To account for question-dependent trade-offs between relevance and coverage, we introduce four preset strategies and a lightweight text-only question-type classifier that routes each query to its best-performing preset. Experiments on MLVU show consistent accuracy gains over uniform sampling and a strong recent baseline across frame budgets, with the largest improvements under tight budgets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes an adaptive greedy frame selection method for long-video VLM question answering. It constructs a 1 FPS candidate pool (capped at 1000), embeds frames via SigLIP (for modular relevance) and DINOv2 (for facility-location coverage), and greedily maximizes a normalized weighted sum objective that is monotone and submodular, yielding the standard (1-1/e) approximation guarantee. A lightweight text-only classifier routes each query to one of four preset relevance-coverage weightings. Experiments on MLVU report consistent accuracy gains over uniform sampling and a recent baseline, largest under tight frame budgets.

Significance. If the gains are robust, the work offers a practical, theoretically grounded solution to the frame-budget bottleneck in long-video VLMs by combining complementary embeddings with submodular optimization and lightweight adaptation. The explicit use of the greedy guarantee and the separation of the classifier from the visual embeddings are notable strengths that could influence efficient inference pipelines.

major comments (3)
  1. [Abstract / Method] Abstract and Method section: the claim that the objective is 'normalized, monotone, and submodular' is asserted without an explicit equation for the weighted sum or a short derivation/reference showing why the facility-location term preserves submodularity under the chosen normalization; this is load-bearing for the (1-1/e) guarantee.
  2. [Experiments] Experiments section: MLVU accuracy improvements are summarized at high level only, with no error bars, statistical tests, or per-budget ablation tables; without these it is impossible to determine whether the reported gains are statistically reliable or driven primarily by the adaptive classifier versus the base greedy selection.
  3. [Method / Experiments] Method / Experiments: the central assumption that SigLIP cosine similarity and DINOv2 features produce scores aligned with answer utility across diverse question types (fine-grained actions, temporal ordering, rare objects) is not supported by qualitative frame-selection examples or failure-case analysis; if this alignment fails, the greedy selections remain suboptimal despite the theoretical guarantee.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'exact timestamp alignment' for the candidate pool should be expanded with a brief description of how frame indices map to video time in the full method.
  2. [Experiments] Overall: a small table showing accuracy for each of the four preset strategies (with and without the classifier) would strengthen the justification for the adaptive routing component.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve rigor and clarity.

read point-by-point responses
  1. Referee: [Abstract / Method] Abstract and Method section: the claim that the objective is 'normalized, monotone, and submodular' is asserted without an explicit equation for the weighted sum or a short derivation/reference showing why the facility-location term preserves submodularity under the chosen normalization; this is load-bearing for the (1-1/e) guarantee.

    Authors: We agree that an explicit formulation is essential. In the revised manuscript we will add the precise objective equation in the Method section (a normalized weighted sum of the modular SigLIP relevance term and the DINOv2 facility-location coverage term) together with a short derivation or standard reference establishing that the coverage term is monotone submodular and that the chosen normalization preserves these properties for the combined objective. This will directly support the (1-1/e) guarantee. revision: yes

  2. Referee: [Experiments] Experiments section: MLVU accuracy improvements are summarized at high level only, with no error bars, statistical tests, or per-budget ablation tables; without these it is impossible to determine whether the reported gains are statistically reliable or driven primarily by the adaptive classifier versus the base greedy selection.

    Authors: We acknowledge the current lack of detailed statistics. The revision will include error bars from multiple random seeds, paired statistical significance tests across methods, and expanded per-budget ablation tables that separately report the base greedy selector and the full adaptive-classifier version. These additions will allow readers to assess both reliability and the relative contribution of each component. revision: yes

  3. Referee: [Method / Experiments] Method / Experiments: the central assumption that SigLIP cosine similarity and DINOv2 features produce scores aligned with answer utility across diverse question types (fine-grained actions, temporal ordering, rare objects) is not supported by qualitative frame-selection examples or failure-case analysis; if this alignment fails, the greedy selections remain suboptimal despite the theoretical guarantee.

    Authors: We agree that qualitative evidence would strengthen the empirical grounding. The revised version will add representative frame-selection visualizations for different MLVU question categories and a short failure-case analysis section that discusses instances where the selected frames align with or deviate from answer utility. This will provide concrete support for the practical utility of the embeddings. revision: yes

Circularity Check

0 steps flagged

No circularity: standard greedy algorithm on explicitly submodular objective with independent pre-trained components

full rationale

The derivation chain consists of constructing a candidate pool, embedding in fixed pre-trained spaces (SigLIP, DINOv2), defining a normalized monotone submodular objective, and applying the known greedy algorithm whose (1-1/e) guarantee is external. The four presets and text-only classifier are trained separately on question text and do not reduce to fitted parameters from the reported video results. No self-definitional equations, no predictions that are inputs by construction, and no load-bearing self-citations appear in the provided derivation. The central claim remains independent of the experimental outcomes.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The method rests on the assumption that the chosen embeddings capture the required signals and that the objective remains submodular after weighting; no new entities are introduced.

free parameters (1)
  • relevance-coverage weight
    The weighted sum requires a balance parameter that is set via one of four presets chosen by the classifier.
axioms (2)
  • standard math The combined relevance-plus-coverage objective is monotone and submodular
    Invoked to obtain the (1-1/e) greedy approximation guarantee.
  • domain assumption SigLIP and DINOv2 embeddings provide faithful measures of question relevance and semantic diversity
    Used to construct the modular terms without further justification in the abstract.

pith-pipeline@v0.9.0 · 5522 in / 1500 out tokens · 43683 ms · 2026-05-15T08:17:57.951070+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

  1. [1]

    Buch, S., Nagrani, A., Arnab, A., Schmid, C.: Flexible frame selection for efficient videoreasoning.In:ProceedingsoftheIEEE/CVFConferenceonComputerVision and Pattern Recognition (CVPR). pp. 29071–29082 (June 2025)

  2. [2]

    The use of mmr, diversity-based reranking for reordering documents and producing summaries

    Carbonell, J., Goldstein, J.: The use of MMR, diversity-based reranking for re- ordering documents and producing summaries. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Informa- tion Retrieval (SIGIR). pp. 335–336. ACM (1998).https://doi.org/10.1145/ 290941.291025,https://doi.org/10.1145/290941.291025

  3. [3]

    In: Proceedings of the 21st Annual In- ternational ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)

    Carbonell, J., Goldstein, J.: The use of mmr, diversity-based reranking for reorder- ing documents and producing summaries. In: Proceedings of the 21st Annual In- ternational ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). pp. 335–336. ACM (1998)

  4. [4]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., Chen, P., Li, Y., Lin, S., Zhao, S., Li, K., Xu, T., Zheng, X., Chen, E., Shan, C., He, R., Sun, X.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision...

  5. [5]

    Cambridge University Press (Feb 2014)

    Krause,A.,Golovin,D.:Submodularfunctionmaximization.In:Tractability:Prac- tical Approaches to Hard Problems. Cambridge University Press (Feb 2014)

  6. [6]

    In: Bordeaux, L., Hamadi, Y., Kohli, P

    Krause, A., Golovin, D.: Submodular function maximization. In: Bordeaux, L., Hamadi, Y., Kohli, P. (eds.) Tractability: Practical Approaches to Hard Problems, pp. 71–104. Cambridge University Press (2014) 12 F. Author et al

  7. [7]

    Foundations and Trends in Machine Learning5(2-3), 123–286 (2012).https: //doi.org/10.1561/2200000044,https://doi.org/10.1561/2200000044

    Kulesza, A., Taskar, B.: Determinantal point processes for machine learning. Foundations and Trends in Machine Learning5(2-3), 123–286 (2012).https: //doi.org/10.1561/2200000044,https://doi.org/10.1561/2200000044

  8. [8]

    In: Lin, D., Matsumoto, Y., Mihalcea, R

    Lin, H., Bilmes, J.: A class of submodular functions for document summariza- tion. In: Lin, D., Matsumoto, Y., Mihalcea, R. (eds.) Proceedings of the 49th An- nual Meeting of the Association for Computational Linguistics: Human Language Technologies. pp. 510–520. Association for Computational Linguistics, Portland, Oregon, USA (Jun 2011),https://aclanthol...

  9. [9]

    In: Proceedings of the Com- puter Vision and Pattern Recognition Conference (CVPR)

    Liu, S., Zhao, C., Xu, T., Ghanem, B.: Bolt: Boost large vision-language model without training for long-form video understanding. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference (CVPR). pp. 3318–3327 (June 2025)

  10. [10]

    In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR)

    Ma,Z.,Gou,C.,Shi,H.,Sun,B.,Li,S.,Rezatofighi,H.,Cai,J.:Drvideo:Document retrieval based long video understanding. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). pp. 18936–18946 (June 2025)

  11. [11]

    Mangalam, K., Akshulakov, R., Malik, J.: Egoschema: A diagnostic benchmark for very long-form video language understanding (2023),https://arxiv.org/abs/ 2308.09126

  12. [12]

    Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin

    Nemhauser, G.L., Wolsey, L.A., Fisher, M.L.: An analysis of approximations for maximizing submodular set functions - I. Mathematical Programming14(1), 265– 294 (1978).https://doi.org/10.1007/BF01588971,https://doi.org/10.1007/ BF01588971

  13. [13]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Song, E., Chai, W., Wang, G., Zhang, Y., Zhou, H., Wu, F., Chi, H., Guo, X., Ye, T., Zhang, Y., Lu, Y., Hwang, J.N., Wang, G.: Moviechat: From dense token to sparse memory for long video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18221– 18232 (June 2024)

  14. [14]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Tang, X., Qiu, J., Xie, L., Tian, Y., Jiao, J., Ye, Q.: Adaptive keyframe sampling for long video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 29118–29128 (June 2025)

  15. [15]

    In: Proceedings of the Computer Vision and Pattern Recognition Confer- ence (CVPR)

    Wang, Z., Yu, S., Stengel-Eskin, E., Yoon, J., Cheng, F., Bertasius, G., Bansal, M.: Videotree: Adaptive tree-based video representation for llm reasoning on long videos. In: Proceedings of the Computer Vision and Pattern Recognition Confer- ence (CVPR). pp. 3272–3283 (June 2025)

  16. [16]

    Wu, H., Li, D., Chen, B., Li, J.: Longvideobench: A benchmark for long-context interleaved video-language understanding (2024),https://arxiv.org/abs/2407. 15754

  17. [17]

    scrolling screenshot

    Yao, L., Wu, H., Ouyang, K., Zhang, Y., Xiong, C., Chen, B., Sun, X., Li, J.: Generative frame sampler for long video understanding. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Findings of the Association for Computational Linguistics: ACL 2025. pp. 17900–17917. Association for Computational Linguis- tics, Vienna, Austria (Jul 2025).https...

  18. [18]

    Zhang, X., Wu, Z., Li, Z., Xu, H., Gong, L., Boussaid, F., Werghi, N., Bennamoun, M.: Adard-key: Adaptive relevance-diversity keyframe sampling for long-form video understanding (Oct 2025).https://doi.org/10.48550/arXiv.2510.02778, https://arxiv.org/abs/2510.02778

  19. [19]

    Zhang, X., Wu, Z., Li, Z., Xu, H., Gong, L., Boussaid, F., Werghi, N., Bennamoun, M.: Adard-key: Adaptive relevance-diversity keyframe sampling for long-form video understanding (2025),https://arxiv.org/abs/2510.02778 Abbreviated paper title 13

  20. [20]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Zhou, J., Shu, Y., Zhao, B., Wu, B., Liang, Z., Xiao, S., Qin, M., Yang, X., Xiong, Y., Zhang, B., Huang, T., Liu, Z.: Mlvu: Benchmarking multi-task long video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13691–13701 (June 2025)

  21. [22]

    Zou, J., Huang, Z., Zhang, S., Zhang, L., Shen, W.: Videobrain: Learning adaptive frame sampling for long video understanding (2026).https://doi.org/10.48550/ arXiv.2602.04094,https://arxiv.org/abs/2602.04094