QCA: Query- and Content-Aware Keyframe Selection for Long Video Understanding

Baiyang Song; Hui Li; Jie Li; Jun Peng; Rongrong Ji; Yiyi Zhou; Yonghong Tian

arxiv: 2607.00983 · v1 · pith:SYJBMWVZnew · submitted 2026-07-01 · 💻 cs.CV

QCA: Query- and Content-Aware Keyframe Selection for Long Video Understanding

Jun Peng , Baiyang Song , Jie Li , Hui Li , Yiyi Zhou , Rongrong Ji , Yonghong Tian This is my paper

Pith reviewed 2026-07-02 14:07 UTC · model grok-4.3

classification 💻 cs.CV

keywords query-aware keyframe selectionlong video understandingtemporal redundancyVideo-LLMsparameter-free methodcontent deviationinformation contribution estimation

0 comments

The pith

QCA selects compact query-relevant keyframes from long videos by scoring segments on relevance and deviation, then picking diverse additions within budget, needing no training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Long videos suffer from temporal redundancy that wastes computation when only a few frames matter for a given query. QCA splits the video into segments and scores each one's contribution by combining how well it matches the query with how much new content it adds. It assigns keyframe slots per segment and picks an anchor frame plus others that stay relevant while increasing variety. The whole process runs without training or fine-tuning and slots into any existing Video-LLM. On LongVideoBench this yields 67.8 percent accuracy with 128 frames, above GPT-4o's 66.7 percent using 256 frames.

Core claim

The QCA framework first partitions a long video into temporal segments, estimates each segment's information contribution through joint modeling of query relevance and content deviation, dynamically allocates a keyframe budget across segments, and within each segment anchors on the most query-relevant frame before iteratively adding frames that maximize diversity while preserving semantic relevance. This selection requires no additional training and integrates directly into Video-LLMs, producing state-of-the-art results such as 67.8 percent on LongVideoBench with only 128 frames.

What carries the argument

The QCA procedure that partitions video into segments, jointly scores query relevance and content deviation to allocate budgets, and iteratively selects diverse yet query-aligned frames inside each segment.

If this is right

Video-LLMs can handle longer inputs at lower cost by processing only the selected keyframes.
Performance scales with query-specific allocation rather than fixed uniform sampling across all videos.
The same selection logic applies across multiple benchmarks without retraining the underlying model.
Computational load during inference drops proportionally to the reduction in processed frames.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to live streams where segments arrive sequentially and budgets must be decided on the fly.
Similar joint relevance-deviation scoring might reduce redundancy in long audio or document sequences fed to language models.
Combining the method with existing compression techniques could push context lengths further while staying under fixed token limits.

Load-bearing premise

Joint modeling of query relevance and content deviation inside each temporal segment ranks information contribution accurately without any learned parameters or task-specific fine-tuning.

What would settle it

An evaluation on LongVideoBench showing that QCA with 128 frames does not reach or exceed the accuracy obtained by uniform sampling with the same 128 frames or by GPT-4o with 256 frames.

Figures

Figures reproduced from arXiv: 2607.00983 by Baiyang Song, Hui Li, Jie Li, Jun Peng, Rongrong Ji, Yiyi Zhou, Yonghong Tian.

**Figure 1.** Figure 1: Uniform sampling may overlook semantically critical moments due to the temporal redundancy in videos, whereas our query- and content-aware selection prioritizes informative and diverse frames, enabling more reliable long video understanding under limited frame budgets. However, despite their strong representational power, applying LLMs or MLLMs to long-form video remains a challenge. Long videos are inher… view at source ↗

**Figure 2.** Figure 2: The proposed QCA consists of: (a) Inter-Segment Keyframe Allocation, where the frame budget for each segment is dynamically determined by considering semantic alignment and visual content deviation; (b) Intra-Segment Keyframe Selection, which iteratively adds the frame with the maximum aggregate distance to the current keyframe set Ks from a relevance-filtered candidate set Cs. 2.2 Keyframe Selection for… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of our QCA (left) and Uniform Sampling (right). The green boxes indicate the critical frames successfully selected by our method, which contain the specific evidence required to answer the questions (e.g., the passing dog, the white beard). In contrast, Uniform Sampling fails to capture these informative moments due to its rigid temporal intervals, leading to incorrect answers. surpr… view at source ↗

**Figure 4.** Figure 4: For the same video, QCA selects different keyframes based on different queries, while ensuring video coverage to prevent missing key information. 4.4 Qualitative Results Fig.3 and Fig.4 present representative video question answering examples comparing our QCA with uniform sampling. Unlike uniform sampling, which selects a fixed set of frames regardless of the query, QCA dynamically adapts keyframe select… view at source ↗

**Figure 5.** Figure 5: An example of our QCA keyframe selection. The upper right visualizes the match score between each frame and query, while the bottom right shows the changes in frame content at the feature level. The red dot indicates the selected keyframe, and the green dot indicates the average ITM score or average frame feature within a segment. The shaded area represents the standard deviation within the segment [PITH_… view at source ↗

**Figure 6.** Figure 6: Time cost (left), sensitivity study in α and γ (middle) and S(right). Moreover, removing the candidate set and selecting frames from the entire segment leads to performance degradation, particularly on MLVU and LVBench. This result suggests that restricting the search space to high-relevance candidates helps suppress noisy frames and stabilize the diversity modeling process. To further assess the role of d… view at source ↗

read the original abstract

Video understanding is often plagued by severe temporal redundancy, where processing dense frame sequences is both semantically inefficient and computationally expensive. This challenge is further amplified when only a small subset of frames is truly relevant to the given query. In this paper, we propose a Query- and Content-Aware (QCA) keyframe selection framework that can select a compact yet information-rich set of frames from long videos. QCA first partitions the video into temporal segments and estimates the information contribution of each segment by jointly modeling query relevance and content deviation, and dynamically allocates keyframe budget to each segment. Within each segment, QCA anchors on the most query-relevant frame and iteratively incorporates additional frames to maximize diversity while maintaining high semantic relevance to the query. Crucially, our method requires no additional training and can be seamlessly integrated into existing Video-LLMs. Extensive experiments across multiple long video understanding benchmarks demonstrate that our proposed approach achieves state-of-the-art performance and has strong generalization ability. For instance, QCA achieves 67.8\% on LongVideoBench using 128 frames, while GPT-4o achieves 66.7\% using 256 frames. Our codes are available in \href{https://github.com/hktk07/QCA}{GitHub}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QCA gives a training-free keyframe picker that claims to match or beat GPT-4o on LongVideoBench with half the frames, but the core scoring rule is a heuristic whose reliability is not shown in the abstract.

read the letter

The paper's main contribution is a deterministic pipeline that splits a long video into temporal segments, scores each segment by combining query relevance with content deviation, allocates a per-segment frame budget, then iteratively adds diverse frames inside the segment while staying close to the query. This is presented as plug-and-play for existing Video-LLMs with no extra training.

What it does cleanly is the practical framing: the method is parameter-free, the GitHub link is supplied, and the headline number (67.8 % on LongVideoBench at 128 frames versus GPT-4o at 66.7 % with 256) is the sort of deployment-relevant comparison people actually run. The combination of dynamic budget allocation and the iterative diversity step inside segments is not exactly the same as the prior work cited in the abstract.

The soft spot is the joint relevance-deviation scorer itself. The abstract gives no equations, no ablation on what happens when relevance and deviation conflict, and no check against an oracle or against cases where pre-trained embeddings miss task-specific semantics. Without those, the performance edge could be driven by the underlying Video-LLM rather than the selection rule. The stress-test note about the heuristic failing on diverse queries is therefore still live; the abstract supplies no counter-evidence.

This is for groups that already run long-video inference and want a lightweight, training-free filter before feeding frames to an LLM. It is not a theoretical advance and does not reorganize the field. The work is coherent on its own terms and the authors are straightforward about the no-training constraint, so it clears the bar for a serious referee even though the current evidence is limited to the abstract-level claim.

Referee Report

2 major / 0 minor

Summary. The paper proposes QCA, a parameter-free Query- and Content-Aware keyframe selection framework for long video understanding. The method partitions a video into temporal segments, estimates each segment's information contribution via joint modeling of query relevance and content deviation, dynamically allocates a keyframe budget per segment, and within segments anchors on the most query-relevant frame while iteratively adding frames to maximize diversity. It requires no training or fine-tuning and integrates with existing Video-LLMs. The central claim is state-of-the-art performance on long-video benchmarks, e.g., 67.8% on LongVideoBench using 128 frames versus GPT-4o's 66.7% using 256 frames.

Significance. If the heuristic for estimating per-segment contribution holds across queries and videos, the work would be significant for enabling more efficient long-video processing in multimodal models without learned parameters or task-specific adaptation. The open availability of code on GitHub is a positive factor for reproducibility.

major comments (2)

[Abstract] Abstract: the central performance claim (67.8% on LongVideoBench with 128 frames) is presented without any description of the experimental protocol, number of evaluated videos, baseline implementations, or statistical analysis, which is load-bearing for assessing whether the reported gains follow from the proposed heuristic.
[Abstract] Abstract: the joint modeling of query relevance and content deviation to estimate information contribution is described only at a high level with no explicit formulation, scoring function, or combination rule, which is load-bearing for the claim that this parameter-free procedure reliably ranks segments without learned parameters or task-specific fine-tuning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that both points identify areas where the abstract can be strengthened for self-containment and will revise it accordingly while preserving its brevity.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claim (67.8% on LongVideoBench with 128 frames) is presented without any description of the experimental protocol, number of evaluated videos, baseline implementations, or statistical analysis, which is load-bearing for assessing whether the reported gains follow from the proposed heuristic.

Authors: We acknowledge the concern. The abstract's space constraints limited inclusion of these details, but the full manuscript (Sections 4.1 and 5) specifies the LongVideoBench evaluation protocol, the standard test split, baseline implementations drawn from official releases or public code, and results averaged across runs with standard deviations where applicable. We will revise the abstract to add a concise clause such as 'evaluated on the LongVideoBench test set following official protocols with comparisons to GPT-4o using its standard implementation' to make the claim more self-contained. revision: yes
Referee: [Abstract] Abstract: the joint modeling of query relevance and content deviation to estimate information contribution is described only at a high level with no explicit formulation, scoring function, or combination rule, which is load-bearing for the claim that this parameter-free procedure reliably ranks segments without learned parameters or task-specific fine-tuning.

Authors: We agree that an explicit high-level formulation would improve clarity in the abstract. The manuscript (Section 3) provides the full scoring functions and combination rule. We will revise the abstract to include a brief explicit description, for example 'via a parameter-free score combining query-frame embedding similarity and segment content deviation', to better support the parameter-free claim without exceeding length limits. revision: yes

Circularity Check

0 steps flagged

No circularity: deterministic heuristic with no fitted parameters or self-referential derivations.

full rationale

The paper describes a parameter-free keyframe selection procedure that partitions videos into segments, models query relevance and content deviation to allocate budgets, and selects frames for diversity. No equations, fitted quantities, or predictions that reduce to inputs by construction are present. No self-citation chains or uniqueness theorems are invoked to justify the core method. Performance numbers are empirical benchmark results, not derived claims. The derivation chain is self-contained as a heuristic algorithm.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.1-grok · 5768 in / 1019 out tokens · 24769 ms · 2026-07-02T14:07:31.805795+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 21 canonical work pages · 13 internal anchors

[1]

Advances in neural information processing systems35, 23716– 23736 (2022) 3

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716– 23736 (2022) 3

2022
[2]

arXiv preprint arXiv:2306.13176 (2023) 2, 4

Arslan, S., Tanberk, S.: Key frame extraction with attention based deep neural networks. arXiv preprint arXiv:2306.13176 (2023) 2, 4

work page arXiv 2023
[3]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Token Merging: Your ViT But Faster

Bolya, D., Fu, C.Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461 (2022) 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Advances in neural information processing systems33, 1877–1901 (2020) 1

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020) 1

1901
[7]

In: European Conference on Computer Vision

Chen, L., Zhao, H., Liu, T., Bai, S., Lin, J., Zhou, C., Chang, B.: An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In: European Conference on Computer Vision. pp. 19–35. Springer (2024) 4

2024
[8]

Towards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval

Chen, T., Ju, S., Wu, Q., Fang, C., Zhang, K., Peng, J., Li, H., Zhou, Y., Ji, R.: Towards effective and efficient long video understanding of multimodal large language models via one-shot clip retrieval. arXiv preprint arXiv:2512.08410 (2025) 8, 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Chen, W., Zeng, Y., Luo, Y., Xie, T., Lin, L., Ji, J., Zhang, Y., Zheng, X.: Wavelet- based frame selection by detecting semantic boundary for long video understand- ing.In:ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPattern Recognition. pp. 24052–24061 (2026) 4

2026
[10]

Pattern Recognition130, 108797 (2022) 2

Dong, W., Zhang, Z., Song, C., Tan, T.: Identifying the key frames: An attention- aware sampling method for action recognition. Pattern Recognition130, 108797 (2022) 2

2022
[11]

The Llama 3 Herd of Models

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024) 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24108–24118 (2025) 1, 3, 7 QCA 17

2025
[13]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Hu, K., Gao, F., Nie, X., Zhou, P., Tran, S., Neiman, T., Wang, L., Shah, M., Hamid, R., Yin, B., et al.: M-llm based video frame selection for efficient video understanding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 13702–13712 (2025) 4

2025
[14]

arXiv preprint arXiv:2504.17447 (2025) 8, 9

Huang, D.A., Radhakrishnan, S., Yu, Z., Kautz, J.: Frag: Frame selection aug- mented generation for long video and long document understanding. arXiv preprint arXiv:2504.17447 (2025) 8, 9

work page arXiv 2025
[15]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024) 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Ju, S., Song, B., Chen, T., Zhang, J., Wu, Q., Chang, C., Wang, H., Zhou, Y., Ji, R.: Forestprune: High-ratio visual token compression for video multimodal large language models via spatial-temporal forest modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8326– 8336 (2026) 4, 15

2026
[17]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 1, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

In: International conference on machine learning

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023) 6, 8

2023
[19]

Science China Information Sciences 68(10), 200102 (2025) 3

Li, K., He, Y., Wang, Y., Li, Y., Wang, W., Luo, P., Wang, Y., Wang, L., Qiao, Y.: Videochat: Chat-centric video understanding. Science China Information Sciences 68(10), 200102 (2025) 3

2025
[20]

arXiv preprint arXiv:2407.03104 (2024) 4

Liang, H., Li, J., Bai, T., Huang, X., Sun, L., Wang, Z., He, C., Cui, B., Chen, C., Zhang, W.: Keyvideollm: Towards large-scale video keyframe selection. arXiv preprint arXiv:2407.03104 (2024) 4

work page arXiv 2024
[21]

Advances in neural information processing systems36, 34892–34916 (2023) 1, 3

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 1, 3

2023
[22]

In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

Liu, S., Zhao, C., Xu, T., Ghanem, B.: Bolt: Boost large vision-language model without training for long-form video understanding. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 3318–3327 (2025) 4, 8

2025
[23]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Liu, Z., Zhu, L., Shi, B., Zhang, Z., Lou, Y., Yang, S., Xi, H., Cao, S., Gu, Y., Li, D., et al.: Nvila: Efficient frontier visual language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 4122–4134 (2025) 9

2025
[24]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Soldan, M., Pardo, A., Alcázar, J.L., Caba, F., Zhao, C., Giancola, S., Ghanem, B.: Mad: A scalable dataset for language grounding in videos from movie audio descriptions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5026–5035 (2022) 1

2022
[25]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Song, B., Peng, J., Zhang, Y., Chen, G., Yang, F., Guo, J.: KTV: Keyframes and key tokens selection for efficient training-free video LLMs. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 40, pp. 9060–9068 (2026). https://doi.org/10.1609/aaai.v40i11.378624

work page doi:10.1609/aaai.v40i11.378624 2026
[26]

arXiv preprint arXiv:2510.02262 (2025) 4

Sun, G., Singhal, A., Uzkent, B., Shah, M., Chen, C., Kessler, G.: From frames to clips: Efficient key clip selection for long-form video understanding. arXiv preprint arXiv:2510.02262 (2025) 4

work page arXiv 2025
[27]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Tang, X., Qiu, J., Xie, L., Tian, Y., Jiao, J., Ye, Q.: Adaptive keyframe sampling for long video understanding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29118–29128 (2025) 4, 8, 9, 14 18 J. Peng, B. Song, and et al

2025
[28]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Team, G., Georgiev, P., Lei, V.I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vin- cent,D.,Pan,Z.,Wang,S.,etal.:Gemini1.5:Unlockingmultimodalunderstanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024) 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Team, K., Du, A., Yin, B., Xing, B., Qu, B., Wang, B., Chen, C., Zhang, C., Du, C., Wei, C., Wang, C., Zhang, D., Du, D., Wang, D., Yuan, E., Lu, E., Li, F., Sung, F., Wei, G., Lai, G., Zhu, H., Ding, H., Hu, H., Yang, H., Zhang, H., Wu, H., Yao, H., Lu, H., Wang, H., Gao, H., Zheng, H., Li, J., Su, J., Wang, J., Deng, J., Qiu, J., Xie, J., Wang, J., Liu,...

2025
[30]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wang,W.,He,Z.,Hong,W.,Cheng,Y.,Zhang,X.,Qi,J.,Ding,M.,Gu,X.,Huang, S., Xu, B., et al.: Lvbench: An extreme long video understanding benchmark. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 22958–22967 (2025) 7

2025
[31]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025) 1, 7, 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Advances in Neural Information Pro- cessing Systems37, 28828–28857 (2024) 3, 7

Wu, H., Li, D., Chen, B., Li, J.: Longvideobench: A benchmark for long-context interleaved video-language understanding. Advances in Neural Information Pro- cessing Systems37, 28828–28857 (2024) 3, 7

2024
[33]

Pattern Recognition 157, 110818 (2025) 1

Wu, W., Zhao, Y., Li, Z., Li, J., Zhou, H., Shou, M.Z., Bai, X.: A large cross- modal video retrieval dataset with reading comprehension. Pattern Recognition 157, 110818 (2025) 1

2025
[34]

arXiv preprint arXiv:2508.01546 (2025) 8, 9

Xu, Z., Zhang, J., Wang, Q., Liu, Y.: E-vrag: Enhancing long video under- standing with resource-efficient retrieval augmented generation. arXiv preprint arXiv:2508.01546 (2025) 8, 9

work page arXiv 2025
[35]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Ye, J., Xu, H., Liu, H., Hu, A., Yan, M., Qian, Q., Zhang, J., Huang, F., Zhou, J.: mplug-owl3: Towards long image-sequence understanding in multi-modal large language models. arXiv preprint arXiv:2408.04840 (2024) 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Zhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual lan- guage model for video understanding. arXiv preprint arXiv:2306.02858 (2023) 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

arXiv preprint arXiv:2506.22139 (2025) 8, 9, 14

Zhang, S., Yang, J., Yin, J., Luo, Z., Luan, J.: Q-frame: Query-aware frame selection and multi-resolution adaptation for video-llms. arXiv preprint arXiv:2506.22139 (2025) 8, 9, 14

work page arXiv 2025
[39]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhang, Y., Zhao, Z., Chen, Z., Ding, Z., Yang, X., Sun, Y.: Beyond training: Dynamic token merging for zero-shot video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 22046–22055 (2025) 4

2025
[40]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Zhang, Y., Wu, J., Li, W., Li, B., Ma, Z., Liu, Z., Li, C.: Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713 (2024) 2, 3, 4, 7, 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

QCA 19 In: Proceedings of the Computer Vision and Pattern Recognition Conference

Zhou, J., Shu, Y., Zhao, B., Wu, B., Liang, Z., Xiao, S., Qin, M., Yang, X., Xiong, Y., Zhang, B., et al.: Mlvu: Benchmarking multi-task long video understanding. QCA 19 In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 13691–13701 (2025) 7

2025
[42]

arXiv preprint arXiv:2510.27280 (2025) 4

Zhu, Z., Xu, H., Luo, Y., Liu, Y., Sarkar, K., Yang, Z., You, Y.: Focus: Efficient keyframe selection for long video understanding. arXiv preprint arXiv:2510.27280 (2025) 4

work page arXiv 2025
[43]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Zohar, O., Wang, X., Dubois, Y., Mehta, N., Xiao, T., Hansen-Estruch, P., Yu, L., Wang, X., Juefei-Xu, F., Zhang, N., et al.: Apollo: An exploration of video understanding in large multimodal models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18891–18901 (2025) 9

2025

[1] [1]

Advances in neural information processing systems35, 23716– 23736 (2022) 3

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716– 23736 (2022) 3

2022

[2] [2]

arXiv preprint arXiv:2306.13176 (2023) 2, 4

Arslan, S., Tanberk, S.: Key frame extraction with attention based deep neural networks. arXiv preprint arXiv:2306.13176 (2023) 2, 4

work page arXiv 2023

[3] [3]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 9

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Token Merging: Your ViT But Faster

Bolya, D., Fu, C.Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461 (2022) 4

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Advances in neural information processing systems33, 1877–1901 (2020) 1

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020) 1

1901

[7] [7]

In: European Conference on Computer Vision

Chen, L., Zhao, H., Liu, T., Bai, S., Lin, J., Zhou, C., Chang, B.: An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In: European Conference on Computer Vision. pp. 19–35. Springer (2024) 4

2024

[8] [8]

Towards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval

Chen, T., Ju, S., Wu, Q., Fang, C., Zhang, K., Peng, J., Li, H., Zhou, Y., Ji, R.: Towards effective and efficient long video understanding of multimodal large language models via one-shot clip retrieval. arXiv preprint arXiv:2512.08410 (2025) 8, 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Chen, W., Zeng, Y., Luo, Y., Xie, T., Lin, L., Ji, J., Zhang, Y., Zheng, X.: Wavelet- based frame selection by detecting semantic boundary for long video understand- ing.In:ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPattern Recognition. pp. 24052–24061 (2026) 4

2026

[10] [10]

Pattern Recognition130, 108797 (2022) 2

Dong, W., Zhang, Z., Song, C., Tan, T.: Identifying the key frames: An attention- aware sampling method for action recognition. Pattern Recognition130, 108797 (2022) 2

2022

[11] [11]

The Llama 3 Herd of Models

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024) 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24108–24118 (2025) 1, 3, 7 QCA 17

2025

[13] [13]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Hu, K., Gao, F., Nie, X., Zhou, P., Tran, S., Neiman, T., Wang, L., Shah, M., Hamid, R., Yin, B., et al.: M-llm based video frame selection for efficient video understanding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 13702–13712 (2025) 4

2025

[14] [14]

arXiv preprint arXiv:2504.17447 (2025) 8, 9

Huang, D.A., Radhakrishnan, S., Yu, Z., Kautz, J.: Frag: Frame selection aug- mented generation for long video and long document understanding. arXiv preprint arXiv:2504.17447 (2025) 8, 9

work page arXiv 2025

[15] [15]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024) 9

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Ju, S., Song, B., Chen, T., Zhang, J., Wu, Q., Chang, C., Wang, H., Zhou, Y., Ji, R.: Forestprune: High-ratio visual token compression for video multimodal large language models via spatial-temporal forest modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8326– 8336 (2026) 4, 15

2026

[17] [17]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 1, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

In: International conference on machine learning

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023) 6, 8

2023

[19] [19]

Science China Information Sciences 68(10), 200102 (2025) 3

Li, K., He, Y., Wang, Y., Li, Y., Wang, W., Luo, P., Wang, Y., Wang, L., Qiao, Y.: Videochat: Chat-centric video understanding. Science China Information Sciences 68(10), 200102 (2025) 3

2025

[20] [20]

arXiv preprint arXiv:2407.03104 (2024) 4

Liang, H., Li, J., Bai, T., Huang, X., Sun, L., Wang, Z., He, C., Cui, B., Chen, C., Zhang, W.: Keyvideollm: Towards large-scale video keyframe selection. arXiv preprint arXiv:2407.03104 (2024) 4

work page arXiv 2024

[21] [21]

Advances in neural information processing systems36, 34892–34916 (2023) 1, 3

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 1, 3

2023

[22] [22]

In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

Liu, S., Zhao, C., Xu, T., Ghanem, B.: Bolt: Boost large vision-language model without training for long-form video understanding. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 3318–3327 (2025) 4, 8

2025

[23] [23]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Liu, Z., Zhu, L., Shi, B., Zhang, Z., Lou, Y., Yang, S., Xi, H., Cao, S., Gu, Y., Li, D., et al.: Nvila: Efficient frontier visual language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 4122–4134 (2025) 9

2025

[24] [24]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Soldan, M., Pardo, A., Alcázar, J.L., Caba, F., Zhao, C., Giancola, S., Ghanem, B.: Mad: A scalable dataset for language grounding in videos from movie audio descriptions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5026–5035 (2022) 1

2022

[25] [25]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Song, B., Peng, J., Zhang, Y., Chen, G., Yang, F., Guo, J.: KTV: Keyframes and key tokens selection for efficient training-free video LLMs. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 40, pp. 9060–9068 (2026). https://doi.org/10.1609/aaai.v40i11.378624

work page doi:10.1609/aaai.v40i11.378624 2026

[26] [26]

arXiv preprint arXiv:2510.02262 (2025) 4

Sun, G., Singhal, A., Uzkent, B., Shah, M., Chen, C., Kessler, G.: From frames to clips: Efficient key clip selection for long-form video understanding. arXiv preprint arXiv:2510.02262 (2025) 4

work page arXiv 2025

[27] [27]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Tang, X., Qiu, J., Xie, L., Tian, Y., Jiao, J., Ye, Q.: Adaptive keyframe sampling for long video understanding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29118–29128 (2025) 4, 8, 9, 14 18 J. Peng, B. Song, and et al

2025

[28] [28]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Team, G., Georgiev, P., Lei, V.I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vin- cent,D.,Pan,Z.,Wang,S.,etal.:Gemini1.5:Unlockingmultimodalunderstanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024) 9

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Team, K., Du, A., Yin, B., Xing, B., Qu, B., Wang, B., Chen, C., Zhang, C., Du, C., Wei, C., Wang, C., Zhang, D., Du, D., Wang, D., Yuan, E., Lu, E., Li, F., Sung, F., Wei, G., Lai, G., Zhu, H., Ding, H., Hu, H., Yang, H., Zhang, H., Wu, H., Yao, H., Lu, H., Wang, H., Gao, H., Zheng, H., Li, J., Su, J., Wang, J., Deng, J., Qiu, J., Xie, J., Wang, J., Liu,...

2025

[30] [30]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wang,W.,He,Z.,Hong,W.,Cheng,Y.,Zhang,X.,Qi,J.,Ding,M.,Gu,X.,Huang, S., Xu, B., et al.: Lvbench: An extreme long video understanding benchmark. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 22958–22967 (2025) 7

2025

[31] [31]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025) 1, 7, 9

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Advances in Neural Information Pro- cessing Systems37, 28828–28857 (2024) 3, 7

Wu, H., Li, D., Chen, B., Li, J.: Longvideobench: A benchmark for long-context interleaved video-language understanding. Advances in Neural Information Pro- cessing Systems37, 28828–28857 (2024) 3, 7

2024

[33] [33]

Pattern Recognition 157, 110818 (2025) 1

Wu, W., Zhao, Y., Li, Z., Li, J., Zhou, H., Shou, M.Z., Bai, X.: A large cross- modal video retrieval dataset with reading comprehension. Pattern Recognition 157, 110818 (2025) 1

2025

[34] [34]

arXiv preprint arXiv:2508.01546 (2025) 8, 9

Xu, Z., Zhang, J., Wang, Q., Liu, Y.: E-vrag: Enhancing long video under- standing with resource-efficient retrieval augmented generation. arXiv preprint arXiv:2508.01546 (2025) 8, 9

work page arXiv 2025

[35] [35]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Ye, J., Xu, H., Liu, H., Hu, A., Yan, M., Qian, Q., Zhang, J., Huang, F., Zhou, J.: mplug-owl3: Towards long image-sequence understanding in multi-modal large language models. arXiv preprint arXiv:2408.04840 (2024) 9

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Zhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual lan- guage model for video understanding. arXiv preprint arXiv:2306.02858 (2023) 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

arXiv preprint arXiv:2506.22139 (2025) 8, 9, 14

Zhang, S., Yang, J., Yin, J., Luo, Z., Luan, J.: Q-frame: Query-aware frame selection and multi-resolution adaptation for video-llms. arXiv preprint arXiv:2506.22139 (2025) 8, 9, 14

work page arXiv 2025

[39] [39]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhang, Y., Zhao, Z., Chen, Z., Ding, Z., Yang, X., Sun, Y.: Beyond training: Dynamic token merging for zero-shot video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 22046–22055 (2025) 4

2025

[40] [40]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Zhang, Y., Wu, J., Li, W., Li, B., Ma, Z., Liu, Z., Li, C.: Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713 (2024) 2, 3, 4, 7, 9

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

QCA 19 In: Proceedings of the Computer Vision and Pattern Recognition Conference

Zhou, J., Shu, Y., Zhao, B., Wu, B., Liang, Z., Xiao, S., Qin, M., Yang, X., Xiong, Y., Zhang, B., et al.: Mlvu: Benchmarking multi-task long video understanding. QCA 19 In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 13691–13701 (2025) 7

2025

[42] [42]

arXiv preprint arXiv:2510.27280 (2025) 4

Zhu, Z., Xu, H., Luo, Y., Liu, Y., Sarkar, K., Yang, Z., You, Y.: Focus: Efficient keyframe selection for long video understanding. arXiv preprint arXiv:2510.27280 (2025) 4

work page arXiv 2025

[43] [43]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Zohar, O., Wang, X., Dubois, Y., Mehta, N., Xiao, T., Hansen-Estruch, P., Yu, L., Wang, X., Juefei-Xu, F., Zhang, N., et al.: Apollo: An exploration of video understanding in large multimodal models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18891–18901 (2025) 9

2025