Towards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval

· 2025 · cs.CV · arXiv 2512.08410

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Due to excessive memory overhead, most Multimodal Large Language Models (MLLMs) can only process videos of limited frames. In this paper, we propose an effective and efficient paradigm to remedy this shortcoming, termed One-shot video-Clip based Retrieval-Augmented Generation (OneClip-RAG). Compared with existing video RAG methods, OneClip-RAG makes full use of the merits of video clips for augmented video understanding in terms of both knowledge integrity and semantic coherence. Besides, it is also equipped with a novel query-guided video chunking algorithm that can unify clip chunking and cross-modal retrieval in one processing step, avoiding redundant computations. To improve instruction following, we further propose a new dataset called SynLongVideo and design a progressive training regime for OneClip-RAG. OneClip-RAG is plugged into three recent MLLMs and validated on a set of long-video benchmarks. Experimental results not only show the obvious performance gains by OneClip-RAG over MLLMs, e.g., boosting Qwen3-VL 8B to the level of GPT-5 on MLVU, but also show its superior efficiency in handling long videos. e.g., enabling LLaVA-Video understand up to an hour of videos in less than 1.2 minutes on a single 4090 GPU.

representative citing papers

QCA: Query- and Content-Aware Keyframe Selection for Long Video Understanding

cs.CV · 2026-07-01 · unverdicted · novelty 5.0

QCA selects compact, query-relevant keyframes from long videos via segment-wise budget allocation and diversity-aware addition, achieving higher accuracy than GPT-4o on LongVideoBench with half the frames.

citing papers explorer

Showing 1 of 1 citing paper after filters.

QCA: Query- and Content-Aware Keyframe Selection for Long Video Understanding cs.CV · 2026-07-01 · unverdicted · none · ref 8 · internal anchor
QCA selects compact, query-relevant keyframes from long videos via segment-wise budget allocation and diversity-aware addition, achieving higher accuracy than GPT-4o on LongVideoBench with half the frames.

Towards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval

fields

years

verdicts

representative citing papers

citing papers explorer