pith. machine review for the scientific record. sign in

arxiv: 2512.08410 · v2 · submitted 2025-12-09 · 💻 cs.CV

Towards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval

Pith reviewed 2026-05-17 00:05 UTC · model grok-4.3

classification 💻 cs.CV
keywords long video understandingmultimodal large language modelsretrieval-augmented generationvideo clip retrievalquery-guided chunkingSynLongVideo datasetMLLMs efficiency
0
0 comments X

The pith

OneClip-RAG lets MLLMs understand long videos by retrieving one-shot clips while preserving semantic coherence and knowledge integrity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OneClip-RAG as a retrieval-augmented approach that retrieves video clips in one shot to let multimodal large language models handle videos far longer than their native frame limits allow. It pairs this with a query-guided chunking algorithm that performs segmentation and cross-modal retrieval together in a single step, avoiding extra computation while aiming to keep both complete information and coherent meaning across the whole video. The authors also create the SynLongVideo dataset and a progressive training schedule to strengthen instruction following. When plugged into existing models, the method delivers measurable gains on long-video benchmarks alongside much lower processing times. A sympathetic reader would care because the approach directly targets the memory bottleneck that currently restricts these models to short clips.

Core claim

OneClip-RAG is an effective and efficient paradigm for long video understanding in MLLMs that makes full use of video clips for augmented understanding in terms of both knowledge integrity and semantic coherence, equipped with a novel query-guided video chunking algorithm that unifies clip chunking and cross-modal retrieval in one processing step, further supported by the SynLongVideo dataset and progressive training to improve instruction following, and validated through integration with three recent MLLMs on long-video benchmarks.

What carries the argument

One-shot video-Clip based Retrieval-Augmented Generation (OneClip-RAG) with its query-guided video chunking algorithm, which unifies chunking and retrieval to preserve coherence.

If this is right

  • Boosts Qwen3-VL 8B performance to the level of GPT-5 on the MLVU benchmark.
  • Enables LLaVA-Video to process up to an hour of video in less than 1.2 minutes on a single 4090 GPU.
  • Delivers superior efficiency compared with prior video RAG methods by avoiding redundant computations.
  • Improves instruction following through the SynLongVideo dataset and progressive training regime.
  • Applies directly to multiple recent MLLMs with consistent gains on long-video tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The unified chunking-retrieval step could extend to other long-sequence modalities such as audio tracks or multi-page documents.
  • If retrieval errors remain low, the method may reduce the need for full-video processing in real-time video analysis applications.
  • Scaling tests on videos exceeding one hour would show whether additional retrieval rounds become necessary.

Load-bearing premise

That one-shot clip retrieval combined with query-guided chunking preserves full knowledge integrity and semantic coherence across the entire long video without omitting critical information or introducing retrieval errors.

What would settle it

A long-video question-answering benchmark in which key events or details required for correct answers lie outside the single retrieved clip, causing the augmented model to produce incorrect responses.

Figures

Figures reproduced from arXiv: 2512.08410 by Chenxin Fang, Hui Li, Jun Peng, Kun Zhang, Qiong Wu, Rongrong Ji, Shaobo Ju, Tao Chen, Yiyi Zhou.

Figure 1
Figure 1. Figure 1: Comparisons between existing video RAG strategies and [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of OneClip-RAG. (a) As a plug-and-play design, OneClip-RAG first performs clip chunking based on the given video [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Statistical overview of the proposed SynLongVideo dataset. SynLongVideo aims to improve the instruction-following capability [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Efficiency and performance comparison between [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualized comparisons between our OneClip-RAG and other Video-RAG methods. The green letters are ground-truth answers, [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Due to excessive memory overhead, most Multimodal Large Language Models (MLLMs) can only process videos of limited frames. In this paper, we propose an effective and efficient paradigm to remedy this shortcoming, termed One-shot video-Clip based Retrieval-Augmented Generation (OneClip-RAG). Compared with existing video RAG methods, OneClip-RAG makes full use of the merits of video clips for augmented video understanding in terms of both knowledge integrity and semantic coherence. Besides, it is also equipped with a novel query-guided video chunking algorithm that can unify clip chunking and cross-modal retrieval in one processing step, avoiding redundant computations. To improve instruction following, we further propose a new dataset called SynLongVideo and design a progressive training regime for OneClip-RAG. OneClip-RAG is plugged into three recent MLLMs and validated on a set of long-video benchmarks. Experimental results not only show the obvious performance gains by OneClip-RAG over MLLMs, e.g., boosting Qwen3-VL 8B to the level of GPT-5 on MLVU, but also show its superior efficiency in handling long videos. e.g., enabling LLaVA-Video understand up to an hour of videos in less than 1.2 minutes on a single 4090 GPU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces OneClip-RAG, a retrieval-augmented generation framework for long-video understanding in MLLMs. It relies on one-shot video-clip retrieval combined with a query-guided chunking algorithm that unifies chunking and cross-modal retrieval in a single step, aiming to preserve knowledge integrity and semantic coherence while reducing memory and compute costs. The approach is augmented by the SynLongVideo dataset and progressive training, then plugged into three existing MLLMs and evaluated on long-video benchmarks, with reported gains such as elevating Qwen3-VL 8B performance to GPT-5 levels on MLVU and enabling hour-long video processing in under 1.2 minutes on a single 4090 GPU.

Significance. If the empirical claims hold under rigorous controls, the work would offer a practical route to scaling MLLM video understanding beyond short clips without prohibitive memory overhead. The emphasis on clip-level retrieval for coherence and the unified query-guided chunking step represent a targeted engineering contribution that could influence future video RAG designs, particularly for open-source models.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experimental Setup): The central performance claims (e.g., Qwen3-VL 8B reaching GPT-5 level on MLVU) are presented without specification of exact baselines, data splits, statistical tests, error bars, or multiple-run averages. This omission makes it impossible to assess whether the reported gains are robust or attributable to post-hoc choices, directly undermining evaluation of the method's effectiveness.
  2. [§3.2] §3.2 (Query-Guided Video Chunking): The algorithm is asserted to preserve full knowledge integrity and semantic coherence by using query relevance as the selection criterion, yet no quantitative measurement of event recall, information-loss rate, or narrative coherence (e.g., via human evaluation or proxy metrics on temporally distant setup/background events) is reported. This leaves the core assumption—that query-aligned chunks suffice for complete video understanding—unfalsified and load-bearing for the integrity claim.
  3. [§4.3] §4.3 (Efficiency Evaluation): The efficiency result (LLaVA-Video processing up to one hour of video in <1.2 minutes on a 4090) lacks comparison against standard frame-sampling or existing video RAG baselines under identical hardware and video-length conditions, and does not clarify the precise frame rate or token budget used, rendering the superiority claim difficult to interpret or reproduce.
minor comments (2)
  1. [Abstract] The abstract contains a minor grammatical inconsistency in the efficiency sentence (missing comma before 'e.g.').
  2. [§3.2] Notation for the unified chunking-retrieval step could be clarified with a short pseudocode block or equation to make the 'one processing step' claim more precise.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address all major comments by adding the requested experimental details, quantitative evaluations, and comparative analyses.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experimental Setup): The central performance claims (e.g., Qwen3-VL 8B reaching GPT-5 level on MLVU) are presented without specification of exact baselines, data splits, statistical tests, error bars, or multiple-run averages. This omission makes it impossible to assess whether the reported gains are robust or attributable to post-hoc choices, directly undermining evaluation of the method's effectiveness.

    Authors: We agree that the original presentation lacked sufficient detail for assessing robustness. In the revised manuscript, we have expanded the abstract and §4 to specify all baselines (vanilla MLLMs and prior RAG methods), exact data splits for each benchmark, averages over three independent runs with standard deviation error bars, and paired t-test results for statistical significance. revision: yes

  2. Referee: [§3.2] §3.2 (Query-Guided Video Chunking): The algorithm is asserted to preserve full knowledge integrity and semantic coherence by using query relevance as the selection criterion, yet no quantitative measurement of event recall, information-loss rate, or narrative coherence (e.g., via human evaluation or proxy metrics on temporally distant setup/background events) is reported. This leaves the core assumption—that query-aligned chunks suffice for complete video understanding—unfalsified and load-bearing for the integrity claim.

    Authors: The referee is correct that direct quantitative validation of knowledge preservation was missing. While end-to-end gains offer indirect support, the revised §3.2 now includes event recall and information-loss metrics computed on annotated video subsets, plus a small-scale human evaluation of narrative coherence for temporally distant events. revision: yes

  3. Referee: [§4.3] §4.3 (Efficiency Evaluation): The efficiency result (LLaVA-Video processing up to one hour of video in <1.2 minutes on a 4090) lacks comparison against standard frame-sampling or existing video RAG baselines under identical hardware and video-length conditions, and does not clarify the precise frame rate or token budget used, rendering the superiority claim difficult to interpret or reproduce.

    Authors: We agree that comparative baselines and implementation details are essential for interpretability. The revised §4.3 now reports efficiency results against standard frame-sampling and existing video RAG methods under identical hardware and video lengths, and explicitly states the 1 FPS sampling rate and token budget used. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is an independent engineering contribution

full rationale

The paper introduces OneClip-RAG as a new retrieval-augmented paradigm with a query-guided chunking algorithm and the SynLongVideo dataset for training. No equations, fitted parameters, or derivations are present that reduce to inputs by construction. Performance claims rest on external benchmarks and efficiency measurements rather than self-referential definitions or load-bearing self-citations. The central components (one-shot clip retrieval and unified chunking) are presented as novel additions with independent validation, making the work self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the paper introduces no explicitly stated free parameters, mathematical axioms, or new invented entities. The central claim rests on the empirical effectiveness of the retrieval-augmented pipeline and the assumption that clip-level retrieval suffices for long-video coherence.

pith-pipeline@v0.9.0 · 5558 in / 1220 out tokens · 41394 ms · 2026-05-17T00:05:35.370541+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 16 internal anchors

  1. [1]

    Adnan Arefeen, Biplob Debnath, Md

    Md. Adnan Arefeen, Biplob Debnath, Md. Yusuf Sarwar Uddin, and Srimat Chakradhar. Vita: An efficient video- to-text algorithm using VLM for rag-based video analysis system. InCVPR Workshops, pages 2266–2274. IEEE, 2024. 2

  2. [2]

    Self-rag: Learning to retrieve, generate, and critique through self-reflection

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Han- naneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. InICLR, 2024. 2

  3. [3]

    Minigpt4-video: Advancing multimodal llms for video un- derstanding with interleaved visual-textual tokens.arXiv Preprint, 2024

    Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Es- sam Sleiman, Deyao Zhu, Jian Ding, and Mohamed Elhoseiny. Minigpt4-video: Advancing multimodal llms for video un- derstanding with interleaved visual-textual tokens.arXiv Preprint, 2024. https://arxiv.org/abs/2404. 03413. 2

  4. [4]

    Goldfish: Vision- language understanding of arbitrarily long videos

    Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Mingchen Zhuge, Jian Ding, Deyao Zhu, Jürgen Schmidhuber, and Mohamed Elhoseiny. Goldfish: Vision- language understanding of arbitrarily long videos. InECCV (29), pages 251–267, 2024. 2, 3, 6

  5. [5]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- vl technical report.a...

  6. [6]

    Where did I leave my keys? - episodic-memory-based question answering on ego- centric videos

    Leonard Bärmann and Alex Waibel. Where did I leave my keys? - episodic-memory-based question answering on ego- centric videos. InCVPR Workshops, pages 1559–1567, 2022. 2, 4, 5, 6

  7. [7]

    Multi-task re- triever fine-tuning for domain-specific and efficient RAG

    Patrice Béchard and Orlando Marquez Ayala. Multi-task re- triever fine-tuning for domain-specific and efficient RAG. arXiv Preprint, 2025. https://arxiv.org/abs/ 2501.04652. 5

  8. [8]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E. Peters, and Arman Cohan. Long- former: The long-document transformer.arXiv Preprint, 2020.https://arxiv.org/abs/2004.05150. 2

  9. [9]

    Berndt and James Clifford

    Donald J. Berndt and James Clifford. Using dynamic time warping to find patterns in time series. InKDD Workshop, pages 359–370, 1994. 4

  10. [10]

    Internlm2 technical report, 2024

    Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Peng- long Jiao, Zhenjiang Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li,...

  11. [12]

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024. 1

  12. [13]

    The power of noise: Re- defining retrieval for RAG systems

    Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. The power of noise: Re- defining retrieval for RAG systems. InSIGIR, pages 719–729,

  13. [14]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. InNeurIPS, 2022. 2

  14. [15]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil 9 Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bet...

  15. [16]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Meng- dan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. MME: A compre- hensive evaluation benchmark for multimodal large language models.arXiv Preprint, 2023. https://arxiv.org/ abs/2306.13394. 1

  16. [17]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InC...

  17. [18]

    Marti A. Hearst. Texttiling: Segmenting text into multi- paragraph subtopic passages.Comput. Linguistics, 23(1): 33–64, 1997. 4

  18. [19]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR (Poster), 2015. 6

  19. [20]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pages 19730–19742, 2023. 1

  20. [21]

    VideoChat: Chat-Centric Video Understanding

    Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv Preprint, 2023. https://arxiv.org/abs/2305.06355. 2

  21. [22]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Lou, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark. InCVPR, pages 22195–22206,

  22. [23]

    End-to-end video question answering with frame scoring mechanisms and adaptive sam- pling.arXiv Preprint, 2024

    Jianxin Liang, Xiaojun Meng, Yueqian Wang, Chang Liu, Qun Liu, and Dongyan Zhao. End-to-end video question answering with frame scoring mechanisms and adaptive sam- pling.arXiv Preprint, 2024. https://arxiv.org/abs/ 2407.15047. 2, 3, 4

  23. [24]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual represen- tation by alignment before projection.arXiv Preprint, 2023. https://arxiv.org/abs/2311.10122. 1, 2

  24. [25]

    VILA: on pre-training for visual language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. VILA: on pre-training for visual language models. InCVPR, pages 26679–26689, 2024. 2, 6

  25. [26]

    MM-VID: advancing video understanding with gpt-4v(ision).arXiv Preprint, 2023

    Kevin Lin, Faisal Ahmed, Linjie Li, Chung-Ching Lin, Ehsan Azarnasab, Zhengyuan Yang, Jianfeng Wang, Lin Liang, Zicheng Liu, Yumao Lu, Ce Liu, and Lijuan Wang. MM-VID: advancing video understanding with gpt-4v(ision).arXiv Preprint, 2023. https://arxiv.org/abs/2310. 19773. 2

  26. [27]

    Llava-next: Improved reason- ing, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reason- ing, ocr, and world knowledge, 2024. 3

  27. [28]

    Bolt: Boost large vision-language model without training for long-form video understanding

    Shuming Liu, , Chen Zhao, Tianqi Xu, and Bernard Ghanem. Bolt: Boost large vision-language model without training for long-form video understanding. InCVPR, 2025. 2, 3

  28. [29]

    NVILA: efficient frontier visual language models

    Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yum- ing Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Haotian Tang, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Jinyi Hu, Sifei Liu, Ranjay Krishna, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, and Yao Lu. NVILA: efficient frontier visual lan...

  29. [30]

    Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts. In ICLR, 2024. 1

  30. [31]

    Towards lightweight transformer via group-wise transformation for vision-and-language tasks.IEEE Trans

    Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Yan Wang, Liujuan Cao, Yongjian Wu, Feiyue Huang, and Rongrong Ji. Towards lightweight transformer via group-wise transformation for vision-and-language tasks.IEEE Trans. Image Process., 31: 3386–3398, 2022. 1

  31. [32]

    Towards language-guided visual recognition via dynamic convolutions.Int

    Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Yongjian Wu, Yue Gao, and Rongrong Ji. Towards language-guided visual recognition via dynamic convolutions.Int. J. Comput. Vis., 132(1):1–19,

  32. [33]

    Moil: Momentum imitation learning for efficient vision-language adaptation.IEEE Trans

    Gen Luo, Yiyi Zhou, Minglang Huang, Tianhe Ren, Xi- aoshuai Sun, and Rongrong Ji. Moil: Momentum imitation learning for efficient vision-language adaptation.IEEE Trans. Pattern Anal. Mach. Intell., 47(7):5192–5204, 2025. 1

  33. [34]

    Video-rag: Visually-aligned retrieval-augmented long video comprehension.arXiv Preprint, 2024

    Yongdong Luo, Xiawu Zheng, Xiao Yang, Guilin Li, Haojia Lin, Jinfa Huang, Jiayi Ji, Fei Chao, Jiebo Luo, and Rongrong Ji. Video-rag: Visually-aligned retrieval-augmented long video comprehension.arXiv Preprint, 2024. https:// arxiv.org/abs/2411.13093. 2

  34. [35]

    Drvideo: Document retrieval based long video understanding

    Ziyu Ma, Chenhui Gou, Hengcan Shi, Bin Sun, Shutao Li, Hamid Rezatofighi, and Jianfei Cai. Drvideo: Document 10 retrieval based long video understanding.arXiv Preprint, 2024.https://arxiv.org/abs/2406.12846. 2, 3

  35. [36]

    Video-chatgpt: Towards detailed video understanding via large vision and language models

    Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InACL, pages 12585–12602, 2024. 1, 2, 3, 6

  36. [37]

    Bassl: Boundary-aware self-supervised learning for video scene seg- mentation

    Jonghwan Mun, Minchul Shin, Gunsoo Han, Sangho Lee, Seongsu Ha, Joonseok Lee, and Eun-Sol Kim. Bassl: Boundary-aware self-supervised learning for video scene seg- mentation. InACCV, pages 485–501, 2022. 4

  37. [38]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, pages 8748–8763, 2021. 2, 5, 6, 8

  38. [39]

    Flashattention-3: Fast and accurate attention with asynchrony and low-precision.arXiv Preprint, 2024

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.arXiv Preprint, 2024. https://arxiv.org/abs/2407. 08608. 2

  39. [40]

    LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

    Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Bal- akrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, and Vikas Chandra. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv Prep...

  40. [41]

    REPLUG: retrieval-augmented black-box language models

    Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Richard James, Mike Lewis, Luke Zettlemoyer, and Wen- tau Yih. REPLUG: retrieval-augmented black-box language models. InNAACL-HLT, pages 8371–8384, 2024. 2

  41. [42]

    Video- xl: Extra-long vision language model for hour-scale video understanding

    Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video- xl: Extra-long vision language model for hour-scale video understanding. InCVPR, pages 26160–26169, 2025. 6

  42. [43]

    Moviechat: From dense token to sparse memory for long video understanding

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, and Gaoang Wang. Moviechat: From dense token to sparse memory for long video understanding. InCVPR, pages 18221–18232,

  43. [44]

    Videonsa: Native sparse attention scales video understanding

    Enxin Song, Wenhao Chai, Shusheng Yang, Ethan Armand, Xiaojun Shan, Haiyang Xu, Jianwen Xie, and Zhuowen Tu. Videonsa: Native sparse attention scales video understanding. arXiv preprint arXiv:2510.02295, 2025. 6

  44. [45]

    Adaptive keyframe sampling for long video understanding.arXiv Preprint, 2025

    Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive keyframe sampling for long video understanding.arXiv Preprint, 2025. https://arxiv. org/abs/2502.21271. 3, 6

  45. [46]

    Weakly supervised gaussian contrastive grounding with large multimodal models for video question answering

    Haibo Wang, Chenghang Lai, Yixuan Sun, and Weifeng Ge. Weakly supervised gaussian contrastive grounding with large multimodal models for video question answering. InACM Multimedia, pages 5289–5298, 2024. 3

  46. [47]

    Dynamic-vlm: Simple dynamic visual token compression for videollm.arXiv Preprint, 2024

    Han Wang, Yuxiang Nie, Yongjie Ye, Guanyu Deng, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, and Can Huang. Dynamic-vlm: Simple dynamic visual token compression for videollm.arXiv Preprint, 2024. https://arxiv.org/ abs/2412.09530. 6

  47. [48]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv Preprint, 2024. https: //arxiv.org...

  48. [49]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Sheng- long Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, JingJing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Zhi Ho...

  49. [50]

    Videollamb: Long-context video understanding with recurrent memory bridges.arXiv Preprint, 2024

    Yuxuan Wang, Cihang Xie, Yang Liu, and Zilong Zheng. Videollamb: Long-context video understanding with recurrent memory bridges.arXiv Preprint, 2024. https://arxiv. org/abs/2409.01071. 3, 4, 8

  50. [51]

    LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.arXiv Preprint, 2024. https: //arxiv.org/abs/2407.15754. 2, 6

  51. [52]

    Next-qa: Next phase of question-answering to explaining temporal actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InCVPR, pages 9777–9786, 2021. 5, 6

  52. [53]

    Chunk, align, select: A simple long-sequence processing method for transformers

    Jiawen Xie, Pengyu Cheng, Xiao Liang, Yong Dai, and Nan Du. Chunk, align, select: A simple long-sequence processing method for transformers. InACL, pages 13500–13519, 2024. 2

  53. [54]

    PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

    Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See-Kiong Ng, and Jiashi Feng. Pllava : Parameter-free llava extension from images to videos for video dense captioning.arXiv Preprint, 2024.https://arxiv.org/abs/2404.16994. 1, 2

  54. [55]

    Corrective Retrieval Augmented Generation

    Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling. Corrective retrieval augmented generation.arXiv Preprint, 2024.https://arxiv.org/abs/2401.15884. 2

  55. [56]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, 11 Mei Li, Mingfe...

  56. [57]

    mplug-owl3: Towards long image-sequence understanding in multi-modal large language models

    Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models. InICLR, 2025. 6

  57. [58]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, pages 11941–11952, 2023. 2, 5, 6

  58. [59]

    Video-llama: An instruction-tuned audio-visual language model for video un- derstanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. InEMNLP, pages 543–553, 2023. 1

  59. [60]

    Omagent: A multi-modal agent framework for complex video understanding with task divide-and-conquer

    Lu Zhang, Tiancheng Zhao, Heting Ying, Yibo Ma, and Kyu- song Lee. Omagent: A multi-modal agent framework for complex video understanding with task divide-and-conquer. InEMNLP, pages 10031–10045, 2024. 2, 3

  60. [61]

    Long Context Transfer from Language to Vision

    Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from lan- guage to vision.arXiv Preprint, 2024. https://arxiv. org/abs/2406.16852. 2, 6

  61. [62]

    Rag4itops: A supervised fine-tunable and comprehensive RAG framework for IT operations and maintenance

    Tianyang Zhang, Zhuoxuan Jiang, Shengguang Bai, Tianrui Zhang, Lin Lin, Yang Liu, and Jiawei Ren. Rag4itops: A supervised fine-tunable and comprehensive RAG framework for IT operations and maintenance. InEMNLP (Industry Track), pages 738–754, 2024. 5

  62. [63]

    Llava- next: A strong zero-shot video understanding model, 2024

    Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model, 2024. 2, 6

  63. [64]

    Llava-video: Video instruction tuning with synthetic data.Trans

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data.Trans. Mach. Learn. Res., 2025, 2025. 2, 6

  64. [65]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. InNeurIPS, 2023. 1, 2

  65. [66]

    MLVU: Benchmarking Multi-task Long Video Understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. MLVU: A comprehensive benchmark for multi-task long video understanding.arXiv Preprint, 2024. https: //arxiv.org/abs/2406.04264. 2, 6

  66. [67]

    TRAR: routing the attention spans in transformer for visual question answering

    Yiyi Zhou, Tianhe Ren, Chaoyang Zhu, Xiaoshuai Sun, Jianzhuang Liu, Xinghao Ding, Mingliang Xu, and Rongrong Ji. TRAR: routing the attention spans in transformer for visual question answering. InICCV, pages 2054–2064, 2021. 1

  67. [68]

    Plenty is plague: Fine- grained learning for visual question answering.IEEE Trans

    Yiyi Zhou, Rongrong Ji, Xiaoshuai Sun, Jinsong Su, Deyu Meng, Yue Gao, and Chunhua Shen. Plenty is plague: Fine- grained learning for visual question answering.IEEE Trans. Pattern Anal. Mach. Intell., 44(2):697–709, 2022. 1

  68. [69]

    Minigpt-4: Enhancing vision-language understanding with advanced large language models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. InICLR. OpenReview.net, 2024. 1 12