arxiv: 2512.08410 · v2 · submitted 2025-12-09 · 💻 cs.CV

Towards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval

Tao Chen , Shaobo Ju , Qiong Wu , Chenxin Fang , Kun Zhang , Jun Peng , Hui Li , Yiyi Zhou

show 1 more author

Rongrong Ji

This is my paper

Pith reviewed 2026-05-17 00:05 UTC · model grok-4.3

classification 💻 cs.CV

keywords long video understandingmultimodal large language modelsretrieval-augmented generationvideo clip retrievalquery-guided chunkingSynLongVideo datasetMLLMs efficiency

0 comments

The pith

OneClip-RAG lets MLLMs understand long videos by retrieving one-shot clips while preserving semantic coherence and knowledge integrity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OneClip-RAG as a retrieval-augmented approach that retrieves video clips in one shot to let multimodal large language models handle videos far longer than their native frame limits allow. It pairs this with a query-guided chunking algorithm that performs segmentation and cross-modal retrieval together in a single step, avoiding extra computation while aiming to keep both complete information and coherent meaning across the whole video. The authors also create the SynLongVideo dataset and a progressive training schedule to strengthen instruction following. When plugged into existing models, the method delivers measurable gains on long-video benchmarks alongside much lower processing times. A sympathetic reader would care because the approach directly targets the memory bottleneck that currently restricts these models to short clips.

Core claim

OneClip-RAG is an effective and efficient paradigm for long video understanding in MLLMs that makes full use of video clips for augmented understanding in terms of both knowledge integrity and semantic coherence, equipped with a novel query-guided video chunking algorithm that unifies clip chunking and cross-modal retrieval in one processing step, further supported by the SynLongVideo dataset and progressive training to improve instruction following, and validated through integration with three recent MLLMs on long-video benchmarks.

What carries the argument

One-shot video-Clip based Retrieval-Augmented Generation (OneClip-RAG) with its query-guided video chunking algorithm, which unifies chunking and retrieval to preserve coherence.

If this is right

Boosts Qwen3-VL 8B performance to the level of GPT-5 on the MLVU benchmark.
Enables LLaVA-Video to process up to an hour of video in less than 1.2 minutes on a single 4090 GPU.
Delivers superior efficiency compared with prior video RAG methods by avoiding redundant computations.
Improves instruction following through the SynLongVideo dataset and progressive training regime.
Applies directly to multiple recent MLLMs with consistent gains on long-video tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The unified chunking-retrieval step could extend to other long-sequence modalities such as audio tracks or multi-page documents.
If retrieval errors remain low, the method may reduce the need for full-video processing in real-time video analysis applications.
Scaling tests on videos exceeding one hour would show whether additional retrieval rounds become necessary.

Load-bearing premise

That one-shot clip retrieval combined with query-guided chunking preserves full knowledge integrity and semantic coherence across the entire long video without omitting critical information or introducing retrieval errors.

What would settle it

A long-video question-answering benchmark in which key events or details required for correct answers lie outside the single retrieved clip, causing the augmented model to produce incorrect responses.

Figures

Figures reproduced from arXiv: 2512.08410 by Chenxin Fang, Hui Li, Jun Peng, Kun Zhang, Qiong Wu, Rongrong Ji, Shaobo Ju, Tao Chen, Yiyi Zhou.

**Figure 2.** Figure 2: Overview of OneClip-RAG. (a) As a plug-and-play design, OneClip-RAG first performs clip chunking based on the given video [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Statistical overview of the proposed SynLongVideo dataset. SynLongVideo aims to improve the instruction-following capability [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Efficiency and performance comparison between [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Visualized comparisons between our OneClip-RAG and other Video-RAG methods. The green letters are ground-truth answers, [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Due to excessive memory overhead, most Multimodal Large Language Models (MLLMs) can only process videos of limited frames. In this paper, we propose an effective and efficient paradigm to remedy this shortcoming, termed One-shot video-Clip based Retrieval-Augmented Generation (OneClip-RAG). Compared with existing video RAG methods, OneClip-RAG makes full use of the merits of video clips for augmented video understanding in terms of both knowledge integrity and semantic coherence. Besides, it is also equipped with a novel query-guided video chunking algorithm that can unify clip chunking and cross-modal retrieval in one processing step, avoiding redundant computations. To improve instruction following, we further propose a new dataset called SynLongVideo and design a progressive training regime for OneClip-RAG. OneClip-RAG is plugged into three recent MLLMs and validated on a set of long-video benchmarks. Experimental results not only show the obvious performance gains by OneClip-RAG over MLLMs, e.g., boosting Qwen3-VL 8B to the level of GPT-5 on MLVU, but also show its superior efficiency in handling long videos. e.g., enabling LLaVA-Video understand up to an hour of videos in less than 1.2 minutes on a single 4090 GPU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OneClip-RAG unifies query-guided chunking with retrieval to cut compute on long videos, but the method still needs checks that it does not drop necessary context.

read the letter

The main point is that this paper offers a practical engineering route around memory limits in MLLMs for videos longer than a few minutes. OneClip-RAG folds clip selection and cross-modal retrieval into a single query-guided step, adds a synthetic dataset called SynLongVideo, and uses progressive training to sharpen instruction following. They attach the module to three existing models and report faster inference plus higher scores on long-video benchmarks, including lifting Qwen3-VL 8B to GPT-5 territory on MLVU while processing an hour of video in under two minutes on one 4090 GPU.

Referee Report

3 major / 2 minor

Summary. The paper introduces OneClip-RAG, a retrieval-augmented generation framework for long-video understanding in MLLMs. It relies on one-shot video-clip retrieval combined with a query-guided chunking algorithm that unifies chunking and cross-modal retrieval in a single step, aiming to preserve knowledge integrity and semantic coherence while reducing memory and compute costs. The approach is augmented by the SynLongVideo dataset and progressive training, then plugged into three existing MLLMs and evaluated on long-video benchmarks, with reported gains such as elevating Qwen3-VL 8B performance to GPT-5 levels on MLVU and enabling hour-long video processing in under 1.2 minutes on a single 4090 GPU.

Significance. If the empirical claims hold under rigorous controls, the work would offer a practical route to scaling MLLM video understanding beyond short clips without prohibitive memory overhead. The emphasis on clip-level retrieval for coherence and the unified query-guided chunking step represent a targeted engineering contribution that could influence future video RAG designs, particularly for open-source models.

major comments (3)

[Abstract and §4] Abstract and §4 (Experimental Setup): The central performance claims (e.g., Qwen3-VL 8B reaching GPT-5 level on MLVU) are presented without specification of exact baselines, data splits, statistical tests, error bars, or multiple-run averages. This omission makes it impossible to assess whether the reported gains are robust or attributable to post-hoc choices, directly undermining evaluation of the method's effectiveness.
[§3.2] §3.2 (Query-Guided Video Chunking): The algorithm is asserted to preserve full knowledge integrity and semantic coherence by using query relevance as the selection criterion, yet no quantitative measurement of event recall, information-loss rate, or narrative coherence (e.g., via human evaluation or proxy metrics on temporally distant setup/background events) is reported. This leaves the core assumption—that query-aligned chunks suffice for complete video understanding—unfalsified and load-bearing for the integrity claim.
[§4.3] §4.3 (Efficiency Evaluation): The efficiency result (LLaVA-Video processing up to one hour of video in <1.2 minutes on a 4090) lacks comparison against standard frame-sampling or existing video RAG baselines under identical hardware and video-length conditions, and does not clarify the precise frame rate or token budget used, rendering the superiority claim difficult to interpret or reproduce.

minor comments (2)

[Abstract] The abstract contains a minor grammatical inconsistency in the efficiency sentence (missing comma before 'e.g.').
[§3.2] Notation for the unified chunking-retrieval step could be clarified with a short pseudocode block or equation to make the 'one processing step' claim more precise.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address all major comments by adding the requested experimental details, quantitative evaluations, and comparative analyses.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experimental Setup): The central performance claims (e.g., Qwen3-VL 8B reaching GPT-5 level on MLVU) are presented without specification of exact baselines, data splits, statistical tests, error bars, or multiple-run averages. This omission makes it impossible to assess whether the reported gains are robust or attributable to post-hoc choices, directly undermining evaluation of the method's effectiveness.

Authors: We agree that the original presentation lacked sufficient detail for assessing robustness. In the revised manuscript, we have expanded the abstract and §4 to specify all baselines (vanilla MLLMs and prior RAG methods), exact data splits for each benchmark, averages over three independent runs with standard deviation error bars, and paired t-test results for statistical significance. revision: yes
Referee: [§3.2] §3.2 (Query-Guided Video Chunking): The algorithm is asserted to preserve full knowledge integrity and semantic coherence by using query relevance as the selection criterion, yet no quantitative measurement of event recall, information-loss rate, or narrative coherence (e.g., via human evaluation or proxy metrics on temporally distant setup/background events) is reported. This leaves the core assumption—that query-aligned chunks suffice for complete video understanding—unfalsified and load-bearing for the integrity claim.

Authors: The referee is correct that direct quantitative validation of knowledge preservation was missing. While end-to-end gains offer indirect support, the revised §3.2 now includes event recall and information-loss metrics computed on annotated video subsets, plus a small-scale human evaluation of narrative coherence for temporally distant events. revision: yes
Referee: [§4.3] §4.3 (Efficiency Evaluation): The efficiency result (LLaVA-Video processing up to one hour of video in <1.2 minutes on a 4090) lacks comparison against standard frame-sampling or existing video RAG baselines under identical hardware and video-length conditions, and does not clarify the precise frame rate or token budget used, rendering the superiority claim difficult to interpret or reproduce.

Authors: We agree that comparative baselines and implementation details are essential for interpretability. The revised §4.3 now reports efficiency results against standard frame-sampling and existing video RAG methods under identical hardware and video lengths, and explicitly states the 1 FPS sampling rate and token budget used. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is an independent engineering contribution

full rationale

The paper introduces OneClip-RAG as a new retrieval-augmented paradigm with a query-guided chunking algorithm and the SynLongVideo dataset for training. No equations, fitted parameters, or derivations are present that reduce to inputs by construction. Performance claims rest on external benchmarks and efficiency measurements rather than self-referential definitions or load-bearing self-citations. The central components (one-shot clip retrieval and unified chunking) are presented as novel additions with independent validation, making the work self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the paper introduces no explicitly stated free parameters, mathematical axioms, or new invented entities. The central claim rests on the empirical effectiveness of the retrieval-augmented pipeline and the assumption that clip-level retrieval suffices for long-video coherence.

pith-pipeline@v0.9.0 · 5558 in / 1220 out tokens · 41394 ms · 2026-05-17T00:05:35.370541+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 16 internal anchors

[1]

Adnan Arefeen, Biplob Debnath, Md

Md. Adnan Arefeen, Biplob Debnath, Md. Yusuf Sarwar Uddin, and Srimat Chakradhar. Vita: An efficient video- to-text algorithm using VLM for rag-based video analysis system. InCVPR Workshops, pages 2266–2274. IEEE, 2024. 2

work page 2024
[2]

Self-rag: Learning to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Han- naneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. InICLR, 2024. 2

work page 2024
[3]

Minigpt4-video: Advancing multimodal llms for video un- derstanding with interleaved visual-textual tokens.arXiv Preprint, 2024

Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Es- sam Sleiman, Deyao Zhu, Jian Ding, and Mohamed Elhoseiny. Minigpt4-video: Advancing multimodal llms for video un- derstanding with interleaved visual-textual tokens.arXiv Preprint, 2024. https://arxiv.org/abs/2404. 03413. 2

work page 2024
[4]

Goldfish: Vision- language understanding of arbitrarily long videos

Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Mingchen Zhuge, Jian Ding, Deyao Zhu, Jürgen Schmidhuber, and Mohamed Elhoseiny. Goldfish: Vision- language understanding of arbitrarily long videos. InECCV (29), pages 251–267, 2024. 2, 3, 6

work page 2024
[5]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- vl technical report.a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Where did I leave my keys? - episodic-memory-based question answering on ego- centric videos

Leonard Bärmann and Alex Waibel. Where did I leave my keys? - episodic-memory-based question answering on ego- centric videos. InCVPR Workshops, pages 1559–1567, 2022. 2, 4, 5, 6

work page 2022
[7]

Multi-task re- triever fine-tuning for domain-specific and efficient RAG

Patrice Béchard and Orlando Marquez Ayala. Multi-task re- triever fine-tuning for domain-specific and efficient RAG. arXiv Preprint, 2025. https://arxiv.org/abs/ 2501.04652. 5

work page arXiv 2025
[8]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Long- former: The long-document transformer.arXiv Preprint, 2020.https://arxiv.org/abs/2004.05150. 2

work page internal anchor Pith review Pith/arXiv arXiv 2020
[9]

Berndt and James Clifford

Donald J. Berndt and James Clifford. Using dynamic time warping to find patterns in time series. InKDD Workshop, pages 359–370, 1994. 4

work page 1994
[10]

Internlm2 technical report, 2024

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Peng- long Jiao, Zhenjiang Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li,...

work page 2024
[12]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

The power of noise: Re- defining retrieval for RAG systems

Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. The power of noise: Re- defining retrieval for RAG systems. InSIGIR, pages 719–729,

work page
[14]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. InNeurIPS, 2022. 2

work page 2022
[15]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil 9 Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bet...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Meng- dan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. MME: A compre- hensive evaluation benchmark for multimodal large language models.arXiv Preprint, 2023. https://arxiv.org/ abs/2306.13394. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InC...

work page 2025
[18]

Marti A. Hearst. Texttiling: Segmenting text into multi- paragraph subtopic passages.Comput. Linguistics, 23(1): 33–64, 1997. 4

work page 1997
[19]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR (Poster), 2015. 6

work page 2015
[20]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pages 19730–19742, 2023. 1

work page 2023
[21]

VideoChat: Chat-Centric Video Understanding

Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv Preprint, 2023. https://arxiv.org/abs/2305.06355. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Lou, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark. InCVPR, pages 22195–22206,

work page
[23]

End-to-end video question answering with frame scoring mechanisms and adaptive sam- pling.arXiv Preprint, 2024

Jianxin Liang, Xiaojun Meng, Yueqian Wang, Chang Liu, Qun Liu, and Dongyan Zhao. End-to-end video question answering with frame scoring mechanisms and adaptive sam- pling.arXiv Preprint, 2024. https://arxiv.org/abs/ 2407.15047. 2, 3, 4

work page arXiv 2024
[24]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual represen- tation by alignment before projection.arXiv Preprint, 2023. https://arxiv.org/abs/2311.10122. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

VILA: on pre-training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. VILA: on pre-training for visual language models. InCVPR, pages 26679–26689, 2024. 2, 6

work page 2024
[26]

MM-VID: advancing video understanding with gpt-4v(ision).arXiv Preprint, 2023

Kevin Lin, Faisal Ahmed, Linjie Li, Chung-Ching Lin, Ehsan Azarnasab, Zhengyuan Yang, Jianfeng Wang, Lin Liang, Zicheng Liu, Yumao Lu, Ce Liu, and Lijuan Wang. MM-VID: advancing video understanding with gpt-4v(ision).arXiv Preprint, 2023. https://arxiv.org/abs/2310. 19773. 2

work page 2023
[27]

Llava-next: Improved reason- ing, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reason- ing, ocr, and world knowledge, 2024. 3

work page 2024
[28]

Bolt: Boost large vision-language model without training for long-form video understanding

Shuming Liu, , Chen Zhao, Tianqi Xu, and Bernard Ghanem. Bolt: Boost large vision-language model without training for long-form video understanding. InCVPR, 2025. 2, 3

work page 2025
[29]

NVILA: efficient frontier visual language models

Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yum- ing Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Haotian Tang, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Jinyi Hu, Sifei Liu, Ranjay Krishna, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, and Yao Lu. NVILA: efficient frontier visual lan...

work page
[30]

Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts. In ICLR, 2024. 1

work page 2024
[31]

Towards lightweight transformer via group-wise transformation for vision-and-language tasks.IEEE Trans

Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Yan Wang, Liujuan Cao, Yongjian Wu, Feiyue Huang, and Rongrong Ji. Towards lightweight transformer via group-wise transformation for vision-and-language tasks.IEEE Trans. Image Process., 31: 3386–3398, 2022. 1

work page 2022
[32]

Towards language-guided visual recognition via dynamic convolutions.Int

Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Yongjian Wu, Yue Gao, and Rongrong Ji. Towards language-guided visual recognition via dynamic convolutions.Int. J. Comput. Vis., 132(1):1–19,

work page
[33]

Moil: Momentum imitation learning for efficient vision-language adaptation.IEEE Trans

Gen Luo, Yiyi Zhou, Minglang Huang, Tianhe Ren, Xi- aoshuai Sun, and Rongrong Ji. Moil: Momentum imitation learning for efficient vision-language adaptation.IEEE Trans. Pattern Anal. Mach. Intell., 47(7):5192–5204, 2025. 1

work page 2025
[34]

Video-rag: Visually-aligned retrieval-augmented long video comprehension.arXiv Preprint, 2024

Yongdong Luo, Xiawu Zheng, Xiao Yang, Guilin Li, Haojia Lin, Jinfa Huang, Jiayi Ji, Fei Chao, Jiebo Luo, and Rongrong Ji. Video-rag: Visually-aligned retrieval-augmented long video comprehension.arXiv Preprint, 2024. https:// arxiv.org/abs/2411.13093. 2

work page arXiv 2024
[35]

Drvideo: Document retrieval based long video understanding

Ziyu Ma, Chenhui Gou, Hengcan Shi, Bin Sun, Shutao Li, Hamid Rezatofighi, and Jianfei Cai. Drvideo: Document 10 retrieval based long video understanding.arXiv Preprint, 2024.https://arxiv.org/abs/2406.12846. 2, 3

work page arXiv 2024
[36]

Video-chatgpt: Towards detailed video understanding via large vision and language models

Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InACL, pages 12585–12602, 2024. 1, 2, 3, 6

work page 2024
[37]

Bassl: Boundary-aware self-supervised learning for video scene seg- mentation

Jonghwan Mun, Minchul Shin, Gunsoo Han, Sangho Lee, Seongsu Ha, Joonseok Lee, and Eun-Sol Kim. Bassl: Boundary-aware self-supervised learning for video scene seg- mentation. InACCV, pages 485–501, 2022. 4

work page 2022
[38]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, pages 8748–8763, 2021. 2, 5, 6, 8

work page 2021
[39]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision.arXiv Preprint, 2024

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.arXiv Preprint, 2024. https://arxiv.org/abs/2407. 08608. 2

work page 2024
[40]

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Bal- akrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, and Vikas Chandra. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv Prep...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

REPLUG: retrieval-augmented black-box language models

Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Richard James, Mike Lewis, Luke Zettlemoyer, and Wen- tau Yih. REPLUG: retrieval-augmented black-box language models. InNAACL-HLT, pages 8371–8384, 2024. 2

work page 2024
[42]

Video- xl: Extra-long vision language model for hour-scale video understanding

Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video- xl: Extra-long vision language model for hour-scale video understanding. InCVPR, pages 26160–26169, 2025. 6

work page 2025
[43]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, and Gaoang Wang. Moviechat: From dense token to sparse memory for long video understanding. InCVPR, pages 18221–18232,

work page
[44]

Videonsa: Native sparse attention scales video understanding

Enxin Song, Wenhao Chai, Shusheng Yang, Ethan Armand, Xiaojun Shan, Haiyang Xu, Jianwen Xie, and Zhuowen Tu. Videonsa: Native sparse attention scales video understanding. arXiv preprint arXiv:2510.02295, 2025. 6

work page arXiv 2025
[45]

Adaptive keyframe sampling for long video understanding.arXiv Preprint, 2025

Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive keyframe sampling for long video understanding.arXiv Preprint, 2025. https://arxiv. org/abs/2502.21271. 3, 6

work page arXiv 2025
[46]

Weakly supervised gaussian contrastive grounding with large multimodal models for video question answering

Haibo Wang, Chenghang Lai, Yixuan Sun, and Weifeng Ge. Weakly supervised gaussian contrastive grounding with large multimodal models for video question answering. InACM Multimedia, pages 5289–5298, 2024. 3

work page 2024
[47]

Dynamic-vlm: Simple dynamic visual token compression for videollm.arXiv Preprint, 2024

Han Wang, Yuxiang Nie, Yongjie Ye, Guanyu Deng, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, and Can Huang. Dynamic-vlm: Simple dynamic visual token compression for videollm.arXiv Preprint, 2024. https://arxiv.org/ abs/2412.09530. 6

work page arXiv 2024
[48]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv Preprint, 2024. https: //arxiv.org...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Sheng- long Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, JingJing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Zhi Ho...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Videollamb: Long-context video understanding with recurrent memory bridges.arXiv Preprint, 2024

Yuxuan Wang, Cihang Xie, Yang Liu, and Zilong Zheng. Videollamb: Long-context video understanding with recurrent memory bridges.arXiv Preprint, 2024. https://arxiv. org/abs/2409.01071. 3, 4, 8

work page arXiv 2024
[51]

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.arXiv Preprint, 2024. https: //arxiv.org/abs/2407.15754. 2, 6

work page internal anchor Pith review arXiv 2024
[52]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InCVPR, pages 9777–9786, 2021. 5, 6

work page 2021
[53]

Chunk, align, select: A simple long-sequence processing method for transformers

Jiawen Xie, Pengyu Cheng, Xiao Liang, Yong Dai, and Nan Du. Chunk, align, select: A simple long-sequence processing method for transformers. InACL, pages 13500–13519, 2024. 2

work page 2024
[54]

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See-Kiong Ng, and Jiashi Feng. Pllava : Parameter-free llava extension from images to videos for video dense captioning.arXiv Preprint, 2024.https://arxiv.org/abs/2404.16994. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Corrective Retrieval Augmented Generation

Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling. Corrective retrieval augmented generation.arXiv Preprint, 2024.https://arxiv.org/abs/2401.15884. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, 11 Mei Li, Mingfe...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

mplug-owl3: Towards long image-sequence understanding in multi-modal large language models

Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models. InICLR, 2025. 6

work page 2025
[58]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, pages 11941–11952, 2023. 2, 5, 6

work page 2023
[59]

Video-llama: An instruction-tuned audio-visual language model for video un- derstanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. InEMNLP, pages 543–553, 2023. 1

work page 2023
[60]

Omagent: A multi-modal agent framework for complex video understanding with task divide-and-conquer

Lu Zhang, Tiancheng Zhao, Heting Ying, Yibo Ma, and Kyu- song Lee. Omagent: A multi-modal agent framework for complex video understanding with task divide-and-conquer. InEMNLP, pages 10031–10045, 2024. 2, 3

work page 2024
[61]

Long Context Transfer from Language to Vision

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from lan- guage to vision.arXiv Preprint, 2024. https://arxiv. org/abs/2406.16852. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

Rag4itops: A supervised fine-tunable and comprehensive RAG framework for IT operations and maintenance

Tianyang Zhang, Zhuoxuan Jiang, Shengguang Bai, Tianrui Zhang, Lin Lin, Yang Liu, and Jiawei Ren. Rag4itops: A supervised fine-tunable and comprehensive RAG framework for IT operations and maintenance. InEMNLP (Industry Track), pages 738–754, 2024. 5

work page 2024
[63]

Llava- next: A strong zero-shot video understanding model, 2024

Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model, 2024. 2, 6

work page 2024
[64]

Llava-video: Video instruction tuning with synthetic data.Trans

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data.Trans. Mach. Learn. Res., 2025, 2025. 2, 6

work page 2025
[65]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. InNeurIPS, 2023. 1, 2

work page 2023
[66]

MLVU: Benchmarking Multi-task Long Video Understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. MLVU: A comprehensive benchmark for multi-task long video understanding.arXiv Preprint, 2024. https: //arxiv.org/abs/2406.04264. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

TRAR: routing the attention spans in transformer for visual question answering

Yiyi Zhou, Tianhe Ren, Chaoyang Zhu, Xiaoshuai Sun, Jianzhuang Liu, Xinghao Ding, Mingliang Xu, and Rongrong Ji. TRAR: routing the attention spans in transformer for visual question answering. InICCV, pages 2054–2064, 2021. 1

work page 2054
[68]

Plenty is plague: Fine- grained learning for visual question answering.IEEE Trans

Yiyi Zhou, Rongrong Ji, Xiaoshuai Sun, Jinsong Su, Deyu Meng, Yue Gao, and Chunhua Shen. Plenty is plague: Fine- grained learning for visual question answering.IEEE Trans. Pattern Anal. Mach. Intell., 44(2):697–709, 2022. 1

work page 2022
[69]

Minigpt-4: Enhancing vision-language understanding with advanced large language models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. InICLR. OpenReview.net, 2024. 1 12

work page 2024