Towards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval
Pith reviewed 2026-05-17 00:05 UTC · model grok-4.3
The pith
OneClip-RAG lets MLLMs understand long videos by retrieving one-shot clips while preserving semantic coherence and knowledge integrity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OneClip-RAG is an effective and efficient paradigm for long video understanding in MLLMs that makes full use of video clips for augmented understanding in terms of both knowledge integrity and semantic coherence, equipped with a novel query-guided video chunking algorithm that unifies clip chunking and cross-modal retrieval in one processing step, further supported by the SynLongVideo dataset and progressive training to improve instruction following, and validated through integration with three recent MLLMs on long-video benchmarks.
What carries the argument
One-shot video-Clip based Retrieval-Augmented Generation (OneClip-RAG) with its query-guided video chunking algorithm, which unifies chunking and retrieval to preserve coherence.
If this is right
- Boosts Qwen3-VL 8B performance to the level of GPT-5 on the MLVU benchmark.
- Enables LLaVA-Video to process up to an hour of video in less than 1.2 minutes on a single 4090 GPU.
- Delivers superior efficiency compared with prior video RAG methods by avoiding redundant computations.
- Improves instruction following through the SynLongVideo dataset and progressive training regime.
- Applies directly to multiple recent MLLMs with consistent gains on long-video tasks.
Where Pith is reading between the lines
- The unified chunking-retrieval step could extend to other long-sequence modalities such as audio tracks or multi-page documents.
- If retrieval errors remain low, the method may reduce the need for full-video processing in real-time video analysis applications.
- Scaling tests on videos exceeding one hour would show whether additional retrieval rounds become necessary.
Load-bearing premise
That one-shot clip retrieval combined with query-guided chunking preserves full knowledge integrity and semantic coherence across the entire long video without omitting critical information or introducing retrieval errors.
What would settle it
A long-video question-answering benchmark in which key events or details required for correct answers lie outside the single retrieved clip, causing the augmented model to produce incorrect responses.
Figures
read the original abstract
Due to excessive memory overhead, most Multimodal Large Language Models (MLLMs) can only process videos of limited frames. In this paper, we propose an effective and efficient paradigm to remedy this shortcoming, termed One-shot video-Clip based Retrieval-Augmented Generation (OneClip-RAG). Compared with existing video RAG methods, OneClip-RAG makes full use of the merits of video clips for augmented video understanding in terms of both knowledge integrity and semantic coherence. Besides, it is also equipped with a novel query-guided video chunking algorithm that can unify clip chunking and cross-modal retrieval in one processing step, avoiding redundant computations. To improve instruction following, we further propose a new dataset called SynLongVideo and design a progressive training regime for OneClip-RAG. OneClip-RAG is plugged into three recent MLLMs and validated on a set of long-video benchmarks. Experimental results not only show the obvious performance gains by OneClip-RAG over MLLMs, e.g., boosting Qwen3-VL 8B to the level of GPT-5 on MLVU, but also show its superior efficiency in handling long videos. e.g., enabling LLaVA-Video understand up to an hour of videos in less than 1.2 minutes on a single 4090 GPU.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OneClip-RAG, a retrieval-augmented generation framework for long-video understanding in MLLMs. It relies on one-shot video-clip retrieval combined with a query-guided chunking algorithm that unifies chunking and cross-modal retrieval in a single step, aiming to preserve knowledge integrity and semantic coherence while reducing memory and compute costs. The approach is augmented by the SynLongVideo dataset and progressive training, then plugged into three existing MLLMs and evaluated on long-video benchmarks, with reported gains such as elevating Qwen3-VL 8B performance to GPT-5 levels on MLVU and enabling hour-long video processing in under 1.2 minutes on a single 4090 GPU.
Significance. If the empirical claims hold under rigorous controls, the work would offer a practical route to scaling MLLM video understanding beyond short clips without prohibitive memory overhead. The emphasis on clip-level retrieval for coherence and the unified query-guided chunking step represent a targeted engineering contribution that could influence future video RAG designs, particularly for open-source models.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experimental Setup): The central performance claims (e.g., Qwen3-VL 8B reaching GPT-5 level on MLVU) are presented without specification of exact baselines, data splits, statistical tests, error bars, or multiple-run averages. This omission makes it impossible to assess whether the reported gains are robust or attributable to post-hoc choices, directly undermining evaluation of the method's effectiveness.
- [§3.2] §3.2 (Query-Guided Video Chunking): The algorithm is asserted to preserve full knowledge integrity and semantic coherence by using query relevance as the selection criterion, yet no quantitative measurement of event recall, information-loss rate, or narrative coherence (e.g., via human evaluation or proxy metrics on temporally distant setup/background events) is reported. This leaves the core assumption—that query-aligned chunks suffice for complete video understanding—unfalsified and load-bearing for the integrity claim.
- [§4.3] §4.3 (Efficiency Evaluation): The efficiency result (LLaVA-Video processing up to one hour of video in <1.2 minutes on a 4090) lacks comparison against standard frame-sampling or existing video RAG baselines under identical hardware and video-length conditions, and does not clarify the precise frame rate or token budget used, rendering the superiority claim difficult to interpret or reproduce.
minor comments (2)
- [Abstract] The abstract contains a minor grammatical inconsistency in the efficiency sentence (missing comma before 'e.g.').
- [§3.2] Notation for the unified chunking-retrieval step could be clarified with a short pseudocode block or equation to make the 'one processing step' claim more precise.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address all major comments by adding the requested experimental details, quantitative evaluations, and comparative analyses.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experimental Setup): The central performance claims (e.g., Qwen3-VL 8B reaching GPT-5 level on MLVU) are presented without specification of exact baselines, data splits, statistical tests, error bars, or multiple-run averages. This omission makes it impossible to assess whether the reported gains are robust or attributable to post-hoc choices, directly undermining evaluation of the method's effectiveness.
Authors: We agree that the original presentation lacked sufficient detail for assessing robustness. In the revised manuscript, we have expanded the abstract and §4 to specify all baselines (vanilla MLLMs and prior RAG methods), exact data splits for each benchmark, averages over three independent runs with standard deviation error bars, and paired t-test results for statistical significance. revision: yes
-
Referee: [§3.2] §3.2 (Query-Guided Video Chunking): The algorithm is asserted to preserve full knowledge integrity and semantic coherence by using query relevance as the selection criterion, yet no quantitative measurement of event recall, information-loss rate, or narrative coherence (e.g., via human evaluation or proxy metrics on temporally distant setup/background events) is reported. This leaves the core assumption—that query-aligned chunks suffice for complete video understanding—unfalsified and load-bearing for the integrity claim.
Authors: The referee is correct that direct quantitative validation of knowledge preservation was missing. While end-to-end gains offer indirect support, the revised §3.2 now includes event recall and information-loss metrics computed on annotated video subsets, plus a small-scale human evaluation of narrative coherence for temporally distant events. revision: yes
-
Referee: [§4.3] §4.3 (Efficiency Evaluation): The efficiency result (LLaVA-Video processing up to one hour of video in <1.2 minutes on a 4090) lacks comparison against standard frame-sampling or existing video RAG baselines under identical hardware and video-length conditions, and does not clarify the precise frame rate or token budget used, rendering the superiority claim difficult to interpret or reproduce.
Authors: We agree that comparative baselines and implementation details are essential for interpretability. The revised §4.3 now reports efficiency results against standard frame-sampling and existing video RAG methods under identical hardware and video lengths, and explicitly states the 1 FPS sampling rate and token budget used. revision: yes
Circularity Check
No significant circularity; method is an independent engineering contribution
full rationale
The paper introduces OneClip-RAG as a new retrieval-augmented paradigm with a query-guided chunking algorithm and the SynLongVideo dataset for training. No equations, fitted parameters, or derivations are present that reduce to inputs by construction. Performance claims rest on external benchmarks and efficiency measurements rather than self-referential definitions or load-bearing self-citations. The central components (one-shot clip retrieval and unified chunking) are presented as novel additions with independent validation, making the work self-contained against external evaluation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Adnan Arefeen, Biplob Debnath, Md
Md. Adnan Arefeen, Biplob Debnath, Md. Yusuf Sarwar Uddin, and Srimat Chakradhar. Vita: An efficient video- to-text algorithm using VLM for rag-based video analysis system. InCVPR Workshops, pages 2266–2274. IEEE, 2024. 2
work page 2024
-
[2]
Self-rag: Learning to retrieve, generate, and critique through self-reflection
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Han- naneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. InICLR, 2024. 2
work page 2024
-
[3]
Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Es- sam Sleiman, Deyao Zhu, Jian Ding, and Mohamed Elhoseiny. Minigpt4-video: Advancing multimodal llms for video un- derstanding with interleaved visual-textual tokens.arXiv Preprint, 2024. https://arxiv.org/abs/2404. 03413. 2
work page 2024
-
[4]
Goldfish: Vision- language understanding of arbitrarily long videos
Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Mingchen Zhuge, Jian Ding, Deyao Zhu, Jürgen Schmidhuber, and Mohamed Elhoseiny. Goldfish: Vision- language understanding of arbitrarily long videos. InECCV (29), pages 251–267, 2024. 2, 3, 6
work page 2024
-
[5]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- vl technical report.a...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Where did I leave my keys? - episodic-memory-based question answering on ego- centric videos
Leonard Bärmann and Alex Waibel. Where did I leave my keys? - episodic-memory-based question answering on ego- centric videos. InCVPR Workshops, pages 1559–1567, 2022. 2, 4, 5, 6
work page 2022
-
[7]
Multi-task re- triever fine-tuning for domain-specific and efficient RAG
Patrice Béchard and Orlando Marquez Ayala. Multi-task re- triever fine-tuning for domain-specific and efficient RAG. arXiv Preprint, 2025. https://arxiv.org/abs/ 2501.04652. 5
-
[8]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E. Peters, and Arman Cohan. Long- former: The long-document transformer.arXiv Preprint, 2020.https://arxiv.org/abs/2004.05150. 2
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[9]
Donald J. Berndt and James Clifford. Using dynamic time warping to find patterns in time series. InKDD Workshop, pages 359–370, 1994. 4
work page 1994
-
[10]
Internlm2 technical report, 2024
Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Peng- long Jiao, Zhenjiang Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li,...
work page 2024
-
[12]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
The power of noise: Re- defining retrieval for RAG systems
Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. The power of noise: Re- defining retrieval for RAG systems. InSIGIR, pages 719–729,
-
[14]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. InNeurIPS, 2022. 2
work page 2022
-
[15]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil 9 Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bet...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Meng- dan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. MME: A compre- hensive evaluation benchmark for multimodal large language models.arXiv Preprint, 2023. https://arxiv.org/ abs/2306.13394. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InC...
work page 2025
-
[18]
Marti A. Hearst. Texttiling: Segmenting text into multi- paragraph subtopic passages.Comput. Linguistics, 23(1): 33–64, 1997. 4
work page 1997
-
[19]
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR (Poster), 2015. 6
work page 2015
-
[20]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pages 19730–19742, 2023. 1
work page 2023
-
[21]
VideoChat: Chat-Centric Video Understanding
Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv Preprint, 2023. https://arxiv.org/abs/2305.06355. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Mvbench: A comprehensive multi-modal video understanding benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Lou, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark. InCVPR, pages 22195–22206,
-
[23]
Jianxin Liang, Xiaojun Meng, Yueqian Wang, Chang Liu, Qun Liu, and Dongyan Zhao. End-to-end video question answering with frame scoring mechanisms and adaptive sam- pling.arXiv Preprint, 2024. https://arxiv.org/abs/ 2407.15047. 2, 3, 4
-
[24]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual represen- tation by alignment before projection.arXiv Preprint, 2023. https://arxiv.org/abs/2311.10122. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
VILA: on pre-training for visual language models
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. VILA: on pre-training for visual language models. InCVPR, pages 26679–26689, 2024. 2, 6
work page 2024
-
[26]
MM-VID: advancing video understanding with gpt-4v(ision).arXiv Preprint, 2023
Kevin Lin, Faisal Ahmed, Linjie Li, Chung-Ching Lin, Ehsan Azarnasab, Zhengyuan Yang, Jianfeng Wang, Lin Liang, Zicheng Liu, Yumao Lu, Ce Liu, and Lijuan Wang. MM-VID: advancing video understanding with gpt-4v(ision).arXiv Preprint, 2023. https://arxiv.org/abs/2310. 19773. 2
work page 2023
-
[27]
Llava-next: Improved reason- ing, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reason- ing, ocr, and world knowledge, 2024. 3
work page 2024
-
[28]
Bolt: Boost large vision-language model without training for long-form video understanding
Shuming Liu, , Chen Zhao, Tianqi Xu, and Bernard Ghanem. Bolt: Boost large vision-language model without training for long-form video understanding. InCVPR, 2025. 2, 3
work page 2025
-
[29]
NVILA: efficient frontier visual language models
Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yum- ing Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Haotian Tang, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Jinyi Hu, Sifei Liu, Ranjay Krishna, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, and Yao Lu. NVILA: efficient frontier visual lan...
-
[30]
Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts. In ICLR, 2024. 1
work page 2024
-
[31]
Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Yan Wang, Liujuan Cao, Yongjian Wu, Feiyue Huang, and Rongrong Ji. Towards lightweight transformer via group-wise transformation for vision-and-language tasks.IEEE Trans. Image Process., 31: 3386–3398, 2022. 1
work page 2022
-
[32]
Towards language-guided visual recognition via dynamic convolutions.Int
Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Yongjian Wu, Yue Gao, and Rongrong Ji. Towards language-guided visual recognition via dynamic convolutions.Int. J. Comput. Vis., 132(1):1–19,
-
[33]
Moil: Momentum imitation learning for efficient vision-language adaptation.IEEE Trans
Gen Luo, Yiyi Zhou, Minglang Huang, Tianhe Ren, Xi- aoshuai Sun, and Rongrong Ji. Moil: Momentum imitation learning for efficient vision-language adaptation.IEEE Trans. Pattern Anal. Mach. Intell., 47(7):5192–5204, 2025. 1
work page 2025
-
[34]
Video-rag: Visually-aligned retrieval-augmented long video comprehension.arXiv Preprint, 2024
Yongdong Luo, Xiawu Zheng, Xiao Yang, Guilin Li, Haojia Lin, Jinfa Huang, Jiayi Ji, Fei Chao, Jiebo Luo, and Rongrong Ji. Video-rag: Visually-aligned retrieval-augmented long video comprehension.arXiv Preprint, 2024. https:// arxiv.org/abs/2411.13093. 2
-
[35]
Ziyu Ma, Chenhui Gou, Hengcan Shi, Bin Sun, Shutao Li, Hamid Rezatofighi, and Jianfei Cai. Drvideo: Document 10 retrieval based long video understanding.arXiv Preprint, 2024.https://arxiv.org/abs/2406.12846. 2, 3
-
[36]
Video-chatgpt: Towards detailed video understanding via large vision and language models
Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InACL, pages 12585–12602, 2024. 1, 2, 3, 6
work page 2024
-
[37]
Bassl: Boundary-aware self-supervised learning for video scene seg- mentation
Jonghwan Mun, Minchul Shin, Gunsoo Han, Sangho Lee, Seongsu Ha, Joonseok Lee, and Eun-Sol Kim. Bassl: Boundary-aware self-supervised learning for video scene seg- mentation. InACCV, pages 485–501, 2022. 4
work page 2022
-
[38]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, pages 8748–8763, 2021. 2, 5, 6, 8
work page 2021
-
[39]
Flashattention-3: Fast and accurate attention with asynchrony and low-precision.arXiv Preprint, 2024
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.arXiv Preprint, 2024. https://arxiv.org/abs/2407. 08608. 2
work page 2024
-
[40]
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Bal- akrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, and Vikas Chandra. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv Prep...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
REPLUG: retrieval-augmented black-box language models
Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Richard James, Mike Lewis, Luke Zettlemoyer, and Wen- tau Yih. REPLUG: retrieval-augmented black-box language models. InNAACL-HLT, pages 8371–8384, 2024. 2
work page 2024
-
[42]
Video- xl: Extra-long vision language model for hour-scale video understanding
Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video- xl: Extra-long vision language model for hour-scale video understanding. InCVPR, pages 26160–26169, 2025. 6
work page 2025
-
[43]
Moviechat: From dense token to sparse memory for long video understanding
Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, and Gaoang Wang. Moviechat: From dense token to sparse memory for long video understanding. InCVPR, pages 18221–18232,
-
[44]
Videonsa: Native sparse attention scales video understanding
Enxin Song, Wenhao Chai, Shusheng Yang, Ethan Armand, Xiaojun Shan, Haiyang Xu, Jianwen Xie, and Zhuowen Tu. Videonsa: Native sparse attention scales video understanding. arXiv preprint arXiv:2510.02295, 2025. 6
-
[45]
Adaptive keyframe sampling for long video understanding.arXiv Preprint, 2025
Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive keyframe sampling for long video understanding.arXiv Preprint, 2025. https://arxiv. org/abs/2502.21271. 3, 6
-
[46]
Haibo Wang, Chenghang Lai, Yixuan Sun, and Weifeng Ge. Weakly supervised gaussian contrastive grounding with large multimodal models for video question answering. InACM Multimedia, pages 5289–5298, 2024. 3
work page 2024
-
[47]
Dynamic-vlm: Simple dynamic visual token compression for videollm.arXiv Preprint, 2024
Han Wang, Yuxiang Nie, Yongjie Ye, Guanyu Deng, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, and Can Huang. Dynamic-vlm: Simple dynamic visual token compression for videollm.arXiv Preprint, 2024. https://arxiv.org/ abs/2412.09530. 6
-
[48]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv Preprint, 2024. https: //arxiv.org...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Sheng- long Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, JingJing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Zhi Ho...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Videollamb: Long-context video understanding with recurrent memory bridges.arXiv Preprint, 2024
Yuxuan Wang, Cihang Xie, Yang Liu, and Zilong Zheng. Videollamb: Long-context video understanding with recurrent memory bridges.arXiv Preprint, 2024. https://arxiv. org/abs/2409.01071. 3, 4, 8
-
[51]
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.arXiv Preprint, 2024. https: //arxiv.org/abs/2407.15754. 2, 6
work page internal anchor Pith review arXiv 2024
-
[52]
Next-qa: Next phase of question-answering to explaining temporal actions
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InCVPR, pages 9777–9786, 2021. 5, 6
work page 2021
-
[53]
Chunk, align, select: A simple long-sequence processing method for transformers
Jiawen Xie, Pengyu Cheng, Xiao Liang, Yong Dai, and Nan Du. Chunk, align, select: A simple long-sequence processing method for transformers. InACL, pages 13500–13519, 2024. 2
work page 2024
-
[54]
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See-Kiong Ng, and Jiashi Feng. Pllava : Parameter-free llava extension from images to videos for video dense captioning.arXiv Preprint, 2024.https://arxiv.org/abs/2404.16994. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Corrective Retrieval Augmented Generation
Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling. Corrective retrieval augmented generation.arXiv Preprint, 2024.https://arxiv.org/abs/2401.15884. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, 11 Mei Li, Mingfe...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
mplug-owl3: Towards long image-sequence understanding in multi-modal large language models
Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models. InICLR, 2025. 6
work page 2025
-
[58]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, pages 11941–11952, 2023. 2, 5, 6
work page 2023
-
[59]
Video-llama: An instruction-tuned audio-visual language model for video un- derstanding
Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. InEMNLP, pages 543–553, 2023. 1
work page 2023
-
[60]
Omagent: A multi-modal agent framework for complex video understanding with task divide-and-conquer
Lu Zhang, Tiancheng Zhao, Heting Ying, Yibo Ma, and Kyu- song Lee. Omagent: A multi-modal agent framework for complex video understanding with task divide-and-conquer. InEMNLP, pages 10031–10045, 2024. 2, 3
work page 2024
-
[61]
Long Context Transfer from Language to Vision
Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from lan- guage to vision.arXiv Preprint, 2024. https://arxiv. org/abs/2406.16852. 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[62]
Tianyang Zhang, Zhuoxuan Jiang, Shengguang Bai, Tianrui Zhang, Lin Lin, Yang Liu, and Jiawei Ren. Rag4itops: A supervised fine-tunable and comprehensive RAG framework for IT operations and maintenance. InEMNLP (Industry Track), pages 738–754, 2024. 5
work page 2024
-
[63]
Llava- next: A strong zero-shot video understanding model, 2024
Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model, 2024. 2, 6
work page 2024
-
[64]
Llava-video: Video instruction tuning with synthetic data.Trans
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data.Trans. Mach. Learn. Res., 2025, 2025. 2, 6
work page 2025
-
[65]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. InNeurIPS, 2023. 1, 2
work page 2023
-
[66]
MLVU: Benchmarking Multi-task Long Video Understanding
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. MLVU: A comprehensive benchmark for multi-task long video understanding.arXiv Preprint, 2024. https: //arxiv.org/abs/2406.04264. 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
TRAR: routing the attention spans in transformer for visual question answering
Yiyi Zhou, Tianhe Ren, Chaoyang Zhu, Xiaoshuai Sun, Jianzhuang Liu, Xinghao Ding, Mingliang Xu, and Rongrong Ji. TRAR: routing the attention spans in transformer for visual question answering. InICCV, pages 2054–2064, 2021. 1
work page 2054
-
[68]
Plenty is plague: Fine- grained learning for visual question answering.IEEE Trans
Yiyi Zhou, Rongrong Ji, Xiaoshuai Sun, Jinsong Su, Deyu Meng, Yue Gao, and Chunhua Shen. Plenty is plague: Fine- grained learning for visual question answering.IEEE Trans. Pattern Anal. Mach. Intell., 44(2):697–709, 2022. 1
work page 2022
-
[69]
Minigpt-4: Enhancing vision-language understanding with advanced large language models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. InICLR. OpenReview.net, 2024. 1 12
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.