Recognition: no theorem link
LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding
Pith reviewed 2026-05-15 19:57 UTC · model grok-4.3
The pith
LongVideo-R1 uses a reasoning MLLM agent to navigate long videos by selecting only the most informative clips from high-level summaries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LongVideo-R1 is a reasoning-equipped MLLM agent that starts traversal from hierarchical video summaries, uses high-level visual cues to infer the most informative clip, and immediately stops exploration once sufficient knowledge is acquired to answer the query, trained via a two-stage SFT-then-RL process on 33K trajectories to maximize selective and efficient navigation.
What carries the argument
The reasoning module that infers the most informative video clip from high-level visual cues in top-level summaries, enabling iterative focus refinement and early halting of exploration.
If this is right
- Long videos can be processed at substantially lower computational cost by skipping exhaustive frame-by-frame analysis.
- Question-answering accuracy on long-video benchmarks remains competitive or superior while inference time and memory usage drop.
- The two-stage training process produces agents that learn to make selective clip choices and stop early when knowledge is sufficient.
- The same navigation pattern can be applied to new queries without retraining the core reasoning logic.
Where Pith is reading between the lines
- The same early-stopping navigation idea could transfer to long documents or audio streams where only portions matter for a given query.
- If top-level summaries omit subtle temporal cues, performance may degrade on tasks that need fine-grained details spread across the video.
- Pairing the navigation agent with stronger base multimodal models would likely widen the accuracy-efficiency gap further.
- The design illustrates a path toward multimodal agents that actively decide what to observe next based on partial understanding.
Load-bearing premise
High-level visual cues extracted from top-level summaries contain enough information for the reasoning module to reliably identify the single most informative clip without missing critical details needed for correct answers.
What would settle it
A test set of queries whose answers require details from multiple non-adjacent clips; if the agent’s chosen clip consistently omits one of those details and accuracy falls below exhaustive baselines, the navigation claim fails.
Figures
read the original abstract
This paper addresses the critical and underexplored challenge of long video understanding with low computational budgets. We propose LongVideo-R1, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient video context navigation, avoiding the redundancy of exhaustive search. At the core of LongVideo-R1 lies a reasoning module that leverages high-level visual cues to infer the most informative video clip for subsequent processing. During inference, the agent initiates traversal from top-level visual summaries and iteratively refines its focus, immediately halting the exploration process upon acquiring sufficient knowledge to answer the query. To facilitate training, we first extract hierarchical video captions from CGBench, a video corpus with grounding annotations, and guide GPT-5 to generate 33K high-quality chain-of-thought-with-tool trajectories. The LongVideo-R1 agent is fine-tuned upon the Qwen-3-8B model through a two-stage paradigm: supervised fine-tuning (SFT) followed by reinforcement learning (RL), where RL employs a specifically designed reward function to maximize selective and efficient clip navigation. Experiments on multiple long video benchmarks validate the effectiveness of name, which enjoys superior tradeoff between QA accuracy and efficiency. All curated data and source code are provided in the supplementary material and will be made publicly available. Code and data are available at: https://github.com/qiujihao19/LongVideo-R1
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LongVideo-R1, an active reasoning MLLM agent for efficient long-video understanding under low computational budgets. It starts from top-level visual summaries, uses a reasoning module to iteratively select the single most informative clip based on high-level cues, and halts exploration once sufficient information is obtained to answer the query. Training data consists of 33K chain-of-thought-with-tool trajectories generated by GPT-5 from hierarchical captions on CGBench; the Qwen-3-8B backbone is first SFT-tuned and then RL-tuned with a custom reward that encourages selective, efficient navigation. The central claim is that experiments on multiple long-video benchmarks demonstrate a superior accuracy-efficiency tradeoff.
Significance. If the navigation reliability and quantitative gains hold, the method could meaningfully reduce the compute cost of long-video QA by replacing exhaustive frame processing with targeted clip selection, offering a practical route toward deploying MLLMs on resource-limited hardware.
major comments (2)
- Abstract: the claim that experiments 'validate the effectiveness of LongVideo-R1' and deliver a 'superior tradeoff between QA accuracy and efficiency' is stated without any numerical results, benchmark names, baselines, or error bars. Because the central contribution is precisely this empirical tradeoff, the absence of visible supporting evidence in the abstract (and the lack of any quantitative section in the supplied text) leaves the primary claim without load-bearing support.
- Abstract / core method description: the reasoning module is asserted to 'leverage high-level visual cues to infer the most informative video clip.' No mechanism, ablation, or failure-case analysis is supplied to show that these cues remain sufficient when a query requires fine-grained visual or temporal detail absent from the top-level summary. If such cases exist, the agent may select the wrong clip or halt prematurely, directly eroding accuracy while the reported efficiency gain becomes illusory. This assumption is load-bearing for the accuracy-efficiency claim and requires concrete evidence or counter-example analysis.
minor comments (1)
- Abstract: 'validate the effectiveness of name' is a clear typographical error and should read 'validate the effectiveness of LongVideo-R1'.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each point below and outline planned revisions to improve clarity and support for our claims.
read point-by-point responses
-
Referee: Abstract: the claim that experiments 'validate the effectiveness of LongVideo-R1' and deliver a 'superior tradeoff between QA accuracy and efficiency' is stated without any numerical results, benchmark names, baselines, or error bars. Because the central contribution is precisely this empirical tradeoff, the absence of visible supporting evidence in the abstract (and the lack of any quantitative section in the supplied text) leaves the primary claim without load-bearing support.
Authors: We agree that the abstract should include concrete numerical support for the central accuracy-efficiency claim. The full manuscript contains quantitative results in the experiments section, including accuracy and efficiency metrics on long-video QA benchmarks with baseline comparisons. We will revise the abstract to summarize key numbers, benchmark names, and gains while preserving brevity. This directly addresses the concern without altering the underlying experiments. revision: yes
-
Referee: Abstract / core method description: the reasoning module is asserted to 'leverage high-level visual cues to infer the most informative video clip.' No mechanism, ablation, or failure-case analysis is supplied to show that these cues remain sufficient when a query requires fine-grained visual or temporal detail absent from the top-level summary. If such cases exist, the agent may select the wrong clip or halt prematurely, directly eroding accuracy while the reported efficiency gain becomes illusory. This assumption is load-bearing for the accuracy-efficiency claim and requires concrete evidence or counter-example analysis.
Authors: The manuscript details the reasoning module in Section 3, including its use of hierarchical captions for iterative clip selection and the RL reward that penalizes inefficient or premature stopping. We acknowledge that explicit ablations on cue sufficiency and failure-case analysis would strengthen the load-bearing assumption. In revision we will add a dedicated subsection with ablations on granularity levels and discussion of cases requiring fine-grained detail, showing how the stopping criterion and training mitigate errors. revision: partial
Circularity Check
No significant circularity in derivation chain
full rationale
The paper describes a training pipeline (hierarchical captions from CGBench used to generate 33K trajectories via GPT-5, followed by SFT then RL with an externally designed reward maximizing navigation efficiency) and reports empirical results on independent long-video benchmarks. No equations appear that equate any claimed prediction or performance metric to quantities fitted from the same data; no self-citations are invoked to establish uniqueness, ansatz, or load-bearing premises; and the central accuracy-efficiency tradeoff is presented as an observed experimental outcome rather than a definitional or self-referential reduction. The derivation therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption High-level visual summaries contain sufficient cues for the reasoning module to infer the most informative video clips
- domain assumption GPT-5 can generate high-quality chain-of-thought-with-tool trajectories from hierarchical captions
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 4, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Guo Chen, Yicheng Liu, Yifei Huang, Yuping He, Baoqi Pei, Jilan Xu, Yali Wang, Tong Lu, and Limin Wang. Cg- bench: Clue-grounded question answering benchmark for long video understanding.arXiv preprint arXiv:2412.12075,
-
[4]
Livecc: Learning video llm with stream- ing speech transcription at scale
Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. Livecc: Learning video llm with stream- ing speech transcription at scale. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29083–29095, 2025. 7
work page 2025
-
[5]
Shimin Chen, Xiaohan Lan, Yitian Yuan, Zequn Jie, and Lin Ma. Timemarker: A versatile video-llm for long and short video understanding with superior temporal localiza- tion ability.arXiv preprint arXiv:2411.18211, 2024. 7
-
[6]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 2
work page 2024
-
[8]
Rohan Choudhury, Guanglei Zhu, Sihan Liu, Koichiro Ni- inuma, Kris Kitani, and L´aszl´o Jeni. Don’t look twice: Faster video transformers with run-length tokenization.Advances in Neural Information Processing Systems, 37:28127–28149,
-
[9]
Videoagent: A memory-augmented mul- timodal agent for video understanding
Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory-augmented mul- timodal agent for video understanding. InEuropean Confer- ence on Computer Vision, pages 75–92, 2024. 2
work page 2024
-
[10]
Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language un- derstanding with causal cross-attention masks for short and long videos.arXiv preprint arXiv:2408.14023, 2024. 7
-
[11]
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InComputer Vision and Pattern Recognition, pages 24108–24118, 2025. 2, 6, 7
work page 2025
-
[13]
Lishuai Gao, Yujie Zhong, Yingsen Zeng, Haoxian Tan, Dengjie Li, and Zheng Zhao. Linvt: Empower your image- level large language model to understand videos.arXiv preprint arXiv:2412.05185, 2024. 7
-
[14]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025. 7
work page 2025
-
[16]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Video token merging for long-form video under- standing.arXiv preprint arXiv:2410.23782, 2024
Seon-Ho Lee, Jue Wang, Zhikang Zhang, David Fan, and Xinyu Li. Video token merging for long-form video under- standing.arXiv preprint arXiv:2410.23782, 2024. 2
-
[19]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Aria: An open multimodal native mixture- of-experts model.arXiv preprint arXiv:2410.05993, 2024
Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Fan Zhou, Chengen Huang, Yanpeng Li, et al. Aria: An open multimodal native mixture- of-experts model.arXiv preprint arXiv:2410.05993, 2024. 7
-
[21]
Vidtome: Video token merging for zero-shot video editing
Xirui Li, Chao Ma, Xiaokang Yang, and Ming-Hsuan Yang. Vidtome: Video token merging for zero-shot video editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7486–7495, 2024. 2
work page 2024
-
[23]
Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical com- pression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2025. 3, 7
-
[24]
Video-llava: Learning united visual repre- sentation by alignment before projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual repre- sentation by alignment before projection. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 5971–5984, 2024. 1, 2
work page 2024
-
[25]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Visual instruction tuning, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 2
work page 2023
-
[27]
Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 2
work page 2024
-
[28]
Videomind: A chain-of-lora agent for long video reasoning.arXiv preprint arXiv:2503.13444,
Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, and Mike Zheng Shou. Videomind: A chain-of-lora agent for long video reasoning.arXiv preprint arXiv:2503.13444,
-
[29]
Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Ji- wen Lu, and Yongming Rao. Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024. 7
-
[30]
Video-chatgpt: Towards detailed video under- standing via large vision and language models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Khan. Video-chatgpt: Towards detailed video under- standing via large vision and language models. InAnnual Meeting of the Association for Computational Linguistics, pages 12585–12602, 2024. 1, 2
work page 2024
-
[31]
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023. 2
work page 2023
- [32]
-
[33]
Accessed: 2025-11-13. 4
work page 2025
-
[34]
Jihao Qiu, Yuan Zhang, Xi Tang, Lingxi Xie, Tianren Ma, Pengyu Yan, David Doermann, Qixiang Ye, and Yunjie Tian. Artemis: Towards referential understanding in com- plex videos.Advances in Neural Information Processing Systems, 37:114321–114347, 2024. 2
work page 2024
-
[35]
Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, and Chao Huang. Videorag: Retrieval-augmented gen- eration with extreme long-context videos.arXiv preprint arXiv:2502.01549, 2025. 7
-
[36]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 3
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[37]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Leqi Shen, Tianxiang Hao, Tao He, Sicheng Zhao, Yifeng Zhang, Pengzhang Liu, Yongjun Bao, and Guiguang Ding. Tempme: Video temporal token merging for efficient text- video retrieval.arXiv preprint arXiv:2409.01156, 2024. 2
-
[39]
Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Bal- akrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024. 2
-
[40]
Video-xl: Extra-long vision language model for hour-scale video understanding
Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Jun- jie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26160–26169,
-
[41]
Enxin Song, Wenhao Chai, Tian Ye, Jenq-Neng Hwang, Xi Li, and Gaoang Wang. Moviechat+: Question-aware sparse memory for long video question answering.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2025. 2
work page 2025
-
[42]
Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. video- salmonn 2: Captioning-enhanced audio-visual large lan- guage models.arXiv preprint arXiv:2506.15220, 2025. 2
-
[43]
Adaptive keyframe sampling for long video understanding
Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive keyframe sampling for long video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29118–29128, 2025. 2, 6
work page 2025
-
[44]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, and Ziwei Liu. Ego-r1: Chain-of-tool- thought for ultra-long egocentric video reasoning.arXiv preprint arXiv:2506.13654, 2025. 1, 2, 3, 4, 5, 6, 7, 8
-
[47]
Chatter- box: Multi-round multimodal referring and grounding.arXiv preprint arXiv:2401.13307, 2024
Yunjie Tian, Tianren Ma, Lingxi Xie, Jihao Qiu, Xi Tang, Yuan Zhang, Jianbin Jiao, Qi Tian, and Qixiang Ye. Chatter- box: Multi-round multimodal referring and grounding.arXiv preprint arXiv:2401.13307, 2024. 2
-
[48]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Lvbench: An extreme long video understand- ing benchmark
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiao- han Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understand- ing benchmark. InInternational Conference on Computer Vision, pages 22958–22967, 2025. 1, 2, 6, 7
work page 2025
-
[51]
Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, and Liqiang Nie. Retake: Reducing temporal and knowledge redundancy for long video understanding.arXiv preprint arXiv:2412.20504, 2024. 7
-
[52]
Videoagent: Long-form video understanding with large language model as agent
Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung- Levy. Videoagent: Long-form video understanding with large language model as agent. InEuropean Conference on Computer Vision, pages 58–76, 2024. 2, 6, 7
work page 2024
-
[53]
Adaretake: Adaptive redundancy reduction to perceive longer for video-language understanding
Xiao Wang, Qingyi Si, Shiyu Zhu, Jianlong Wu, Li Cao, and Liqiang Nie. Adaretake: Adaptive redundancy reduction to perceive longer for video-language understanding. InFind- ings of the Association for Computational Linguistics: ACL 2025, pages 5417–5432, 2025. 7
work page 2025
-
[54]
Videotree: Adaptive tree-based video representation for llm reasoning on long videos
Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, and Mohit Bansal. Videotree: Adaptive tree-based video representation for llm reasoning on long videos. InComputer Vision and Pattern Recognition, pages 3272–3283, 2025. 2, 6, 7, 8
work page 2025
-
[55]
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024. 2
work page 2024
-
[56]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Vca: Video curious agent for long video under- standing
Zeyuan Yang, Delin Chen, Xueyang Yu, Maohao Shen, and Chuang Gan. Vca: Video curious agent for long video under- standing. InInternational Conference on Computer Vision, pages 20168–20179, 2025. 6, 7
work page 2025
-
[58]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gao- hong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Huaying Yuan, Zheng Liu, Minghao Qin, Hongjin Qian, Yan Shu, Zhicheng Dou, Ji-Rong Wen, and Nicu Sebe. Memory- enhanced retrieval augmentation for long video understand- ing.arXiv preprint arXiv:2503.09149, 2025. 7
-
[60]
Yufei Zhan, Yousong Zhu, Shurong Zheng, Hongyin Zhao, Fan Yang, Ming Tang, and Jinqiao Wang. Vision-r1: Evolv- ing human-free alignment in large vision-language models via vision-guided reinforcement learning.arXiv preprint arXiv:2503.18013, 2025. 3
-
[61]
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing.arXiv preprint arXiv:2501.13106, 2025. 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
SiLVR: A Simple Language-based Video Reasoning Framework
Ce Zhang, Yan-Bo Lin, Ziyang Wang, Mohit Bansal, and Gedas Bertasius. Silvr: A simple language-based video rea- soning framework.arXiv preprint arXiv:2505.24869, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory- based real-time understanding for long video streams.arXiv preprint arXiv:2406.08085, 2024. 2
-
[64]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[65]
Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673, 2025
Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qix- iang Ye, et al. Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673, 2025. 3
-
[66]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[67]
Mlvu: Benchmarking multi-task long video understanding
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InComputer Vision and Pattern Recognition, pages 13691–13701, 2025. 2, 6, 7 The supplementary document provides (1) details of our hierarchical video definition in A;(2)compreh...
work page 2025
-
[68]
High-level: The video is divided intowidthmajor segments
-
[69]
Medium-level: Each High-level segment is further divided intowidthsub-segments
-
[70]
You will be asked a question about the video
Low-level: Each Medium-level segment is further divided intowidthfiner sub-segments. You will be asked a question about the video. At the beginning, you are given **only the High-level captions**. Your goal is to answer the question as accurately as possible. [END OF GOAL] [BEGIN OF REASONING AND TOOL USAGE INSTRUCTIONS]
-
[71]
Reason first: Before taking any action, carefully analyze whether the current information (captions you already have) is sufficient to answer the question
-
[72]
If sufficient: Directly provide your final answer inside⟨answer⟩⟨/answer⟩tags
-
[73]
what color is the person’s shirt?
If insufficient: Identify which part(s) of the video might contain the needed information. Then use one of the following tools: - To obtain finer captions: ⟨tool⟩get caption((high segment id, medium segment id, low segment id))⟨/tool⟩ - Each of the three IDs is an integer from 1 towidth. - To request a Medium-level caption, provide (high segment id, mediu...
-
[74]
Restriction: In each reasoning round, you may only call one tool (either ‘get caption‘ or ‘videoqa‘) once to obtain new infor- mation. [END OF REASONING AND TOOL USAGE INSTRUCTIONS] [BEGIN OF FORMAT INSTRUCTIONS] Your reasoning and actions must follow this structure exactly:⟨think⟩Your internal reasoning process here. An- alyze what information you have, ...
-
[75]
Heavens, you have been in the wars,
and ultra-long video examples (Figure 5) in this sec- tion. These examples illustrate LongVideo-R1’s ability to perform hierarchical search, disambiguate similar scenes across hours-long content, and jointly use both high-level and fine-grained information. The examples include cases from TV series such as Downton Abbey, where the model successfully navig...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.