pith. machine review for the scientific record. sign in

arxiv: 2602.20913 · v2 · submitted 2026-02-24 · 💻 cs.CV

Recognition: no theorem link

LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:57 UTC · model grok-4.3

classification 💻 cs.CV
keywords long video understandingvideo navigationmultimodal agentsefficient inferencereasoning modulereinforcement learningchain-of-thought trajectories
0
0 comments X

The pith

LongVideo-R1 uses a reasoning MLLM agent to navigate long videos by selecting only the most informative clips from high-level summaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LongVideo-R1, an active multimodal large language model agent built to handle long video understanding under tight computational limits. Instead of processing every frame, the agent begins with top-level visual summaries and applies a reasoning module to iteratively locate the single most useful clip, halting as soon as it can answer the query. Training data consists of 33,000 synthetic chain-of-thought trajectories generated from grounded video captions, with the model fine-tuned on Qwen-3-8B first through supervised learning and then through reinforcement learning that rewards efficient clip selection. Readers would care because conventional long-video models incur high costs from exhaustive processing, while this method claims to preserve question-answering performance at far lower expense.

Core claim

LongVideo-R1 is a reasoning-equipped MLLM agent that starts traversal from hierarchical video summaries, uses high-level visual cues to infer the most informative clip, and immediately stops exploration once sufficient knowledge is acquired to answer the query, trained via a two-stage SFT-then-RL process on 33K trajectories to maximize selective and efficient navigation.

What carries the argument

The reasoning module that infers the most informative video clip from high-level visual cues in top-level summaries, enabling iterative focus refinement and early halting of exploration.

If this is right

  • Long videos can be processed at substantially lower computational cost by skipping exhaustive frame-by-frame analysis.
  • Question-answering accuracy on long-video benchmarks remains competitive or superior while inference time and memory usage drop.
  • The two-stage training process produces agents that learn to make selective clip choices and stop early when knowledge is sufficient.
  • The same navigation pattern can be applied to new queries without retraining the core reasoning logic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same early-stopping navigation idea could transfer to long documents or audio streams where only portions matter for a given query.
  • If top-level summaries omit subtle temporal cues, performance may degrade on tasks that need fine-grained details spread across the video.
  • Pairing the navigation agent with stronger base multimodal models would likely widen the accuracy-efficiency gap further.
  • The design illustrates a path toward multimodal agents that actively decide what to observe next based on partial understanding.

Load-bearing premise

High-level visual cues extracted from top-level summaries contain enough information for the reasoning module to reliably identify the single most informative clip without missing critical details needed for correct answers.

What would settle it

A test set of queries whose answers require details from multiple non-adjacent clips; if the agent’s chosen clip consistently omits one of those details and accuracy falls below exhaustive baselines, the navigation claim fails.

Figures

Figures reproduced from arXiv: 2602.20913 by Jihao Qiu, Lingxi Xie, Qi Tian, Qixiang Ye, Xinyue Huo.

Figure 1
Figure 1. Figure 1: Motivation and performance comparison. Left: For efficient understanding of long video, the algorithm shall learn to fetch and perceive information effectively, where the core abilities are: (1) judging whether collected information is sufficient for answering, and (2) if not, navigating to the next clip most likely to contain useful information. Drawing style was inspired by Ego-R1 [45]. Right: LongVideo-… view at source ↗
Figure 2
Figure 2. Figure 2: An illustration of generating CoTwT trajectories from [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: LongVideo-R1 can navigate in ultra-long videos efficiently. We show an example in a long-form TV drama, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An example of how LongVideo-R1 smartly navigates to the critical segment and answers the question. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: More example on ultra-long videos. Question:How many sticks does the protagonist put in the incense burner? <think>The question asks about the number of sticks the protagonist puts in the incense burner. This detail is likely in a scene involving lighting incense, which is mentioned in High-level Caption 2 between 635.0s and 654.2s … <think> <tool> VideoCap(2,1) </tool> <think>The medium-level caption (2,1… view at source ↗
Figure 6
Figure 6. Figure 6: More qualitative examples [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: More qualitative examples. Question: What kind of transportation do the people carry on the boat? <think>The question asks about what kind of transportation people carry on the boat. The high-level captions mention boats in High-level Caption 6, but they don't specify what transportation is carried on the boat. To find this detail… <think> <tool> VideoCap(6,6) </tool> <think>The medium-level caption for (6… view at source ↗
Figure 8
Figure 8. Figure 8: LongVideo-R1 may sometimes be disturbed by similar information, but people can guide the model back on track with just a few [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: LongVideo-R1 may sometimes be disturbed by similar information, but people can guide the model back on track with just a few [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
read the original abstract

This paper addresses the critical and underexplored challenge of long video understanding with low computational budgets. We propose LongVideo-R1, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient video context navigation, avoiding the redundancy of exhaustive search. At the core of LongVideo-R1 lies a reasoning module that leverages high-level visual cues to infer the most informative video clip for subsequent processing. During inference, the agent initiates traversal from top-level visual summaries and iteratively refines its focus, immediately halting the exploration process upon acquiring sufficient knowledge to answer the query. To facilitate training, we first extract hierarchical video captions from CGBench, a video corpus with grounding annotations, and guide GPT-5 to generate 33K high-quality chain-of-thought-with-tool trajectories. The LongVideo-R1 agent is fine-tuned upon the Qwen-3-8B model through a two-stage paradigm: supervised fine-tuning (SFT) followed by reinforcement learning (RL), where RL employs a specifically designed reward function to maximize selective and efficient clip navigation. Experiments on multiple long video benchmarks validate the effectiveness of name, which enjoys superior tradeoff between QA accuracy and efficiency. All curated data and source code are provided in the supplementary material and will be made publicly available. Code and data are available at: https://github.com/qiujihao19/LongVideo-R1

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces LongVideo-R1, an active reasoning MLLM agent for efficient long-video understanding under low computational budgets. It starts from top-level visual summaries, uses a reasoning module to iteratively select the single most informative clip based on high-level cues, and halts exploration once sufficient information is obtained to answer the query. Training data consists of 33K chain-of-thought-with-tool trajectories generated by GPT-5 from hierarchical captions on CGBench; the Qwen-3-8B backbone is first SFT-tuned and then RL-tuned with a custom reward that encourages selective, efficient navigation. The central claim is that experiments on multiple long-video benchmarks demonstrate a superior accuracy-efficiency tradeoff.

Significance. If the navigation reliability and quantitative gains hold, the method could meaningfully reduce the compute cost of long-video QA by replacing exhaustive frame processing with targeted clip selection, offering a practical route toward deploying MLLMs on resource-limited hardware.

major comments (2)
  1. Abstract: the claim that experiments 'validate the effectiveness of LongVideo-R1' and deliver a 'superior tradeoff between QA accuracy and efficiency' is stated without any numerical results, benchmark names, baselines, or error bars. Because the central contribution is precisely this empirical tradeoff, the absence of visible supporting evidence in the abstract (and the lack of any quantitative section in the supplied text) leaves the primary claim without load-bearing support.
  2. Abstract / core method description: the reasoning module is asserted to 'leverage high-level visual cues to infer the most informative video clip.' No mechanism, ablation, or failure-case analysis is supplied to show that these cues remain sufficient when a query requires fine-grained visual or temporal detail absent from the top-level summary. If such cases exist, the agent may select the wrong clip or halt prematurely, directly eroding accuracy while the reported efficiency gain becomes illusory. This assumption is load-bearing for the accuracy-efficiency claim and requires concrete evidence or counter-example analysis.
minor comments (1)
  1. Abstract: 'validate the effectiveness of name' is a clear typographical error and should read 'validate the effectiveness of LongVideo-R1'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each point below and outline planned revisions to improve clarity and support for our claims.

read point-by-point responses
  1. Referee: Abstract: the claim that experiments 'validate the effectiveness of LongVideo-R1' and deliver a 'superior tradeoff between QA accuracy and efficiency' is stated without any numerical results, benchmark names, baselines, or error bars. Because the central contribution is precisely this empirical tradeoff, the absence of visible supporting evidence in the abstract (and the lack of any quantitative section in the supplied text) leaves the primary claim without load-bearing support.

    Authors: We agree that the abstract should include concrete numerical support for the central accuracy-efficiency claim. The full manuscript contains quantitative results in the experiments section, including accuracy and efficiency metrics on long-video QA benchmarks with baseline comparisons. We will revise the abstract to summarize key numbers, benchmark names, and gains while preserving brevity. This directly addresses the concern without altering the underlying experiments. revision: yes

  2. Referee: Abstract / core method description: the reasoning module is asserted to 'leverage high-level visual cues to infer the most informative video clip.' No mechanism, ablation, or failure-case analysis is supplied to show that these cues remain sufficient when a query requires fine-grained visual or temporal detail absent from the top-level summary. If such cases exist, the agent may select the wrong clip or halt prematurely, directly eroding accuracy while the reported efficiency gain becomes illusory. This assumption is load-bearing for the accuracy-efficiency claim and requires concrete evidence or counter-example analysis.

    Authors: The manuscript details the reasoning module in Section 3, including its use of hierarchical captions for iterative clip selection and the RL reward that penalizes inefficient or premature stopping. We acknowledge that explicit ablations on cue sufficiency and failure-case analysis would strengthen the load-bearing assumption. In revision we will add a dedicated subsection with ablations on granularity levels and discussion of cases requiring fine-grained detail, showing how the stopping criterion and training mitigate errors. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes a training pipeline (hierarchical captions from CGBench used to generate 33K trajectories via GPT-5, followed by SFT then RL with an externally designed reward maximizing navigation efficiency) and reports empirical results on independent long-video benchmarks. No equations appear that equate any claimed prediction or performance metric to quantities fitted from the same data; no self-citations are invoked to establish uniqueness, ansatz, or load-bearing premises; and the central accuracy-efficiency tradeoff is presented as an observed experimental outcome rather than a definitional or self-referential reduction. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions: that MLLM reasoning over coarse visual summaries can accurately predict the most informative fine-grained clips, and that GPT-5-generated trajectories provide sufficiently high-quality supervision for the downstream RL stage. No free parameters or new invented entities are explicitly introduced.

axioms (2)
  • domain assumption High-level visual summaries contain sufficient cues for the reasoning module to infer the most informative video clips
    Invoked directly in the description of the iterative navigation process.
  • domain assumption GPT-5 can generate high-quality chain-of-thought-with-tool trajectories from hierarchical captions
    Used to create the 33K training trajectories from CGBench.

pith-pipeline@v0.9.0 · 5558 in / 1287 out tokens · 56784 ms · 2026-05-15T19:57:56.971861+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 21 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 4, 6, 7

  3. [3]

    Cg- bench: Clue-grounded question answering benchmark for long video understanding.arXiv preprint arXiv:2412.12075,

    Guo Chen, Yicheng Liu, Yifei Huang, Yuping He, Baoqi Pei, Jilan Xu, Yali Wang, Tong Lu, and Limin Wang. Cg- bench: Clue-grounded question answering benchmark for long video understanding.arXiv preprint arXiv:2412.12075,

  4. [4]

    Livecc: Learning video llm with stream- ing speech transcription at scale

    Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. Livecc: Learning video llm with stream- ing speech transcription at scale. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29083–29095, 2025. 7

  5. [5]

    Timemarker: A versatile video-llm for long and short video understanding with superior temporal localiza- tion ability.arXiv preprint arXiv:2411.18211, 2024

    Shimin Chen, Xiaohan Lan, Yitian Yuan, Zequn Jie, and Lin Ma. Timemarker: A versatile video-llm for long and short video understanding with superior temporal localiza- tion ability.arXiv preprint arXiv:2411.18211, 2024. 7

  6. [6]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 7

  7. [7]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 2

  8. [8]

    Don’t look twice: Faster video transformers with run-length tokenization.Advances in Neural Information Processing Systems, 37:28127–28149,

    Rohan Choudhury, Guanglei Zhu, Sihan Liu, Koichiro Ni- inuma, Kris Kitani, and L´aszl´o Jeni. Don’t look twice: Faster video transformers with run-length tokenization.Advances in Neural Information Processing Systems, 37:28127–28149,

  9. [9]

    Videoagent: A memory-augmented mul- timodal agent for video understanding

    Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory-augmented mul- timodal agent for video understanding. InEuropean Confer- ence on Computer Vision, pages 75–92, 2024. 2

  10. [10]

    Video-ccam: Enhancing video-language un- derstanding with causal cross-attention masks for short and long videos.arXiv preprint arXiv:2408.14023, 2024

    Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language un- derstanding with causal cross-attention masks for short and long videos.arXiv preprint arXiv:2408.14023, 2024. 7

  11. [11]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

  12. [12]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InComputer Vision and Pattern Recognition, pages 24108–24118, 2025. 2, 6, 7

  13. [13]

    Linvt: Empower your image- level large language model to understand videos.arXiv preprint arXiv:2412.05185, 2024

    Lishuai Gao, Yujie Zhong, Yingsen Zeng, Haoxian Tan, Dengjie Li, and Zheng Zhao. Linvt: Empower your image- level large language model to understand videos.arXiv preprint arXiv:2412.05185, 2024. 7

  14. [14]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2

  15. [15]

    Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025. 7

  16. [16]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

  17. [17]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 2, 7

  18. [18]

    Video token merging for long-form video under- standing.arXiv preprint arXiv:2410.23782, 2024

    Seon-Ho Lee, Jue Wang, Zhikang Zhang, David Fan, and Xinyu Li. Video token merging for long-form video under- standing.arXiv preprint arXiv:2410.23782, 2024. 2

  19. [19]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 7

  20. [20]

    Aria: An open multimodal native mixture- of-experts model.arXiv preprint arXiv:2410.05993, 2024

    Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Fan Zhou, Chengen Huang, Yanpeng Li, et al. Aria: An open multimodal native mixture- of-experts model.arXiv preprint arXiv:2410.05993, 2024. 7

  21. [21]

    Vidtome: Video token merging for zero-shot video editing

    Xirui Li, Chao Ma, Xiaokang Yang, and Ming-Hsuan Yang. Vidtome: Video token merging for zero-shot video editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7486–7495, 2024. 2

  22. [23]

    Videochat-flash: Hierarchical compression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024

    Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical com- pression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2025. 3, 7

  23. [24]

    Video-llava: Learning united visual repre- sentation by alignment before projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual repre- sentation by alignment before projection. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 5971–5984, 2024. 1, 2

  24. [25]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 2

  25. [26]

    Visual instruction tuning, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 2

  26. [27]

    Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 2

  27. [28]

    Videomind: A chain-of-lora agent for long video reasoning.arXiv preprint arXiv:2503.13444,

    Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, and Mike Zheng Shou. Videomind: A chain-of-lora agent for long video reasoning.arXiv preprint arXiv:2503.13444,

  28. [29]

    Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024

    Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Ji- wen Lu, and Yongming Rao. Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024. 7

  29. [30]

    Video-chatgpt: Towards detailed video under- standing via large vision and language models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Khan. Video-chatgpt: Towards detailed video under- standing via large vision and language models. InAnnual Meeting of the Association for Computational Linguistics, pages 12585–12602, 2024. 1, 2

  30. [31]

    Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023. 2

  31. [32]

    Gpt-5 system card

    OpenAI. Gpt-5 system card. Technical report, OpenAI,

  32. [33]

    Accessed: 2025-11-13. 4

  33. [34]

    Artemis: Towards referential understanding in com- plex videos.Advances in Neural Information Processing Systems, 37:114321–114347, 2024

    Jihao Qiu, Yuan Zhang, Xi Tang, Lingxi Xie, Tianren Ma, Pengyu Yan, David Doermann, Qixiang Ye, and Yunjie Tian. Artemis: Towards referential understanding in com- plex videos.Advances in Neural Information Processing Systems, 37:114321–114347, 2024. 2

  34. [35]

    Videorag: Retrieval-augmented gen- eration with extreme long-context videos.arXiv preprint arXiv:2502.01549, 2025

    Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, and Chao Huang. Videorag: Retrieval-augmented gen- eration with extreme long-context videos.arXiv preprint arXiv:2502.01549, 2025. 7

  35. [36]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 3

  36. [37]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 3, 5

  37. [38]

    Tempme: Video temporal token merging for efficient text- video retrieval.arXiv preprint arXiv:2409.01156, 2024

    Leqi Shen, Tianxiang Hao, Tao He, Sicheng Zhao, Yifeng Zhang, Pengzhang Liu, Yongjun Bao, and Guiguang Ding. Tempme: Video temporal token merging for efficient text- video retrieval.arXiv preprint arXiv:2409.01156, 2024. 2

  38. [39]

    Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024

    Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Bal- akrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024. 2

  39. [40]

    Video-xl: Extra-long vision language model for hour-scale video understanding

    Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Jun- jie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26160–26169,

  40. [41]

    Moviechat+: Question-aware sparse memory for long video question answering.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2025

    Enxin Song, Wenhao Chai, Tian Ye, Jenq-Neng Hwang, Xi Li, and Gaoang Wang. Moviechat+: Question-aware sparse memory for long video question answering.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2025. 2

  41. [42]

    video- SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models.arXiv preprint arXiv:2506.15220, 2025

    Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. video- salmonn 2: Captioning-enhanced audio-visual large lan- guage models.arXiv preprint arXiv:2506.15220, 2025. 2

  42. [43]

    Adaptive keyframe sampling for long video understanding

    Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive keyframe sampling for long video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29118–29128, 2025. 2, 6

  43. [44]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2

  44. [45]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 7

  45. [46]

    Ego-r1: Chain-of-tool- thought for ultra-long egocentric video reasoning.arXiv preprint arXiv:2506.13654, 2025

    Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, and Ziwei Liu. Ego-r1: Chain-of-tool- thought for ultra-long egocentric video reasoning.arXiv preprint arXiv:2506.13654, 2025. 1, 2, 3, 4, 5, 6, 7, 8

  46. [47]

    Chatter- box: Multi-round multimodal referring and grounding.arXiv preprint arXiv:2401.13307, 2024

    Yunjie Tian, Tianren Ma, Lingxi Xie, Jihao Qiu, Xi Tang, Yuan Zhang, Jianbin Jiao, Qi Tian, and Qixiang Ye. Chatter- box: Multi-round multimodal referring and grounding.arXiv preprint arXiv:2401.13307, 2024. 2

  47. [48]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 2

  48. [49]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2, 7

  49. [50]

    Lvbench: An extreme long video understand- ing benchmark

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiao- han Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understand- ing benchmark. InInternational Conference on Computer Vision, pages 22958–22967, 2025. 1, 2, 6, 7

  50. [51]

    Retake: Reducing temporal and knowledge redundancy for long video understanding.arXiv preprint arXiv:2412.20504, 2024

    Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, and Liqiang Nie. Retake: Reducing temporal and knowledge redundancy for long video understanding.arXiv preprint arXiv:2412.20504, 2024. 7

  51. [52]

    Videoagent: Long-form video understanding with large language model as agent

    Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung- Levy. Videoagent: Long-form video understanding with large language model as agent. InEuropean Conference on Computer Vision, pages 58–76, 2024. 2, 6, 7

  52. [53]

    Adaretake: Adaptive redundancy reduction to perceive longer for video-language understanding

    Xiao Wang, Qingyi Si, Shiyu Zhu, Jianlong Wu, Li Cao, and Liqiang Nie. Adaretake: Adaptive redundancy reduction to perceive longer for video-language understanding. InFind- ings of the Association for Computational Linguistics: ACL 2025, pages 5417–5432, 2025. 7

  53. [54]

    Videotree: Adaptive tree-based video representation for llm reasoning on long videos

    Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, and Mohit Bansal. Videotree: Adaptive tree-based video representation for llm reasoning on long videos. InComputer Vision and Pattern Recognition, pages 3272–3283, 2025. 2, 6, 7, 8

  54. [55]

    Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024. 2

  55. [56]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 2, 6

  56. [57]

    Vca: Video curious agent for long video under- standing

    Zeyuan Yang, Delin Chen, Xueyang Yu, Maohao Shen, and Chuang Gan. Vca: Video curious agent for long video under- standing. InInternational Conference on Computer Vision, pages 20168–20179, 2025. 6, 7

  57. [58]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gao- hong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. 3

  58. [59]

    Memory- enhanced retrieval augmentation for long video understand- ing.arXiv preprint arXiv:2503.09149, 2025

    Huaying Yuan, Zheng Liu, Minghao Qin, Hongjin Qian, Yan Shu, Zhicheng Dou, Ji-Rong Wen, and Nicu Sebe. Memory- enhanced retrieval augmentation for long video understand- ing.arXiv preprint arXiv:2503.09149, 2025. 7

  59. [60]

    Vision-r1: Evolving human-free alignment in large vision-language models via vision-guided reinforcement learning.arXiv preprint arXiv:2503.18013, 2025

    Yufei Zhan, Yousong Zhu, Shurong Zheng, Hongyin Zhao, Fan Yang, Ming Tang, and Jinqiao Wang. Vision-r1: Evolv- ing human-free alignment in large vision-language models via vision-guided reinforcement learning.arXiv preprint arXiv:2503.18013, 2025. 3

  60. [61]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing.arXiv preprint arXiv:2501.13106, 2025. 7

  61. [62]

    SiLVR: A Simple Language-based Video Reasoning Framework

    Ce Zhang, Yan-Bo Lin, Ziyang Wang, Mohit Bansal, and Gedas Bertasius. Silvr: A simple language-based video rea- soning framework.arXiv preprint arXiv:2505.24869, 2025. 2

  62. [63]

    Flash-vstream: Memory- based real-time understanding for long video streams.arXiv preprint arXiv:2406.08085, 2024

    Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory- based real-time understanding for long video streams.arXiv preprint arXiv:2406.08085, 2024. 2

  63. [64]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 7

  64. [65]

    Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673, 2025

    Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qix- iang Ye, et al. Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673, 2025. 3

  65. [66]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. 3

  66. [67]

    Mlvu: Benchmarking multi-task long video understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InComputer Vision and Pattern Recognition, pages 13691–13701, 2025. 2, 6, 7 The supplementary document provides (1) details of our hierarchical video definition in A;(2)compreh...

  67. [68]

    High-level: The video is divided intowidthmajor segments

  68. [69]

    Medium-level: Each High-level segment is further divided intowidthsub-segments

  69. [70]

    You will be asked a question about the video

    Low-level: Each Medium-level segment is further divided intowidthfiner sub-segments. You will be asked a question about the video. At the beginning, you are given **only the High-level captions**. Your goal is to answer the question as accurately as possible. [END OF GOAL] [BEGIN OF REASONING AND TOOL USAGE INSTRUCTIONS]

  70. [71]

    Reason first: Before taking any action, carefully analyze whether the current information (captions you already have) is sufficient to answer the question

  71. [72]

    If sufficient: Directly provide your final answer inside⟨answer⟩⟨/answer⟩tags

  72. [73]

    what color is the person’s shirt?

    If insufficient: Identify which part(s) of the video might contain the needed information. Then use one of the following tools: - To obtain finer captions: ⟨tool⟩get caption((high segment id, medium segment id, low segment id))⟨/tool⟩ - Each of the three IDs is an integer from 1 towidth. - To request a Medium-level caption, provide (high segment id, mediu...

  73. [74]

    Restriction: In each reasoning round, you may only call one tool (either ‘get caption‘ or ‘videoqa‘) once to obtain new infor- mation. [END OF REASONING AND TOOL USAGE INSTRUCTIONS] [BEGIN OF FORMAT INSTRUCTIONS] Your reasoning and actions must follow this structure exactly:⟨think⟩Your internal reasoning process here. An- alyze what information you have, ...

  74. [75]

    Heavens, you have been in the wars,

    and ultra-long video examples (Figure 5) in this sec- tion. These examples illustrate LongVideo-R1’s ability to perform hierarchical search, disambiguate similar scenes across hours-long content, and jointly use both high-level and fine-grained information. The examples include cases from TV series such as Downton Abbey, where the model successfully navig...