arxiv: 2602.20913 · v2 · submitted 2026-02-24 · 💻 cs.CV

Recognition: no theorem link

LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

Jihao Qiu , Lingxi Xie , Xinyue Huo , Qi Tian , Qixiang Ye

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords long video understandingvideo navigationmultimodal agentsefficient inferencereasoning modulereinforcement learningchain-of-thought trajectories

0 comments

The pith

LongVideo-R1 uses a reasoning MLLM agent to navigate long videos by selecting only the most informative clips from high-level summaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LongVideo-R1, an active multimodal large language model agent built to handle long video understanding under tight computational limits. Instead of processing every frame, the agent begins with top-level visual summaries and applies a reasoning module to iteratively locate the single most useful clip, halting as soon as it can answer the query. Training data consists of 33,000 synthetic chain-of-thought trajectories generated from grounded video captions, with the model fine-tuned on Qwen-3-8B first through supervised learning and then through reinforcement learning that rewards efficient clip selection. Readers would care because conventional long-video models incur high costs from exhaustive processing, while this method claims to preserve question-answering performance at far lower expense.

Core claim

LongVideo-R1 is a reasoning-equipped MLLM agent that starts traversal from hierarchical video summaries, uses high-level visual cues to infer the most informative clip, and immediately stops exploration once sufficient knowledge is acquired to answer the query, trained via a two-stage SFT-then-RL process on 33K trajectories to maximize selective and efficient navigation.

What carries the argument

The reasoning module that infers the most informative video clip from high-level visual cues in top-level summaries, enabling iterative focus refinement and early halting of exploration.

If this is right

Long videos can be processed at substantially lower computational cost by skipping exhaustive frame-by-frame analysis.
Question-answering accuracy on long-video benchmarks remains competitive or superior while inference time and memory usage drop.
The two-stage training process produces agents that learn to make selective clip choices and stop early when knowledge is sufficient.
The same navigation pattern can be applied to new queries without retraining the core reasoning logic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same early-stopping navigation idea could transfer to long documents or audio streams where only portions matter for a given query.
If top-level summaries omit subtle temporal cues, performance may degrade on tasks that need fine-grained details spread across the video.
Pairing the navigation agent with stronger base multimodal models would likely widen the accuracy-efficiency gap further.
The design illustrates a path toward multimodal agents that actively decide what to observe next based on partial understanding.

Load-bearing premise

High-level visual cues extracted from top-level summaries contain enough information for the reasoning module to reliably identify the single most informative clip without missing critical details needed for correct answers.

What would settle it

A test set of queries whose answers require details from multiple non-adjacent clips; if the agent’s chosen clip consistently omits one of those details and accuracy falls below exhaustive baselines, the navigation claim fails.

Figures

Figures reproduced from arXiv: 2602.20913 by Jihao Qiu, Lingxi Xie, Qi Tian, Qixiang Ye, Xinyue Huo.

**Figure 1.** Figure 1: Motivation and performance comparison. Left: For efficient understanding of long video, the algorithm shall learn to fetch and perceive information effectively, where the core abilities are: (1) judging whether collected information is sufficient for answering, and (2) if not, navigating to the next clip most likely to contain useful information. Drawing style was inspired by Ego-R1 [45]. Right: LongVideo-… view at source ↗

**Figure 2.** Figure 2: An illustration of generating CoTwT trajectories from [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: LongVideo-R1 can navigate in ultra-long videos efficiently. We show an example in a long-form TV drama, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: An example of how LongVideo-R1 smartly navigates to the critical segment and answers the question. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: More example on ultra-long videos. Question：How many sticks does the protagonist put in the incense burner? <think>The question asks about the number of sticks the protagonist puts in the incense burner. This detail is likely in a scene involving lighting incense, which is mentioned in High-level Caption 2 between 635.0s and 654.2s … <think> <tool> VideoCap(2,1) </tool> <think>The medium-level caption (2,1… view at source ↗

**Figure 6.** Figure 6: More qualitative examples [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: More qualitative examples. Question: What kind of transportation do the people carry on the boat? <think>The question asks about what kind of transportation people carry on the boat. The high-level captions mention boats in High-level Caption 6, but they don't specify what transportation is carried on the boat. To find this detail… <think> <tool> VideoCap(6,6) </tool> <think>The medium-level caption for (6… view at source ↗

**Figure 8.** Figure 8: LongVideo-R1 may sometimes be disturbed by similar information, but people can guide the model back on track with just a few [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: LongVideo-R1 may sometimes be disturbed by similar information, but people can guide the model back on track with just a few [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

read the original abstract

This paper addresses the critical and underexplored challenge of long video understanding with low computational budgets. We propose LongVideo-R1, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient video context navigation, avoiding the redundancy of exhaustive search. At the core of LongVideo-R1 lies a reasoning module that leverages high-level visual cues to infer the most informative video clip for subsequent processing. During inference, the agent initiates traversal from top-level visual summaries and iteratively refines its focus, immediately halting the exploration process upon acquiring sufficient knowledge to answer the query. To facilitate training, we first extract hierarchical video captions from CGBench, a video corpus with grounding annotations, and guide GPT-5 to generate 33K high-quality chain-of-thought-with-tool trajectories. The LongVideo-R1 agent is fine-tuned upon the Qwen-3-8B model through a two-stage paradigm: supervised fine-tuning (SFT) followed by reinforcement learning (RL), where RL employs a specifically designed reward function to maximize selective and efficient clip navigation. Experiments on multiple long video benchmarks validate the effectiveness of name, which enjoys superior tradeoff between QA accuracy and efficiency. All curated data and source code are provided in the supplementary material and will be made publicly available. Code and data are available at: https://github.com/qiujihao19/LongVideo-R1

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LongVideo-R1 adds a concrete SFT-plus-RL navigation agent for long videos but the abstract gives no numbers to back the accuracy-efficiency claim.

read the letter

The main thing to know is that this paper trains an 8B Qwen model to decide which video clips to examine instead of scanning everything. They build hierarchical captions from CGBench, use GPT-5 to create 33K reasoning trajectories, run supervised fine-tuning, then apply reinforcement learning with a reward that pushes for quick, selective navigation. At inference the agent starts from top-level summaries and stops once it has enough information for the query. That pipeline is new enough in the long-video setting and they release the data and code, which helps others test it directly. The practical angle—low compute budgets for real deployment—lines up with what many groups actually need. The central claim is that this yields a better accuracy-efficiency tradeoff on long-video benchmarks, yet the abstract supplies no scores, no benchmark names, no ablations, and no error bars. Without those numbers it is impossible to judge whether the high-level cues really let the model pick the right clip or whether it misses fine-grained details that matter for the answer. The RL reward is defined externally, so there is no obvious circularity, but the lack of visible results still leaves the soundness thin. This work is aimed at researchers building efficient video MLLMs who are willing to implement and measure the navigation themselves. It deserves a serious referee because the method is coherent, the training recipe is reproducible, and the released artifacts let reviewers check the claims directly. The authors will need to add the quantitative results and failure-case analysis before it can be accepted, but the core idea is worth the review effort.

Referee Report

2 major / 1 minor

Summary. The paper introduces LongVideo-R1, an active reasoning MLLM agent for efficient long-video understanding under low computational budgets. It starts from top-level visual summaries, uses a reasoning module to iteratively select the single most informative clip based on high-level cues, and halts exploration once sufficient information is obtained to answer the query. Training data consists of 33K chain-of-thought-with-tool trajectories generated by GPT-5 from hierarchical captions on CGBench; the Qwen-3-8B backbone is first SFT-tuned and then RL-tuned with a custom reward that encourages selective, efficient navigation. The central claim is that experiments on multiple long-video benchmarks demonstrate a superior accuracy-efficiency tradeoff.

Significance. If the navigation reliability and quantitative gains hold, the method could meaningfully reduce the compute cost of long-video QA by replacing exhaustive frame processing with targeted clip selection, offering a practical route toward deploying MLLMs on resource-limited hardware.

major comments (2)

Abstract: the claim that experiments 'validate the effectiveness of LongVideo-R1' and deliver a 'superior tradeoff between QA accuracy and efficiency' is stated without any numerical results, benchmark names, baselines, or error bars. Because the central contribution is precisely this empirical tradeoff, the absence of visible supporting evidence in the abstract (and the lack of any quantitative section in the supplied text) leaves the primary claim without load-bearing support.
Abstract / core method description: the reasoning module is asserted to 'leverage high-level visual cues to infer the most informative video clip.' No mechanism, ablation, or failure-case analysis is supplied to show that these cues remain sufficient when a query requires fine-grained visual or temporal detail absent from the top-level summary. If such cases exist, the agent may select the wrong clip or halt prematurely, directly eroding accuracy while the reported efficiency gain becomes illusory. This assumption is load-bearing for the accuracy-efficiency claim and requires concrete evidence or counter-example analysis.

minor comments (1)

Abstract: 'validate the effectiveness of name' is a clear typographical error and should read 'validate the effectiveness of LongVideo-R1'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each point below and outline planned revisions to improve clarity and support for our claims.

read point-by-point responses

Referee: Abstract: the claim that experiments 'validate the effectiveness of LongVideo-R1' and deliver a 'superior tradeoff between QA accuracy and efficiency' is stated without any numerical results, benchmark names, baselines, or error bars. Because the central contribution is precisely this empirical tradeoff, the absence of visible supporting evidence in the abstract (and the lack of any quantitative section in the supplied text) leaves the primary claim without load-bearing support.

Authors: We agree that the abstract should include concrete numerical support for the central accuracy-efficiency claim. The full manuscript contains quantitative results in the experiments section, including accuracy and efficiency metrics on long-video QA benchmarks with baseline comparisons. We will revise the abstract to summarize key numbers, benchmark names, and gains while preserving brevity. This directly addresses the concern without altering the underlying experiments. revision: yes
Referee: Abstract / core method description: the reasoning module is asserted to 'leverage high-level visual cues to infer the most informative video clip.' No mechanism, ablation, or failure-case analysis is supplied to show that these cues remain sufficient when a query requires fine-grained visual or temporal detail absent from the top-level summary. If such cases exist, the agent may select the wrong clip or halt prematurely, directly eroding accuracy while the reported efficiency gain becomes illusory. This assumption is load-bearing for the accuracy-efficiency claim and requires concrete evidence or counter-example analysis.

Authors: The manuscript details the reasoning module in Section 3, including its use of hierarchical captions for iterative clip selection and the RL reward that penalizes inefficient or premature stopping. We acknowledge that explicit ablations on cue sufficiency and failure-case analysis would strengthen the load-bearing assumption. In revision we will add a dedicated subsection with ablations on granularity levels and discussion of cases requiring fine-grained detail, showing how the stopping criterion and training mitigate errors. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes a training pipeline (hierarchical captions from CGBench used to generate 33K trajectories via GPT-5, followed by SFT then RL with an externally designed reward maximizing navigation efficiency) and reports empirical results on independent long-video benchmarks. No equations appear that equate any claimed prediction or performance metric to quantities fitted from the same data; no self-citations are invoked to establish uniqueness, ansatz, or load-bearing premises; and the central accuracy-efficiency tradeoff is presented as an observed experimental outcome rather than a definitional or self-referential reduction. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions: that MLLM reasoning over coarse visual summaries can accurately predict the most informative fine-grained clips, and that GPT-5-generated trajectories provide sufficiently high-quality supervision for the downstream RL stage. No free parameters or new invented entities are explicitly introduced.

axioms (2)

domain assumption High-level visual summaries contain sufficient cues for the reasoning module to infer the most informative video clips
Invoked directly in the description of the iterative navigation process.
domain assumption GPT-5 can generate high-quality chain-of-thought-with-tool trajectories from hierarchical captions
Used to create the 33K training trajectories from CGBench.

pith-pipeline@v0.9.0 · 5558 in / 1287 out tokens · 56784 ms · 2026-05-15T19:57:56.971861+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 21 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 4, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Cg- bench: Clue-grounded question answering benchmark for long video understanding.arXiv preprint arXiv:2412.12075,

Guo Chen, Yicheng Liu, Yifei Huang, Yuping He, Baoqi Pei, Jilan Xu, Yali Wang, Tong Lu, and Limin Wang. Cg- bench: Clue-grounded question answering benchmark for long video understanding.arXiv preprint arXiv:2412.12075,

work page arXiv
[4]

Livecc: Learning video llm with stream- ing speech transcription at scale

Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. Livecc: Learning video llm with stream- ing speech transcription at scale. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29083–29095, 2025. 7

work page 2025
[5]

Timemarker: A versatile video-llm for long and short video understanding with superior temporal localiza- tion ability.arXiv preprint arXiv:2411.18211, 2024

Shimin Chen, Xiaohan Lan, Yitian Yuan, Zequn Jie, and Lin Ma. Timemarker: A versatile video-llm for long and short video understanding with superior temporal localiza- tion ability.arXiv preprint arXiv:2411.18211, 2024. 7

work page arXiv 2024
[6]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 2

work page 2024
[8]

Don’t look twice: Faster video transformers with run-length tokenization.Advances in Neural Information Processing Systems, 37:28127–28149,

Rohan Choudhury, Guanglei Zhu, Sihan Liu, Koichiro Ni- inuma, Kris Kitani, and L´aszl´o Jeni. Don’t look twice: Faster video transformers with run-length tokenization.Advances in Neural Information Processing Systems, 37:28127–28149,

work page
[9]

Videoagent: A memory-augmented mul- timodal agent for video understanding

Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory-augmented mul- timodal agent for video understanding. InEuropean Confer- ence on Computer Vision, pages 75–92, 2024. 2

work page 2024
[10]

Video-ccam: Enhancing video-language un- derstanding with causal cross-attention masks for short and long videos.arXiv preprint arXiv:2408.14023, 2024

Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language un- derstanding with causal cross-attention masks for short and long videos.arXiv preprint arXiv:2408.14023, 2024. 7

work page arXiv 2024
[11]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InComputer Vision and Pattern Recognition, pages 24108–24118, 2025. 2, 6, 7

work page 2025
[13]

Linvt: Empower your image- level large language model to understand videos.arXiv preprint arXiv:2412.05185, 2024

Lishuai Gao, Yujie Zhong, Yingsen Zeng, Haoxian Tan, Dengjie Li, and Zheng Zhao. Linvt: Empower your image- level large language model to understand videos.arXiv preprint arXiv:2412.05185, 2024. 7

work page arXiv 2024
[14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025. 7

work page 2025
[16]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Video token merging for long-form video under- standing.arXiv preprint arXiv:2410.23782, 2024

Seon-Ho Lee, Jue Wang, Zhikang Zhang, David Fan, and Xinyu Li. Video token merging for long-form video under- standing.arXiv preprint arXiv:2410.23782, 2024. 2

work page arXiv 2024
[19]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Aria: An open multimodal native mixture- of-experts model.arXiv preprint arXiv:2410.05993, 2024

Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Fan Zhou, Chengen Huang, Yanpeng Li, et al. Aria: An open multimodal native mixture- of-experts model.arXiv preprint arXiv:2410.05993, 2024. 7

work page arXiv 2024
[21]

Vidtome: Video token merging for zero-shot video editing

Xirui Li, Chao Ma, Xiaokang Yang, and Ming-Hsuan Yang. Vidtome: Video token merging for zero-shot video editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7486–7495, 2024. 2

work page 2024
[23]

Videochat-flash: Hierarchical compression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024

Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical com- pression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2025. 3, 7

work page arXiv 2025
[24]

Video-llava: Learning united visual repre- sentation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual repre- sentation by alignment before projection. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 5971–5984, 2024. 1, 2

work page 2024
[25]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 2

work page 2023
[27]

Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 2

work page 2024
[28]

Videomind: A chain-of-lora agent for long video reasoning.arXiv preprint arXiv:2503.13444,

Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, and Mike Zheng Shou. Videomind: A chain-of-lora agent for long video reasoning.arXiv preprint arXiv:2503.13444,

work page arXiv
[29]

Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024

Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Ji- wen Lu, and Yongming Rao. Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024. 7

work page arXiv 2024
[30]

Video-chatgpt: Towards detailed video under- standing via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Khan. Video-chatgpt: Towards detailed video under- standing via large vision and language models. InAnnual Meeting of the Association for Computational Linguistics, pages 12585–12602, 2024. 1, 2

work page 2024
[31]

Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023. 2

work page 2023
[32]

Gpt-5 system card

OpenAI. Gpt-5 system card. Technical report, OpenAI,

work page
[33]

Accessed: 2025-11-13. 4

work page 2025
[34]

Artemis: Towards referential understanding in com- plex videos.Advances in Neural Information Processing Systems, 37:114321–114347, 2024

Jihao Qiu, Yuan Zhang, Xi Tang, Lingxi Xie, Tianren Ma, Pengyu Yan, David Doermann, Qixiang Ye, and Yunjie Tian. Artemis: Towards referential understanding in com- plex videos.Advances in Neural Information Processing Systems, 37:114321–114347, 2024. 2

work page 2024
[35]

Videorag: Retrieval-augmented gen- eration with extreme long-context videos.arXiv preprint arXiv:2502.01549, 2025

Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, and Chao Huang. Videorag: Retrieval-augmented gen- eration with extreme long-context videos.arXiv preprint arXiv:2502.01549, 2025. 7

work page arXiv 2025
[36]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 3

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Tempme: Video temporal token merging for efficient text- video retrieval.arXiv preprint arXiv:2409.01156, 2024

Leqi Shen, Tianxiang Hao, Tao He, Sicheng Zhao, Yifeng Zhang, Pengzhang Liu, Yongjun Bao, and Guiguang Ding. Tempme: Video temporal token merging for efficient text- video retrieval.arXiv preprint arXiv:2409.01156, 2024. 2

work page arXiv 2024
[39]

Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Bal- akrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024. 2

work page arXiv 2024
[40]

Video-xl: Extra-long vision language model for hour-scale video understanding

Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Jun- jie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26160–26169,

work page
[41]

Moviechat+: Question-aware sparse memory for long video question answering.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2025

Enxin Song, Wenhao Chai, Tian Ye, Jenq-Neng Hwang, Xi Li, and Gaoang Wang. Moviechat+: Question-aware sparse memory for long video question answering.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2025. 2

work page 2025
[42]

video- SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models.arXiv preprint arXiv:2506.15220, 2025

Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. video- salmonn 2: Captioning-enhanced audio-visual large lan- guage models.arXiv preprint arXiv:2506.15220, 2025. 2

work page arXiv 2025
[43]

Adaptive keyframe sampling for long video understanding

Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive keyframe sampling for long video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29118–29128, 2025. 2, 6

work page 2025
[44]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Ego-r1: Chain-of-tool- thought for ultra-long egocentric video reasoning.arXiv preprint arXiv:2506.13654, 2025

Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, and Ziwei Liu. Ego-r1: Chain-of-tool- thought for ultra-long egocentric video reasoning.arXiv preprint arXiv:2506.13654, 2025. 1, 2, 3, 4, 5, 6, 7, 8

work page arXiv 2025
[47]

Chatter- box: Multi-round multimodal referring and grounding.arXiv preprint arXiv:2401.13307, 2024

Yunjie Tian, Tianren Ma, Lingxi Xie, Jihao Qiu, Xi Tang, Yuan Zhang, Jianbin Jiao, Qi Tian, and Qixiang Ye. Chatter- box: Multi-round multimodal referring and grounding.arXiv preprint arXiv:2401.13307, 2024. 2

work page arXiv 2024
[48]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Lvbench: An extreme long video understand- ing benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiao- han Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understand- ing benchmark. InInternational Conference on Computer Vision, pages 22958–22967, 2025. 1, 2, 6, 7

work page 2025
[51]

Retake: Reducing temporal and knowledge redundancy for long video understanding.arXiv preprint arXiv:2412.20504, 2024

Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, and Liqiang Nie. Retake: Reducing temporal and knowledge redundancy for long video understanding.arXiv preprint arXiv:2412.20504, 2024. 7

work page arXiv 2024
[52]

Videoagent: Long-form video understanding with large language model as agent

Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung- Levy. Videoagent: Long-form video understanding with large language model as agent. InEuropean Conference on Computer Vision, pages 58–76, 2024. 2, 6, 7

work page 2024
[53]

Adaretake: Adaptive redundancy reduction to perceive longer for video-language understanding

Xiao Wang, Qingyi Si, Shiyu Zhu, Jianlong Wu, Li Cao, and Liqiang Nie. Adaretake: Adaptive redundancy reduction to perceive longer for video-language understanding. InFind- ings of the Association for Computational Linguistics: ACL 2025, pages 5417–5432, 2025. 7

work page 2025
[54]

Videotree: Adaptive tree-based video representation for llm reasoning on long videos

Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, and Mohit Bansal. Videotree: Adaptive tree-based video representation for llm reasoning on long videos. InComputer Vision and Pattern Recognition, pages 3272–3283, 2025. 2, 6, 7, 8

work page 2025
[55]

Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024. 2

work page 2024
[56]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Vca: Video curious agent for long video under- standing

Zeyuan Yang, Delin Chen, Xueyang Yu, Maohao Shen, and Chuang Gan. Vca: Video curious agent for long video under- standing. InInternational Conference on Computer Vision, pages 20168–20179, 2025. 6, 7

work page 2025
[58]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gao- hong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Memory- enhanced retrieval augmentation for long video understand- ing.arXiv preprint arXiv:2503.09149, 2025

Huaying Yuan, Zheng Liu, Minghao Qin, Hongjin Qian, Yan Shu, Zhicheng Dou, Ji-Rong Wen, and Nicu Sebe. Memory- enhanced retrieval augmentation for long video understand- ing.arXiv preprint arXiv:2503.09149, 2025. 7

work page arXiv 2025
[60]

Vision-r1: Evolving human-free alignment in large vision-language models via vision-guided reinforcement learning.arXiv preprint arXiv:2503.18013, 2025

Yufei Zhan, Yousong Zhu, Shurong Zheng, Hongyin Zhao, Fan Yang, Ming Tang, and Jinqiao Wang. Vision-r1: Evolv- ing human-free alignment in large vision-language models via vision-guided reinforcement learning.arXiv preprint arXiv:2503.18013, 2025. 3

work page arXiv 2025
[61]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing.arXiv preprint arXiv:2501.13106, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

SiLVR: A Simple Language-based Video Reasoning Framework

Ce Zhang, Yan-Bo Lin, Ziyang Wang, Mohit Bansal, and Gedas Bertasius. Silvr: A simple language-based video rea- soning framework.arXiv preprint arXiv:2505.24869, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Flash-vstream: Memory- based real-time understanding for long video streams.arXiv preprint arXiv:2406.08085, 2024

Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory- based real-time understanding for long video streams.arXiv preprint arXiv:2406.08085, 2024. 2

work page arXiv 2024
[64]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673, 2025

Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qix- iang Ye, et al. Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673, 2025. 3

work page arXiv 2025
[66]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

Mlvu: Benchmarking multi-task long video understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InComputer Vision and Pattern Recognition, pages 13691–13701, 2025. 2, 6, 7 The supplementary document provides (1) details of our hierarchical video definition in A;(2)compreh...

work page 2025
[68]

High-level: The video is divided intowidthmajor segments

work page
[69]

Medium-level: Each High-level segment is further divided intowidthsub-segments

work page
[70]

You will be asked a question about the video

Low-level: Each Medium-level segment is further divided intowidthfiner sub-segments. You will be asked a question about the video. At the beginning, you are given **only the High-level captions**. Your goal is to answer the question as accurately as possible. [END OF GOAL] [BEGIN OF REASONING AND TOOL USAGE INSTRUCTIONS]

work page
[71]

Reason first: Before taking any action, carefully analyze whether the current information (captions you already have) is sufficient to answer the question

work page
[72]

If sufficient: Directly provide your final answer inside⟨answer⟩⟨/answer⟩tags

work page
[73]

what color is the person’s shirt?

If insufficient: Identify which part(s) of the video might contain the needed information. Then use one of the following tools: - To obtain finer captions: ⟨tool⟩get caption((high segment id, medium segment id, low segment id))⟨/tool⟩ - Each of the three IDs is an integer from 1 towidth. - To request a Medium-level caption, provide (high segment id, mediu...

work page
[74]

Restriction: In each reasoning round, you may only call one tool (either ‘get caption‘ or ‘videoqa‘) once to obtain new infor- mation. [END OF REASONING AND TOOL USAGE INSTRUCTIONS] [BEGIN OF FORMAT INSTRUCTIONS] Your reasoning and actions must follow this structure exactly:⟨think⟩Your internal reasoning process here. An- alyze what information you have, ...

work page
[75]

Heavens, you have been in the wars,

and ultra-long video examples (Figure 5) in this sec- tion. These examples illustrate LongVideo-R1’s ability to perform hierarchical search, disambiguate similar scenes across hours-long content, and jointly use both high-level and fine-grained information. The examples include cases from TV series such as Downton Abbey, where the model successfully navig...

work page