VideoTIR: Accurate Understanding for Long Videos with Efficient Tool-Integrated Reasoning

Dacheng Tao; Haotian Xu; Qi Fan; Shiyu Shen; Taifeng Chai; Weinong Wang; Wenbin Li; Xing W; Yang Gao; Zhe Gao

arxiv: 2603.25021 · v2 · submitted 2026-03-26 · 💻 cs.CV

VideoTIR: Accurate Understanding for Long Videos with Efficient Tool-Integrated Reasoning

Zhe Gao , Shiyu Shen , Taifeng Chai , Weinong Wang , Haotian Xu , Xing W , Wenbin Li , Qi Fan

show 2 more authors

Yang Gao Dacheng Tao

This is my paper

Pith reviewed 2026-05-15 00:41 UTC · model grok-4.3

classification 💻 cs.CV

keywords long video understandingmultimodal large language modelsreinforcement learningtool callingvideo question answeringhallucination mitigationpolicy optimization

0 comments

The pith

VideoTIR trains multimodal models with reinforcement learning to call tools that isolate key segments in long videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes VideoTIR, a reinforcement learning approach that teaches multimodal large language models to use multi-level toolkits for selecting and focusing on relevant video parts rather than ingesting entire long sequences. This targets the token imbalance that leads to hallucinations when models process extended visual input. The method combines zero-shot RL and SFT cold starts, introduces Toolkit Action Grouped Policy Optimization for stepwise rewards and reuse of failed attempts, and relies on a sandbox to synthesize training trajectories. Experiments on three long-video question-answering benchmarks show gains in accuracy and efficiency compared with prior supervised fine-tuning methods that demand large amounts of fine-grained data.

Core claim

VideoTIR shows that reinforcement learning on sandbox-generated trajectories enables MLLMs to learn reliable tool-calling behavior for retrieving meaningful segments, images, and regions, thereby improving long-video understanding without the extensive high-quality annotation required by previous SFT-based tool-calling systems.

What carries the argument

Toolkit Action Grouped Policy Optimization (TAGPO) combined with sandbox-based trajectory synthesis, which supplies stepwise rewards and reuses failed rollouts to train efficient multi-level tool usage.

If this is right

Long-video question-answering accuracy improves because the model processes only selected segments instead of full sequences.
Computational cost drops as redundant visual tokens are avoided through learned tool calls.
The same RL setup can be applied to other toolkits that parse video at different granularities.
Models become less dependent on massive supervised datasets for tool-use behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to streaming video by reusing the same reward structure for incremental tool decisions.
Sandbox trajectory generation could be adapted to create training data for tool use in other domains such as document or audio analysis.
If the learned policies transfer across different MLLM backbones, the method would reduce the need for per-model supervised fine-tuning.

Load-bearing premise

Reinforcement learning on trajectories produced inside a sandbox will yield stable tool-calling policies without the need for large volumes of human-curated fine-grained data.

What would settle it

A direct comparison on the same long-video QA benchmarks where VideoTIR models produce more hallucinations or more redundant tool calls than strong SFT baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2603.25021 by Dacheng Tao, Haotian Xu, Qi Fan, Shiyu Shen, Taifeng Chai, Weinong Wang, Wenbin Li, Xing W, Yang Gao, Zhe Gao.

**Figure 1.** Figure 1: We propose VideoTIR , a tool-integrated reasoning framework that flexibly and hierarchically retrieves relevant video segments through endogenous tool invocation to support long-video understanding. Furthermore, to enable SFT cold start, we introduce a sandbox-based trajectory synthesis framework. We also present TAGPO to address the inefficiency in early-stage RL exploration caused by tool misuse and over… view at source ↗

**Figure 2.** Figure 2: Framework of our methods. VideoTIR adopts a multi-turn manner to deal with the users’ input videos and questions. When the model fails to conclude an answer based on current visual information, it calls tools to perceive the absent vision clues, which is combined with the former context as the input for the next-turn reasoning. the reasoning loop. Such approaches reduce redundant visual tokens and improve … view at source ↗

**Figure 3.** Figure 3: Comparison of tool-integrated reasoning (TIR) designs for video understanding. (a) Methods such as [38, 42] adopt a paradigm in which the VLM outputs timestamps in text form for subsequent video clipping. (b) Alternatively, some methods rely on heavyweight external tools, incurring substantial interaction costs. In contrast, VideoTIR leverages the intrinsic encoding structure of the VLM to design interna… view at source ↗

**Figure 4.** Figure 4: Hierarchical Visual Toolkits containing both Global and Local Tools. When there’s a need for more information, the textual router calls global-level browsing tools for the general questions and detail-level tools for the questions targeting at finer perception of the videos. – Otherwise, Call browsing tools when the questions’ intents are at a global understanding of the videos and temporal-spatial groundi… view at source ↗

**Figure 5.** Figure 5: Visualization of Tool Action Advantage. We define rewards for each tool calling action that punishing redundancy. The toolkit action advantage is the average of the tool advantages.rewrite LLM Judge Rewritten Prompt # Tool A: Name: xxx Instruction: xxx --- # Tool B: Name: xxx Instruction: xxx --- … Initial System Prompt Sandbox Inference Judge Reasonable Templates Turn into QAs Video-Text Grounding Video T… view at source ↗

**Figure 6.** Figure 6: Framework of Data Synthesis. For video-text grounding datasets, we first convert them into QA datasets. Then, for the easy question, we synthesize trajectories that answer directly with no tool calls. For hard questions that model answers wrong, they are processed through a sandbox to generate tool calling trajectories. Finally, a large LLM is used to judge the rationality of the trajectories and we only … view at source ↗

**Figure 7.** Figure 7: Distribution and Task Range of Curated Datasets (VideoSIAM is not included in). We selected 4 general tasks that potentially need tools for finer perception. We also sythetic a high quality trajectories as the SFT dataset for 3B model cold starting. vision-language model that supports both image and video inputs and demonstrates competitive performance on a wide range of multimodal understanding benchmark… view at source ↗

**Figure 8.** Figure 8: Training dynamics analysis. (a) At early stages, response length increases while format quality drops, indicating the model prioritizes rational tool exploration. Once response length stabilizes, format quality gradually improves, suggesting a balance between rationality and formality. (b) TAGPO accelerates valid tool learning. The valid tool reward rises significantly faster than episode-level GRPO, reduc… view at source ↗

read the original abstract

Existing Multimodal Large Language Models (MLLMs) often suffer from hallucinations in long video understanding (LVU), primarily due to the imbalance between textual and visual tokens. Observing that MLLMs handle short visual inputs well, recent LVU works alleviate hallucinations by automatically parsing the vast visual data into manageable segments that can be effectively processed by MLLMs. SFT-based tool-calling methods can serve this purpose, but they typically require vast amounts of fine-grained, high-quality data and suffer from constrained tool-calling trajectories. We propose a novel VideoTIR that leverages Reinforcement Learning (RL) to encourage proper usage of comprehensive multi-level toolkits for efficient long video understanding. VideoTIR explores both Zero-RL and SFT cold-starting to enable MLLMs to retrieve and focus on meaningful video segments/images/regions, enhancing long video understanding both accurately and efficiently. To reduce redundant tool-calling, we propose Toolkit Action Grouped Policy Optimization (TAGPO), which enhances the efficiency of the calling process through stepwise reward assignment and reuse of failed rollouts. Additionally, we develop a sandbox-based trajectory synthesis framework to generate high-quality trajectories data. Extensive experiments on three long-video QA benchmarks demonstrate the effectiveness and efficiency of our method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes VideoTIR, an RL-based framework (exploring both Zero-RL and SFT cold-start) that trains MLLMs to use multi-level toolkits for long video understanding. It introduces Toolkit Action Grouped Policy Optimization (TAGPO) for stepwise reward assignment and reuse of failed rollouts to reduce redundant calls, plus a sandbox-based trajectory synthesis method to generate training data without relying on vast fine-grained SFT corpora. Experiments on three long-video QA benchmarks are reported to show gains in accuracy and efficiency over prior SFT tool-calling approaches.

Significance. If the central claims hold, the work would offer a data-efficient alternative to SFT for tool-integrated LVU, potentially lowering the barrier to reliable multi-level tool use in MLLMs while addressing token imbalance and hallucinations. The combination of RL with sandbox trajectories and grouped policy optimization could influence scalable video reasoning pipelines.

major comments (3)

[§3.3] §3.3 (TAGPO): The description of stepwise reward assignment and failed-rollout reuse is given at a high level, but the exact reward formulation (e.g., how efficiency penalties are balanced against accuracy) and the mathematical definition of the grouped policy update are not provided. Without these, it is impossible to verify whether TAGPO avoids the circularity of rewarding tool use that the sandbox already biases toward.
[§3.4] §3.4 (sandbox trajectory synthesis): The framework is claimed to produce high-quality, diverse trajectories that enable RL to outperform SFT data requirements. However, no quantitative characterization of trajectory diversity (e.g., coverage of video lengths, tool-call distributions, or failure modes) or ablation on sandbox curation choices is reported. This directly bears on the data-efficiency claim and generalization to the three QA benchmarks.
[§4.2] §4.2 (experimental results): The reported improvements over SFT baselines are presented without disclosing the total number of sandbox-generated trajectories versus the scale of fine-grained SFT data used in comparators. This omission prevents assessment of whether the RL advantage is genuine or an artifact of unequal data budgets.

minor comments (2)

[§1] The abstract and §1 use “comprehensive multi-level toolkits” without an early enumeration or diagram of the exact tool hierarchy (segment, image, region levels).
[Figure 2] Figure 2 (method overview) would benefit from explicit arrows showing how TAGPO reuses failed rollouts in the training loop.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address each major comment below and will incorporate the requested details and analyses into the revised manuscript.

read point-by-point responses

Referee: [§3.3] §3.3 (TAGPO): The description of stepwise reward assignment and failed-rollout reuse is given at a high level, but the exact reward formulation (e.g., how efficiency penalties are balanced against accuracy) and the mathematical definition of the grouped policy update are not provided. Without these, it is impossible to verify whether TAGPO avoids the circularity of rewarding tool use that the sandbox already biases toward.

Authors: We thank the referee for this observation. In the revised manuscript we will expand §3.3 with the exact reward formulation (accuracy term plus explicit efficiency penalty) and the full mathematical definition of the grouped policy update used by TAGPO. On the circularity concern, the sandbox is used only to synthesize initial trajectories; the subsequent RL stage with TAGPO optimizes the policy to reduce redundant calls, which is reflected in the measured efficiency gains. We will add a short clarifying paragraph on this distinction. revision: yes
Referee: [§3.4] §3.4 (sandbox trajectory synthesis): The framework is claimed to produce high-quality, diverse trajectories that enable RL to outperform SFT data requirements. However, no quantitative characterization of trajectory diversity (e.g., coverage of video lengths, tool-call distributions, or failure modes) or ablation on sandbox curation choices is reported. This directly bears on the data-efficiency claim and generalization to the three QA benchmarks.

Authors: We agree that quantitative characterization is needed. In the revision we will add statistics on trajectory diversity (video-length coverage, tool-call distributions, and failure-mode coverage) together with an ablation study on sandbox curation choices and their effect on final performance across the three benchmarks. revision: yes
Referee: [§4.2] §4.2 (experimental results): The reported improvements over SFT baselines are presented without disclosing the total number of sandbox-generated trajectories versus the scale of fine-grained SFT data used in comparators. This omission prevents assessment of whether the RL advantage is genuine or an artifact of unequal data budgets.

Authors: We acknowledge the omission. In the revised §4.2 we will report the exact total number of sandbox-generated trajectories used for RL training and provide a direct comparison with the data scale of the SFT baselines, allowing readers to evaluate the data-efficiency claim. revision: yes

Circularity Check

0 steps flagged

No circularity: method proposal builds on external RL/tool ideas without self-referential reduction

full rationale

The paper proposes VideoTIR as a new RL-driven framework (Zero-RL and SFT cold-start variants) plus TAGPO and sandbox trajectory synthesis for long-video tool calling. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the derivation; the central claims rest on empirical results across three QA benchmarks rather than reducing by construction to the inputs. The approach extends existing MLLM and RL concepts without self-definitional loops or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method description relies on standard RL concepts and tool-calling without detailing any new fitted constants or unproven assumptions.

pith-pipeline@v0.9.0 · 5546 in / 998 out tokens · 42122 ms · 2026-05-15T00:41:39.770288+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages

[1]

Authorea Preprints (2025) 2, 10, 12

Ahmed, I., Islam, S., Datta, P.P., Kabir, I., Chowdhury, N.U.R., Haque, A.: Qwen 2.5: A comprehensive review of the leading resource-efficient llm with potentioal to surpass all competitors. Authorea Preprints (2025) 2, 10, 12

work page 2025
[2]

NeurIPS (2022) 4

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. NeurIPS (2022) 4

work page 2022
[3]

Chen, B., Yue, Z., Chen, S., Wang, Z., Liu, Y., Li, P., Wang, Y.: Lvagent: Long videounderstandingbymulti-rounddynamicalcollaborationofmllmagents.ArXiv (2025) 5

work page 2025
[4]

In: CVPR (2024) 2, 12

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: CVPR (2024) 2, 12

work page 2024
[5]

Clark, C., Zhang, J., Ma, Z., Park, J.S., Salehi, M., Tripathi, R., Lee, S., Ren, Z., Kim, C.D., Yang, Y., Shao, V., Yang, Y., Huang, W., Gao, Z., Anderson, T., Zhang, J., Jain, J., Stoica, G., Han, W., Farhadi, A., Krishna, R.: Molmo2: Open weights and data for vision-language models with video understanding and grounding (2026) 2

work page 2026
[6]

In: ECCV (2024) 2

Fan, Y., Ma, X., Wu, R., Du, Y., Li, J., Gao, Z., Li, Q.: Videoagent: A memory- augmented multimodal agent for video understanding. In: ECCV (2024) 2

work page 2024
[7]

In: ECCV (2024) 5

Fan, Y., Ma, X., Wu, R., Du, Y., Li, J., Gao, Z., Li, Q.: Videoagent: A memory- augmented multimodal agent for video understanding. In: ECCV (2024) 5

work page 2024
[8]

ArXiv (2025) 12

Feng, K., Gong, K., Li, B., Guo, Z., Wang, Y., Peng, T., Wu, J., Zhang, X., Wang, B., Yue, X.: Video-r1: Reinforcing video reasoning in mllms. ArXiv (2025) 12

work page 2025
[9]

In: CVPR (2025) 4

Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: CVPR (2025) 4

work page 2025
[10]

ArXiv (2024) 2, 3

GLM, T., Zeng, A., Xu, B., Wang, B., Zhang, C., Yin, D., Zhang, D., Rojas, D., Feng, G., Zhao, H., et al.: Chatglm: A family of large language models from glm-130b to glm-4 all tools. ArXiv (2024) 2, 3

work page 2024
[11]

In: CVPR (2022) 4

Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Ham- burger, J., Jiang, H., Liu, M., Liu, X., et al.: Ego4d: Around the world in 3,000 hours of egocentric video. In: CVPR (2022) 4

work page 2022
[12]

In: CVPR (2025) 2

Hu, K., Gao, F., Nie, X., Zhou, P., Tran, S., Neiman, T., Wang, L., Shah, M., Hamid, R., Yin, B., et al.: M-llm based video frame selection for efficient video understanding. In: CVPR (2025) 2

work page 2025
[13]

ArXiv (2024) 5

Jeoung, S., Huybrechts, G., Ganesh, B., Galstyan, A., Bodapati, S.: Adaptive video understanding agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning. ArXiv (2024) 5

work page 2024
[14]

ArXiv (2025) 5

Kugo, N., Li, X., Li, Z., Gupta, A., Khatua, A., Jain, N., Patel, C., Kyuragi, Y., Ishii, Y., Tanabiki, M., et al.: Videomultiagents: A multi-agent framework for video question answering. ArXiv (2025) 5

work page 2025
[15]

NeurIPS (2021) 11 16 Z

Lei, J., Berg, T.L., Bansal, M.: Detecting moments and highlights in videos via natural language queries. NeurIPS (2021) 11 16 Z. Gao et al

work page 2021
[16]

In: CVPR (2024) 4

Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., et al.: Mvbench: A comprehensive multi-modal video understanding benchmark. In: CVPR (2024) 4

work page 2024
[17]

In: EMNLP (2024) 4

Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: EMNLP (2024) 4

work page 2024
[18]

IEEE Trans

Lin, J., Hua, H., Chen, M., Li, Y., Hsiao, J., Ho, C., Luo, J.: Videoxum: Cross- modal visual and textural summarization of videos. IEEE Trans. Multimed. (2023) 11

work page 2023
[19]

ArXiv (2025) 2

Liu, Y., Lin, K.Q., Chen, C.W., Shou, M.Z.: Videomind: A chain-of-lora agent for long video reasoning. ArXiv (2025) 2

work page 2025
[20]

In: CVPR (2025) 2

Ma,Z.,Gou,C.,Shi,H.,Sun,B.,Li,S.,Rezatofighi,H.,Cai,J.:Drvideo:Document retrieval based long video understanding. In: CVPR (2025) 2

work page 2025
[21]

In: CVPR (2025) 5

Ma,Z.,Gou,C.,Shi,H.,Sun,B.,Li,S.,Rezatofighi,H.,Cai,J.:Drvideo:Document retrieval based long video understanding. In: CVPR (2025) 5

work page 2025
[22]

In: ACL (2024) 4

Maaz, M., Rasheed, H., Khan, S., Khan, F.: Video-chatgpt: Towards detailed video understanding via large vision and language models. In: ACL (2024) 4

work page 2024
[23]

NeurIPS (2023) 4

Mangalam, K., Akshulakov, R., Malik, J.: Egoschema: A diagnostic benchmark for very long-form video language understanding. NeurIPS (2023) 4

work page 2023
[24]

NeurIPS (2023) 5

Shinn,N.,Cassano,F.,Gopinath,A.,Narasimhan,K.,Yao,S.:Reflexion:Language agents with verbal reinforcement learning. NeurIPS (2023) 5

work page 2023
[25]

In: CVPR (2025) 12

Shu, Y., Liu, Z., Zhang, P., Qin, M., Zhou, J., Liang, Z., Huang, T., Zhao, B.: Video-xl: Extra-long vision language model for hour-scale video understanding. In: CVPR (2025) 12

work page 2025
[26]

NeurIPS (2022) 4

Tong, Z., Song, Y., Wang, J., Wang, L.: Videomae: Masked autoencoders are data- efficient learners for self-supervised video pre-training. NeurIPS (2022) 4

work page 2022
[27]

ArXiv (2024) 2, 4, 12

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. ArXiv (2024) 2, 4, 12

work page 2024
[28]

In: ICCV (2025) 1

Wang,W.,He,Z.,Hong,W.,Cheng,Y.,Zhang,X.,Qi,J.,Ding,M.,Gu,X.,Huang, S., Xu, B., et al.: Lvbench: An extreme long video understanding benchmark. In: ICCV (2025) 1

work page 2025
[29]

In: ECCV (2024) 2

Wang, X., Zhang, Y., Zohar, O., Yeung-Levy, S.: Videoagent: Long-form video understanding with large language model as agent. In: ECCV (2024) 2

work page 2024
[30]

NeurIPS (2023) 10

Wang, Z., Blume, A., Li, S., Liu, G., Cho, J., Tang, Z., Bansal, M., Ji, H.: Paxion: Patching action knowledge in video-language foundation models. NeurIPS (2023) 10

work page 2023
[31]

ArXiv (2025) 5, 12

Wang, Z., Chen, B., Yue, Z., Wang, Y., Qiao, Y., Wang, L., Wang, Y.: Videochat- a1: Thinking with long videos by chain-of-shot reasoning. ArXiv (2025) 5, 12

work page 2025
[32]

In: ECCV (2024) 2

Weng, Y., Han, M., He, H., Chang, X., Zhuang, B.: Longvlm: Efficient long video understanding via large language models. In: ECCV (2024) 2

work page 2024
[33]

ArXiv (2024) 10

Wu,B.,Yu,S.,Chen,Z.,Tenenbaum,J.B.,Gan,C.:Star:Abenchmarkforsituated reasoning in real-world videos. ArXiv (2024) 10

work page 2024
[34]

In: CVPR (2019) 1

Wu, C.Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., Girshick, R.: Long- term feature banks for detailed video understanding. In: CVPR (2019) 1

work page 2019
[35]

In: CVPR (2021) 1, 4

Wu, C.Y., Krahenbuhl, P.: Towards long-form video understanding. In: CVPR (2021) 1, 4

work page 2021
[36]

In: CVPR (2021) 4, 11

Xiao,J.,Shang,X.,Yao,A.,Chua,T.S.:Next-qa:Nextphaseofquestion-answering to explaining temporal actions. In: CVPR (2021) 4, 11

work page 2021
[37]

In: CVPR (2024) 11 VideoTIR: Efficient Long Video Understanding 17

Xiao, J., Yao, A., Li, Y., Chua, T.S.: Can i trust your answer? visually grounded video question answering. In: CVPR (2024) 11 VideoTIR: Efficient Long Video Understanding 17

work page 2024
[38]

ArXiv (2025) 2, 3, 5, 6, 12

Xie, Y., Chen, T., Ge, Z., Ni, L.: Video-mtr: Reinforced multi-turn reasoning for long video understanding. ArXiv (2025) 2, 3, 5, 6, 12

work page 2025
[39]

ArXiv (2025) 5

Xu, Z., Zhang, J., Wang, Q., Liu, Y.: E-vrag: Enhancing long video understanding with resource-efficient retrieval augmented generation. ArXiv (2025) 5

work page 2025
[40]

ArXiv (2025) 11

Xue, Z., Zheng, L., Liu, Q., Li, Y., Zheng, X., Ma, Z., An, B.: Simpletir: End-to- end reinforcement learning for multi-turn tool-integrated reasoning. ArXiv (2025) 11

work page 2025
[41]

ArXiv (2025) 5

Xue, Z., Zhang, J., Xie, X., Cai, Y., Liu, Y., Li, X., Tao, D.: Omni-adavideorag: Omni-contextual adaptive retrieval-augmented for efficient long video understand- ing. ArXiv (2025) 5

work page 2025
[42]

thinking with long videos

Yang, Z., Wang, S., Zhang, K., Wu, K., Leng, S., Zhang, Y., Li, B., Qin, C., Lu, S., Li, X., Bing, L.: Longvt: Incentivizing "thinking with long videos" via native tool calling (2025) 2, 5, 6, 12

work page 2025
[43]

In: ICLR (2022) 5

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: React: Synergizing reasoning and acting in language models. In: ICLR (2022) 5

work page 2022
[44]

ArXiv (2019) 10

Yi, K., Gan, C., Li, Y., Kohli, P., Wu, J., Torralba, A., Tenenbaum, J.B.: Clevrer: Collision events for video representation and reasoning. ArXiv (2019) 10

work page 2019
[45]

ArXiv (2023) 4, 12

Zhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual lan- guage model for video understanding. ArXiv (2023) 4, 12

work page 2023
[46]

ArXiv (2025) 3

Zhang, Y.F., Lu, X., Yin, S., Fu, C., Chen, W., Hu, X., Wen, B., Jiang, K., Liu, C., Zhang, T., et al.: Thyme: Think beyond images. ArXiv (2025) 3

work page 2025
[47]

ArXiv (2024) 11

Zheng, Y., Zhang, R., Zhang, J., Ye, Y., Luo, Z., Feng, Z., Ma, Y.: Llamafactory: Unified efficient fine-tuning of 100+ language models. ArXiv (2024) 11

work page 2024
[48]

thinking with images

Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deepeyes: Incentivizing" thinking with images" via reinforcement learning. ArXiv (2025) 3, 8

work page 2025
[49]

Arxiv (2025) 12

Zhong, Y., Hu, Z.Y., Li, Y., Wang, L.: Rethinking chain-of-thought reasoning for videos. Arxiv (2025) 12

work page 2025

[1] [1]

Authorea Preprints (2025) 2, 10, 12

Ahmed, I., Islam, S., Datta, P.P., Kabir, I., Chowdhury, N.U.R., Haque, A.: Qwen 2.5: A comprehensive review of the leading resource-efficient llm with potentioal to surpass all competitors. Authorea Preprints (2025) 2, 10, 12

work page 2025

[2] [2]

NeurIPS (2022) 4

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. NeurIPS (2022) 4

work page 2022

[3] [3]

Chen, B., Yue, Z., Chen, S., Wang, Z., Liu, Y., Li, P., Wang, Y.: Lvagent: Long videounderstandingbymulti-rounddynamicalcollaborationofmllmagents.ArXiv (2025) 5

work page 2025

[4] [4]

In: CVPR (2024) 2, 12

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: CVPR (2024) 2, 12

work page 2024

[5] [5]

Clark, C., Zhang, J., Ma, Z., Park, J.S., Salehi, M., Tripathi, R., Lee, S., Ren, Z., Kim, C.D., Yang, Y., Shao, V., Yang, Y., Huang, W., Gao, Z., Anderson, T., Zhang, J., Jain, J., Stoica, G., Han, W., Farhadi, A., Krishna, R.: Molmo2: Open weights and data for vision-language models with video understanding and grounding (2026) 2

work page 2026

[6] [6]

In: ECCV (2024) 2

Fan, Y., Ma, X., Wu, R., Du, Y., Li, J., Gao, Z., Li, Q.: Videoagent: A memory- augmented multimodal agent for video understanding. In: ECCV (2024) 2

work page 2024

[7] [7]

In: ECCV (2024) 5

Fan, Y., Ma, X., Wu, R., Du, Y., Li, J., Gao, Z., Li, Q.: Videoagent: A memory- augmented multimodal agent for video understanding. In: ECCV (2024) 5

work page 2024

[8] [8]

ArXiv (2025) 12

Feng, K., Gong, K., Li, B., Guo, Z., Wang, Y., Peng, T., Wu, J., Zhang, X., Wang, B., Yue, X.: Video-r1: Reinforcing video reasoning in mllms. ArXiv (2025) 12

work page 2025

[9] [9]

In: CVPR (2025) 4

Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: CVPR (2025) 4

work page 2025

[10] [10]

ArXiv (2024) 2, 3

GLM, T., Zeng, A., Xu, B., Wang, B., Zhang, C., Yin, D., Zhang, D., Rojas, D., Feng, G., Zhao, H., et al.: Chatglm: A family of large language models from glm-130b to glm-4 all tools. ArXiv (2024) 2, 3

work page 2024

[11] [11]

In: CVPR (2022) 4

Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Ham- burger, J., Jiang, H., Liu, M., Liu, X., et al.: Ego4d: Around the world in 3,000 hours of egocentric video. In: CVPR (2022) 4

work page 2022

[12] [12]

In: CVPR (2025) 2

Hu, K., Gao, F., Nie, X., Zhou, P., Tran, S., Neiman, T., Wang, L., Shah, M., Hamid, R., Yin, B., et al.: M-llm based video frame selection for efficient video understanding. In: CVPR (2025) 2

work page 2025

[13] [13]

ArXiv (2024) 5

Jeoung, S., Huybrechts, G., Ganesh, B., Galstyan, A., Bodapati, S.: Adaptive video understanding agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning. ArXiv (2024) 5

work page 2024

[14] [14]

ArXiv (2025) 5

Kugo, N., Li, X., Li, Z., Gupta, A., Khatua, A., Jain, N., Patel, C., Kyuragi, Y., Ishii, Y., Tanabiki, M., et al.: Videomultiagents: A multi-agent framework for video question answering. ArXiv (2025) 5

work page 2025

[15] [15]

NeurIPS (2021) 11 16 Z

Lei, J., Berg, T.L., Bansal, M.: Detecting moments and highlights in videos via natural language queries. NeurIPS (2021) 11 16 Z. Gao et al

work page 2021

[16] [16]

In: CVPR (2024) 4

Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., et al.: Mvbench: A comprehensive multi-modal video understanding benchmark. In: CVPR (2024) 4

work page 2024

[17] [17]

In: EMNLP (2024) 4

Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: EMNLP (2024) 4

work page 2024

[18] [18]

IEEE Trans

Lin, J., Hua, H., Chen, M., Li, Y., Hsiao, J., Ho, C., Luo, J.: Videoxum: Cross- modal visual and textural summarization of videos. IEEE Trans. Multimed. (2023) 11

work page 2023

[19] [19]

ArXiv (2025) 2

Liu, Y., Lin, K.Q., Chen, C.W., Shou, M.Z.: Videomind: A chain-of-lora agent for long video reasoning. ArXiv (2025) 2

work page 2025

[20] [20]

In: CVPR (2025) 2

Ma,Z.,Gou,C.,Shi,H.,Sun,B.,Li,S.,Rezatofighi,H.,Cai,J.:Drvideo:Document retrieval based long video understanding. In: CVPR (2025) 2

work page 2025

[21] [21]

In: CVPR (2025) 5

Ma,Z.,Gou,C.,Shi,H.,Sun,B.,Li,S.,Rezatofighi,H.,Cai,J.:Drvideo:Document retrieval based long video understanding. In: CVPR (2025) 5

work page 2025

[22] [22]

In: ACL (2024) 4

Maaz, M., Rasheed, H., Khan, S., Khan, F.: Video-chatgpt: Towards detailed video understanding via large vision and language models. In: ACL (2024) 4

work page 2024

[23] [23]

NeurIPS (2023) 4

Mangalam, K., Akshulakov, R., Malik, J.: Egoschema: A diagnostic benchmark for very long-form video language understanding. NeurIPS (2023) 4

work page 2023

[24] [24]

NeurIPS (2023) 5

Shinn,N.,Cassano,F.,Gopinath,A.,Narasimhan,K.,Yao,S.:Reflexion:Language agents with verbal reinforcement learning. NeurIPS (2023) 5

work page 2023

[25] [25]

In: CVPR (2025) 12

Shu, Y., Liu, Z., Zhang, P., Qin, M., Zhou, J., Liang, Z., Huang, T., Zhao, B.: Video-xl: Extra-long vision language model for hour-scale video understanding. In: CVPR (2025) 12

work page 2025

[26] [26]

NeurIPS (2022) 4

Tong, Z., Song, Y., Wang, J., Wang, L.: Videomae: Masked autoencoders are data- efficient learners for self-supervised video pre-training. NeurIPS (2022) 4

work page 2022

[27] [27]

ArXiv (2024) 2, 4, 12

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. ArXiv (2024) 2, 4, 12

work page 2024

[28] [28]

In: ICCV (2025) 1

Wang,W.,He,Z.,Hong,W.,Cheng,Y.,Zhang,X.,Qi,J.,Ding,M.,Gu,X.,Huang, S., Xu, B., et al.: Lvbench: An extreme long video understanding benchmark. In: ICCV (2025) 1

work page 2025

[29] [29]

In: ECCV (2024) 2

Wang, X., Zhang, Y., Zohar, O., Yeung-Levy, S.: Videoagent: Long-form video understanding with large language model as agent. In: ECCV (2024) 2

work page 2024

[30] [30]

NeurIPS (2023) 10

Wang, Z., Blume, A., Li, S., Liu, G., Cho, J., Tang, Z., Bansal, M., Ji, H.: Paxion: Patching action knowledge in video-language foundation models. NeurIPS (2023) 10

work page 2023

[31] [31]

ArXiv (2025) 5, 12

Wang, Z., Chen, B., Yue, Z., Wang, Y., Qiao, Y., Wang, L., Wang, Y.: Videochat- a1: Thinking with long videos by chain-of-shot reasoning. ArXiv (2025) 5, 12

work page 2025

[32] [32]

In: ECCV (2024) 2

Weng, Y., Han, M., He, H., Chang, X., Zhuang, B.: Longvlm: Efficient long video understanding via large language models. In: ECCV (2024) 2

work page 2024

[33] [33]

ArXiv (2024) 10

Wu,B.,Yu,S.,Chen,Z.,Tenenbaum,J.B.,Gan,C.:Star:Abenchmarkforsituated reasoning in real-world videos. ArXiv (2024) 10

work page 2024

[34] [34]

In: CVPR (2019) 1

Wu, C.Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., Girshick, R.: Long- term feature banks for detailed video understanding. In: CVPR (2019) 1

work page 2019

[35] [35]

In: CVPR (2021) 1, 4

Wu, C.Y., Krahenbuhl, P.: Towards long-form video understanding. In: CVPR (2021) 1, 4

work page 2021

[36] [36]

In: CVPR (2021) 4, 11

Xiao,J.,Shang,X.,Yao,A.,Chua,T.S.:Next-qa:Nextphaseofquestion-answering to explaining temporal actions. In: CVPR (2021) 4, 11

work page 2021

[37] [37]

In: CVPR (2024) 11 VideoTIR: Efficient Long Video Understanding 17

Xiao, J., Yao, A., Li, Y., Chua, T.S.: Can i trust your answer? visually grounded video question answering. In: CVPR (2024) 11 VideoTIR: Efficient Long Video Understanding 17

work page 2024

[38] [38]

ArXiv (2025) 2, 3, 5, 6, 12

Xie, Y., Chen, T., Ge, Z., Ni, L.: Video-mtr: Reinforced multi-turn reasoning for long video understanding. ArXiv (2025) 2, 3, 5, 6, 12

work page 2025

[39] [39]

ArXiv (2025) 5

Xu, Z., Zhang, J., Wang, Q., Liu, Y.: E-vrag: Enhancing long video understanding with resource-efficient retrieval augmented generation. ArXiv (2025) 5

work page 2025

[40] [40]

ArXiv (2025) 11

Xue, Z., Zheng, L., Liu, Q., Li, Y., Zheng, X., Ma, Z., An, B.: Simpletir: End-to- end reinforcement learning for multi-turn tool-integrated reasoning. ArXiv (2025) 11

work page 2025

[41] [41]

ArXiv (2025) 5

Xue, Z., Zhang, J., Xie, X., Cai, Y., Liu, Y., Li, X., Tao, D.: Omni-adavideorag: Omni-contextual adaptive retrieval-augmented for efficient long video understand- ing. ArXiv (2025) 5

work page 2025

[42] [42]

thinking with long videos

Yang, Z., Wang, S., Zhang, K., Wu, K., Leng, S., Zhang, Y., Li, B., Qin, C., Lu, S., Li, X., Bing, L.: Longvt: Incentivizing "thinking with long videos" via native tool calling (2025) 2, 5, 6, 12

work page 2025

[43] [43]

In: ICLR (2022) 5

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: React: Synergizing reasoning and acting in language models. In: ICLR (2022) 5

work page 2022

[44] [44]

ArXiv (2019) 10

Yi, K., Gan, C., Li, Y., Kohli, P., Wu, J., Torralba, A., Tenenbaum, J.B.: Clevrer: Collision events for video representation and reasoning. ArXiv (2019) 10

work page 2019

[45] [45]

ArXiv (2023) 4, 12

Zhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual lan- guage model for video understanding. ArXiv (2023) 4, 12

work page 2023

[46] [46]

ArXiv (2025) 3

Zhang, Y.F., Lu, X., Yin, S., Fu, C., Chen, W., Hu, X., Wen, B., Jiang, K., Liu, C., Zhang, T., et al.: Thyme: Think beyond images. ArXiv (2025) 3

work page 2025

[47] [47]

ArXiv (2024) 11

Zheng, Y., Zhang, R., Zhang, J., Ye, Y., Luo, Z., Feng, Z., Ma, Y.: Llamafactory: Unified efficient fine-tuning of 100+ language models. ArXiv (2024) 11

work page 2024

[48] [48]

thinking with images

Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deepeyes: Incentivizing" thinking with images" via reinforcement learning. ArXiv (2025) 3, 8

work page 2025

[49] [49]

Arxiv (2025) 12

Zhong, Y., Hu, Z.Y., Li, Y., Wang, L.: Rethinking chain-of-thought reasoning for videos. Arxiv (2025) 12

work page 2025