pith. sign in

arxiv: 2603.25021 · v2 · submitted 2026-03-26 · 💻 cs.CV

VideoTIR: Accurate Understanding for Long Videos with Efficient Tool-Integrated Reasoning

Pith reviewed 2026-05-15 00:41 UTC · model grok-4.3

classification 💻 cs.CV
keywords long video understandingmultimodal large language modelsreinforcement learningtool callingvideo question answeringhallucination mitigationpolicy optimization
0
0 comments X

The pith

VideoTIR trains multimodal models with reinforcement learning to call tools that isolate key segments in long videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes VideoTIR, a reinforcement learning approach that teaches multimodal large language models to use multi-level toolkits for selecting and focusing on relevant video parts rather than ingesting entire long sequences. This targets the token imbalance that leads to hallucinations when models process extended visual input. The method combines zero-shot RL and SFT cold starts, introduces Toolkit Action Grouped Policy Optimization for stepwise rewards and reuse of failed attempts, and relies on a sandbox to synthesize training trajectories. Experiments on three long-video question-answering benchmarks show gains in accuracy and efficiency compared with prior supervised fine-tuning methods that demand large amounts of fine-grained data.

Core claim

VideoTIR shows that reinforcement learning on sandbox-generated trajectories enables MLLMs to learn reliable tool-calling behavior for retrieving meaningful segments, images, and regions, thereby improving long-video understanding without the extensive high-quality annotation required by previous SFT-based tool-calling systems.

What carries the argument

Toolkit Action Grouped Policy Optimization (TAGPO) combined with sandbox-based trajectory synthesis, which supplies stepwise rewards and reuses failed rollouts to train efficient multi-level tool usage.

If this is right

  • Long-video question-answering accuracy improves because the model processes only selected segments instead of full sequences.
  • Computational cost drops as redundant visual tokens are avoided through learned tool calls.
  • The same RL setup can be applied to other toolkits that parse video at different granularities.
  • Models become less dependent on massive supervised datasets for tool-use behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may extend to streaming video by reusing the same reward structure for incremental tool decisions.
  • Sandbox trajectory generation could be adapted to create training data for tool use in other domains such as document or audio analysis.
  • If the learned policies transfer across different MLLM backbones, the method would reduce the need for per-model supervised fine-tuning.

Load-bearing premise

Reinforcement learning on trajectories produced inside a sandbox will yield stable tool-calling policies without the need for large volumes of human-curated fine-grained data.

What would settle it

A direct comparison on the same long-video QA benchmarks where VideoTIR models produce more hallucinations or more redundant tool calls than strong SFT baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2603.25021 by Dacheng Tao, Haotian Xu, Qi Fan, Shiyu Shen, Taifeng Chai, Weinong Wang, Wenbin Li, Xing W, Yang Gao, Zhe Gao.

Figure 1
Figure 1. Figure 1: We propose VideoTIR , a tool-integrated reasoning framework that flexibly and hierarchically retrieves relevant video segments through endogenous tool invocation to support long-video understanding. Furthermore, to enable SFT cold start, we introduce a sandbox-based trajectory synthesis framework. We also present TAGPO to address the inefficiency in early-stage RL exploration caused by tool misuse and over… view at source ↗
Figure 2
Figure 2. Figure 2: Framework of our methods. VideoTIR adopts a multi-turn manner to deal with the users’ input videos and questions. When the model fails to conclude an answer based on current visual information, it calls tools to perceive the absent vision clues, which is combined with the former context as the input for the next-turn reasoning. the reasoning loop. Such approaches reduce redundant visual tokens and improve … view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of tool-integrated reasoning (TIR) designs for video understand￾ing. (a) Methods such as [38, 42] adopt a paradigm in which the VLM outputs times￾tamps in text form for subsequent video clipping. (b) Alternatively, some methods rely on heavyweight external tools, incurring substantial interaction costs. In contrast, VideoTIR leverages the intrinsic encoding structure of the VLM to design interna… view at source ↗
Figure 4
Figure 4. Figure 4: Hierarchical Visual Toolkits containing both Global and Local Tools. When there’s a need for more information, the textual router calls global-level browsing tools for the general questions and detail-level tools for the questions targeting at finer perception of the videos. – Otherwise, Call browsing tools when the questions’ intents are at a global understanding of the videos and temporal-spatial groundi… view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of Tool Action Advantage. We define rewards for each tool calling action that punishing redundancy. The toolkit action advantage is the average of the tool advantages.rewrite LLM Judge Rewritten Prompt # Tool A: Name: xxx Instruction: xxx --- # Tool B: Name: xxx Instruction: xxx --- … Initial System Prompt Sandbox Inference Judge Reasonable Templates Turn into QAs Video-Text Grounding Video T… view at source ↗
Figure 6
Figure 6. Figure 6: Framework of Data Synthesis. For video-text grounding datasets, we first con￾vert them into QA datasets. Then, for the easy question, we synthesize trajectories that answer directly with no tool calls. For hard questions that model answers wrong, they are processed through a sandbox to generate tool calling trajectories. Finally, a large LLM is used to judge the rationality of the trajectories and we only … view at source ↗
Figure 7
Figure 7. Figure 7: Distribution and Task Range of Curated Datasets (VideoSIAM is not included in). We selected 4 general tasks that potentially need tools for finer perception. We also sythetic a high quality trajectories as the SFT dataset for 3B model cold starting. vision-language model that supports both image and video inputs and demon￾strates competitive performance on a wide range of multimodal understanding benchmark… view at source ↗
Figure 8
Figure 8. Figure 8: Training dynamics analysis. (a) At early stages, response length increases while format quality drops, indicating the model prioritizes rational tool exploration. Once response length stabilizes, format quality gradually improves, suggesting a balance between rationality and formality. (b) TAGPO accelerates valid tool learning. The valid tool reward rises significantly faster than episode-level GRPO, reduc… view at source ↗
read the original abstract

Existing Multimodal Large Language Models (MLLMs) often suffer from hallucinations in long video understanding (LVU), primarily due to the imbalance between textual and visual tokens. Observing that MLLMs handle short visual inputs well, recent LVU works alleviate hallucinations by automatically parsing the vast visual data into manageable segments that can be effectively processed by MLLMs. SFT-based tool-calling methods can serve this purpose, but they typically require vast amounts of fine-grained, high-quality data and suffer from constrained tool-calling trajectories. We propose a novel VideoTIR that leverages Reinforcement Learning (RL) to encourage proper usage of comprehensive multi-level toolkits for efficient long video understanding. VideoTIR explores both Zero-RL and SFT cold-starting to enable MLLMs to retrieve and focus on meaningful video segments/images/regions, enhancing long video understanding both accurately and efficiently. To reduce redundant tool-calling, we propose Toolkit Action Grouped Policy Optimization (TAGPO), which enhances the efficiency of the calling process through stepwise reward assignment and reuse of failed rollouts. Additionally, we develop a sandbox-based trajectory synthesis framework to generate high-quality trajectories data. Extensive experiments on three long-video QA benchmarks demonstrate the effectiveness and efficiency of our method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes VideoTIR, an RL-based framework (exploring both Zero-RL and SFT cold-start) that trains MLLMs to use multi-level toolkits for long video understanding. It introduces Toolkit Action Grouped Policy Optimization (TAGPO) for stepwise reward assignment and reuse of failed rollouts to reduce redundant calls, plus a sandbox-based trajectory synthesis method to generate training data without relying on vast fine-grained SFT corpora. Experiments on three long-video QA benchmarks are reported to show gains in accuracy and efficiency over prior SFT tool-calling approaches.

Significance. If the central claims hold, the work would offer a data-efficient alternative to SFT for tool-integrated LVU, potentially lowering the barrier to reliable multi-level tool use in MLLMs while addressing token imbalance and hallucinations. The combination of RL with sandbox trajectories and grouped policy optimization could influence scalable video reasoning pipelines.

major comments (3)
  1. [§3.3] §3.3 (TAGPO): The description of stepwise reward assignment and failed-rollout reuse is given at a high level, but the exact reward formulation (e.g., how efficiency penalties are balanced against accuracy) and the mathematical definition of the grouped policy update are not provided. Without these, it is impossible to verify whether TAGPO avoids the circularity of rewarding tool use that the sandbox already biases toward.
  2. [§3.4] §3.4 (sandbox trajectory synthesis): The framework is claimed to produce high-quality, diverse trajectories that enable RL to outperform SFT data requirements. However, no quantitative characterization of trajectory diversity (e.g., coverage of video lengths, tool-call distributions, or failure modes) or ablation on sandbox curation choices is reported. This directly bears on the data-efficiency claim and generalization to the three QA benchmarks.
  3. [§4.2] §4.2 (experimental results): The reported improvements over SFT baselines are presented without disclosing the total number of sandbox-generated trajectories versus the scale of fine-grained SFT data used in comparators. This omission prevents assessment of whether the RL advantage is genuine or an artifact of unequal data budgets.
minor comments (2)
  1. [§1] The abstract and §1 use “comprehensive multi-level toolkits” without an early enumeration or diagram of the exact tool hierarchy (segment, image, region levels).
  2. [Figure 2] Figure 2 (method overview) would benefit from explicit arrows showing how TAGPO reuses failed rollouts in the training loop.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address each major comment below and will incorporate the requested details and analyses into the revised manuscript.

read point-by-point responses
  1. Referee: [§3.3] §3.3 (TAGPO): The description of stepwise reward assignment and failed-rollout reuse is given at a high level, but the exact reward formulation (e.g., how efficiency penalties are balanced against accuracy) and the mathematical definition of the grouped policy update are not provided. Without these, it is impossible to verify whether TAGPO avoids the circularity of rewarding tool use that the sandbox already biases toward.

    Authors: We thank the referee for this observation. In the revised manuscript we will expand §3.3 with the exact reward formulation (accuracy term plus explicit efficiency penalty) and the full mathematical definition of the grouped policy update used by TAGPO. On the circularity concern, the sandbox is used only to synthesize initial trajectories; the subsequent RL stage with TAGPO optimizes the policy to reduce redundant calls, which is reflected in the measured efficiency gains. We will add a short clarifying paragraph on this distinction. revision: yes

  2. Referee: [§3.4] §3.4 (sandbox trajectory synthesis): The framework is claimed to produce high-quality, diverse trajectories that enable RL to outperform SFT data requirements. However, no quantitative characterization of trajectory diversity (e.g., coverage of video lengths, tool-call distributions, or failure modes) or ablation on sandbox curation choices is reported. This directly bears on the data-efficiency claim and generalization to the three QA benchmarks.

    Authors: We agree that quantitative characterization is needed. In the revision we will add statistics on trajectory diversity (video-length coverage, tool-call distributions, and failure-mode coverage) together with an ablation study on sandbox curation choices and their effect on final performance across the three benchmarks. revision: yes

  3. Referee: [§4.2] §4.2 (experimental results): The reported improvements over SFT baselines are presented without disclosing the total number of sandbox-generated trajectories versus the scale of fine-grained SFT data used in comparators. This omission prevents assessment of whether the RL advantage is genuine or an artifact of unequal data budgets.

    Authors: We acknowledge the omission. In the revised §4.2 we will report the exact total number of sandbox-generated trajectories used for RL training and provide a direct comparison with the data scale of the SFT baselines, allowing readers to evaluate the data-efficiency claim. revision: yes

Circularity Check

0 steps flagged

No circularity: method proposal builds on external RL/tool ideas without self-referential reduction

full rationale

The paper proposes VideoTIR as a new RL-driven framework (Zero-RL and SFT cold-start variants) plus TAGPO and sandbox trajectory synthesis for long-video tool calling. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the derivation; the central claims rest on empirical results across three QA benchmarks rather than reducing by construction to the inputs. The approach extends existing MLLM and RL concepts without self-definitional loops or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method description relies on standard RL concepts and tool-calling without detailing any new fitted constants or unproven assumptions.

pith-pipeline@v0.9.0 · 5546 in / 998 out tokens · 42122 ms · 2026-05-15T00:41:39.770288+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages

  1. [1]

    Authorea Preprints (2025) 2, 10, 12

    Ahmed, I., Islam, S., Datta, P.P., Kabir, I., Chowdhury, N.U.R., Haque, A.: Qwen 2.5: A comprehensive review of the leading resource-efficient llm with potentioal to surpass all competitors. Authorea Preprints (2025) 2, 10, 12

  2. [2]

    NeurIPS (2022) 4

    Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. NeurIPS (2022) 4

  3. [3]

    Chen, B., Yue, Z., Chen, S., Wang, Z., Liu, Y., Li, P., Wang, Y.: Lvagent: Long videounderstandingbymulti-rounddynamicalcollaborationofmllmagents.ArXiv (2025) 5

  4. [4]

    In: CVPR (2024) 2, 12

    Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: CVPR (2024) 2, 12

  5. [5]

    Clark, C., Zhang, J., Ma, Z., Park, J.S., Salehi, M., Tripathi, R., Lee, S., Ren, Z., Kim, C.D., Yang, Y., Shao, V., Yang, Y., Huang, W., Gao, Z., Anderson, T., Zhang, J., Jain, J., Stoica, G., Han, W., Farhadi, A., Krishna, R.: Molmo2: Open weights and data for vision-language models with video understanding and grounding (2026) 2

  6. [6]

    In: ECCV (2024) 2

    Fan, Y., Ma, X., Wu, R., Du, Y., Li, J., Gao, Z., Li, Q.: Videoagent: A memory- augmented multimodal agent for video understanding. In: ECCV (2024) 2

  7. [7]

    In: ECCV (2024) 5

    Fan, Y., Ma, X., Wu, R., Du, Y., Li, J., Gao, Z., Li, Q.: Videoagent: A memory- augmented multimodal agent for video understanding. In: ECCV (2024) 5

  8. [8]

    ArXiv (2025) 12

    Feng, K., Gong, K., Li, B., Guo, Z., Wang, Y., Peng, T., Wu, J., Zhang, X., Wang, B., Yue, X.: Video-r1: Reinforcing video reasoning in mllms. ArXiv (2025) 12

  9. [9]

    In: CVPR (2025) 4

    Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: CVPR (2025) 4

  10. [10]

    ArXiv (2024) 2, 3

    GLM, T., Zeng, A., Xu, B., Wang, B., Zhang, C., Yin, D., Zhang, D., Rojas, D., Feng, G., Zhao, H., et al.: Chatglm: A family of large language models from glm-130b to glm-4 all tools. ArXiv (2024) 2, 3

  11. [11]

    In: CVPR (2022) 4

    Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Ham- burger, J., Jiang, H., Liu, M., Liu, X., et al.: Ego4d: Around the world in 3,000 hours of egocentric video. In: CVPR (2022) 4

  12. [12]

    In: CVPR (2025) 2

    Hu, K., Gao, F., Nie, X., Zhou, P., Tran, S., Neiman, T., Wang, L., Shah, M., Hamid, R., Yin, B., et al.: M-llm based video frame selection for efficient video understanding. In: CVPR (2025) 2

  13. [13]

    ArXiv (2024) 5

    Jeoung, S., Huybrechts, G., Ganesh, B., Galstyan, A., Bodapati, S.: Adaptive video understanding agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning. ArXiv (2024) 5

  14. [14]

    ArXiv (2025) 5

    Kugo, N., Li, X., Li, Z., Gupta, A., Khatua, A., Jain, N., Patel, C., Kyuragi, Y., Ishii, Y., Tanabiki, M., et al.: Videomultiagents: A multi-agent framework for video question answering. ArXiv (2025) 5

  15. [15]

    NeurIPS (2021) 11 16 Z

    Lei, J., Berg, T.L., Bansal, M.: Detecting moments and highlights in videos via natural language queries. NeurIPS (2021) 11 16 Z. Gao et al

  16. [16]

    In: CVPR (2024) 4

    Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., et al.: Mvbench: A comprehensive multi-modal video understanding benchmark. In: CVPR (2024) 4

  17. [17]

    In: EMNLP (2024) 4

    Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: EMNLP (2024) 4

  18. [18]

    IEEE Trans

    Lin, J., Hua, H., Chen, M., Li, Y., Hsiao, J., Ho, C., Luo, J.: Videoxum: Cross- modal visual and textural summarization of videos. IEEE Trans. Multimed. (2023) 11

  19. [19]

    ArXiv (2025) 2

    Liu, Y., Lin, K.Q., Chen, C.W., Shou, M.Z.: Videomind: A chain-of-lora agent for long video reasoning. ArXiv (2025) 2

  20. [20]

    In: CVPR (2025) 2

    Ma,Z.,Gou,C.,Shi,H.,Sun,B.,Li,S.,Rezatofighi,H.,Cai,J.:Drvideo:Document retrieval based long video understanding. In: CVPR (2025) 2

  21. [21]

    In: CVPR (2025) 5

    Ma,Z.,Gou,C.,Shi,H.,Sun,B.,Li,S.,Rezatofighi,H.,Cai,J.:Drvideo:Document retrieval based long video understanding. In: CVPR (2025) 5

  22. [22]

    In: ACL (2024) 4

    Maaz, M., Rasheed, H., Khan, S., Khan, F.: Video-chatgpt: Towards detailed video understanding via large vision and language models. In: ACL (2024) 4

  23. [23]

    NeurIPS (2023) 4

    Mangalam, K., Akshulakov, R., Malik, J.: Egoschema: A diagnostic benchmark for very long-form video language understanding. NeurIPS (2023) 4

  24. [24]

    NeurIPS (2023) 5

    Shinn,N.,Cassano,F.,Gopinath,A.,Narasimhan,K.,Yao,S.:Reflexion:Language agents with verbal reinforcement learning. NeurIPS (2023) 5

  25. [25]

    In: CVPR (2025) 12

    Shu, Y., Liu, Z., Zhang, P., Qin, M., Zhou, J., Liang, Z., Huang, T., Zhao, B.: Video-xl: Extra-long vision language model for hour-scale video understanding. In: CVPR (2025) 12

  26. [26]

    NeurIPS (2022) 4

    Tong, Z., Song, Y., Wang, J., Wang, L.: Videomae: Masked autoencoders are data- efficient learners for self-supervised video pre-training. NeurIPS (2022) 4

  27. [27]

    ArXiv (2024) 2, 4, 12

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. ArXiv (2024) 2, 4, 12

  28. [28]

    In: ICCV (2025) 1

    Wang,W.,He,Z.,Hong,W.,Cheng,Y.,Zhang,X.,Qi,J.,Ding,M.,Gu,X.,Huang, S., Xu, B., et al.: Lvbench: An extreme long video understanding benchmark. In: ICCV (2025) 1

  29. [29]

    In: ECCV (2024) 2

    Wang, X., Zhang, Y., Zohar, O., Yeung-Levy, S.: Videoagent: Long-form video understanding with large language model as agent. In: ECCV (2024) 2

  30. [30]

    NeurIPS (2023) 10

    Wang, Z., Blume, A., Li, S., Liu, G., Cho, J., Tang, Z., Bansal, M., Ji, H.: Paxion: Patching action knowledge in video-language foundation models. NeurIPS (2023) 10

  31. [31]

    ArXiv (2025) 5, 12

    Wang, Z., Chen, B., Yue, Z., Wang, Y., Qiao, Y., Wang, L., Wang, Y.: Videochat- a1: Thinking with long videos by chain-of-shot reasoning. ArXiv (2025) 5, 12

  32. [32]

    In: ECCV (2024) 2

    Weng, Y., Han, M., He, H., Chang, X., Zhuang, B.: Longvlm: Efficient long video understanding via large language models. In: ECCV (2024) 2

  33. [33]

    ArXiv (2024) 10

    Wu,B.,Yu,S.,Chen,Z.,Tenenbaum,J.B.,Gan,C.:Star:Abenchmarkforsituated reasoning in real-world videos. ArXiv (2024) 10

  34. [34]

    In: CVPR (2019) 1

    Wu, C.Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., Girshick, R.: Long- term feature banks for detailed video understanding. In: CVPR (2019) 1

  35. [35]

    In: CVPR (2021) 1, 4

    Wu, C.Y., Krahenbuhl, P.: Towards long-form video understanding. In: CVPR (2021) 1, 4

  36. [36]

    In: CVPR (2021) 4, 11

    Xiao,J.,Shang,X.,Yao,A.,Chua,T.S.:Next-qa:Nextphaseofquestion-answering to explaining temporal actions. In: CVPR (2021) 4, 11

  37. [37]

    In: CVPR (2024) 11 VideoTIR: Efficient Long Video Understanding 17

    Xiao, J., Yao, A., Li, Y., Chua, T.S.: Can i trust your answer? visually grounded video question answering. In: CVPR (2024) 11 VideoTIR: Efficient Long Video Understanding 17

  38. [38]

    ArXiv (2025) 2, 3, 5, 6, 12

    Xie, Y., Chen, T., Ge, Z., Ni, L.: Video-mtr: Reinforced multi-turn reasoning for long video understanding. ArXiv (2025) 2, 3, 5, 6, 12

  39. [39]

    ArXiv (2025) 5

    Xu, Z., Zhang, J., Wang, Q., Liu, Y.: E-vrag: Enhancing long video understanding with resource-efficient retrieval augmented generation. ArXiv (2025) 5

  40. [40]

    ArXiv (2025) 11

    Xue, Z., Zheng, L., Liu, Q., Li, Y., Zheng, X., Ma, Z., An, B.: Simpletir: End-to- end reinforcement learning for multi-turn tool-integrated reasoning. ArXiv (2025) 11

  41. [41]

    ArXiv (2025) 5

    Xue, Z., Zhang, J., Xie, X., Cai, Y., Liu, Y., Li, X., Tao, D.: Omni-adavideorag: Omni-contextual adaptive retrieval-augmented for efficient long video understand- ing. ArXiv (2025) 5

  42. [42]

    thinking with long videos

    Yang, Z., Wang, S., Zhang, K., Wu, K., Leng, S., Zhang, Y., Li, B., Qin, C., Lu, S., Li, X., Bing, L.: Longvt: Incentivizing "thinking with long videos" via native tool calling (2025) 2, 5, 6, 12

  43. [43]

    In: ICLR (2022) 5

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: React: Synergizing reasoning and acting in language models. In: ICLR (2022) 5

  44. [44]

    ArXiv (2019) 10

    Yi, K., Gan, C., Li, Y., Kohli, P., Wu, J., Torralba, A., Tenenbaum, J.B.: Clevrer: Collision events for video representation and reasoning. ArXiv (2019) 10

  45. [45]

    ArXiv (2023) 4, 12

    Zhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual lan- guage model for video understanding. ArXiv (2023) 4, 12

  46. [46]

    ArXiv (2025) 3

    Zhang, Y.F., Lu, X., Yin, S., Fu, C., Chen, W., Hu, X., Wen, B., Jiang, K., Liu, C., Zhang, T., et al.: Thyme: Think beyond images. ArXiv (2025) 3

  47. [47]

    ArXiv (2024) 11

    Zheng, Y., Zhang, R., Zhang, J., Ye, Y., Luo, Z., Feng, Z., Ma, Y.: Llamafactory: Unified efficient fine-tuning of 100+ language models. ArXiv (2024) 11

  48. [48]

    thinking with images

    Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deepeyes: Incentivizing" thinking with images" via reinforcement learning. ArXiv (2025) 3, 8

  49. [49]

    Arxiv (2025) 12

    Zhong, Y., Hu, Z.Y., Li, Y., Wang, L.: Rethinking chain-of-thought reasoning for videos. Arxiv (2025) 12