pith. machine review for the scientific record. sign in

arxiv: 2601.15724 · v2 · submitted 2026-01-22 · 💻 cs.CV · cs.AI

Recognition: no theorem link

VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-16 12:14 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords agentic video modelstool reasoningsynthetic datalong-form video understandingVideoLLMtemporal retrievalspatial zoom
0
0 comments X

The pith

VideoThinker trains agentic video models on synthetic tool trajectories generated from captions then grounded to frames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Long videos challenge current VideoLLMs because static sampling of frames loses critical timing and detail. VideoThinker breaks the cycle by first turning videos into detailed captions, then using a strong language model to create multi-step tool sequences such as temporal retrieval and spatial zoom entirely in text. These sequences are later paired with the original frames to form training data. The resulting model learns to explore videos adaptively rather than processing them uniformly. On long-video benchmarks it outperforms both pure language agents and existing video models.

Core claim

The central claim is that tool-use trajectories generated by an agentic language model operating only on video captions can be directly grounded to the corresponding frames to produce large-scale training data, equipping a VideoLLM with dynamic retrieval and zoom capabilities without any prior requirement for strong long-form video comprehension in the base model.

What carries the argument

The synthetic dataset creation pipeline that converts videos to captions, simulates multi-step tool interactions in caption space, and then substitutes the captions with actual video frames to yield interleaved video-and-reasoning sequences.

If this is right

  • The model acquires the ability to retrieve specific moments and zoom into regions instead of relying on fixed frame sampling.
  • Training data for agentic video behavior can be scaled without human annotation or access to already-capable video models.
  • Multi-step reasoning emerges from the structure of the grounded tool trajectories.
  • Long-form video tasks that require localization benefit directly from the adaptive exploration learned during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same caption-to-trajectory-then-ground approach could be tested on audio or sensor streams where direct simulation is costly.
  • Iterating the process with the trained model itself generating new trajectories might create a self-improving loop.
  • Inference cost could decrease if the model learns to request only the frames it needs rather than processing the entire video.

Load-bearing premise

Tool trajectories created from captions alone preserve enough temporal and spatial information to remain useful once the captions are replaced by real frames.

What would settle it

Performance drop on a benchmark set of videos where the original captions miss key events that the tool trajectories were supposed to locate.

Figures

Figures reproduced from arXiv: 2601.15724 by Chenglin Li, Feng Han, Jiaqi Wang, Qianglong Chen, Ruilin Li, Xingxi Yin, Yan Gong, Yikun Wang, Yin Zhang.

Figure 1
Figure 1. Figure 1: Comparison of VideoThinker with VideoLLMs and LLM agents. VideoThinker excels at interleaved video reasoning on long [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: VideoThinker integrates retrieval and zoom tools for multi-turn reasoning. LLMs use [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The prompt designed to enable VideoLLM to serve as a [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Confidence–accuracy relationship. The analysis is [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance of the VideoThinker at varying confidence [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Case Study: VideoThinker performs agentic tool use by [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of Video Durations in CoTs. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The prompt is designed to enable VideoLLM to think [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The training script with Swift [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: VideoThinker’s agentic tool reasoning on LVBench, testing its ability to retrieve key information from a 61-minute video [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: VideoThinker’s agentic tool reasoning on LVBench, testing its reasoning ability from a 61-minute video (Cm73ma6Ibcs). [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: VideoThinker’s agentic tool reasoning on LVBench, testing its event-understanding ability using a 55-minute video [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: VideoThinker’s agentic tool reasoning on LVBench, testing its event-recognition ability using a 55-minute video [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: VideoThinker’s agentic tool reasoning on LVBench, testing its temporal grounding ability using a 64-minute video [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: VideoThinker’s agentic tool reasoning on LVBench, testing its summarization ability using a 53-minute video (TiQBTesZUJQ). [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗
read the original abstract

Long-form video understanding remains a fundamental challenge for current Video Large Language Models. Most existing models rely on static reasoning over uniformly sampled frames, which weakens temporal localization and leads to substantial information loss in long videos. Agentic tools such as temporal retrieval, spatial zoom, and temporal zoom offer a natural way to overcome these limitations by enabling adaptive exploration of key moments. However, constructing agentic video understanding data requires models that already possess strong long-form video comprehension, creating a circular dependency. We address this challenge with VideoThinker, an agentic Video Large Language Model trained entirely on synthetic tool interaction trajectories. Our key idea is to convert videos into rich captions and employ a powerful agentic language model to generate multi-step tool use sequences in caption space. These trajectories are subsequently grounded back to video by replacing captions with the corresponding frames, yielding a large-scale interleaved video and tool reasoning dataset without requiring any long-form understanding from the underlying model. Training on this synthetic agentic dataset equips VideoThinker with dynamic reasoning capabilities, adaptive temporal exploration, and multi-step tool use. Remarkably, VideoThinker significantly outperforms both caption-only language model agents and strong video model baselines across long-video benchmarks, demonstrating the effectiveness of tool augmented synthetic data and adaptive retrieval and zoom reasoning for long-form video understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces VideoThinker, an agentic VideoLLM trained on large-scale synthetic data consisting of multi-step tool-use trajectories (temporal retrieval, spatial zoom, temporal zoom) generated by a powerful external LLM operating entirely in video caption space; these trajectories are then grounded to the original video frames by direct substitution, yielding interleaved training data that equips the model with adaptive long-form reasoning without requiring the target VideoLLM to possess strong video understanding a priori. The central claim is that this construction produces significant outperformance over both caption-only LM agents and strong video-model baselines on long-video benchmarks.

Significance. If the transfer from caption-generated trajectories to frame-grounded execution holds, the approach supplies a scalable, non-circular route to high-quality agentic supervision for video models, directly addressing the data bottleneck that has limited dynamic tool use in long-form video understanding. The explicit separation of trajectory synthesis (caption space) from visual execution (frame space) is a methodological strength that could generalize to other modalities.

major comments (3)
  1. [§3] §3 (Method, trajectory generation): The assumption that tool-use sequences optimal in caption space remain near-optimal once captions are replaced by frames is load-bearing for the entire training pipeline, yet the manuscript provides no quantitative validation (e.g., agreement rate between caption-based and frame-based tool decisions on a held-out set) or error analysis of cases where motion timing or low-contrast details omitted by captions would change retrieval/zoom choices.
  2. [§4] §4 (Experiments): The reported outperformance is stated without accompanying ablations that isolate the contribution of each tool type or of caption quality; without these controls it is impossible to determine whether gains derive from the adaptive policy or from incidental effects of the synthetic data construction.
  3. [§4.2] §4.2 (Results tables): No error analysis or qualitative examples are supplied showing trajectories that succeed on frames versus those that fail due to caption-induced mismatches, which directly tests the weakest assumption identified in the skeptic note.
minor comments (2)
  1. Notation for the three tools (retrieval, spatial zoom, temporal zoom) is introduced without a compact summary table; adding one would improve readability.
  2. [Abstract] The abstract claims 'significant outperformance' but supplies no numerical deltas or baseline identifiers; these should be stated explicitly even in the abstract.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and have revised the manuscript to incorporate the suggested analyses and ablations.

read point-by-point responses
  1. Referee: [§3] §3 (Method, trajectory generation): The assumption that tool-use sequences optimal in caption space remain near-optimal once captions are replaced by frames is load-bearing for the entire training pipeline, yet the manuscript provides no quantitative validation (e.g., agreement rate between caption-based and frame-based tool decisions on a held-out set) or error analysis of cases where motion timing or low-contrast details omitted by captions would change retrieval/zoom choices.

    Authors: We agree that empirical validation of the caption-to-frame transfer is important. While our grounding method uses direct substitution of frames for captions, we will add a new subsection in the revised manuscript with quantitative agreement rates computed on a held-out set (comparing tool decisions made from captions versus from frames) and an accompanying error analysis of mismatch cases arising from omitted motion or low-contrast details. revision: yes

  2. Referee: [§4] §4 (Experiments): The reported outperformance is stated without accompanying ablations that isolate the contribution of each tool type or of caption quality; without these controls it is impossible to determine whether gains derive from the adaptive policy or from incidental effects of the synthetic data construction.

    Authors: We thank the referee for highlighting the need for clearer isolation of contributions. In the revised experiments section we will add ablations that (i) remove each tool type individually while keeping the others and (ii) vary caption quality (e.g., using shorter or noisier captions). These controls will clarify whether performance gains stem primarily from the adaptive multi-step policy. revision: yes

  3. Referee: [§4.2] §4.2 (Results tables): No error analysis or qualitative examples are supplied showing trajectories that succeed on frames versus those that fail due to caption-induced mismatches, which directly tests the weakest assumption identified in the skeptic note.

    Authors: We agree that qualitative and error analysis directly addressing caption-frame mismatches would strengthen the paper. The revised manuscript will include a new subsection with representative qualitative examples of successful trajectories and failure cases caused by caption omissions, together with a quantitative breakdown of mismatch frequency and its effect on benchmark performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper explicitly identifies the circular dependency problem (needing strong video comprehension to generate agentic tool trajectories) and resolves it by using an external powerful agentic LM to produce trajectories entirely in caption space before substituting real frames. This construction relies on an independent external model rather than the target VideoThinker or any self-referential loop. No equations, fitted parameters renamed as predictions, load-bearing self-citations, uniqueness theorems, or ansatzes appear in the provided derivation. The central training data pipeline is therefore self-contained and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transferability of caption-space tool trajectories to video frames and on the assumption that the external agentic LM produces high-quality reasoning sequences without video input.

axioms (1)
  • domain assumption Tool-use trajectories generated in caption space transfer effectively to real video frames without loss of reasoning quality or introduction of artifacts.
    This premise is required for the synthetic dataset to serve as valid training data for the video model.

pith-pipeline@v0.9.0 · 5554 in / 1285 out tokens · 42126 ms · 2026-05-16T12:14:37.461977+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 1 Pith paper · 17 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 3, 6, 7, 2

  2. [2]

    Cg- bench: Clue-grounded question answering benchmark for long video understanding.arXiv preprint arXiv:2412.12075,

    Guo Chen, Yicheng Liu, Yifei Huang, Yuping He, Baoqi Pei, Jilan Xu, Yali Wang, Tong Lu, and Limin Wang. Cg- bench: Clue-grounded question answering benchmark for long video understanding.arXiv preprint arXiv:2412.12075,

  3. [3]

    Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering.arXiv preprint arXiv:2311.14906, 2023

    Xiuyuan Chen, Yuan Lin, Yuchen Zhang, and Weiran Huang. Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering.arXiv preprint arXiv:2311.14906, 2023. 1

  4. [4]

    Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024

    Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024. 6, 7

  5. [5]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1

  6. [6]

    Videoagent: A memory-augmented mul- timodal agent for video understanding

    Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory-augmented mul- timodal agent for video understanding. InEuropean Confer- ence on Computer Vision, pages 75–92. Springer, 2024. 1, 2, 6, 7

  7. [7]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

  8. [8]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024. 1, 3, 6

  9. [9]

    Love-r1: Advancing long video understanding with an adaptive zoom-in mechanism via multi-step reasoning.arXiv preprint arXiv:2509.24786,

    Shenghao Fu, Qize Yang, Yuan-Ming Li, Xihan Wei, Xi- aohua Xie, and Wei-Shi Zheng. Love-r1: Advancing long video understanding with an adaptive zoom-in mechanism via multi-step reasoning.arXiv preprint arXiv:2509.24786,

  10. [10]

    Understanding the planning of LLM agents: A survey

    Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. Understanding the planning of llm agents: A survey.arXiv preprint arXiv:2402.02716, 2024. 2

  11. [11]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 6, 7, 2

  12. [12]

    Mini-o3: Scaling up reasoning pat- terns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025

    Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning pat- terns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025. 1, 3

  13. [13]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1

  14. [14]

    Vcbench: A controllable benchmark for symbolic and abstract challenges in video cognition.arXiv preprint arXiv:2411.09105, 2024

    Chenglin Li, Qianglong Chen, Zhi Li, Feng Tao, and Yin Zhang. Vcbench: A controllable benchmark for symbolic and abstract challenges in video cognition.arXiv preprint arXiv:2411.09105, 2024. 1

  15. [15]

    Adaptive fast-and- slow visual program reasoning for long-form videoqa.arXiv preprint arXiv:2509.17743, 2025

    Chenglin Li, Feng Han, Ruilin Li, Qianglong Chen, Jingqi Tong, Yin Zhang, Jiaqi Wang, et al. Adaptive fast-and- slow visual program reasoning for long-form videoqa.arXiv preprint arXiv:2509.17743, 2025. 2

  16. [16]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 1, 3

  17. [17]

    Mvbench: A comprehensive multi-modal video understand- ing benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 1

  18. [18]

    Videochat-flash: Hierarchical com- pression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024

    Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical com- pression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024. 2

  19. [19]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual represen- tation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023. 1

  20. [20]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 1

  21. [21]

    Open-o3 video: Grounded video reasoning with explicit spatio-temporal evi- dence.arXiv preprint arXiv:2510.20579, 2025

    Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, and Zhuochen Wang. Open-o3 video: Grounded video reasoning with explicit spatio-temporal evi- dence.arXiv preprint arXiv:2510.20579, 2025. 1, 3

  22. [22]

    Video-bench: A com- prehensive benchmark and toolkit for evaluating video-based large language models.arXiv preprint arXiv:2311.16103,

    Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A com- prehensive benchmark and toolkit for evaluating video-based large language models.arXiv preprint arXiv:2311.16103,

  23. [23]

    Introducing openai o3 and o4-mini.https:// openai.com/index/introducing-o3-and-o4- mini/, 2025

    OpenAI. Introducing openai o3 and o4-mini.https:// openai.com/index/introducing-o3-and-o4- mini/, 2025. Accessed: 2025-11-01. 1

  24. [24]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789,

  25. [25]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInterna- tional conference on machine learning, pages 28492–28518. PMLR, 2023. 4, 7

  26. [26]

    Video-xl: Extra-long vision language model for hour-scale video understanding

    Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Jun- jie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26160–26169,

  27. [27]

    Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025

    Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025. 3

  28. [28]

    Seed1.5-VL Technical Report

    ByteDance Seed Team. Seed1.5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. 6, 7

  29. [29]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 6, 7, 2

  30. [30]

    Lvbench: An extreme long video understanding benchmark.arXiv preprint arXiv:2406.08035, 2024

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiao- han Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, et al. Lvbench: An extreme long video understanding benchmark.arXiv preprint arXiv:2406.08035, 2024. 1, 3, 6

  31. [31]

    Videoagent: Long-form video understanding with large language model as agent

    Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung- Levy. Videoagent: Long-form video understanding with large language model as agent. InEuropean Conference on Computer Vision, pages 58–76. Springer, 2024. 1, 2, 3, 6, 7

  32. [32]

    Lifelongmem- ory: Leveraging llms for answering queries in long-form egocentric videos.arXiv preprint arXiv:2312.05269, 2023

    Ying Wang, Yanlai Yang, and Mengye Ren. Lifelongmem- ory: Leveraging llms for answering queries in long-form egocentric videos.arXiv preprint arXiv:2312.05269, 2023. 2

  33. [33]

    Simple o3: To- wards interleaved vision-language reasoning.arXiv preprint arXiv:2508.12109, 2025

    Ye Wang, Qianglong Chen, Zejun Li, Siyuan Wang, Shi- jie Guo, Zhirui Zhang, and Zhongyu Wei. Simple o3: To- wards interleaved vision-language reasoning.arXiv preprint arXiv:2508.12109, 2025. 1, 3

  34. [34]

    Videotree: Adaptive tree-based video representation for llm reasoning on long videos

    Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, and Mohit Bansal. Videotree: Adaptive tree-based video representation for llm reasoning on long videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3272– 3283, 2025. 1, 2, 3, 7

  35. [35]

    Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024. 1, 3, 6

  36. [36]

    Next-qa: Next phase of question-answering to explaining temporal actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9777–9786, 2021. 3

  37. [37]

    Video- mtr: Reinforced multi-turn reasoning for long video under- standing.arXiv preprint arXiv:2508.20478, 2025

    Yuan Xie, Tianshui Chen, Zheng Ge, and Lionel Ni. Video- mtr: Reinforced multi-turn reasoning for long video under- standing.arXiv preprint arXiv:2508.20478, 2025. 3

  38. [38]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1, 4, 6

  39. [39]

    Doraemongpt: Toward understanding dynamic scenes with large language models (exemplified as a video agent).arXiv preprint arXiv:2401.08392, 2024

    Zongxin Yang, Guikun Chen, Xiaodi Li, Wenguan Wang, and Yi Yang. Doraemongpt: Toward understanding dynamic scenes with large language models (exemplified as a video agent).arXiv preprint arXiv:2401.08392, 2024. 2

  40. [40]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022. 2

  41. [41]

    Activitynet-qa: A dataset for understanding complex web videos via question answering

    Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yuet- ing Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. InProceedings of the AAAI Conference on Artificial Intelli- gence, pages 9127–9134, 2019. 1

  42. [42]

    Think with videos for agentic long-video understanding, 2025

    Huaying Yuan, Zheng Liu, Junjie Zhou, Hongjin Qian, Yan Shu, Nicu Sebe, Ji-Rong Wen, and Zhicheng Dou. Think with videos for agentic long-video understanding, 2025. 3, 6, 7

  43. [43]

    Videodeepresearch: Long video understanding with agentic tool using.arXiv preprint arXiv:2506.10821, 2025

    Huaying Yuan, Zheng Liu, Junjie Zhou, Ji-Rong Wen, and Zhicheng Dou. Videodeepresearch: Long video understanding with agentic tool using.arXiv preprint arXiv:2506.10821, 2025. 1, 2

  44. [44]

    Easy- tool: Enhancing llm-based agents with concise tool instruc- tion.arXiv preprint arXiv:2401.06201, 2024

    Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Yongliang Shen, Ren Kan, Dongsheng Li, and Deqing Yang. Easy- tool: Enhancing llm-based agents with concise tool instruc- tion.arXiv preprint arXiv:2401.06201, 2024. 2

  45. [45]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding.arXiv preprint arXiv:2306.02858, 2023. 1

  46. [46]

    Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning

    Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning. arXiv preprint arXiv:2508.04416, 2025. 1

  47. [47]

    Deep video discovery: Agen- tic search with tool use for long-form video understanding

    Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, and Yan Lu. Deep video discovery: Agen- tic search with tool use for long-form video understanding. arXiv preprint arXiv:2505.18079, 2025. 3

  48. [48]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 1

  49. [49]

    Swift:a scal- able lightweight infrastructure for fine-tuning, 2024

    Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yun- lin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. Swift:a scal- able lightweight infrastructure for fine-tuning, 2024. 6

  50. [50]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 1, 3

  51. [51]

    Mlvu: Benchmarking multi-task long video understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13691– 13701, 2025. 6

  52. [52]

    Languagebind: Extending video-language pretraining to n- modality by language-based semantic alignment, 2023

    Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, Wang HongFa, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zong- wei Li, Cai Wan Zhang, Zhifeng Li, Wei Liu, and Li Yuan. Languagebind: Extending video-language pretraining to n- modality by language-based semantic alignment, 2023. 4