arxiv: 2601.15724 · v2 · submitted 2026-01-22 · 💻 cs.CV · cs.AI

Recognition: no theorem link

VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning

Chenglin Li , Qianglong Chen , Feng Han , Yikun Wang , Xingxi Yin , Yan Gong , Ruilin Li , Yin Zhang

show 1 more author

Jiaqi Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 12:14 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords agentic video modelstool reasoningsynthetic datalong-form video understandingVideoLLMtemporal retrievalspatial zoom

0 comments

The pith

VideoThinker trains agentic video models on synthetic tool trajectories generated from captions then grounded to frames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Long videos challenge current VideoLLMs because static sampling of frames loses critical timing and detail. VideoThinker breaks the cycle by first turning videos into detailed captions, then using a strong language model to create multi-step tool sequences such as temporal retrieval and spatial zoom entirely in text. These sequences are later paired with the original frames to form training data. The resulting model learns to explore videos adaptively rather than processing them uniformly. On long-video benchmarks it outperforms both pure language agents and existing video models.

Core claim

The central claim is that tool-use trajectories generated by an agentic language model operating only on video captions can be directly grounded to the corresponding frames to produce large-scale training data, equipping a VideoLLM with dynamic retrieval and zoom capabilities without any prior requirement for strong long-form video comprehension in the base model.

What carries the argument

The synthetic dataset creation pipeline that converts videos to captions, simulates multi-step tool interactions in caption space, and then substitutes the captions with actual video frames to yield interleaved video-and-reasoning sequences.

If this is right

The model acquires the ability to retrieve specific moments and zoom into regions instead of relying on fixed frame sampling.
Training data for agentic video behavior can be scaled without human annotation or access to already-capable video models.
Multi-step reasoning emerges from the structure of the grounded tool trajectories.
Long-form video tasks that require localization benefit directly from the adaptive exploration learned during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same caption-to-trajectory-then-ground approach could be tested on audio or sensor streams where direct simulation is costly.
Iterating the process with the trained model itself generating new trajectories might create a self-improving loop.
Inference cost could decrease if the model learns to request only the frames it needs rather than processing the entire video.

Load-bearing premise

Tool trajectories created from captions alone preserve enough temporal and spatial information to remain useful once the captions are replaced by real frames.

What would settle it

Performance drop on a benchmark set of videos where the original captions miss key events that the tool trajectories were supposed to locate.

Figures

Figures reproduced from arXiv: 2601.15724 by Chenglin Li, Feng Han, Jiaqi Wang, Qianglong Chen, Ruilin Li, Xingxi Yin, Yan Gong, Yikun Wang, Yin Zhang.

**Figure 1.** Figure 1: Comparison of VideoThinker with VideoLLMs and LLM agents. VideoThinker excels at interleaved video reasoning on long [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: VideoThinker integrates retrieval and zoom tools for multi-turn reasoning. LLMs use [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The prompt designed to enable VideoLLM to serve as a [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Confidence–accuracy relationship. The analysis is [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: Performance of the VideoThinker at varying confidence [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Case Study: VideoThinker performs agentic tool use by [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Distribution of Video Durations in CoTs. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: The prompt is designed to enable VideoLLM to think [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: The training script with Swift [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: VideoThinker’s agentic tool reasoning on LVBench, testing its ability to retrieve key information from a 61-minute video [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: VideoThinker’s agentic tool reasoning on LVBench, testing its reasoning ability from a 61-minute video (Cm73ma6Ibcs). [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: VideoThinker’s agentic tool reasoning on LVBench, testing its event-understanding ability using a 55-minute video [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 14.** Figure 14: VideoThinker’s agentic tool reasoning on LVBench, testing its event-recognition ability using a 55-minute video [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

**Figure 15.** Figure 15: VideoThinker’s agentic tool reasoning on LVBench, testing its temporal grounding ability using a 64-minute video [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗

**Figure 16.** Figure 16: VideoThinker’s agentic tool reasoning on LVBench, testing its summarization ability using a 53-minute video (TiQBTesZUJQ). [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗

read the original abstract

Long-form video understanding remains a fundamental challenge for current Video Large Language Models. Most existing models rely on static reasoning over uniformly sampled frames, which weakens temporal localization and leads to substantial information loss in long videos. Agentic tools such as temporal retrieval, spatial zoom, and temporal zoom offer a natural way to overcome these limitations by enabling adaptive exploration of key moments. However, constructing agentic video understanding data requires models that already possess strong long-form video comprehension, creating a circular dependency. We address this challenge with VideoThinker, an agentic Video Large Language Model trained entirely on synthetic tool interaction trajectories. Our key idea is to convert videos into rich captions and employ a powerful agentic language model to generate multi-step tool use sequences in caption space. These trajectories are subsequently grounded back to video by replacing captions with the corresponding frames, yielding a large-scale interleaved video and tool reasoning dataset without requiring any long-form understanding from the underlying model. Training on this synthetic agentic dataset equips VideoThinker with dynamic reasoning capabilities, adaptive temporal exploration, and multi-step tool use. Remarkably, VideoThinker significantly outperforms both caption-only language model agents and strong video model baselines across long-video benchmarks, demonstrating the effectiveness of tool augmented synthetic data and adaptive retrieval and zoom reasoning for long-form video understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VideoThinker gives a clean synthetic-data fix for the circularity problem in agentic video models, but the caption-to-frame transfer is the unproven step.

read the letter

VideoThinker tackles the circular dependency in training agentic video models by creating synthetic tool-use data in caption space. They run a strong external LM on video captions to produce multi-step trajectories involving retrieval, spatial zoom, and temporal zoom, then replace the captions with the actual frames to build the training set. This is the main new piece: it decouples trajectory generation from the video model itself and produces interleaved data for adaptive exploration in long videos. The approach is a practical extension of prior synthetic-data ideas rather than a routine one, and it directly targets the static-frame limitation in current VideoLLMs. The paper does a solid job laying out why uniform sampling loses temporal information and how tool use can fix it without requiring an already-capable model upfront. The high-level pipeline is clearly motivated and avoids obvious self-reference loops. The soft spot is the grounding step. Captions routinely drop exact timing, subtle motion, and low-contrast spatial cues that would decide whether a zoom or retrieval is actually needed. If those trajectories do not stay near-optimal once real frames are swapped in, the training data will contain systematic mismatches. The abstract claims clear outperformance over caption-only agents and strong video baselines, yet it supplies no numbers, ablations, or error breakdowns, so it is impossible to judge how large the gap is or whether the gains survive the transfer. The stress-test concern about reasoning artifacts holds up on the information given. This paper is for researchers working on tool-augmented multimodal models and long-video benchmarks. A reader building synthetic agentic datasets would get concrete implementation ideas from it. I would send it to peer review because the core construction is worth referee scrutiny on the experimental side, even if the current evidence is thin.

Referee Report

3 major / 2 minor

Summary. The paper introduces VideoThinker, an agentic VideoLLM trained on large-scale synthetic data consisting of multi-step tool-use trajectories (temporal retrieval, spatial zoom, temporal zoom) generated by a powerful external LLM operating entirely in video caption space; these trajectories are then grounded to the original video frames by direct substitution, yielding interleaved training data that equips the model with adaptive long-form reasoning without requiring the target VideoLLM to possess strong video understanding a priori. The central claim is that this construction produces significant outperformance over both caption-only LM agents and strong video-model baselines on long-video benchmarks.

Significance. If the transfer from caption-generated trajectories to frame-grounded execution holds, the approach supplies a scalable, non-circular route to high-quality agentic supervision for video models, directly addressing the data bottleneck that has limited dynamic tool use in long-form video understanding. The explicit separation of trajectory synthesis (caption space) from visual execution (frame space) is a methodological strength that could generalize to other modalities.

major comments (3)

[§3] §3 (Method, trajectory generation): The assumption that tool-use sequences optimal in caption space remain near-optimal once captions are replaced by frames is load-bearing for the entire training pipeline, yet the manuscript provides no quantitative validation (e.g., agreement rate between caption-based and frame-based tool decisions on a held-out set) or error analysis of cases where motion timing or low-contrast details omitted by captions would change retrieval/zoom choices.
[§4] §4 (Experiments): The reported outperformance is stated without accompanying ablations that isolate the contribution of each tool type or of caption quality; without these controls it is impossible to determine whether gains derive from the adaptive policy or from incidental effects of the synthetic data construction.
[§4.2] §4.2 (Results tables): No error analysis or qualitative examples are supplied showing trajectories that succeed on frames versus those that fail due to caption-induced mismatches, which directly tests the weakest assumption identified in the skeptic note.

minor comments (2)

Notation for the three tools (retrieval, spatial zoom, temporal zoom) is introduced without a compact summary table; adding one would improve readability.
[Abstract] The abstract claims 'significant outperformance' but supplies no numerical deltas or baseline identifiers; these should be stated explicitly even in the abstract.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and have revised the manuscript to incorporate the suggested analyses and ablations.

read point-by-point responses

Referee: [§3] §3 (Method, trajectory generation): The assumption that tool-use sequences optimal in caption space remain near-optimal once captions are replaced by frames is load-bearing for the entire training pipeline, yet the manuscript provides no quantitative validation (e.g., agreement rate between caption-based and frame-based tool decisions on a held-out set) or error analysis of cases where motion timing or low-contrast details omitted by captions would change retrieval/zoom choices.

Authors: We agree that empirical validation of the caption-to-frame transfer is important. While our grounding method uses direct substitution of frames for captions, we will add a new subsection in the revised manuscript with quantitative agreement rates computed on a held-out set (comparing tool decisions made from captions versus from frames) and an accompanying error analysis of mismatch cases arising from omitted motion or low-contrast details. revision: yes
Referee: [§4] §4 (Experiments): The reported outperformance is stated without accompanying ablations that isolate the contribution of each tool type or of caption quality; without these controls it is impossible to determine whether gains derive from the adaptive policy or from incidental effects of the synthetic data construction.

Authors: We thank the referee for highlighting the need for clearer isolation of contributions. In the revised experiments section we will add ablations that (i) remove each tool type individually while keeping the others and (ii) vary caption quality (e.g., using shorter or noisier captions). These controls will clarify whether performance gains stem primarily from the adaptive multi-step policy. revision: yes
Referee: [§4.2] §4.2 (Results tables): No error analysis or qualitative examples are supplied showing trajectories that succeed on frames versus those that fail due to caption-induced mismatches, which directly tests the weakest assumption identified in the skeptic note.

Authors: We agree that qualitative and error analysis directly addressing caption-frame mismatches would strengthen the paper. The revised manuscript will include a new subsection with representative qualitative examples of successful trajectories and failure cases caused by caption omissions, together with a quantitative breakdown of mismatch frequency and its effect on benchmark performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper explicitly identifies the circular dependency problem (needing strong video comprehension to generate agentic tool trajectories) and resolves it by using an external powerful agentic LM to produce trajectories entirely in caption space before substituting real frames. This construction relies on an independent external model rather than the target VideoThinker or any self-referential loop. No equations, fitted parameters renamed as predictions, load-bearing self-citations, uniqueness theorems, or ansatzes appear in the provided derivation. The central training data pipeline is therefore self-contained and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transferability of caption-space tool trajectories to video frames and on the assumption that the external agentic LM produces high-quality reasoning sequences without video input.

axioms (1)

domain assumption Tool-use trajectories generated in caption space transfer effectively to real video frames without loss of reasoning quality or introduction of artifacts.
This premise is required for the synthetic dataset to serve as valid training data for the video model.

pith-pipeline@v0.9.0 · 5554 in / 1285 out tokens · 42126 ms · 2026-05-16T12:14:37.461977+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding
cs.CV 2026-05 unverdicted novelty 7.0

ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 1 Pith paper · 17 internal anchors

[1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 3, 6, 7, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Cg- bench: Clue-grounded question answering benchmark for long video understanding.arXiv preprint arXiv:2412.12075,

Guo Chen, Yicheng Liu, Yifei Huang, Yuping He, Baoqi Pei, Jilan Xu, Yali Wang, Tong Lu, and Limin Wang. Cg- bench: Clue-grounded question answering benchmark for long video understanding.arXiv preprint arXiv:2412.12075,

work page arXiv
[3]

Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering.arXiv preprint arXiv:2311.14906, 2023

Xiuyuan Chen, Yuan Lin, Yuchen Zhang, and Weiran Huang. Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering.arXiv preprint arXiv:2311.14906, 2023. 1

work page arXiv 2023
[4]

Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024

Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024. 6, 7

work page arXiv 2024
[5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Videoagent: A memory-augmented mul- timodal agent for video understanding

Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory-augmented mul- timodal agent for video understanding. InEuropean Confer- ence on Computer Vision, pages 75–92. Springer, 2024. 1, 2, 6, 7

work page 2024
[7]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024. 1, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Love-r1: Advancing long video understanding with an adaptive zoom-in mechanism via multi-step reasoning.arXiv preprint arXiv:2509.24786,

Shenghao Fu, Qize Yang, Yuan-Ming Li, Xihan Wei, Xi- aohua Xie, and Wei-Shi Zheng. Love-r1: Advancing long video understanding with an adaptive zoom-in mechanism via multi-step reasoning.arXiv preprint arXiv:2509.24786,

work page arXiv
[10]

Understanding the planning of LLM agents: A survey

Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. Understanding the planning of llm agents: A survey.arXiv preprint arXiv:2402.02716, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 6, 7, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Mini-o3: Scaling up reasoning pat- terns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025

Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning pat- terns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025. 1, 3

work page arXiv 2025
[13]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Vcbench: A controllable benchmark for symbolic and abstract challenges in video cognition.arXiv preprint arXiv:2411.09105, 2024

Chenglin Li, Qianglong Chen, Zhi Li, Feng Tao, and Yin Zhang. Vcbench: A controllable benchmark for symbolic and abstract challenges in video cognition.arXiv preprint arXiv:2411.09105, 2024. 1

work page arXiv 2024
[15]

Adaptive fast-and- slow visual program reasoning for long-form videoqa.arXiv preprint arXiv:2509.17743, 2025

Chenglin Li, Feng Han, Ruilin Li, Qianglong Chen, Jingqi Tong, Yin Zhang, Jiaqi Wang, et al. Adaptive fast-and- slow visual program reasoning for long-form videoqa.arXiv preprint arXiv:2509.17743, 2025. 2

work page arXiv 2025
[16]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Mvbench: A comprehensive multi-modal video understand- ing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 1

work page 2024
[18]

Videochat-flash: Hierarchical com- pression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024

Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical com- pression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024. 2

work page arXiv 2024
[19]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual represen- tation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Open-o3 video: Grounded video reasoning with explicit spatio-temporal evi- dence.arXiv preprint arXiv:2510.20579, 2025

Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, and Zhuochen Wang. Open-o3 video: Grounded video reasoning with explicit spatio-temporal evi- dence.arXiv preprint arXiv:2510.20579, 2025. 1, 3

work page arXiv 2025
[22]

Video-bench: A com- prehensive benchmark and toolkit for evaluating video-based large language models.arXiv preprint arXiv:2311.16103,

Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A com- prehensive benchmark and toolkit for evaluating video-based large language models.arXiv preprint arXiv:2311.16103,

work page arXiv
[23]

Introducing openai o3 and o4-mini.https:// openai.com/index/introducing-o3-and-o4- mini/, 2025

OpenAI. Introducing openai o3 and o4-mini.https:// openai.com/index/introducing-o3-and-o4- mini/, 2025. Accessed: 2025-11-01. 1

work page 2025
[24]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInterna- tional conference on machine learning, pages 28492–28518. PMLR, 2023. 4, 7

work page 2023
[26]

Video-xl: Extra-long vision language model for hour-scale video understanding

Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Jun- jie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26160–26169,

work page
[27]

Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025

Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025. 3

work page arXiv 2025
[28]

Seed1.5-VL Technical Report

ByteDance Seed Team. Seed1.5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 6, 7, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Lvbench: An extreme long video understanding benchmark.arXiv preprint arXiv:2406.08035, 2024

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiao- han Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, et al. Lvbench: An extreme long video understanding benchmark.arXiv preprint arXiv:2406.08035, 2024. 1, 3, 6

work page arXiv 2024
[31]

Videoagent: Long-form video understanding with large language model as agent

Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung- Levy. Videoagent: Long-form video understanding with large language model as agent. InEuropean Conference on Computer Vision, pages 58–76. Springer, 2024. 1, 2, 3, 6, 7

work page 2024
[32]

Lifelongmem- ory: Leveraging llms for answering queries in long-form egocentric videos.arXiv preprint arXiv:2312.05269, 2023

Ying Wang, Yanlai Yang, and Mengye Ren. Lifelongmem- ory: Leveraging llms for answering queries in long-form egocentric videos.arXiv preprint arXiv:2312.05269, 2023. 2

work page arXiv 2023
[33]

Simple o3: To- wards interleaved vision-language reasoning.arXiv preprint arXiv:2508.12109, 2025

Ye Wang, Qianglong Chen, Zejun Li, Siyuan Wang, Shi- jie Guo, Zhirui Zhang, and Zhongyu Wei. Simple o3: To- wards interleaved vision-language reasoning.arXiv preprint arXiv:2508.12109, 2025. 1, 3

work page arXiv 2025
[34]

Videotree: Adaptive tree-based video representation for llm reasoning on long videos

Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, and Mohit Bansal. Videotree: Adaptive tree-based video representation for llm reasoning on long videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3272– 3283, 2025. 1, 2, 3, 7

work page 2025
[35]

Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024. 1, 3, 6

work page 2024
[36]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9777–9786, 2021. 3

work page 2021
[37]

Video- mtr: Reinforced multi-turn reasoning for long video under- standing.arXiv preprint arXiv:2508.20478, 2025

Yuan Xie, Tianshui Chen, Zheng Ge, and Lionel Ni. Video- mtr: Reinforced multi-turn reasoning for long video under- standing.arXiv preprint arXiv:2508.20478, 2025. 3

work page arXiv 2025
[38]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1, 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Doraemongpt: Toward understanding dynamic scenes with large language models (exemplified as a video agent).arXiv preprint arXiv:2401.08392, 2024

Zongxin Yang, Guikun Chen, Xiaodi Li, Wenguan Wang, and Yi Yang. Doraemongpt: Toward understanding dynamic scenes with large language models (exemplified as a video agent).arXiv preprint arXiv:2401.08392, 2024. 2

work page arXiv 2024
[40]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022. 2

work page 2022
[41]

Activitynet-qa: A dataset for understanding complex web videos via question answering

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yuet- ing Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. InProceedings of the AAAI Conference on Artificial Intelli- gence, pages 9127–9134, 2019. 1

work page 2019
[42]

Think with videos for agentic long-video understanding, 2025

Huaying Yuan, Zheng Liu, Junjie Zhou, Hongjin Qian, Yan Shu, Nicu Sebe, Ji-Rong Wen, and Zhicheng Dou. Think with videos for agentic long-video understanding, 2025. 3, 6, 7

work page 2025
[43]

Videodeepresearch: Long video understanding with agentic tool using.arXiv preprint arXiv:2506.10821, 2025

Huaying Yuan, Zheng Liu, Junjie Zhou, Ji-Rong Wen, and Zhicheng Dou. Videodeepresearch: Long video understanding with agentic tool using.arXiv preprint arXiv:2506.10821, 2025. 1, 2

work page arXiv 2025
[44]

Easy- tool: Enhancing llm-based agents with concise tool instruc- tion.arXiv preprint arXiv:2401.06201, 2024

Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Yongliang Shen, Ren Kan, Dongsheng Li, and Deqing Yang. Easy- tool: Enhancing llm-based agents with concise tool instruc- tion.arXiv preprint arXiv:2401.06201, 2024. 2

work page arXiv 2024
[45]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding.arXiv preprint arXiv:2306.02858, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning

Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning. arXiv preprint arXiv:2508.04416, 2025. 1

work page arXiv 2025
[47]

Deep video discovery: Agen- tic search with tool use for long-form video understanding

Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, and Yan Lu. Deep video discovery: Agen- tic search with tool use for long-form video understanding. arXiv preprint arXiv:2505.18079, 2025. 3

work page arXiv 2025
[48]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Swift:a scal- able lightweight infrastructure for fine-tuning, 2024

Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yun- lin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. Swift:a scal- able lightweight infrastructure for fine-tuning, 2024. 6

work page 2024
[50]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Mlvu: Benchmarking multi-task long video understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13691– 13701, 2025. 6

work page 2025
[52]

Languagebind: Extending video-language pretraining to n- modality by language-based semantic alignment, 2023

Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, Wang HongFa, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zong- wei Li, Cai Wan Zhang, Zhifeng Li, Wei Liu, and Li Yuan. Languagebind: Extending video-language pretraining to n- modality by language-based semantic alignment, 2023. 4

work page 2023