Recognition: no theorem link
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
Pith reviewed 2026-05-16 12:14 UTC · model grok-4.3
The pith
VideoThinker trains agentic video models on synthetic tool trajectories generated from captions then grounded to frames.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that tool-use trajectories generated by an agentic language model operating only on video captions can be directly grounded to the corresponding frames to produce large-scale training data, equipping a VideoLLM with dynamic retrieval and zoom capabilities without any prior requirement for strong long-form video comprehension in the base model.
What carries the argument
The synthetic dataset creation pipeline that converts videos to captions, simulates multi-step tool interactions in caption space, and then substitutes the captions with actual video frames to yield interleaved video-and-reasoning sequences.
If this is right
- The model acquires the ability to retrieve specific moments and zoom into regions instead of relying on fixed frame sampling.
- Training data for agentic video behavior can be scaled without human annotation or access to already-capable video models.
- Multi-step reasoning emerges from the structure of the grounded tool trajectories.
- Long-form video tasks that require localization benefit directly from the adaptive exploration learned during training.
Where Pith is reading between the lines
- The same caption-to-trajectory-then-ground approach could be tested on audio or sensor streams where direct simulation is costly.
- Iterating the process with the trained model itself generating new trajectories might create a self-improving loop.
- Inference cost could decrease if the model learns to request only the frames it needs rather than processing the entire video.
Load-bearing premise
Tool trajectories created from captions alone preserve enough temporal and spatial information to remain useful once the captions are replaced by real frames.
What would settle it
Performance drop on a benchmark set of videos where the original captions miss key events that the tool trajectories were supposed to locate.
Figures
read the original abstract
Long-form video understanding remains a fundamental challenge for current Video Large Language Models. Most existing models rely on static reasoning over uniformly sampled frames, which weakens temporal localization and leads to substantial information loss in long videos. Agentic tools such as temporal retrieval, spatial zoom, and temporal zoom offer a natural way to overcome these limitations by enabling adaptive exploration of key moments. However, constructing agentic video understanding data requires models that already possess strong long-form video comprehension, creating a circular dependency. We address this challenge with VideoThinker, an agentic Video Large Language Model trained entirely on synthetic tool interaction trajectories. Our key idea is to convert videos into rich captions and employ a powerful agentic language model to generate multi-step tool use sequences in caption space. These trajectories are subsequently grounded back to video by replacing captions with the corresponding frames, yielding a large-scale interleaved video and tool reasoning dataset without requiring any long-form understanding from the underlying model. Training on this synthetic agentic dataset equips VideoThinker with dynamic reasoning capabilities, adaptive temporal exploration, and multi-step tool use. Remarkably, VideoThinker significantly outperforms both caption-only language model agents and strong video model baselines across long-video benchmarks, demonstrating the effectiveness of tool augmented synthetic data and adaptive retrieval and zoom reasoning for long-form video understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VideoThinker, an agentic VideoLLM trained on large-scale synthetic data consisting of multi-step tool-use trajectories (temporal retrieval, spatial zoom, temporal zoom) generated by a powerful external LLM operating entirely in video caption space; these trajectories are then grounded to the original video frames by direct substitution, yielding interleaved training data that equips the model with adaptive long-form reasoning without requiring the target VideoLLM to possess strong video understanding a priori. The central claim is that this construction produces significant outperformance over both caption-only LM agents and strong video-model baselines on long-video benchmarks.
Significance. If the transfer from caption-generated trajectories to frame-grounded execution holds, the approach supplies a scalable, non-circular route to high-quality agentic supervision for video models, directly addressing the data bottleneck that has limited dynamic tool use in long-form video understanding. The explicit separation of trajectory synthesis (caption space) from visual execution (frame space) is a methodological strength that could generalize to other modalities.
major comments (3)
- [§3] §3 (Method, trajectory generation): The assumption that tool-use sequences optimal in caption space remain near-optimal once captions are replaced by frames is load-bearing for the entire training pipeline, yet the manuscript provides no quantitative validation (e.g., agreement rate between caption-based and frame-based tool decisions on a held-out set) or error analysis of cases where motion timing or low-contrast details omitted by captions would change retrieval/zoom choices.
- [§4] §4 (Experiments): The reported outperformance is stated without accompanying ablations that isolate the contribution of each tool type or of caption quality; without these controls it is impossible to determine whether gains derive from the adaptive policy or from incidental effects of the synthetic data construction.
- [§4.2] §4.2 (Results tables): No error analysis or qualitative examples are supplied showing trajectories that succeed on frames versus those that fail due to caption-induced mismatches, which directly tests the weakest assumption identified in the skeptic note.
minor comments (2)
- Notation for the three tools (retrieval, spatial zoom, temporal zoom) is introduced without a compact summary table; adding one would improve readability.
- [Abstract] The abstract claims 'significant outperformance' but supplies no numerical deltas or baseline identifiers; these should be stated explicitly even in the abstract.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below and have revised the manuscript to incorporate the suggested analyses and ablations.
read point-by-point responses
-
Referee: [§3] §3 (Method, trajectory generation): The assumption that tool-use sequences optimal in caption space remain near-optimal once captions are replaced by frames is load-bearing for the entire training pipeline, yet the manuscript provides no quantitative validation (e.g., agreement rate between caption-based and frame-based tool decisions on a held-out set) or error analysis of cases where motion timing or low-contrast details omitted by captions would change retrieval/zoom choices.
Authors: We agree that empirical validation of the caption-to-frame transfer is important. While our grounding method uses direct substitution of frames for captions, we will add a new subsection in the revised manuscript with quantitative agreement rates computed on a held-out set (comparing tool decisions made from captions versus from frames) and an accompanying error analysis of mismatch cases arising from omitted motion or low-contrast details. revision: yes
-
Referee: [§4] §4 (Experiments): The reported outperformance is stated without accompanying ablations that isolate the contribution of each tool type or of caption quality; without these controls it is impossible to determine whether gains derive from the adaptive policy or from incidental effects of the synthetic data construction.
Authors: We thank the referee for highlighting the need for clearer isolation of contributions. In the revised experiments section we will add ablations that (i) remove each tool type individually while keeping the others and (ii) vary caption quality (e.g., using shorter or noisier captions). These controls will clarify whether performance gains stem primarily from the adaptive multi-step policy. revision: yes
-
Referee: [§4.2] §4.2 (Results tables): No error analysis or qualitative examples are supplied showing trajectories that succeed on frames versus those that fail due to caption-induced mismatches, which directly tests the weakest assumption identified in the skeptic note.
Authors: We agree that qualitative and error analysis directly addressing caption-frame mismatches would strengthen the paper. The revised manuscript will include a new subsection with representative qualitative examples of successful trajectories and failure cases caused by caption omissions, together with a quantitative breakdown of mismatch frequency and its effect on benchmark performance. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper explicitly identifies the circular dependency problem (needing strong video comprehension to generate agentic tool trajectories) and resolves it by using an external powerful agentic LM to produce trajectories entirely in caption space before substituting real frames. This construction relies on an independent external model rather than the target VideoThinker or any self-referential loop. No equations, fitted parameters renamed as predictions, load-bearing self-citations, uniqueness theorems, or ansatzes appear in the provided derivation. The central training data pipeline is therefore self-contained and does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Tool-use trajectories generated in caption space transfer effectively to real video frames without loss of reasoning quality or introduction of artifacts.
Forward citations
Cited by 1 Pith paper
-
ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding
ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 3, 6, 7, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Guo Chen, Yicheng Liu, Yifei Huang, Yuping He, Baoqi Pei, Jilan Xu, Yali Wang, Tong Lu, and Limin Wang. Cg- bench: Clue-grounded question answering benchmark for long video understanding.arXiv preprint arXiv:2412.12075,
-
[3]
Xiuyuan Chen, Yuan Lin, Yuchen Zhang, and Weiran Huang. Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering.arXiv preprint arXiv:2311.14906, 2023. 1
-
[4]
Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024. 6, 7
-
[5]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Videoagent: A memory-augmented mul- timodal agent for video understanding
Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory-augmented mul- timodal agent for video understanding. InEuropean Confer- ence on Computer Vision, pages 75–92. Springer, 2024. 1, 2, 6, 7
work page 2024
-
[7]
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024. 1, 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Shenghao Fu, Qize Yang, Yuan-Ming Li, Xihan Wei, Xi- aohua Xie, and Wei-Shi Zheng. Love-r1: Advancing long video understanding with an adaptive zoom-in mechanism via multi-step reasoning.arXiv preprint arXiv:2509.24786,
-
[10]
Understanding the planning of LLM agents: A survey
Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. Understanding the planning of llm agents: A survey.arXiv preprint arXiv:2402.02716, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 6, 7, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning pat- terns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025. 1, 3
-
[13]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Chenglin Li, Qianglong Chen, Zhi Li, Feng Tao, and Yin Zhang. Vcbench: A controllable benchmark for symbolic and abstract challenges in video cognition.arXiv preprint arXiv:2411.09105, 2024. 1
-
[15]
Chenglin Li, Feng Han, Ruilin Li, Qianglong Chen, Jingqi Tong, Yin Zhang, Jiaqi Wang, et al. Adaptive fast-and- slow visual program reasoning for long-form videoqa.arXiv preprint arXiv:2509.17743, 2025. 2
-
[16]
VideoChat: Chat-Centric Video Understanding
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Mvbench: A comprehensive multi-modal video understand- ing benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 1
work page 2024
-
[18]
Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical com- pression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024. 2
-
[19]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual represen- tation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, and Zhuochen Wang. Open-o3 video: Grounded video reasoning with explicit spatio-temporal evi- dence.arXiv preprint arXiv:2510.20579, 2025. 1, 3
-
[22]
Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A com- prehensive benchmark and toolkit for evaluating video-based large language models.arXiv preprint arXiv:2311.16103,
-
[23]
Introducing openai o3 and o4-mini.https:// openai.com/index/introducing-o3-and-o4- mini/, 2025
OpenAI. Introducing openai o3 and o4-mini.https:// openai.com/index/introducing-o3-and-o4- mini/, 2025. Accessed: 2025-11-01. 1
work page 2025
-
[24]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInterna- tional conference on machine learning, pages 28492–28518. PMLR, 2023. 4, 7
work page 2023
-
[26]
Video-xl: Extra-long vision language model for hour-scale video understanding
Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Jun- jie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26160–26169,
-
[27]
Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025. 3
-
[28]
ByteDance Seed Team. Seed1.5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 6, 7, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Lvbench: An extreme long video understanding benchmark.arXiv preprint arXiv:2406.08035, 2024
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiao- han Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, et al. Lvbench: An extreme long video understanding benchmark.arXiv preprint arXiv:2406.08035, 2024. 1, 3, 6
-
[31]
Videoagent: Long-form video understanding with large language model as agent
Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung- Levy. Videoagent: Long-form video understanding with large language model as agent. InEuropean Conference on Computer Vision, pages 58–76. Springer, 2024. 1, 2, 3, 6, 7
work page 2024
-
[32]
Ying Wang, Yanlai Yang, and Mengye Ren. Lifelongmem- ory: Leveraging llms for answering queries in long-form egocentric videos.arXiv preprint arXiv:2312.05269, 2023. 2
-
[33]
Simple o3: To- wards interleaved vision-language reasoning.arXiv preprint arXiv:2508.12109, 2025
Ye Wang, Qianglong Chen, Zejun Li, Siyuan Wang, Shi- jie Guo, Zhirui Zhang, and Zhongyu Wei. Simple o3: To- wards interleaved vision-language reasoning.arXiv preprint arXiv:2508.12109, 2025. 1, 3
-
[34]
Videotree: Adaptive tree-based video representation for llm reasoning on long videos
Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, and Mohit Bansal. Videotree: Adaptive tree-based video representation for llm reasoning on long videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3272– 3283, 2025. 1, 2, 3, 7
work page 2025
-
[35]
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024. 1, 3, 6
work page 2024
-
[36]
Next-qa: Next phase of question-answering to explaining temporal actions
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9777–9786, 2021. 3
work page 2021
-
[37]
Yuan Xie, Tianshui Chen, Zheng Ge, and Lionel Ni. Video- mtr: Reinforced multi-turn reasoning for long video under- standing.arXiv preprint arXiv:2508.20478, 2025. 3
-
[38]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1, 4, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Zongxin Yang, Guikun Chen, Xiaodi Li, Wenguan Wang, and Yi Yang. Doraemongpt: Toward understanding dynamic scenes with large language models (exemplified as a video agent).arXiv preprint arXiv:2401.08392, 2024. 2
-
[40]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022. 2
work page 2022
-
[41]
Activitynet-qa: A dataset for understanding complex web videos via question answering
Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yuet- ing Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. InProceedings of the AAAI Conference on Artificial Intelli- gence, pages 9127–9134, 2019. 1
work page 2019
-
[42]
Think with videos for agentic long-video understanding, 2025
Huaying Yuan, Zheng Liu, Junjie Zhou, Hongjin Qian, Yan Shu, Nicu Sebe, Ji-Rong Wen, and Zhicheng Dou. Think with videos for agentic long-video understanding, 2025. 3, 6, 7
work page 2025
-
[43]
Huaying Yuan, Zheng Liu, Junjie Zhou, Ji-Rong Wen, and Zhicheng Dou. Videodeepresearch: Long video understanding with agentic tool using.arXiv preprint arXiv:2506.10821, 2025. 1, 2
-
[44]
Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Yongliang Shen, Ren Kan, Dongsheng Li, and Deqing Yang. Easy- tool: Enhancing llm-based agents with concise tool instruc- tion.arXiv preprint arXiv:2401.06201, 2024. 2
-
[45]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding.arXiv preprint arXiv:2306.02858, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning
Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning. arXiv preprint arXiv:2508.04416, 2025. 1
-
[47]
Deep video discovery: Agen- tic search with tool use for long-form video understanding
Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, and Yan Lu. Deep video discovery: Agen- tic search with tool use for long-form video understanding. arXiv preprint arXiv:2505.18079, 2025. 3
-
[48]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Swift:a scal- able lightweight infrastructure for fine-tuning, 2024
Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yun- lin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. Swift:a scal- able lightweight infrastructure for fine-tuning, 2024. 6
work page 2024
-
[50]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Mlvu: Benchmarking multi-task long video understanding
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13691– 13701, 2025. 6
work page 2025
-
[52]
Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, Wang HongFa, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zong- wei Li, Cai Wan Zhang, Zhifeng Li, Wei Liu, and Li Yuan. Languagebind: Extending video-language pretraining to n- modality by language-based semantic alignment, 2023. 4
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.