ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
Pith reviewed 2026-05-22 08:56 UTC · model grok-4.3
The pith
Reinforcement learning enables parallel video tool calls by resolving the tool prior paradox through targeted rewards and randomization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that the Tool Prior Paradox arises because strong pretrained priors both enable exploration of tools and cause format collapse under sampling, and that augmenting GRPO with targeted rewards only at structural tokens plus randomized per-prompt frame budgets stabilizes format compliance from 0.13 to 0.64 while eliciting tool-use rewards, leading to an average 7.9 percent gain over the baseline on six long-video benchmarks.
What carries the argument
PARA-GRPO augments standard RL with a targeted format reward at structural-token positions and per-prompt frame-budget randomization to create situations where tool calls yield measurable rewards over skipping.
If this is right
- Parallel dispatch of multiple time-window crops in a single turn prevents a single wrong crop from propagating errors without peer correction.
- Single-turn tool calls avoid corrupting the context that happens with multi-turn sequential calls.
- Inference costs stop scaling linearly with the number of tool calls since everything happens in one turn.
- Format compliance during training rises substantially, from 0.13 to 0.64.
- Overall performance on long-video understanding tasks increases by 7.9 percent on average across benchmarks.
Where Pith is reading between the lines
- Similar stabilization techniques could apply to other RL settings where pretraining priors create conflicts with desired output formats.
- Coordination among the parallel tool calls might need additional mechanisms if the number of simultaneous calls grows large.
- Extending the randomization to other aspects of prompts could further encourage exploration in agentic RL.
Load-bearing premise
The two PARA-GRPO mechanisms of targeted format rewards at structural tokens and per-prompt frame-budget randomization suffice to stabilize format compliance and generate clear tool-use reward signals without introducing new reward-hacking paths or coordination failures.
What would settle it
Running the training without the frame-budget randomization and checking if tool calls drop to zero or format compliance stays near 0.13 would show the mechanisms are not sufficient.
Figures
read the original abstract
Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promising route to long-video understanding. However, existing native-RL methods dispatch tool calls sequentially (i.e., one per turn): a single wrong crop propagates errors without peer correction, multi-turn tool calls corrupt context, and inference cost scales linearly with the number of turns. We introduce ParaVT, the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling, dispatching multiple time-window crops in a single turn for cleaner context and better fault tolerance. Yet applying standard RL to ParaVT reveals an obstacle we term the Tool Prior Paradox: the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose the skip-tool reward shortcut under temperature sampling. A cross-model contrast on a weaker-prior LMM supports this claim: format stays stable but RL elicits zero tool calls, indicating that prior strength is the shared driver of both format collapse and tool exploration. We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), which augments standard RL with two complementary mechanisms: (i) a targeted format reward applied only at the structural-token positions most prone to collapse, and (ii) a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it. Across six long-video understanding benchmarks, ParaVT improves over the Qwen3-VL baseline by +7.9% on average, with PARA-GRPO lifting training-time format compliance from 0.13 to 0.64. As tool capabilities become increasingly internalized in modern LMMs, RL must cooperate with the resulting priors, and ParaVT offers a general recipe for agentic RL. Code, data, and model weights are publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ParaVT, the first multi-agent end-to-end RL framework for parallel video tool calling (e.g., simultaneous time-window crops) in long-video LMMs. It diagnoses the Tool Prior Paradox, whereby pretrained tool priors enable exploration but destabilize structural format under RL and temperature sampling. PARA-GRPO augments GRPO with a targeted format reward at collapse-prone structural tokens and per-prompt frame-budget randomization to generate reward signals favoring tool calls over skips. On six long-video benchmarks, ParaVT reports a +7.9% average gain over the Qwen3-VL baseline, with PARA-GRPO raising training-time format compliance from 0.13 to 0.64. Code, data, and weights are released.
Significance. If the central attribution of gains to robust parallel tool use holds, the work is significant for agentic RL in multimodal models: it supplies a concrete recipe for cooperating with (rather than overriding) internalized tool priors and demonstrates measurable fault-tolerance benefits from single-turn parallel dispatching. Public release of code, data, and model weights is a clear strength that supports reproducibility and follow-on research.
major comments (2)
- [§3.2] §3.2 (PARA-GRPO): the per-prompt frame-budget randomization is presented as creating prompts where tool calls yield measurable reward over skips, yet the text provides no ablation or diagnostic showing that the policy learns coordinated multi-crop selection rather than emitting a single high-value crop only on high-budget prompts. This distinction is load-bearing for the claim that the +7.9% benchmark gain arises from parallel fault tolerance rather than conditional single-tool behavior.
- [§4.3] §4.3 and Table 2: the cross-model contrast on the weaker-prior LMM is used to argue that prior strength is the shared driver of format collapse and tool exploration, but the manuscript does not report the exact model variant, training hyper-parameters, or whether the same PARA-GRPO schedule was applied; without these details the contrast cannot be treated as independent evidence.
minor comments (2)
- [§4] The experimental tables report point estimates without error bars, standard deviations across seeds, or statistical significance tests; adding these would allow readers to assess whether the reported gains are robust.
- [§3.2] Notation for the structural-token reward mask and the frame-budget distribution is introduced without an explicit equation; a short formal definition would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have addressed each major comment below, providing clarifications and committing to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [§3.2] §3.2 (PARA-GRPO): the per-prompt frame-budget randomization is presented as creating prompts where tool calls yield measurable reward over skips, yet the text provides no ablation or diagnostic showing that the policy learns coordinated multi-crop selection rather than emitting a single high-value crop only on high-budget prompts. This distinction is load-bearing for the claim that the +7.9% benchmark gain arises from parallel fault tolerance rather than conditional single-tool behavior.
Authors: We agree that a direct diagnostic would better isolate coordinated multi-crop behavior from conditional single-crop selection. In the revised manuscript we add a new analysis in §3.2 (with accompanying figure) that plots the average number of tool calls per prompt against the randomized frame budget. The results show a clear positive correlation: the policy emits multiple crops on moderate-to-high budgets rather than collapsing to a single high-value crop. We also include an ablation that disables budget randomization and shows a measurable drop in both format compliance and final benchmark gains, supporting that the mechanism encourages parallel rather than conditional single-tool use. These additions directly address the load-bearing distinction. revision: yes
-
Referee: [§4.3] §4.3 and Table 2: the cross-model contrast on the weaker-prior LMM is used to argue that prior strength is the shared driver of format collapse and tool exploration, but the manuscript does not report the exact model variant, training hyper-parameters, or whether the same PARA-GRPO schedule was applied; without these details the contrast cannot be treated as independent evidence.
Authors: We thank the referee for highlighting this omission. The weaker-prior model is Qwen2-VL-7B-Instruct. We have now added the precise model identifier, all training hyperparameters (learning rate 1e-6, batch size 64, 3 epochs, temperature 0.7), and explicit confirmation that the identical PARA-GRPO reward schedule and format-reward weighting were applied. These details appear in the revised §4.3 and new Appendix C, allowing the contrast to serve as supporting evidence for the role of prior strength. revision: yes
Circularity Check
No circularity: empirical RL framework with external benchmark validation
full rationale
The paper introduces ParaVT and PARA-GRPO as an RL augmentation for parallel video tool calling, describing two mechanisms (targeted format reward at structural tokens and per-prompt frame-budget randomization) that address the Tool Prior Paradox. Reported gains (+7.9% average over Qwen3-VL baseline, format compliance 0.13 to 0.64) are presented as outcomes of training and evaluation on six external long-video benchmarks. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that reduce the central claims to inputs by construction. The methods are framed as practical additions to standard RL, with results treated as falsifiable external evidence rather than tautological outputs.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.lean (washburn_uniqueness_aczel, Jcost definition)J_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), which augments standard RL with two complementary mechanisms: (i) a targeted format reward applied only at the structural-token positions most prone to collapse, and (ii) a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it.
-
IndisputableMonolith/Foundation/AlexanderDuality.lean (D=3 forcing)alexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ParaVT is the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Boyu Chen, Zikang Wang, Zhengrong Yue, Kainan Yan, Chenyun Yu, Yi Huang, Zijun Liu, Yafei Wen, Xiaoxin Chen, Yang Liu, et al. Videochat-m1: Collaborative policy planning for video understanding via multi-agent reinforcement learning.arXiv preprint arXiv:2511.19524, 2025
-
[3]
Lvagent: Long video understanding by multi-round dynamical collaboration of mllm agents
Boyu Chen, Zhengrong Yue, Siran Chen, Zikang Wang, Yang Liu, Peng Li, and Yali Wang. Lvagent: Long video understanding by multi-round dynamical collaboration of mllm agents. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20237– 20246, 2025
work page 2025
-
[4]
Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025
Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025
-
[5]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Yang Ding, Yizhen Zhang, Xin Lai, Ruihang Chu, and Yujiu Yang. Videozoomer: Reinforcement-learned temporal focusing for long video reasoning.arXiv preprint arXiv:2512.22315, 2025
-
[7]
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025
work page 2025
-
[9]
Shenghao Fu, Qize Yang, Yuan-Ming Li, Xihan Wei, Xiaohua Xie, and Wei-Shi Zheng. Love- r1: Advancing long video understanding with an adaptive zoom-in mechanism via multi-step reasoning.arXiv preprint arXiv:2509.24786, 2025
-
[10]
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, et al. Areal: A large-scale asynchronous reinforcement learning system for language reasoning.arXiv preprint arXiv:2505.24298, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Tall: Temporal activity localization via language query
Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. InProceedings of the IEEE international conference on computer vision, pages 5267–5275, 2017
work page 2017
-
[12]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025. 10
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Jitesh Jain, Jialuo Li, Zixian Ma, Jieyu Zhang, Chris Dongjoo Kim, Sangho Lee, Rohun Tripathi, Tanmay Gupta, Christopher Clark, and Humphrey Shi. Sage: Training smart any-horizon agents for long video reasoning with reinforcement learning.arXiv preprint arXiv:2512.13874, 2025
-
[16]
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforce- ment fine-tuning.arXiv preprint arXiv:2504.06958, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Longvideoagent: Multi-agent reasoning with long videos.arXiv preprint arXiv:2512.20618, 2025
Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, and Qifeng Chen. Longvideoagent: Multi-agent reasoning with long videos.arXiv preprint arXiv:2512.20618, 2025
-
[18]
MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding
Fuwen Luo, Shengfeng Lou, Chi Chen, Ziyue Wang, Chenliang Li, Weizhou Shen, Jiyue Guo, Peng Li, Ming Yan, Ji Zhang, et al. Museg: Reinforcing video temporal understanding via timestamp-aware multi-segment grounding.arXiv preprint arXiv:2505.20715, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2503.07365, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Haoming Meng, Kexin Huang, Shaohang Wei, Chiyu Ma, Shuo Yang, Xue Wang, Guoyin Wang, Bolin Ding, and Jingren Zhou. Sparse but critical: A token-level analysis of distributional shifts in rlvr fine-tuning of llms.arXiv preprint arXiv:2603.22446, 2026
-
[21]
Conan: Progressive learning to reason like a detective over multi-scale visual evidence
Kun Ouyang, Yuanxin Liu, Linli Yao, Yishuo Cai, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Conan: Progressive learning to reason like a detective over multi-scale visual evidence. arXiv preprint arXiv:2510.20470, 2025
-
[22]
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Safety alignment should be made more than just a few tokens deep.arXiv preprint arXiv:2406.05946,
Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep.arXiv preprint arXiv:2406.05946, 2024
-
[24]
ToolRL: Reward is All Tool Learning Needs
Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Qwen Team. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Revisiting the superficial alignment hypothesis.arXiv preprint arXiv:2410.03717,
Mohit Raghavendra, Vaskar Nath, and Sean Hendryx. Revisiting the superficial alignment hypothesis.arXiv preprint arXiv:2410.03717, 2024
-
[27]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023
work page 2023
-
[28]
Zoom-zero: Reinforced coarse-to-fine video understanding via temporal zoom-in
Xiaoqian Shen, Min-Hung Chen, Yu-Chiang Frank Wang, Mohamed Elhoseiny, and Ryo Hachiuma. Zoom-zero: Reinforced coarse-to-fine video understanding via temporal zoom-in. arXiv preprint arXiv:2512.14273, 2025
-
[29]
Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460– 9471, 2022. 11
work page 2022
-
[30]
Jianghao Su, Xia Zeng, Luhui Liu, Chao Luo, Ye Chen, and Zhuoran Zhuang. Enhancing agentic rl with progressive reward shaping and value-based sampling policy optimization.arXiv preprint arXiv:2512.07478, 2025
-
[31]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Ignore the kl penalty! boosting exploration on critical tokens to enhance rl fine-tuning
Jean Vassoyan, Nathanaël Beau, and Roman Plaud. Ignore the kl penalty! boosting exploration on critical tokens to enhance rl fine-tuning. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 6108–6118, 2025
work page 2025
-
[33]
Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025
-
[34]
Shijian Wang, Jiarui Jin, Xingjian Wang, Linxin Song, Runhao Fu, Hecheng Wang, Zongyuan Ge, Yuan Lu, and Xuelian Cheng. Video-thinker: Sparking “thinking with videos” via rein- forcement learning.arXiv preprint arXiv:2510.23473, 2025
-
[35]
Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL
Sudong Wang, Weiquan Huang, Xiaomin Yu, Zuhao Yang, Hehai Lin, Keming Wu, Chaojun Xiao, Chen Chen, Wenxuan Wang, Beier Zhu, Yunjian Zhang, and Chengwei Qin. Beyond SFT- to-RL: Pre-alignment via black-box on-policy distillation for multimodal RL.arXiv preprint arXiv:2604.28123, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[36]
Lvbench: An extreme long video understanding benchmark
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958– 22967, 2025
work page 2025
-
[37]
Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision language model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long- context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37:28828–28857, 2024
work page 2024
-
[39]
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
Zhongyu Yang, Zuhao Yang, Shuo Zhan, Tan Yue, Wei Pang, and Yingfang Yuan. Svagent: Storyline-guided long video understanding via cross-modal multi-agent collaboration.arXiv preprint arXiv:2604.05079, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[40]
Inex: Hallucination mitigation via introspection and cross-modal multi-agent collaboration
Zhongyu Yang, Yingfang Yuan, Xuanming Jiang, Baoyi An, and Wei Pang. Inex: Hallucination mitigation via introspection and cross-modal multi-agent collaboration. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 29829–29837, 2026
work page 2026
-
[41]
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Bo Li, Chengwei Qin, Shijian Lu, Xingxuan Li, et al. Longvt: Incentivizing “thinking with long videos” via native tool calling.arXiv preprint arXiv:2511.20785, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[43]
Re-thinking temporal search for long-form video understanding
Jinhui Ye, Zihan Wang, Haosen Sun, Keshigeyan Chandrasegaran, Zane Durante, Cristobal Eyzaguirre, Yonatan Bisk, Juan Carlos Niebles, Ehsan Adeli, Li Fei-Fei, et al. Re-thinking temporal search for long-form video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8579–8591, 2025
work page 2025
-
[44]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning
Xiangyu Zeng, Zhiqiu Zhang, Yuhan Zhu, Xinhao Li, Zikang Wang, Changlian Ma, Qingyu Zhang, Zizheng Huang, Kun Ouyang, Tianxiang Jiang, et al. Video-o3: Native interleaved clue seeking for long video multi-hop reasoning.arXiv preprint arXiv:2601.23224, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[46]
Congzhi Zhang, Zhibin Wang, Yinchao Ma, Jiawei Peng, Yihan Wang, Qiang Zhou, Jun Song, and Bo Zheng. Rewatch-r1: Boosting complex video reasoning in large vision-language models through agentic data synthesis.arXiv preprint arXiv:2509.23652, 2025
-
[47]
Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning
Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning.arXiv preprint arXiv:2508.04416, 2025
-
[48]
Kaichen Zhang, Keming Wu, Zuhao Yang, Bo Li, Kairui Hu, Bin Wang, Ziwei Liu, Xingxuan Li, and Lidong Bing. Openmmreasoner: Pushing the frontiers for multimodal reasoning with an open and general recipe.arXiv preprint arXiv:2511.16334, 2025
-
[49]
Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, and Yan Lu. Deep video discovery: Agentic search with tool use for long-form video understanding.arXiv preprint arXiv:2505.18079, 2025
-
[50]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava- video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Mmvu: Measuring expert-level multi-discipline video understanding
Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, et al. Mmvu: Measuring expert-level multi-discipline video understanding. InProceedings of the Computer Vision and Pattern Recognition Confer- ence, pages 8475–8489, 2025
work page 2025
-
[52]
Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36:55006–55021, 2023
work page 2023
-
[53]
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13691–13701, 2025. 13 Appendix •Limitations and Broader Impact(Section A): scope limi...
work page 2025
-
[54]
Think inside <think>...</think> about which video segments contain the evidence needed to answer
-
[55]
Call tools using <tool_call>...</tool_call> blocks. You may issue multiple <tool_call> blocks in one turn to inspect different temporal windows in parallel
-
[56]
After receiving <tool_response>, place your final answer inside <answer>...</answer>. # Format <think>your reasoning here</think> <tool_call>{"name": "crop_video", "arguments": {"video_path": "...", "start_time": ..., "end_time": ...}}</tool_call> ... (more <tool_call> blocks if needed) ... [After tool responses arrive] <answer>your final answer</answer> ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.