LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
Pith reviewed 2026-05-22 12:22 UTC · model grok-4.3
The pith
Large multimodal models can improve long-video reasoning by using their built-in temporal grounding to crop and resample key clips as a native tool.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By exploiting LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on specific clips and resample finer-grained frames within an interleaved Multimodal Chain-of-Tool-Thought, and training via a three-stage strategy on the VideoSIAH dataset, LongVT enables thinking with long videos and outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks.
What carries the argument
The native video cropping tool that uses the model's own temporal grounding to select relevant clips and resample them at higher frame rates inside the global-to-local reasoning loop.
If this is right
- The model learns to decide when to stop the reasoning loop once evidence is sufficient rather than continuing indefinitely.
- Training data that mixes tool-use examples with reinforcement learning helps the model avoid over-cropping or missing distant evidence.
- Releasing the VideoSIAH training and evaluation sets lets others test similar native-tool approaches on their own models.
- The same global-to-local pattern can be applied to other tasks where evidence must be located across long sequences.
Where Pith is reading between the lines
- If the native cropping works without extra modules, similar built-in grounding could be used for audio tracks or multi-camera setups without custom tool training.
- Extending the loop to include multiple rounds of cropping on the same clip might help when initial selections still lack detail.
- Measuring how often the model chooses to crop versus answering directly would show whether the tool is used efficiently.
Load-bearing premise
The model's existing temporal grounding skill can be used directly to crop videos accurately without adding new errors or requiring separate training for the cropping action itself.
What would settle it
Run the model on a long video containing a clearly defined event at a known timestamp and check whether the clips it chooses to crop and inspect actually contain that event.
Figures
read the original abstract
Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LongVT, an end-to-end agentic framework for long-video reasoning that interleaves multimodal chain-of-tool-thought with native video cropping. It exploits LMMs' pre-trained temporal grounding to select and resample relevant clips without separate tool training, using a three-stage pipeline (247.9K-sample cold-start SFT, 1.6K-sample agentic RL, 15.4K-sample RL fine-tuning) on the newly curated VideoSIAH dataset. Evaluation on a 1,280-pair held-out benchmark shows consistent gains over strong baselines on four long-video understanding and reasoning tasks, with public code, data, and checkpoints.
Significance. If the reported gains prove robust and attributable to the native-tool mechanism, the work offers a practical route to scalable long-video reasoning that avoids the overhead of training dedicated localization modules. The public release of training data, evaluation set, and model weights is a clear strength for reproducibility and follow-on research.
major comments (2)
- [§4] §4 (Experiments) and associated tables: the central outperformance claim is presented without any diagnostic metrics on the reliability of the native temporal-grounding tool (e.g., precision/recall or temporal IoU of predicted start/end times against ground-truth relevant segments). Because the method explicitly relies on this unmeasured component to produce accurate crops, the absence of such analysis leaves open the possibility that gains derive primarily from data curation or the RL stages rather than the claimed native-tool loop.
- [§3.2] §3.2 (Tool-integrated cold-start SFT) and §3.3 (Agentic RL): no ablation isolates the contribution of the cropping tool itself. Removing or replacing the native grounding step with oracle crops or a separately trained localizer would directly test whether the reported improvements require the specific assumption that LMMs already possess reliable, error-free temporal grounding.
minor comments (2)
- [Abstract] The abstract and §1 state that LongVT 'consistently outperforms existing strong baselines' yet provide no numerical deltas, baseline names, or error bars; these should be added for immediate readability.
- [§3] Notation for the interleaved reasoning loop (e.g., how tool calls are formatted and how frame resampling is performed) is described only at a high level; a concrete example or pseudocode would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that directly incorporate the suggested analyses to strengthen the evidence for the native tool-calling mechanism.
read point-by-point responses
-
Referee: [§4] §4 (Experiments) and associated tables: the central outperformance claim is presented without any diagnostic metrics on the reliability of the native temporal-grounding tool (e.g., precision/recall or temporal IoU of predicted start/end times against ground-truth relevant segments). Because the method explicitly relies on this unmeasured component to produce accurate crops, the absence of such analysis leaves open the possibility that gains derive primarily from data curation or the RL stages rather than the claimed native-tool loop.
Authors: We agree that direct diagnostic metrics on the native temporal-grounding tool would provide stronger support for attributing gains to the tool-calling loop rather than data or RL alone. In the revised manuscript we will add a dedicated analysis subsection in §4 reporting precision, recall, and temporal IoU of the model’s predicted start/end times against the ground-truth relevant segments available in the VideoSIAH evaluation set. These metrics will be computed on the 1,280-pair held-out benchmark and included alongside the existing task-performance tables. revision: yes
-
Referee: [§3.2] §3.2 (Tool-integrated cold-start SFT) and §3.3 (Agentic RL): no ablation isolates the contribution of the cropping tool itself. Removing or replacing the native grounding step with oracle crops or a separately trained localizer would directly test whether the reported improvements require the specific assumption that LMMs already possess reliable, error-free temporal grounding.
Authors: We concur that an ablation isolating the native cropping tool is necessary to validate the core claim. In the revision we will add experiments that replace the native grounding step with (i) oracle ground-truth crops and (ii) a separately trained temporal localizer baseline. The resulting performance deltas on the four long-video benchmarks will be reported in §4 (new rows in Table 2 or an additional ablation table). These comparisons will quantify how much the integrated native-tool loop contributes beyond data curation and the RL stages, while acknowledging that the LMM is not assumed to be error-free—the RL phases are explicitly designed to incentivize robust tool use despite occasional grounding inaccuracies. revision: yes
Circularity Check
No circularity: empirical framework with external benchmarks and public artifacts
full rationale
The paper describes an agentic LMM framework trained in three stages on curated VideoSIAH data (247.9K SFT + RL samples) and evaluated on four benchmarks with 1,280 QA pairs. No equations, derivations, or fitted parameters are presented that reduce by construction to the inputs. The core assumption of inherent temporal grounding is invoked as a starting point for tool use and then validated through end-to-end performance gains rather than being defined in terms of the outputs. Public code, data, and checkpoints allow independent reproduction against external benchmarks, confirming the derivation chain is self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LMMs possess inherent temporal grounding ability usable as a native cropping tool
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we exploit LMMs' inherent temporal grounding ability as a native video cropping tool... joint answer-temporal grounding reward... R(k) = R(k)acc + R(k)format + R(k)time with IoU
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
three-stage training strategy... cold-start SFT + agentic RL + RFT
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 9 Pith papers
-
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
ParaVT is a parallel video tool-calling RL framework that resolves the Tool Prior Paradox via PARA-GRPO, delivering +7.9% average gains on six long-video benchmarks and raising format compliance from 0.13 to 0.64.
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...
-
Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding
LAT-Audio introduces a global-to-local reasoning approach with TWA-CoT that outperforms prior models on temporal tasks for audio up to 30 minutes.
-
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
-
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
ParaVT introduces the first multi-agent RL framework for parallel video tool calling in LMMs, using PARA-GRPO to resolve the Tool Prior Paradox and achieve +7.9% average improvement over Qwen3-VL baseline across six b...
-
VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation
VideoSeeker integrates agentic reasoning and visual prompts into LVLMs via automated data synthesis, cold-start supervision, and RL training, yielding +13.7% gains on instance-level video tasks over baselines includin...
-
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 3, 4, 5, 7, 8, 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, et al. Temporalbench: Benchmarking fine- grained temporal understanding for multimodal video mod- els.arXiv preprint arXiv:2410.10818, 2024. 2
-
[3]
Maya Cakmak and Andrea L Thomaz. Eliciting good teach- ing from humans for machine learners.Artificial Intelli- gence, 217:198–215, 2014. 2
work page 2014
-
[4]
Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025
Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Han- rong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025. 3, 4, 5
-
[5]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 2, 4, 5, 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles
Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. Openvlthinker: An early ex- ploration to complex vision-language reasoning via iterative self-improvement.arXiv preprint arXiv:2503.17352, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Grit: Teaching mllms to think with images
Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images. InAdvances in Neural Information Processing Systems, 2025. 3 9
work page 2025
-
[8]
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 2, 3, 7, 8, 5
work page 2025
-
[10]
Tall: Temporal activity localization via language query
Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on com- puter vision, pages 5267–5275, 2017. 2, 8, 9
work page 2017
-
[11]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2, 3, 9
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025. 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline pro- fessional videos.arXiv preprint arXiv:2501.13826, 2025. 2, 7, 8, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Multimodal 2gpretraining for dense video cap- tioning.arXiv preprint arXiv:2011.11760, 2020
Gabriel Huang, Bo Pang, Zhenhai Zhu, Clara Rivera, and Radu Soricut. Multimodal 2gpretraining for dense video cap- tioning.arXiv preprint arXiv:2011.11760, 2020. 2
-
[15]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Dense-captioning events in videos
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE international conference on com- puter vision, pages 706–715, 2017. 2
work page 2017
-
[19]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the 29th symposium on operating systems prin- ciples, pages 611–626, 2023. 5
work page 2023
-
[20]
Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Yuming Jiang, Hang Zhang, Xin Li, Lidong Bing, et al. Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources.arXiv preprint arXiv:2509.21268, 2025. 3
-
[21]
Gang Li, Jizhong Liu, Heinrich Dinkel, Yadong Niu, Junbo Zhang, and Jian Luan. Reinforcement learning outperforms supervised fine-tuning: A case study on audio question an- swering.arXiv preprint arXiv:2503.11197, 2025. 3
-
[22]
Jiaxiang Li, Siliang Zeng, Hoi-To Wai, Chenliang Li, Al- fredo Garcia, and Mingyi Hong. Getting more juice out of the sft data: Reward learning from human demonstration im- proves sft for llm alignment. InAdvances in Neural Informa- tion Processing Systems, pages 124292–124318, 2024. 9
work page 2024
-
[23]
Mvbench: A comprehensive multi-modal video understand- ing benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 2
work page 2024
-
[24]
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025. 2, 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Im- proving llm video understanding with 16 frames per second
Yixuan Li, Changli Tang, Jimin Zhuang, Yudong Yang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. Im- proving llm video understanding with 16 frames per second. arXiv preprint arXiv:2503.13956, 2025. 2
-
[26]
TempCompass: Do Video LLMs Really Understand Videos?
Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcom- pass: Do video llms really understand videos?arXiv preprint arXiv:2403.00476, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Visual- rft: Visual reinforcement fine-tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning. InProceedings of the IEEE international conference on computer vision, 2025. 3
work page 2025
-
[29]
Lmms engine: A simple, unified multimodal framework for pretraining and finetuning., 2025
LMMs-Lab. Lmms engine: A simple, unified multimodal framework for pretraining and finetuning., 2025. 4
work page 2025
-
[30]
Decoupled weight de- cay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 4
work page 2019
-
[31]
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforce- ment learning.arXiv preprint arXiv:2503.07365, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Multi-agent tool-integrated policy optimization.arXiv preprint arXiv:2510.04678, 2025
Zhanfeng Mo, Xingxuan Li, Yuntao Chen, and Lidong Bing. Multi-agent tool-integrated policy optimization.arXiv preprint arXiv:2510.04678, 2025. 6
-
[33]
Runqi Qiao, Qiuna Tan, Peiqing Yang, Yanzi Wang, Xi- aowan Wang, Enhui Wan, Sitong Zhou, Guanting Dong, Yuchen Zeng, Yida Xu, et al. We-math 2.0: A versatile 10 mathbook system for incentivizing visual mathematical rea- soning.arXiv preprint arXiv:2508.10433, 2025. 3
-
[34]
Timechat: A time-sensitive multimodal large lan- guage model for long video understanding
Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large lan- guage model for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14313–14323, 2024. 2
work page 2024
-
[35]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 5, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Hybridflow: A flexible and efficient rlhf frame- work
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf frame- work. InProceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025. 4
work page 2025
-
[38]
Moviechat: From dense token to sparse memory for long video understanding
Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024. 2
work page 2024
-
[39]
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Haoyuan Sun, Jiaqi Wu, Bo Xia, Yifu Luo, Yifei Zhao, Kai Qin, Xufei Lv, Tiantian Zhang, Yongzhe Chang, and Xue- qian Wang. Reinforcement fine-tuning powers reasoning ca- pability of multimodal large language models.arXiv preprint arXiv:2505.18536, 2025. 6
-
[41]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Introducing gpt-5.https://openai
OpenAI Team. Introducing gpt-5.https://openai. com/index/introducing-gpt-5/, 2025. 2, 1
work page 2025
-
[43]
Thinking with images.https : / / openai
OpenAI Team. Thinking with images.https : / / openai . com / index / thinking - with - images/,
-
[44]
Qwen3-vl: Sharper vision, deeper thought, broader action.https : / / qwen
Qwen Team. Qwen3-vl: Sharper vision, deeper thought, broader action.https : / / qwen . ai / blog ? from = research . latest - advancements - list & id = 99f0335c4ad9ff6153e517418d48535ab6d8afef,
-
[45]
Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, and Ziwei Liu. Ego-r1: Chain-of-tool- thought for ultra-long egocentric video reasoning.arXiv preprint arXiv:2506.13654, 2025. 3
-
[46]
Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434,
-
[47]
Shijian Wang, Jiarui Jin, Xingjian Wang, Linxin Song, Run- hao Fu, Hecheng Wang, Zongyuan Ge, Yuan Lu, and Xuelian Cheng. Video-thinker: Sparking” thinking with videos” via reinforcement learning.arXiv preprint arXiv:2510.23473,
-
[48]
LVBench: An Extreme Long Video Understanding Benchmark
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiao- han Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, et al. Lvbench: An extreme long video understanding benchmark.arXiv preprint arXiv:2406.08035, 2024. 2, 3, 7, 8, 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision lan- guage model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025. 2, 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Cheng Wen, Tingwei Guo, Shuaijiang Zhao, Wei Zou, and Xiangang Li. Sari: Structured audio reasoning via curriculum-guided reinforcement learning.arXiv preprint arXiv:2504.15900, 2025. 3
-
[51]
Longvideobench: A benchmark for long-context interleaved video-language understanding
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. InAdvances in Neural Infor- mation Processing Systems, pages 28828–28857, 2024. 2, 3
work page 2024
-
[52]
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven think- ing and visual drawing.arXiv preprint arXiv:2506.09965,
work page internal anchor Pith review Pith/arXiv arXiv
-
[53]
Llava-cot: Let vision language models reason step-by-step
Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2087– 2098, 2025. 3
work page 2087
-
[54]
Vidchapters-7m: Video chapters at scale
Antoine Yang, Arsha Nagrani, Ivan Laptev, Josef Sivic, and Cordelia Schmid. Vidchapters-7m: Video chapters at scale. Advances in Neural Information Processing Systems, 36: 49428–49444, 2023. 2
work page 2023
-
[55]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 6, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Zhongyu Yang, Junhao Song, Siyang Song, Wei Pang, and Yingfang Yuan. Mermaid: Multi-perspective self-reflective agents with generative augmentation for emotion recogni- tion. InProceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing, pages 24650– 24666, 2025. 3
work page 2025
-
[57]
Timeexpert: An expert-guided video llm for video temporal grounding
Zuhao Yang, Yingchen Yu, Yunqing Zhao, Shijian Lu, and Song Bai. Timeexpert: An expert-guided video llm for video temporal grounding. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 24286– 24296, 2025. 2 11
work page 2025
-
[58]
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing.arXiv preprint arXiv:2501.13106, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning
Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning. arXiv preprint arXiv:2508.04416, 2025. 2, 3, 7
-
[60]
Lmms-eval: Re- ality check on the evaluation of large multimodal models
Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Re- ality check on the evaluation of large multimodal models. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 881–916, 2025. 7, 5
work page 2025
-
[61]
Kaichen Zhang, Keming Wu, Zuhao Yang, Kairui Hu, Bin Wang, Ziwei Liu, Xingxuan Li, and Lidong Bing. Open- mmreasoner: Pushing the frontiers for multimodal rea- soning with an open and general recipe.arXiv preprint arXiv:2511.16334, 2025. 3
-
[62]
Sglang: Efficient execution of structured language model programs
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs. InAdvances in neural information processing systems, pages 62557–62583, 2024. 4
work page 2024
-
[63]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
Hao Zhong, Muzhi Zhu, Zongze Du, Zheng Huang, Canyu Zhao, Mingyu Liu, Wen Wang, Hao Chen, and Chunhua Shen. Omni-r1: Reinforcement learning for omnimodal reasoning via two-system collaboration.arXiv preprint arXiv:2505.20256, 2025. 3
-
[65]
Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. InProceedings of the AAAI conference on artificial intelligence, 2018. 2 12 LongVT: Incentivizing “Thinking with Long Videos” via Native Tool Calling Supplementary Material Outline This Supplementary Material complements the main paper, prov...
work page 2018
-
[66]
LongVT Performs Human-Aligned Think- ing like Leading Proprietary LMMs The core philosophy of our proposed interleaved Multi- modal Chain-of-Tool-Thought (iMCoTT) entails a “global- to-local” thinking pattern: the model first performs a coarse skim to formulate a hypothesis, and subsequently invokes the nativecrop video()tool to inspect specific tempo- ra...
-
[67]
What Motivates VideoSIAH? Unveiling the Data Contamination in Qwen-VL Series With the rapid advancements of LMMs, model performance on various benchmarks has steadily improved. However, the “black-box” nature of training data raises a critical ques- tion:Do these improvements reflect genuine reasoning ca- pability, or are they partly due to the model memo...
-
[68]
Additional VideoSIAH Details Source Purpose Samples LLaV A-CoT [53] General Visual Reasoning 54,591 OpenVLThinker [6] Complex Reasoning 2,829 We-Math 2.0 [33] Mathematical Reasoning 602 Table 5.Detailed Statistics of Image-based CoT Data for Cold- Start SFT. Breakdown of Image-based CoT Data.As detailed in Table 5, we construct a diverse mixture of image-...
-
[69]
For a sequence of to- kensx= (x 1, x2,
Additional Methodological Details Next-Token Prediction.During SFT, we train our model by minimizing the negative log-likelihood of the target to- kens given their preceding context. For a sequence of to- kensx= (x 1, x2, . . . , xT )and a model parameterized byθ that defines conditional probabilitiesp θ(xt |x <t), the loss function is defined as L(θ) =− ...
-
[70]
Reflection Trajectory: From Verbose Self- Correction to Internalized Tool Usage We visualize the evolution of the model’s internal thought process in Figure 7 (left). Echoing the training dynam- ics observed in DeepEyes [63], the trajectory of reflection token proportion discloses a distinct three-phase evolution from exploratory correction to efficient t...
-
[71]
of Training Steps 3000 160 1600 No
Additional Implementation Details Component SFT RL RFT Optimizer AdamW [30] AdamW AdamW Learning Rate (LR) 5e-5 1e-6 5e-5 LR Scheduler cosine constant cosine Weight Decay 0.0 1e-2 0.0 No. of Training Steps 3000 160 1600 No. of Warmup Steps 300 0 160 Max Length 51200 52384 51200 Dynamic Batch Size True False True Remove Padding True True True Liger Kernel ...
-
[72]
framework. To optimize training throughput and mini- mize memory overhead, we employ an online stream pack- ing strategy on iterable datasets. Specifically, instead of padding individual sequences, we concatenate input sam- ples to fill a fixed buffer size of 51,200 tokens, thereby elim- inating redundant computation on padding tokens. Incom- ing data is ...
-
[73]
Inference Efficiency Analysis Efficiency Analysis.We present a comparative analysis of inference latency across four benchmarks in Table 7. De- spite incorporating multi-turn tool interactions, LongVT- 7B-RFT demonstrates remarkable efficiency, achieving the lowest latency on VideoMMMU (1329.8 seconds) and LVBench (1509.3 seconds), and maintaining highly ...
-
[74]
Examples Prompts and Data Examples.To enhance reproducibil- ity and transparency, we provide concrete examples of the key resources used in our experiments. Figure 8 shows the RL prompt template, while Figure 9 presents the evaluation prompts used in LLM-as-a-Judge [55] for measuring an- swer’s accuracy during RL. One representative sample from both SFT a...
-
[75]
Failure Case Analysis To further illustrate the instability of the RL-only variant discussed in Section 5.3 of the main paper, we present a rep- resentative failure case. As shown in Figure 14, the model correctly recognizes the need to invoke a tool to inspect the glass coffee table. However, after receiving the resampled video frames, it fails to integr...
-
[76]
Limitation and Future Direction While our efficiency analysis in Section 13 confirms that multi-turn tool interactions do not impose significant la- tency penalties, the memory footprint of such recursive rea- soning remains a bottleneck. The single-agent architecture of LongVT is constrained by the inherent context window of the underlying LMM: as the nu...
-
[77]
Broader Impact LongVT advances the field of long-video understanding by introducing an agentic framework capable of proactive ev- idence seeking and self-correction. By enabling LMMs to dynamically inspect and re-examine video segments, this work addresses critical reliability issues—such as hallu- cinations and temporal misalignment that hinder the de- p...
-
[78]
Ethical Considerations Advancing Reliability and Safety.LongVT is explicitly designed to enhance the reliability of video LMMs by mit- igating hallucinations through on-demand visual verifica- tion. By grounding answers in retrieved video evidence, the system reduces the likelihood of fabricating events or misinterpreting context, thereby fostering more t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.