pith. sign in

arxiv: 2511.20785 · v3 · pith:VXOB4GETnew · submitted 2025-11-25 · 💻 cs.CV

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

Pith reviewed 2026-05-22 12:22 UTC · model grok-4.3

classification 💻 cs.CV
keywords long video understandingmultimodal chain-of-thoughtagentic reasoningvideo cropping tooltemporal groundinglarge multimodal modelsreinforcement learning
0
0 comments X

The pith

Large multimodal models can improve long-video reasoning by using their built-in temporal grounding to crop and resample key clips as a native tool.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that large multimodal models often hallucinate on long videos because relevant evidence is sparse and spread out over time. It introduces LongVT as an agentic system that lets the model first review the full video and then use its own sense of timing to select and zoom into specific clips by resampling finer frames. This global-to-local loop repeats until the answer rests on actual retrieved visual evidence rather than guesswork. To support the approach the authors create the VideoSIAH data suite with hundreds of thousands of training examples across three stages and a 1280-pair evaluation benchmark. They train the model in three stages that include supervised fine-tuning and reinforcement learning, then show consistent gains over strong baselines on four long-video benchmarks.

Core claim

By exploiting LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on specific clips and resample finer-grained frames within an interleaved Multimodal Chain-of-Tool-Thought, and training via a three-stage strategy on the VideoSIAH dataset, LongVT enables thinking with long videos and outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks.

What carries the argument

The native video cropping tool that uses the model's own temporal grounding to select relevant clips and resample them at higher frame rates inside the global-to-local reasoning loop.

If this is right

  • The model learns to decide when to stop the reasoning loop once evidence is sufficient rather than continuing indefinitely.
  • Training data that mixes tool-use examples with reinforcement learning helps the model avoid over-cropping or missing distant evidence.
  • Releasing the VideoSIAH training and evaluation sets lets others test similar native-tool approaches on their own models.
  • The same global-to-local pattern can be applied to other tasks where evidence must be located across long sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the native cropping works without extra modules, similar built-in grounding could be used for audio tracks or multi-camera setups without custom tool training.
  • Extending the loop to include multiple rounds of cropping on the same clip might help when initial selections still lack detail.
  • Measuring how often the model chooses to crop versus answering directly would show whether the tool is used efficiently.

Load-bearing premise

The model's existing temporal grounding skill can be used directly to crop videos accurately without adding new errors or requiring separate training for the cropping action itself.

What would settle it

Run the model on a long video containing a clearly defined event at a known timestamp and check whether the clips it chooses to crop and inspect actually contain that event.

Figures

Figures reproduced from arXiv: 2511.20785 by Bo Li, Chengwei Qin, Kaichen Zhang, Keming Wu, Lidong Bing, Shijian Lu, Sicong Leng, Sudong Wang, Xingxuan Li, Yifan Zhang, Zuhao Yang.

Figure 1
Figure 1. Figure 1: Interleaved Multimodal Chain-of-Tool-Thought (iMCoTT). Compared to prior text-based Chain-of-Thought (CoT) reasoning, iMCoTT in our proposed LongVT can natively perform self-reflection via calling crop video(start time, end time) tool. It proposes a time window after a global preview, proactively fetches the corresponding short clip, rethinks based on the new evidence, and determines whether to refine or a… view at source ↗
Figure 2
Figure 2. Figure 2: Data Pipeline of VideoSIAH. We construct a semi-automatic data pipeline that integrates several state-of-the-art LMMs [1, 5, 12, 43] to sequentially perform long video segmentation, video clip captioning, segment-in-a-haystack QA generation, cross-modal QA filtering, and iMCoTT generation. Icons with human silhouettes denote human-in-the-loop validation, where annotators inspect a small set of representati… view at source ↗
Figure 3
Figure 3. Figure 3: Ablations on Reward Design. The left panel shows training dynamics under different accuracy and time rewards, and the right panel shows the effect of tool-call reward on tool usage. Answer Accuracy. Let K be the number of sampled roll￾outs in a group. For the k-th rollout (k ∈ {1, . . . , K}), let aˆ (k) denote its generated answer and let a ⋆ denote the ground-truth answer. We employ LLM-as-a-Judge [55] t… view at source ↗
Figure 4
Figure 4. Figure 4: Overall Framework of LongVT. Our approach processes long-form videos in a human-like two-stage manner. Specifically, LongVT is augmented with interleaved Multimodal Chain-of-Tool-Thought (iMCoTT): first performs a global skim over sampled video frames to form a coarse hypothesis about when evidence likely occurs; then invokes a native video tool crop video(start time, end time) to resample finer-grained fr… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of Watching Strategies Proposed by Gemini 2.5 Pro [5] and GPT-5 Thinking [42]. Best viewed when zoomed in. Setting VideoMME [9] VideoMMMU [13] VideoSIAH-Eval w/o subtitle adaptation∗ comprehension perception test Qwen2.5-VL-7B-Instruct [1] Original 64.3 35.7 44.3 56.7 33.8 No Visual 40.1 27.0 38.3 39.3 12.7 Rearranged Choices 56.0 31.6 40.3 67.0 - Qwen3-VL-8B-Instruct [44] Original 69.3 40.7 60.… view at source ↗
Figure 6
Figure 6. Figure 6: Category Distribution of VideoSIAH-Eval. We present the distribution of video types (a) and question types (b), highlighting the diversity of our proposed benchmark. 0 25 50 75 100 125 150 175 Training Step 0.007 0.008 0.009 0.010 0.011 0.012 0.013 0.014 0.015 Reflection Word Proportion Reflection Words Over Training Reflection Word Proportion (Raw) Reflection Word Proportion (Smoothed) Word Cloud from Las… view at source ↗
Figure 7
Figure 7. Figure 7: Trend of Reflection-Related Words and the Corresponding Word Cloud across All Rollouts. 12. Additional Implementation Details Component SFT RL RFT Optimizer AdamW [30] AdamW AdamW Learning Rate (LR) 5e-5 1e-6 5e-5 LR Scheduler cosine constant cosine Weight Decay 0.0 1e-2 0.0 No. of Training Steps 3000 160 1600 No. of Warmup Steps 300 0 160 Max Length 51200 52384 51200 Dynamic Batch Size True False True Rem… view at source ↗
Figure 8
Figure 8. Figure 8: Prompt Template Utilized for RL. This template outlines the structural guidelines and system instructions provided to the model during the RL training phase. Below are two answers to a question. Question is [Question], [Standard Answer] is the standard answer to the question, and [Model_answer] is the answer extracted from a model's output to this question. Judge how consistent the two answers are. Scoring… view at source ↗
Figure 9
Figure 9. Figure 9: Evaluation Prompt for LLM-as-a-Judge. We present the full system instruction used to query the judge model. This prompt defines the scoring criteria and guidelines to ensure consistent evaluation of the model’s generated responses. 7 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Representative Data Example for SFT and RFT. The example illustrates the input format and the corresponding ground-truth response used to train the model across both fine-tuning stages. 8 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: An Example of Single-turn Inference with Self-Correction. The model initially misidentifies the basin color as pink. However, through the reasoning process (highlighted in the “Thinking” block), it explicitly decides to double-check the frames, corrects the hallucinations, and outputs the correct answer (Blue). 9 [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: An Example of Multi-step Inference Involving Tool Interaction. In this complex query, the model initially crops an incorrect time window (297s-305s) which lacks the target visual information. Recognizing this error during the reasoning phase, it refines the parameters and calls the tool again with the correct window (344s-372s) to successfully identify the US flag. 10 [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative Comparison between Textual CoT and Our Designed iMCoTT. The baseline textual CoT (left) relies on hallucinated memory, confidently providing an incorrect answer regarding the cars’ colors (“Black and Yellow”). In contrast, our model (right) actively engages with the video content via tool usage. Despite an initial mis-localization (90s-120s), the model explicitly detects the absence of the tar… view at source ↗
Figure 14
Figure 14. Figure 14: Failure Case of the RL-only Variant. This example demonstrates the model’s inability to maintain the logical flow after a tool interaction without prior SFT. Although the model initiates a tool call to inspect the blurred region, it fails to utilize the returned observation to answer the user’s question. Instead, it loses the conversational context and hallucinates a repetition of the general video descri… view at source ↗
read the original abstract

Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LongVT, an end-to-end agentic framework for long-video reasoning that interleaves multimodal chain-of-tool-thought with native video cropping. It exploits LMMs' pre-trained temporal grounding to select and resample relevant clips without separate tool training, using a three-stage pipeline (247.9K-sample cold-start SFT, 1.6K-sample agentic RL, 15.4K-sample RL fine-tuning) on the newly curated VideoSIAH dataset. Evaluation on a 1,280-pair held-out benchmark shows consistent gains over strong baselines on four long-video understanding and reasoning tasks, with public code, data, and checkpoints.

Significance. If the reported gains prove robust and attributable to the native-tool mechanism, the work offers a practical route to scalable long-video reasoning that avoids the overhead of training dedicated localization modules. The public release of training data, evaluation set, and model weights is a clear strength for reproducibility and follow-on research.

major comments (2)
  1. [§4] §4 (Experiments) and associated tables: the central outperformance claim is presented without any diagnostic metrics on the reliability of the native temporal-grounding tool (e.g., precision/recall or temporal IoU of predicted start/end times against ground-truth relevant segments). Because the method explicitly relies on this unmeasured component to produce accurate crops, the absence of such analysis leaves open the possibility that gains derive primarily from data curation or the RL stages rather than the claimed native-tool loop.
  2. [§3.2] §3.2 (Tool-integrated cold-start SFT) and §3.3 (Agentic RL): no ablation isolates the contribution of the cropping tool itself. Removing or replacing the native grounding step with oracle crops or a separately trained localizer would directly test whether the reported improvements require the specific assumption that LMMs already possess reliable, error-free temporal grounding.
minor comments (2)
  1. [Abstract] The abstract and §1 state that LongVT 'consistently outperforms existing strong baselines' yet provide no numerical deltas, baseline names, or error bars; these should be added for immediate readability.
  2. [§3] Notation for the interleaved reasoning loop (e.g., how tool calls are formatted and how frame resampling is performed) is described only at a high level; a concrete example or pseudocode would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that directly incorporate the suggested analyses to strengthen the evidence for the native tool-calling mechanism.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and associated tables: the central outperformance claim is presented without any diagnostic metrics on the reliability of the native temporal-grounding tool (e.g., precision/recall or temporal IoU of predicted start/end times against ground-truth relevant segments). Because the method explicitly relies on this unmeasured component to produce accurate crops, the absence of such analysis leaves open the possibility that gains derive primarily from data curation or the RL stages rather than the claimed native-tool loop.

    Authors: We agree that direct diagnostic metrics on the native temporal-grounding tool would provide stronger support for attributing gains to the tool-calling loop rather than data or RL alone. In the revised manuscript we will add a dedicated analysis subsection in §4 reporting precision, recall, and temporal IoU of the model’s predicted start/end times against the ground-truth relevant segments available in the VideoSIAH evaluation set. These metrics will be computed on the 1,280-pair held-out benchmark and included alongside the existing task-performance tables. revision: yes

  2. Referee: [§3.2] §3.2 (Tool-integrated cold-start SFT) and §3.3 (Agentic RL): no ablation isolates the contribution of the cropping tool itself. Removing or replacing the native grounding step with oracle crops or a separately trained localizer would directly test whether the reported improvements require the specific assumption that LMMs already possess reliable, error-free temporal grounding.

    Authors: We concur that an ablation isolating the native cropping tool is necessary to validate the core claim. In the revision we will add experiments that replace the native grounding step with (i) oracle ground-truth crops and (ii) a separately trained temporal localizer baseline. The resulting performance deltas on the four long-video benchmarks will be reported in §4 (new rows in Table 2 or an additional ablation table). These comparisons will quantify how much the integrated native-tool loop contributes beyond data curation and the RL stages, while acknowledging that the LMM is not assumed to be error-free—the RL phases are explicitly designed to incentivize robust tool use despite occasional grounding inaccuracies. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with external benchmarks and public artifacts

full rationale

The paper describes an agentic LMM framework trained in three stages on curated VideoSIAH data (247.9K SFT + RL samples) and evaluated on four benchmarks with 1,280 QA pairs. No equations, derivations, or fitted parameters are presented that reduce by construction to the inputs. The core assumption of inherent temporal grounding is invoked as a starting point for tool use and then validated through end-to-end performance gains rather than being defined in terms of the outputs. Public code, data, and checkpoints allow independent reproduction against external benchmarks, confirming the derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that current LMMs already contain usable temporal grounding that can be turned into a reliable cropping tool; no new physical entities or mathematical axioms are introduced beyond standard supervised and reinforcement learning setups.

axioms (1)
  • domain assumption LMMs possess inherent temporal grounding ability usable as a native cropping tool
    Invoked in the description of the global-to-local reasoning loop.

pith-pipeline@v0.9.0 · 5861 in / 1278 out tokens · 31124 ms · 2026-05-22T12:22:16.466338+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

    cs.CV 2026-05 unverdicted novelty 7.0

    ParaVT is a parallel video tool-calling RL framework that resolves the Tool Prior Paradox via PARA-GRPO, delivering +7.9% average gains on six long-video benchmarks and raising format compliance from 0.13 to 0.64.

  2. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 7.0

    VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...

  3. Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding

    eess.AS 2026-04 unverdicted novelty 7.0

    LAT-Audio introduces a global-to-local reasoning approach with TWA-CoT that outperforms prior models on temporal tasks for audio up to 30 minutes.

  4. SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

    cs.CV 2026-04 unverdicted novelty 7.0

    SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.

  5. ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

    cs.CV 2026-05 unverdicted novelty 6.0

    ParaVT introduces the first multi-agent RL framework for parallel video tool calling in LMMs, using PARA-GRPO to resolve the Tool Prior Paradox and achieve +7.9% average improvement over Qwen3-VL baseline across six b...

  6. VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

    cs.CV 2026-05 unverdicted novelty 6.0

    VideoSeeker integrates agentic reasoning and visual prompts into LVLMs via automated data synthesis, cold-start supervision, and RL training, yielding +13.7% gains on instance-level video tasks over baselines includin...

  7. Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

    cs.CV 2026-05 unverdicted novelty 6.0

    Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.

  8. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 5.0

    VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...

  9. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 5.0

    VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · cited by 6 Pith papers · 24 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 3, 4, 5, 7, 8, 1, 2

  2. [2]

    Temporalbench: Benchmarking fine- grained temporal understanding for multimodal video mod- els.arXiv preprint arXiv:2410.10818, 2024

    Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, et al. Temporalbench: Benchmarking fine- grained temporal understanding for multimodal video mod- els.arXiv preprint arXiv:2410.10818, 2024. 2

  3. [3]

    Eliciting good teach- ing from humans for machine learners.Artificial Intelli- gence, 217:198–215, 2014

    Maya Cakmak and Andrea L Thomaz. Eliciting good teach- ing from humans for machine learners.Artificial Intelli- gence, 217:198–215, 2014. 2

  4. [4]

    Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

    Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Han- rong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025. 3, 4, 5

  5. [5]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 2, 4, 5, 1

  6. [6]

    OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

    Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. Openvlthinker: An early ex- ploration to complex vision-language reasoning via iterative self-improvement.arXiv preprint arXiv:2503.17352, 2025. 3

  7. [7]

    Grit: Teaching mllms to think with images

    Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images. InAdvances in Neural Information Processing Systems, 2025. 3 9

  8. [8]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

  9. [9]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 2, 3, 7, 8, 5

  10. [10]

    Tall: Temporal activity localization via language query

    Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on com- puter vision, pages 5267–5275, 2017. 2, 8, 9

  11. [11]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2, 3, 9

  12. [12]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025. 3, 4

  13. [13]

    Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline pro- fessional videos.arXiv preprint arXiv:2501.13826, 2025. 2, 7, 8, 5

  14. [14]

    Multimodal 2gpretraining for dense video cap- tioning.arXiv preprint arXiv:2011.11760, 2020

    Gabriel Huang, Bo Pang, Zhenhai Zhu, Clara Rivera, and Radu Soricut. Multimodal 2gpretraining for dense video cap- tioning.arXiv preprint arXiv:2011.11760, 2020. 2

  15. [15]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

  16. [16]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 7, 8

  17. [17]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 3

  18. [18]

    Dense-captioning events in videos

    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE international conference on com- puter vision, pages 706–715, 2017. 2

  19. [19]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the 29th symposium on operating systems prin- ciples, pages 611–626, 2023. 5

  20. [20]

    Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources.arXiv preprint arXiv:2509.21268, 2025

    Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Yuming Jiang, Hang Zhang, Xin Li, Lidong Bing, et al. Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources.arXiv preprint arXiv:2509.21268, 2025. 3

  21. [21]

    Reinforcement learning outperforms supervised fine-tuning: A case study on audio question an- swering.arXiv preprint arXiv:2503.11197, 2025

    Gang Li, Jizhong Liu, Heinrich Dinkel, Yadong Niu, Junbo Zhang, and Jian Luan. Reinforcement learning outperforms supervised fine-tuning: A case study on audio question an- swering.arXiv preprint arXiv:2503.11197, 2025. 3

  22. [22]

    Getting more juice out of the sft data: Reward learning from human demonstration im- proves sft for llm alignment

    Jiaxiang Li, Siliang Zeng, Hoi-To Wai, Chenliang Li, Al- fredo Garcia, and Mingyi Hong. Getting more juice out of the sft data: Reward learning from human demonstration im- proves sft for llm alignment. InAdvances in Neural Informa- tion Processing Systems, pages 124292–124318, 2024. 9

  23. [23]

    Mvbench: A comprehensive multi-modal video understand- ing benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 2

  24. [24]

    VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025. 2, 3, 6

  25. [25]

    Im- proving llm video understanding with 16 frames per second

    Yixuan Li, Changli Tang, Jimin Zhuang, Yudong Yang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. Im- proving llm video understanding with 16 frames per second. arXiv preprint arXiv:2503.13956, 2025. 2

  26. [26]

    TempCompass: Do Video LLMs Really Understand Videos?

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcom- pass: Do video llms really understand videos?arXiv preprint arXiv:2403.00476, 2024. 2

  27. [27]

    Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

    Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025. 3

  28. [28]

    Visual- rft: Visual reinforcement fine-tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning. InProceedings of the IEEE international conference on computer vision, 2025. 3

  29. [29]

    Lmms engine: A simple, unified multimodal framework for pretraining and finetuning., 2025

    LMMs-Lab. Lmms engine: A simple, unified multimodal framework for pretraining and finetuning., 2025. 4

  30. [30]

    Decoupled weight de- cay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 4

  31. [31]

    MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforce- ment learning.arXiv preprint arXiv:2503.07365, 2025. 3

  32. [32]

    Multi-agent tool-integrated policy optimization.arXiv preprint arXiv:2510.04678, 2025

    Zhanfeng Mo, Xingxuan Li, Yuntao Chen, and Lidong Bing. Multi-agent tool-integrated policy optimization.arXiv preprint arXiv:2510.04678, 2025. 6

  33. [33]

    We-math 2.0: A versatile 10 mathbook system for incentivizing visual mathematical rea- soning.arXiv preprint arXiv:2508.10433, 2025

    Runqi Qiao, Qiuna Tan, Peiqing Yang, Yanzi Wang, Xi- aowan Wang, Enhui Wan, Sitong Zhou, Guanting Dong, Yuchen Zeng, Yida Xu, et al. We-math 2.0: A versatile 10 mathbook system for incentivizing visual mathematical rea- soning.arXiv preprint arXiv:2508.10433, 2025. 3

  34. [34]

    Timechat: A time-sensitive multimodal large lan- guage model for long video understanding

    Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large lan- guage model for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14313–14323, 2024. 2

  35. [35]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 5, 3

  36. [36]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 3

  37. [37]

    Hybridflow: A flexible and efficient rlhf frame- work

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf frame- work. InProceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025. 4

  38. [38]

    Moviechat: From dense token to sparse memory for long video understanding

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024. 2

  39. [39]

    Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

    Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 3

  40. [40]

    Reinforcement fine-tuning powers reasoning ca- pability of multimodal large language models.arXiv preprint arXiv:2505.18536, 2025

    Haoyuan Sun, Jiaqi Wu, Bo Xia, Yifu Luo, Yifei Zhao, Kai Qin, Xufei Lv, Tiantian Zhang, Yongzhe Chang, and Xue- qian Wang. Reinforcement fine-tuning powers reasoning ca- pability of multimodal large language models.arXiv preprint arXiv:2505.18536, 2025. 6

  41. [41]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 7, 8

  42. [42]

    Introducing gpt-5.https://openai

    OpenAI Team. Introducing gpt-5.https://openai. com/index/introducing-gpt-5/, 2025. 2, 1

  43. [43]

    Thinking with images.https : / / openai

    OpenAI Team. Thinking with images.https : / / openai . com / index / thinking - with - images/,

  44. [44]

    Qwen3-vl: Sharper vision, deeper thought, broader action.https : / / qwen

    Qwen Team. Qwen3-vl: Sharper vision, deeper thought, broader action.https : / / qwen . ai / blog ? from = research . latest - advancements - list & id = 99f0335c4ad9ff6153e517418d48535ab6d8afef,

  45. [45]

    Ego-r1: Chain-of-tool- thought for ultra-long egocentric video reasoning.arXiv preprint arXiv:2506.13654, 2025

    Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, and Ziwei Liu. Ego-r1: Chain-of-tool- thought for ultra-long egocentric video reasoning.arXiv preprint arXiv:2506.13654, 2025. 3

  46. [46]

    Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434,

    Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434,

  47. [47]

    Video-thinker: Sparking” thinking with videos” via reinforcement learning.arXiv preprint arXiv:2510.23473,

    Shijian Wang, Jiarui Jin, Xingjian Wang, Linxin Song, Run- hao Fu, Hecheng Wang, Zongyuan Ge, Yuan Lu, and Xuelian Cheng. Video-thinker: Sparking” thinking with videos” via reinforcement learning.arXiv preprint arXiv:2510.23473,

  48. [48]

    LVBench: An Extreme Long Video Understanding Benchmark

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiao- han Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, et al. Lvbench: An extreme long video understanding benchmark.arXiv preprint arXiv:2406.08035, 2024. 2, 3, 7, 8, 5

  49. [49]

    Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

    Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision lan- guage model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025. 2, 3, 5

  50. [50]

    Sari: Structured audio reasoning via curriculum-guided reinforcement learning.arXiv preprint arXiv:2504.15900, 2025

    Cheng Wen, Tingwei Guo, Shuaijiang Zhao, Wei Zou, and Xiangang Li. Sari: Structured audio reasoning via curriculum-guided reinforcement learning.arXiv preprint arXiv:2504.15900, 2025. 3

  51. [51]

    Longvideobench: A benchmark for long-context interleaved video-language understanding

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. InAdvances in Neural Infor- mation Processing Systems, pages 28828–28857, 2024. 2, 3

  52. [52]

    Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

    Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven think- ing and visual drawing.arXiv preprint arXiv:2506.09965,

  53. [53]

    Llava-cot: Let vision language models reason step-by-step

    Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2087– 2098, 2025. 3

  54. [54]

    Vidchapters-7m: Video chapters at scale

    Antoine Yang, Arsha Nagrani, Ivan Laptev, Josef Sivic, and Cordelia Schmid. Vidchapters-7m: Video chapters at scale. Advances in Neural Information Processing Systems, 36: 49428–49444, 2023. 2

  55. [55]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 6, 5

  56. [56]

    Mermaid: Multi-perspective self-reflective agents with generative augmentation for emotion recogni- tion

    Zhongyu Yang, Junhao Song, Siyang Song, Wei Pang, and Yingfang Yuan. Mermaid: Multi-perspective self-reflective agents with generative augmentation for emotion recogni- tion. InProceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing, pages 24650– 24666, 2025. 3

  57. [57]

    Timeexpert: An expert-guided video llm for video temporal grounding

    Zuhao Yang, Yingchen Yu, Yunqing Zhao, Shijian Lu, and Song Bai. Timeexpert: An expert-guided video llm for video temporal grounding. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 24286– 24296, 2025. 2 11

  58. [58]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing.arXiv preprint arXiv:2501.13106, 2025. 2, 3

  59. [59]

    Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning

    Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning. arXiv preprint arXiv:2508.04416, 2025. 2, 3, 7

  60. [60]

    Lmms-eval: Re- ality check on the evaluation of large multimodal models

    Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Re- ality check on the evaluation of large multimodal models. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 881–916, 2025. 7, 5

  61. [61]

    Open- mmreasoner: Pushing the frontiers for multimodal rea- soning with an open and general recipe.arXiv preprint arXiv:2511.16334, 2025

    Kaichen Zhang, Keming Wu, Zuhao Yang, Kairui Hu, Bin Wang, Ziwei Liu, Xingxuan Li, and Lidong Bing. Open- mmreasoner: Pushing the frontiers for multimodal rea- soning with an open and general recipe.arXiv preprint arXiv:2511.16334, 2025. 3

  62. [62]

    Sglang: Efficient execution of structured language model programs

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs. InAdvances in neural information processing systems, pages 62557–62583, 2024. 4

  63. [63]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 3

  64. [64]

    Omni-r1: Reinforcement learning for omnimodal reasoning via two-system collaboration.arXiv preprint arXiv:2505.20256, 2025

    Hao Zhong, Muzhi Zhu, Zongze Du, Zheng Huang, Canyu Zhao, Mingyu Liu, Wen Wang, Hao Chen, and Chunhua Shen. Omni-r1: Reinforcement learning for omnimodal reasoning via two-system collaboration.arXiv preprint arXiv:2505.20256, 2025. 3

  65. [65]

    Thinking with Long Videos

    Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. InProceedings of the AAAI conference on artificial intelligence, 2018. 2 12 LongVT: Incentivizing “Thinking with Long Videos” via Native Tool Calling Supplementary Material Outline This Supplementary Material complements the main paper, prov...

  66. [66]

    global- to-local

    LongVT Performs Human-Aligned Think- ing like Leading Proprietary LMMs The core philosophy of our proposed interleaved Multi- modal Chain-of-Tool-Thought (iMCoTT) entails a “global- to-local” thinking pattern: the model first performs a coarse skim to formulate a hypothesis, and subsequently invokes the nativecrop video()tool to inspect specific tempo- ra...

  67. [67]

    black-box

    What Motivates VideoSIAH? Unveiling the Data Contamination in Qwen-VL Series With the rapid advancements of LMMs, model performance on various benchmarks has steadily improved. However, the “black-box” nature of training data raises a critical ques- tion:Do these improvements reflect genuine reasoning ca- pability, or are they partly due to the model memo...

  68. [68]

    Additional VideoSIAH Details Source Purpose Samples LLaV A-CoT [53] General Visual Reasoning 54,591 OpenVLThinker [6] Complex Reasoning 2,829 We-Math 2.0 [33] Mathematical Reasoning 602 Table 5.Detailed Statistics of Image-based CoT Data for Cold- Start SFT. Breakdown of Image-based CoT Data.As detailed in Table 5, we construct a diverse mixture of image-...

  69. [69]

    For a sequence of to- kensx= (x 1, x2,

    Additional Methodological Details Next-Token Prediction.During SFT, we train our model by minimizing the negative log-likelihood of the target to- kens given their preceding context. For a sequence of to- kensx= (x 1, x2, . . . , xT )and a model parameterized byθ that defines conditional probabilitiesp θ(xt |x <t), the loss function is defined as L(θ) =− ...

  70. [70]

    segment,

    Reflection Trajectory: From Verbose Self- Correction to Internalized Tool Usage We visualize the evolution of the model’s internal thought process in Figure 7 (left). Echoing the training dynam- ics observed in DeepEyes [63], the trajectory of reflection token proportion discloses a distinct three-phase evolution from exploratory correction to efficient t...

  71. [71]

    of Training Steps 3000 160 1600 No

    Additional Implementation Details Component SFT RL RFT Optimizer AdamW [30] AdamW AdamW Learning Rate (LR) 5e-5 1e-6 5e-5 LR Scheduler cosine constant cosine Weight Decay 0.0 1e-2 0.0 No. of Training Steps 3000 160 1600 No. of Warmup Steps 300 0 160 Max Length 51200 52384 51200 Dynamic Batch Size True False True Remove Padding True True True Liger Kernel ...

  72. [72]

    To optimize training throughput and mini- mize memory overhead, we employ an online stream pack- ing strategy on iterable datasets

    framework. To optimize training throughput and mini- mize memory overhead, we employ an online stream pack- ing strategy on iterable datasets. Specifically, instead of padding individual sequences, we concatenate input sam- ples to fill a fixed buffer size of 51,200 tokens, thereby elim- inating redundant computation on padding tokens. Incom- ing data is ...

  73. [73]

    blindly rephrasing

    Inference Efficiency Analysis Efficiency Analysis.We present a comparative analysis of inference latency across four benchmarks in Table 7. De- spite incorporating multi-turn tool interactions, LongVT- 7B-RFT demonstrates remarkable efficiency, achieving the lowest latency on VideoMMMU (1329.8 seconds) and LVBench (1509.3 seconds), and maintaining highly ...

  74. [74]

    Figure 8 shows the RL prompt template, while Figure 9 presents the evaluation prompts used in LLM-as-a-Judge [55] for measuring an- swer’s accuracy during RL

    Examples Prompts and Data Examples.To enhance reproducibil- ity and transparency, we provide concrete examples of the key resources used in our experiments. Figure 8 shows the RL prompt template, while Figure 9 presents the evaluation prompts used in LLM-as-a-Judge [55] for measuring an- swer’s accuracy during RL. One representative sample from both SFT a...

  75. [75]

    which video-game device

    Failure Case Analysis To further illustrate the instability of the RL-only variant discussed in Section 5.3 of the main paper, we present a rep- resentative failure case. As shown in Figure 14, the model correctly recognizes the need to invoke a tool to inspect the glass coffee table. However, after receiving the resampled video frames, it fails to integr...

  76. [76]

    Manager Agent

    Limitation and Future Direction While our efficiency analysis in Section 13 confirms that multi-turn tool interactions do not impose significant la- tency penalties, the memory footprint of such recursive rea- soning remains a bottleneck. The single-agent architecture of LongVT is constrained by the inherent context window of the underlying LMM: as the nu...

  77. [77]

    Broader Impact LongVT advances the field of long-video understanding by introducing an agentic framework capable of proactive ev- idence seeking and self-correction. By enabling LMMs to dynamically inspect and re-examine video segments, this work addresses critical reliability issues—such as hallu- cinations and temporal misalignment that hinder the de- p...

  78. [78]

    type\": \

    Ethical Considerations Advancing Reliability and Safety.LongVT is explicitly designed to enhance the reliability of video LMMs by mit- igating hallucinations through on-demand visual verifica- tion. By grounding answers in retrieved video evidence, the system reduces the likelihood of fabricating events or misinterpreting context, thereby fostering more t...