LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

Bo Li; Chengwei Qin; Kaichen Zhang; Keming Wu; Lidong Bing; Shijian Lu; Sicong Leng; Sudong Wang; Xingxuan Li; Yifan Zhang

arxiv: 2511.20785 · v3 · pith:VXOB4GETnew · submitted 2025-11-25 · 💻 cs.CV

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

Zuhao Yang , Sudong Wang , Kaichen Zhang , Keming Wu , Sicong Leng , Yifan Zhang , Bo Li , Chengwei Qin

show 3 more authors

Shijian Lu Xingxuan Li Lidong Bing

This is my paper

Pith reviewed 2026-05-22 12:22 UTC · model grok-4.3

classification 💻 cs.CV

keywords long video understandingmultimodal chain-of-thoughtagentic reasoningvideo cropping tooltemporal groundinglarge multimodal modelsreinforcement learning

0 comments

The pith

Large multimodal models can improve long-video reasoning by using their built-in temporal grounding to crop and resample key clips as a native tool.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that large multimodal models often hallucinate on long videos because relevant evidence is sparse and spread out over time. It introduces LongVT as an agentic system that lets the model first review the full video and then use its own sense of timing to select and zoom into specific clips by resampling finer frames. This global-to-local loop repeats until the answer rests on actual retrieved visual evidence rather than guesswork. To support the approach the authors create the VideoSIAH data suite with hundreds of thousands of training examples across three stages and a 1280-pair evaluation benchmark. They train the model in three stages that include supervised fine-tuning and reinforcement learning, then show consistent gains over strong baselines on four long-video benchmarks.

Core claim

By exploiting LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on specific clips and resample finer-grained frames within an interleaved Multimodal Chain-of-Tool-Thought, and training via a three-stage strategy on the VideoSIAH dataset, LongVT enables thinking with long videos and outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks.

What carries the argument

The native video cropping tool that uses the model's own temporal grounding to select relevant clips and resample them at higher frame rates inside the global-to-local reasoning loop.

If this is right

The model learns to decide when to stop the reasoning loop once evidence is sufficient rather than continuing indefinitely.
Training data that mixes tool-use examples with reinforcement learning helps the model avoid over-cropping or missing distant evidence.
Releasing the VideoSIAH training and evaluation sets lets others test similar native-tool approaches on their own models.
The same global-to-local pattern can be applied to other tasks where evidence must be located across long sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the native cropping works without extra modules, similar built-in grounding could be used for audio tracks or multi-camera setups without custom tool training.
Extending the loop to include multiple rounds of cropping on the same clip might help when initial selections still lack detail.
Measuring how often the model chooses to crop versus answering directly would show whether the tool is used efficiently.

Load-bearing premise

The model's existing temporal grounding skill can be used directly to crop videos accurately without adding new errors or requiring separate training for the cropping action itself.

What would settle it

Run the model on a long video containing a clearly defined event at a known timestamp and check whether the clips it chooses to crop and inspect actually contain that event.

Figures

Figures reproduced from arXiv: 2511.20785 by Bo Li, Chengwei Qin, Kaichen Zhang, Keming Wu, Lidong Bing, Shijian Lu, Sicong Leng, Sudong Wang, Xingxuan Li, Yifan Zhang, Zuhao Yang.

**Figure 1.** Figure 1: Interleaved Multimodal Chain-of-Tool-Thought (iMCoTT). Compared to prior text-based Chain-of-Thought (CoT) reasoning, iMCoTT in our proposed LongVT can natively perform self-reflection via calling crop video(start time, end time) tool. It proposes a time window after a global preview, proactively fetches the corresponding short clip, rethinks based on the new evidence, and determines whether to refine or a… view at source ↗

**Figure 2.** Figure 2: Data Pipeline of VideoSIAH. We construct a semi-automatic data pipeline that integrates several state-of-the-art LMMs [1, 5, 12, 43] to sequentially perform long video segmentation, video clip captioning, segment-in-a-haystack QA generation, cross-modal QA filtering, and iMCoTT generation. Icons with human silhouettes denote human-in-the-loop validation, where annotators inspect a small set of representati… view at source ↗

**Figure 3.** Figure 3: Ablations on Reward Design. The left panel shows training dynamics under different accuracy and time rewards, and the right panel shows the effect of tool-call reward on tool usage. Answer Accuracy. Let K be the number of sampled rollouts in a group. For the k-th rollout (k ∈ {1, . . . , K}), let aˆ (k) denote its generated answer and let a ⋆ denote the ground-truth answer. We employ LLM-as-a-Judge [55] t… view at source ↗

**Figure 4.** Figure 4: Overall Framework of LongVT. Our approach processes long-form videos in a human-like two-stage manner. Specifically, LongVT is augmented with interleaved Multimodal Chain-of-Tool-Thought (iMCoTT): first performs a global skim over sampled video frames to form a coarse hypothesis about when evidence likely occurs; then invokes a native video tool crop video(start time, end time) to resample finer-grained fr… view at source ↗

**Figure 5.** Figure 5: Comparison of Watching Strategies Proposed by Gemini 2.5 Pro [5] and GPT-5 Thinking [42]. Best viewed when zoomed in. Setting VideoMME [9] VideoMMMU [13] VideoSIAH-Eval w/o subtitle adaptation∗ comprehension perception test Qwen2.5-VL-7B-Instruct [1] Original 64.3 35.7 44.3 56.7 33.8 No Visual 40.1 27.0 38.3 39.3 12.7 Rearranged Choices 56.0 31.6 40.3 67.0 - Qwen3-VL-8B-Instruct [44] Original 69.3 40.7 60.… view at source ↗

**Figure 6.** Figure 6: Category Distribution of VideoSIAH-Eval. We present the distribution of video types (a) and question types (b), highlighting the diversity of our proposed benchmark. 0 25 50 75 100 125 150 175 Training Step 0.007 0.008 0.009 0.010 0.011 0.012 0.013 0.014 0.015 Reflection Word Proportion Reflection Words Over Training Reflection Word Proportion (Raw) Reflection Word Proportion (Smoothed) Word Cloud from Las… view at source ↗

**Figure 7.** Figure 7: Trend of Reflection-Related Words and the Corresponding Word Cloud across All Rollouts. 12. Additional Implementation Details Component SFT RL RFT Optimizer AdamW [30] AdamW AdamW Learning Rate (LR) 5e-5 1e-6 5e-5 LR Scheduler cosine constant cosine Weight Decay 0.0 1e-2 0.0 No. of Training Steps 3000 160 1600 No. of Warmup Steps 300 0 160 Max Length 51200 52384 51200 Dynamic Batch Size True False True Rem… view at source ↗

**Figure 8.** Figure 8: Prompt Template Utilized for RL. This template outlines the structural guidelines and system instructions provided to the model during the RL training phase. Below are two answers to a question. Question is [Question], [Standard Answer] is the standard answer to the question, and [Model_answer] is the answer extracted from a model's output to this question. Judge how consistent the two answers are. Scoring… view at source ↗

**Figure 9.** Figure 9: Evaluation Prompt for LLM-as-a-Judge. We present the full system instruction used to query the judge model. This prompt defines the scoring criteria and guidelines to ensure consistent evaluation of the model’s generated responses. 7 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Representative Data Example for SFT and RFT. The example illustrates the input format and the corresponding ground-truth response used to train the model across both fine-tuning stages. 8 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: An Example of Single-turn Inference with Self-Correction. The model initially misidentifies the basin color as pink. However, through the reasoning process (highlighted in the “Thinking” block), it explicitly decides to double-check the frames, corrects the hallucinations, and outputs the correct answer (Blue). 9 [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: An Example of Multi-step Inference Involving Tool Interaction. In this complex query, the model initially crops an incorrect time window (297s-305s) which lacks the target visual information. Recognizing this error during the reasoning phase, it refines the parameters and calls the tool again with the correct window (344s-372s) to successfully identify the US flag. 10 [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

**Figure 13.** Figure 13: Qualitative Comparison between Textual CoT and Our Designed iMCoTT. The baseline textual CoT (left) relies on hallucinated memory, confidently providing an incorrect answer regarding the cars’ colors (“Black and Yellow”). In contrast, our model (right) actively engages with the video content via tool usage. Despite an initial mis-localization (90s-120s), the model explicitly detects the absence of the tar… view at source ↗

**Figure 14.** Figure 14: Failure Case of the RL-only Variant. This example demonstrates the model’s inability to maintain the logical flow after a tool interaction without prior SFT. Although the model initiates a tool call to inspect the blurred region, it fails to utilize the returned observation to answer the user’s question. Instead, it loses the conversational context and hallucinates a repetition of the general video descri… view at source ↗

read the original abstract

Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LongVT shows a workable native cropping tool for LMM agents on long videos with public data and staged training, but the grounding reliability is assumed rather than measured.

read the letter

The main point is that LongVT lets an LMM use its own temporal grounding to crop and zoom into relevant clips inside an agentic reasoning loop, and the authors back this with a new dataset plus three-stage training that beats baselines on four long-video benchmarks. They treat the cropping as a native tool call within an interleaved multimodal chain-of-tool-thought, so the model decides when to pull a finer-grained segment and resample frames until the answer is visually grounded. This mirrors how people skim then focus on long videos and aims to reduce hallucinations when evidence is sparse and spread out. The VideoSIAH suite they release—247.9K samples for cold-start SFT on tool use, smaller RL sets, and a 1,280-pair eval set with human validation—plus public code and checkpoints is a clear practical plus for anyone wanting to reproduce or extend the setup. The staged training recipe looks like a reasonable way to bootstrap the agent behavior without starting from zero. The incremental novelty is in making the temporal grounding directly callable as a tool rather than bolting on external modules. On the soft side, the whole claim rests on the untested idea that the LMM already has reliable temporal grounding that can be invoked without systematic errors or extra training. There is no reported precision or recall on the predicted start and end times, no ablation showing what happens when crops are off, and no diagnostic of how often bad crops affect final accuracy. If grounding mistakes are frequent on long sparse videos, the benchmark gains could come mainly from the data curation or the RL stages instead of the native-tool mechanism itself. The abstract claims consistent outperformance but the summary gives no numbers or error breakdowns, so the effect size is hard to judge from what is here. This is for labs working on agentic multimodal models and long-video QA who need concrete recipes and open resources. A reader who wants to try tool-augmented reasoning on videos will get usable ideas and code to build from. It deserves a serious referee because the approach is described clearly, the resources are public, and the empirical claims can be checked directly, though any review would likely push for ablations on the cropping accuracy to pin down where the gains actually come from.

Referee Report

2 major / 2 minor

Summary. The paper introduces LongVT, an end-to-end agentic framework for long-video reasoning that interleaves multimodal chain-of-tool-thought with native video cropping. It exploits LMMs' pre-trained temporal grounding to select and resample relevant clips without separate tool training, using a three-stage pipeline (247.9K-sample cold-start SFT, 1.6K-sample agentic RL, 15.4K-sample RL fine-tuning) on the newly curated VideoSIAH dataset. Evaluation on a 1,280-pair held-out benchmark shows consistent gains over strong baselines on four long-video understanding and reasoning tasks, with public code, data, and checkpoints.

Significance. If the reported gains prove robust and attributable to the native-tool mechanism, the work offers a practical route to scalable long-video reasoning that avoids the overhead of training dedicated localization modules. The public release of training data, evaluation set, and model weights is a clear strength for reproducibility and follow-on research.

major comments (2)

[§4] §4 (Experiments) and associated tables: the central outperformance claim is presented without any diagnostic metrics on the reliability of the native temporal-grounding tool (e.g., precision/recall or temporal IoU of predicted start/end times against ground-truth relevant segments). Because the method explicitly relies on this unmeasured component to produce accurate crops, the absence of such analysis leaves open the possibility that gains derive primarily from data curation or the RL stages rather than the claimed native-tool loop.
[§3.2] §3.2 (Tool-integrated cold-start SFT) and §3.3 (Agentic RL): no ablation isolates the contribution of the cropping tool itself. Removing or replacing the native grounding step with oracle crops or a separately trained localizer would directly test whether the reported improvements require the specific assumption that LMMs already possess reliable, error-free temporal grounding.

minor comments (2)

[Abstract] The abstract and §1 state that LongVT 'consistently outperforms existing strong baselines' yet provide no numerical deltas, baseline names, or error bars; these should be added for immediate readability.
[§3] Notation for the interleaved reasoning loop (e.g., how tool calls are formatted and how frame resampling is performed) is described only at a high level; a concrete example or pseudocode would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that directly incorporate the suggested analyses to strengthen the evidence for the native tool-calling mechanism.

read point-by-point responses

Referee: [§4] §4 (Experiments) and associated tables: the central outperformance claim is presented without any diagnostic metrics on the reliability of the native temporal-grounding tool (e.g., precision/recall or temporal IoU of predicted start/end times against ground-truth relevant segments). Because the method explicitly relies on this unmeasured component to produce accurate crops, the absence of such analysis leaves open the possibility that gains derive primarily from data curation or the RL stages rather than the claimed native-tool loop.

Authors: We agree that direct diagnostic metrics on the native temporal-grounding tool would provide stronger support for attributing gains to the tool-calling loop rather than data or RL alone. In the revised manuscript we will add a dedicated analysis subsection in §4 reporting precision, recall, and temporal IoU of the model’s predicted start/end times against the ground-truth relevant segments available in the VideoSIAH evaluation set. These metrics will be computed on the 1,280-pair held-out benchmark and included alongside the existing task-performance tables. revision: yes
Referee: [§3.2] §3.2 (Tool-integrated cold-start SFT) and §3.3 (Agentic RL): no ablation isolates the contribution of the cropping tool itself. Removing or replacing the native grounding step with oracle crops or a separately trained localizer would directly test whether the reported improvements require the specific assumption that LMMs already possess reliable, error-free temporal grounding.

Authors: We concur that an ablation isolating the native cropping tool is necessary to validate the core claim. In the revision we will add experiments that replace the native grounding step with (i) oracle ground-truth crops and (ii) a separately trained temporal localizer baseline. The resulting performance deltas on the four long-video benchmarks will be reported in §4 (new rows in Table 2 or an additional ablation table). These comparisons will quantify how much the integrated native-tool loop contributes beyond data curation and the RL stages, while acknowledging that the LMM is not assumed to be error-free—the RL phases are explicitly designed to incentivize robust tool use despite occasional grounding inaccuracies. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with external benchmarks and public artifacts

full rationale

The paper describes an agentic LMM framework trained in three stages on curated VideoSIAH data (247.9K SFT + RL samples) and evaluated on four benchmarks with 1,280 QA pairs. No equations, derivations, or fitted parameters are presented that reduce by construction to the inputs. The core assumption of inherent temporal grounding is invoked as a starting point for tool use and then validated through end-to-end performance gains rather than being defined in terms of the outputs. Public code, data, and checkpoints allow independent reproduction against external benchmarks, confirming the derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that current LMMs already contain usable temporal grounding that can be turned into a reliable cropping tool; no new physical entities or mathematical axioms are introduced beyond standard supervised and reinforcement learning setups.

axioms (1)

domain assumption LMMs possess inherent temporal grounding ability usable as a native cropping tool
Invoked in the description of the global-to-local reasoning loop.

pith-pipeline@v0.9.0 · 5861 in / 1278 out tokens · 31124 ms · 2026-05-22T12:22:16.466338+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we exploit LMMs' inherent temporal grounding ability as a native video cropping tool... joint answer-temporal grounding reward... R(k) = R(k)acc + R(k)format + R(k)time with IoU
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three-stage training strategy... cold-start SFT + agentic RL + RFT

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 7.0

ParaVT is a parallel video tool-calling RL framework that resolves the Tool Prior Paradox via PARA-GRPO, delivering +7.9% average gains on six long-video benchmarks and raising format compliance from 0.13 to 0.64.
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 7.0

VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...
Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding
eess.AS 2026-04 unverdicted novelty 7.0

LAT-Audio introduces a global-to-local reasoning approach with TWA-CoT that outperforms prior models on temporal tasks for audio up to 30 minutes.
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
cs.CV 2026-04 unverdicted novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 6.0

ParaVT introduces the first multi-agent RL framework for parallel video tool calling in LMMs, using PARA-GRPO to resolve the Tool Prior Paradox and achieve +7.9% average improvement over Qwen3-VL baseline across six b...
VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation
cs.CV 2026-05 unverdicted novelty 6.0

VideoSeeker integrates agentic reasoning and visual prompts into LVLMs via automated data synthesis, cold-start supervision, and RL training, yielding +13.7% gains on instance-level video tasks over baselines includin...
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
cs.CV 2026-05 unverdicted novelty 6.0

Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 5.0

VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 5.0

VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · cited by 6 Pith papers · 24 internal anchors

[1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 3, 4, 5, 7, 8, 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Temporalbench: Benchmarking fine- grained temporal understanding for multimodal video mod- els.arXiv preprint arXiv:2410.10818, 2024

Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, et al. Temporalbench: Benchmarking fine- grained temporal understanding for multimodal video mod- els.arXiv preprint arXiv:2410.10818, 2024. 2

work page arXiv 2024
[3]

Eliciting good teach- ing from humans for machine learners.Artificial Intelli- gence, 217:198–215, 2014

Maya Cakmak and Andrea L Thomaz. Eliciting good teach- ing from humans for machine learners.Artificial Intelli- gence, 217:198–215, 2014. 2

work page 2014
[4]

Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Han- rong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025. 3, 4, 5

work page arXiv 2025
[5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 2, 4, 5, 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. Openvlthinker: An early ex- ploration to complex vision-language reasoning via iterative self-improvement.arXiv preprint arXiv:2503.17352, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Grit: Teaching mllms to think with images

Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images. InAdvances in Neural Information Processing Systems, 2025. 3 9

work page 2025
[8]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 2, 3, 7, 8, 5

work page 2025
[10]

Tall: Temporal activity localization via language query

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on com- puter vision, pages 5267–5275, 2017. 2, 8, 9

work page 2017
[11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2, 3, 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline pro- fessional videos.arXiv preprint arXiv:2501.13826, 2025. 2, 7, 8, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Multimodal 2gpretraining for dense video cap- tioning.arXiv preprint arXiv:2011.11760, 2020

Gabriel Huang, Bo Pang, Zhenhai Zhu, Clara Rivera, and Radu Soricut. Multimodal 2gpretraining for dense video cap- tioning.arXiv preprint arXiv:2011.11760, 2020. 2

work page arXiv 2011
[15]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Dense-captioning events in videos

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE international conference on com- puter vision, pages 706–715, 2017. 2

work page 2017
[19]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the 29th symposium on operating systems prin- ciples, pages 611–626, 2023. 5

work page 2023
[20]

Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources.arXiv preprint arXiv:2509.21268, 2025

Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Yuming Jiang, Hang Zhang, Xin Li, Lidong Bing, et al. Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources.arXiv preprint arXiv:2509.21268, 2025. 3

work page arXiv 2025
[21]

Reinforcement learning outperforms supervised fine-tuning: A case study on audio question an- swering.arXiv preprint arXiv:2503.11197, 2025

Gang Li, Jizhong Liu, Heinrich Dinkel, Yadong Niu, Junbo Zhang, and Jian Luan. Reinforcement learning outperforms supervised fine-tuning: A case study on audio question an- swering.arXiv preprint arXiv:2503.11197, 2025. 3

work page arXiv 2025
[22]

Getting more juice out of the sft data: Reward learning from human demonstration im- proves sft for llm alignment

Jiaxiang Li, Siliang Zeng, Hoi-To Wai, Chenliang Li, Al- fredo Garcia, and Mingyi Hong. Getting more juice out of the sft data: Reward learning from human demonstration im- proves sft for llm alignment. InAdvances in Neural Informa- tion Processing Systems, pages 124292–124318, 2024. 9

work page 2024
[23]

Mvbench: A comprehensive multi-modal video understand- ing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 2

work page 2024
[24]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025. 2, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Im- proving llm video understanding with 16 frames per second

Yixuan Li, Changli Tang, Jimin Zhuang, Yudong Yang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. Im- proving llm video understanding with 16 frames per second. arXiv preprint arXiv:2503.13956, 2025. 2

work page arXiv 2025
[26]

TempCompass: Do Video LLMs Really Understand Videos?

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcom- pass: Do video llms really understand videos?arXiv preprint arXiv:2403.00476, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Visual- rft: Visual reinforcement fine-tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning. InProceedings of the IEEE international conference on computer vision, 2025. 3

work page 2025
[29]

Lmms engine: A simple, unified multimodal framework for pretraining and finetuning., 2025

LMMs-Lab. Lmms engine: A simple, unified multimodal framework for pretraining and finetuning., 2025. 4

work page 2025
[30]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 4

work page 2019
[31]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforce- ment learning.arXiv preprint arXiv:2503.07365, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Multi-agent tool-integrated policy optimization.arXiv preprint arXiv:2510.04678, 2025

Zhanfeng Mo, Xingxuan Li, Yuntao Chen, and Lidong Bing. Multi-agent tool-integrated policy optimization.arXiv preprint arXiv:2510.04678, 2025. 6

work page arXiv 2025
[33]

We-math 2.0: A versatile 10 mathbook system for incentivizing visual mathematical rea- soning.arXiv preprint arXiv:2508.10433, 2025

Runqi Qiao, Qiuna Tan, Peiqing Yang, Yanzi Wang, Xi- aowan Wang, Enhui Wan, Sitong Zhou, Guanting Dong, Yuchen Zeng, Yida Xu, et al. We-math 2.0: A versatile 10 mathbook system for incentivizing visual mathematical rea- soning.arXiv preprint arXiv:2508.10433, 2025. 3

work page arXiv 2025
[34]

Timechat: A time-sensitive multimodal large lan- guage model for long video understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large lan- guage model for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14313–14323, 2024. 2

work page 2024
[35]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 5, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Hybridflow: A flexible and efficient rlhf frame- work

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf frame- work. InProceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025. 4

work page 2025
[38]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024. 2

work page 2024
[39]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Reinforcement fine-tuning powers reasoning ca- pability of multimodal large language models.arXiv preprint arXiv:2505.18536, 2025

Haoyuan Sun, Jiaqi Wu, Bo Xia, Yifu Luo, Yifei Zhao, Kai Qin, Xufei Lv, Tiantian Zhang, Yongzhe Chang, and Xue- qian Wang. Reinforcement fine-tuning powers reasoning ca- pability of multimodal large language models.arXiv preprint arXiv:2505.18536, 2025. 6

work page arXiv 2025
[41]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Introducing gpt-5.https://openai

OpenAI Team. Introducing gpt-5.https://openai. com/index/introducing-gpt-5/, 2025. 2, 1

work page 2025
[43]

Thinking with images.https : / / openai

OpenAI Team. Thinking with images.https : / / openai . com / index / thinking - with - images/,

work page
[44]

Qwen3-vl: Sharper vision, deeper thought, broader action.https : / / qwen

Qwen Team. Qwen3-vl: Sharper vision, deeper thought, broader action.https : / / qwen . ai / blog ? from = research . latest - advancements - list & id = 99f0335c4ad9ff6153e517418d48535ab6d8afef,

work page
[45]

Ego-r1: Chain-of-tool- thought for ultra-long egocentric video reasoning.arXiv preprint arXiv:2506.13654, 2025

Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, and Ziwei Liu. Ego-r1: Chain-of-tool- thought for ultra-long egocentric video reasoning.arXiv preprint arXiv:2506.13654, 2025. 3

work page arXiv 2025
[46]

Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434,

Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434,

work page arXiv
[47]

Video-thinker: Sparking” thinking with videos” via reinforcement learning.arXiv preprint arXiv:2510.23473,

Shijian Wang, Jiarui Jin, Xingjian Wang, Linxin Song, Run- hao Fu, Hecheng Wang, Zongyuan Ge, Yuan Lu, and Xuelian Cheng. Video-thinker: Sparking” thinking with videos” via reinforcement learning.arXiv preprint arXiv:2510.23473,

work page arXiv
[48]

LVBench: An Extreme Long Video Understanding Benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiao- han Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, et al. Lvbench: An extreme long video understanding benchmark.arXiv preprint arXiv:2406.08035, 2024. 2, 3, 7, 8, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision lan- guage model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025. 2, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Sari: Structured audio reasoning via curriculum-guided reinforcement learning.arXiv preprint arXiv:2504.15900, 2025

Cheng Wen, Tingwei Guo, Shuaijiang Zhao, Wei Zou, and Xiangang Li. Sari: Structured audio reasoning via curriculum-guided reinforcement learning.arXiv preprint arXiv:2504.15900, 2025. 3

work page arXiv 2025
[51]

Longvideobench: A benchmark for long-context interleaved video-language understanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. InAdvances in Neural Infor- mation Processing Systems, pages 28828–28857, 2024. 2, 3

work page 2024
[52]

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven think- ing and visual drawing.arXiv preprint arXiv:2506.09965,

work page internal anchor Pith review Pith/arXiv arXiv
[53]

Llava-cot: Let vision language models reason step-by-step

Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2087– 2098, 2025. 3

work page 2087
[54]

Vidchapters-7m: Video chapters at scale

Antoine Yang, Arsha Nagrani, Ivan Laptev, Josef Sivic, and Cordelia Schmid. Vidchapters-7m: Video chapters at scale. Advances in Neural Information Processing Systems, 36: 49428–49444, 2023. 2

work page 2023
[55]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 6, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Mermaid: Multi-perspective self-reflective agents with generative augmentation for emotion recogni- tion

Zhongyu Yang, Junhao Song, Siyang Song, Wei Pang, and Yingfang Yuan. Mermaid: Multi-perspective self-reflective agents with generative augmentation for emotion recogni- tion. InProceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing, pages 24650– 24666, 2025. 3

work page 2025
[57]

Timeexpert: An expert-guided video llm for video temporal grounding

Zuhao Yang, Yingchen Yu, Yunqing Zhao, Shijian Lu, and Song Bai. Timeexpert: An expert-guided video llm for video temporal grounding. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 24286– 24296, 2025. 2 11

work page 2025
[58]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing.arXiv preprint arXiv:2501.13106, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning

Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning. arXiv preprint arXiv:2508.04416, 2025. 2, 3, 7

work page arXiv 2025
[60]

Lmms-eval: Re- ality check on the evaluation of large multimodal models

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Re- ality check on the evaluation of large multimodal models. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 881–916, 2025. 7, 5

work page 2025
[61]

Open- mmreasoner: Pushing the frontiers for multimodal rea- soning with an open and general recipe.arXiv preprint arXiv:2511.16334, 2025

Kaichen Zhang, Keming Wu, Zuhao Yang, Kairui Hu, Bin Wang, Ziwei Liu, Xingxuan Li, and Lidong Bing. Open- mmreasoner: Pushing the frontiers for multimodal rea- soning with an open and general recipe.arXiv preprint arXiv:2511.16334, 2025. 3

work page arXiv 2025
[62]

Sglang: Efficient execution of structured language model programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs. InAdvances in neural information processing systems, pages 62557–62583, 2024. 4

work page 2024
[63]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Omni-r1: Reinforcement learning for omnimodal reasoning via two-system collaboration.arXiv preprint arXiv:2505.20256, 2025

Hao Zhong, Muzhi Zhu, Zongze Du, Zheng Huang, Canyu Zhao, Mingyu Liu, Wen Wang, Hao Chen, and Chunhua Shen. Omni-r1: Reinforcement learning for omnimodal reasoning via two-system collaboration.arXiv preprint arXiv:2505.20256, 2025. 3

work page arXiv 2025
[65]

Thinking with Long Videos

Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. InProceedings of the AAAI conference on artificial intelligence, 2018. 2 12 LongVT: Incentivizing “Thinking with Long Videos” via Native Tool Calling Supplementary Material Outline This Supplementary Material complements the main paper, prov...

work page 2018
[66]

global- to-local

LongVT Performs Human-Aligned Think- ing like Leading Proprietary LMMs The core philosophy of our proposed interleaved Multi- modal Chain-of-Tool-Thought (iMCoTT) entails a “global- to-local” thinking pattern: the model first performs a coarse skim to formulate a hypothesis, and subsequently invokes the nativecrop video()tool to inspect specific tempo- ra...

work page
[67]

black-box

What Motivates VideoSIAH? Unveiling the Data Contamination in Qwen-VL Series With the rapid advancements of LMMs, model performance on various benchmarks has steadily improved. However, the “black-box” nature of training data raises a critical ques- tion:Do these improvements reflect genuine reasoning ca- pability, or are they partly due to the model memo...

work page
[68]

Additional VideoSIAH Details Source Purpose Samples LLaV A-CoT [53] General Visual Reasoning 54,591 OpenVLThinker [6] Complex Reasoning 2,829 We-Math 2.0 [33] Mathematical Reasoning 602 Table 5.Detailed Statistics of Image-based CoT Data for Cold- Start SFT. Breakdown of Image-based CoT Data.As detailed in Table 5, we construct a diverse mixture of image-...

work page
[69]

For a sequence of to- kensx= (x 1, x2,

Additional Methodological Details Next-Token Prediction.During SFT, we train our model by minimizing the negative log-likelihood of the target to- kens given their preceding context. For a sequence of to- kensx= (x 1, x2, . . . , xT )and a model parameterized byθ that defines conditional probabilitiesp θ(xt |x <t), the loss function is defined as L(θ) =− ...

work page
[70]

segment,

Reflection Trajectory: From Verbose Self- Correction to Internalized Tool Usage We visualize the evolution of the model’s internal thought process in Figure 7 (left). Echoing the training dynam- ics observed in DeepEyes [63], the trajectory of reflection token proportion discloses a distinct three-phase evolution from exploratory correction to efficient t...

work page
[71]

of Training Steps 3000 160 1600 No

Additional Implementation Details Component SFT RL RFT Optimizer AdamW [30] AdamW AdamW Learning Rate (LR) 5e-5 1e-6 5e-5 LR Scheduler cosine constant cosine Weight Decay 0.0 1e-2 0.0 No. of Training Steps 3000 160 1600 No. of Warmup Steps 300 0 160 Max Length 51200 52384 51200 Dynamic Batch Size True False True Remove Padding True True True Liger Kernel ...

work page
[72]

To optimize training throughput and mini- mize memory overhead, we employ an online stream pack- ing strategy on iterable datasets

framework. To optimize training throughput and mini- mize memory overhead, we employ an online stream pack- ing strategy on iterable datasets. Specifically, instead of padding individual sequences, we concatenate input sam- ples to fill a fixed buffer size of 51,200 tokens, thereby elim- inating redundant computation on padding tokens. Incom- ing data is ...

work page arXiv 2014
[73]

blindly rephrasing

Inference Efficiency Analysis Efficiency Analysis.We present a comparative analysis of inference latency across four benchmarks in Table 7. De- spite incorporating multi-turn tool interactions, LongVT- 7B-RFT demonstrates remarkable efficiency, achieving the lowest latency on VideoMMMU (1329.8 seconds) and LVBench (1509.3 seconds), and maintaining highly ...

work page
[74]

Figure 8 shows the RL prompt template, while Figure 9 presents the evaluation prompts used in LLM-as-a-Judge [55] for measuring an- swer’s accuracy during RL

Examples Prompts and Data Examples.To enhance reproducibil- ity and transparency, we provide concrete examples of the key resources used in our experiments. Figure 8 shows the RL prompt template, while Figure 9 presents the evaluation prompts used in LLM-as-a-Judge [55] for measuring an- swer’s accuracy during RL. One representative sample from both SFT a...

work page
[75]

which video-game device

Failure Case Analysis To further illustrate the instability of the RL-only variant discussed in Section 5.3 of the main paper, we present a rep- resentative failure case. As shown in Figure 14, the model correctly recognizes the need to invoke a tool to inspect the glass coffee table. However, after receiving the resampled video frames, it fails to integr...

work page
[76]

Manager Agent

Limitation and Future Direction While our efficiency analysis in Section 13 confirms that multi-turn tool interactions do not impose significant la- tency penalties, the memory footprint of such recursive rea- soning remains a bottleneck. The single-agent architecture of LongVT is constrained by the inherent context window of the underlying LMM: as the nu...

work page
[77]

Broader Impact LongVT advances the field of long-video understanding by introducing an agentic framework capable of proactive ev- idence seeking and self-correction. By enabling LMMs to dynamically inspect and re-examine video segments, this work addresses critical reliability issues—such as hallu- cinations and temporal misalignment that hinder the de- p...

work page
[78]

type\": \

Ethical Considerations Advancing Reliability and Safety.LongVT is explicitly designed to enhance the reliability of video LMMs by mit- igating hallucinations through on-demand visual verifica- tion. By grounding answers in retrieved video evidence, the system reduces the likelihood of fabricating events or misinterpreting context, thereby fostering more t...

work page

[1] [1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 3, 4, 5, 7, 8, 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Temporalbench: Benchmarking fine- grained temporal understanding for multimodal video mod- els.arXiv preprint arXiv:2410.10818, 2024

Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, et al. Temporalbench: Benchmarking fine- grained temporal understanding for multimodal video mod- els.arXiv preprint arXiv:2410.10818, 2024. 2

work page arXiv 2024

[3] [3]

Eliciting good teach- ing from humans for machine learners.Artificial Intelli- gence, 217:198–215, 2014

Maya Cakmak and Andrea L Thomaz. Eliciting good teach- ing from humans for machine learners.Artificial Intelli- gence, 217:198–215, 2014. 2

work page 2014

[4] [4]

Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Han- rong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025. 3, 4, 5

work page arXiv 2025

[5] [5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 2, 4, 5, 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. Openvlthinker: An early ex- ploration to complex vision-language reasoning via iterative self-improvement.arXiv preprint arXiv:2503.17352, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Grit: Teaching mllms to think with images

Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images. InAdvances in Neural Information Processing Systems, 2025. 3 9

work page 2025

[8] [8]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 2, 3, 7, 8, 5

work page 2025

[10] [10]

Tall: Temporal activity localization via language query

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on com- puter vision, pages 5267–5275, 2017. 2, 8, 9

work page 2017

[11] [11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2, 3, 9

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline pro- fessional videos.arXiv preprint arXiv:2501.13826, 2025. 2, 7, 8, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Multimodal 2gpretraining for dense video cap- tioning.arXiv preprint arXiv:2011.11760, 2020

Gabriel Huang, Bo Pang, Zhenhai Zhu, Clara Rivera, and Radu Soricut. Multimodal 2gpretraining for dense video cap- tioning.arXiv preprint arXiv:2011.11760, 2020. 2

work page arXiv 2011

[15] [15]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Dense-captioning events in videos

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE international conference on com- puter vision, pages 706–715, 2017. 2

work page 2017

[19] [19]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the 29th symposium on operating systems prin- ciples, pages 611–626, 2023. 5

work page 2023

[20] [20]

Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources.arXiv preprint arXiv:2509.21268, 2025

Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Yuming Jiang, Hang Zhang, Xin Li, Lidong Bing, et al. Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources.arXiv preprint arXiv:2509.21268, 2025. 3

work page arXiv 2025

[21] [21]

Reinforcement learning outperforms supervised fine-tuning: A case study on audio question an- swering.arXiv preprint arXiv:2503.11197, 2025

Gang Li, Jizhong Liu, Heinrich Dinkel, Yadong Niu, Junbo Zhang, and Jian Luan. Reinforcement learning outperforms supervised fine-tuning: A case study on audio question an- swering.arXiv preprint arXiv:2503.11197, 2025. 3

work page arXiv 2025

[22] [22]

Getting more juice out of the sft data: Reward learning from human demonstration im- proves sft for llm alignment

Jiaxiang Li, Siliang Zeng, Hoi-To Wai, Chenliang Li, Al- fredo Garcia, and Mingyi Hong. Getting more juice out of the sft data: Reward learning from human demonstration im- proves sft for llm alignment. InAdvances in Neural Informa- tion Processing Systems, pages 124292–124318, 2024. 9

work page 2024

[23] [23]

Mvbench: A comprehensive multi-modal video understand- ing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 2

work page 2024

[24] [24]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025. 2, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Im- proving llm video understanding with 16 frames per second

Yixuan Li, Changli Tang, Jimin Zhuang, Yudong Yang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. Im- proving llm video understanding with 16 frames per second. arXiv preprint arXiv:2503.13956, 2025. 2

work page arXiv 2025

[26] [26]

TempCompass: Do Video LLMs Really Understand Videos?

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcom- pass: Do video llms really understand videos?arXiv preprint arXiv:2403.00476, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Visual- rft: Visual reinforcement fine-tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning. InProceedings of the IEEE international conference on computer vision, 2025. 3

work page 2025

[29] [29]

Lmms engine: A simple, unified multimodal framework for pretraining and finetuning., 2025

LMMs-Lab. Lmms engine: A simple, unified multimodal framework for pretraining and finetuning., 2025. 4

work page 2025

[30] [30]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 4

work page 2019

[31] [31]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforce- ment learning.arXiv preprint arXiv:2503.07365, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Multi-agent tool-integrated policy optimization.arXiv preprint arXiv:2510.04678, 2025

Zhanfeng Mo, Xingxuan Li, Yuntao Chen, and Lidong Bing. Multi-agent tool-integrated policy optimization.arXiv preprint arXiv:2510.04678, 2025. 6

work page arXiv 2025

[33] [33]

We-math 2.0: A versatile 10 mathbook system for incentivizing visual mathematical rea- soning.arXiv preprint arXiv:2508.10433, 2025

Runqi Qiao, Qiuna Tan, Peiqing Yang, Yanzi Wang, Xi- aowan Wang, Enhui Wan, Sitong Zhou, Guanting Dong, Yuchen Zeng, Yida Xu, et al. We-math 2.0: A versatile 10 mathbook system for incentivizing visual mathematical rea- soning.arXiv preprint arXiv:2508.10433, 2025. 3

work page arXiv 2025

[34] [34]

Timechat: A time-sensitive multimodal large lan- guage model for long video understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large lan- guage model for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14313–14323, 2024. 2

work page 2024

[35] [35]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 5, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Hybridflow: A flexible and efficient rlhf frame- work

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf frame- work. InProceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025. 4

work page 2025

[38] [38]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024. 2

work page 2024

[39] [39]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Reinforcement fine-tuning powers reasoning ca- pability of multimodal large language models.arXiv preprint arXiv:2505.18536, 2025

Haoyuan Sun, Jiaqi Wu, Bo Xia, Yifu Luo, Yifei Zhao, Kai Qin, Xufei Lv, Tiantian Zhang, Yongzhe Chang, and Xue- qian Wang. Reinforcement fine-tuning powers reasoning ca- pability of multimodal large language models.arXiv preprint arXiv:2505.18536, 2025. 6

work page arXiv 2025

[41] [41]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Introducing gpt-5.https://openai

OpenAI Team. Introducing gpt-5.https://openai. com/index/introducing-gpt-5/, 2025. 2, 1

work page 2025

[43] [43]

Thinking with images.https : / / openai

OpenAI Team. Thinking with images.https : / / openai . com / index / thinking - with - images/,

work page

[44] [44]

Qwen3-vl: Sharper vision, deeper thought, broader action.https : / / qwen

Qwen Team. Qwen3-vl: Sharper vision, deeper thought, broader action.https : / / qwen . ai / blog ? from = research . latest - advancements - list & id = 99f0335c4ad9ff6153e517418d48535ab6d8afef,

work page

[45] [45]

Ego-r1: Chain-of-tool- thought for ultra-long egocentric video reasoning.arXiv preprint arXiv:2506.13654, 2025

Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, and Ziwei Liu. Ego-r1: Chain-of-tool- thought for ultra-long egocentric video reasoning.arXiv preprint arXiv:2506.13654, 2025. 3

work page arXiv 2025

[46] [46]

Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434,

Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434,

work page arXiv

[47] [47]

Video-thinker: Sparking” thinking with videos” via reinforcement learning.arXiv preprint arXiv:2510.23473,

Shijian Wang, Jiarui Jin, Xingjian Wang, Linxin Song, Run- hao Fu, Hecheng Wang, Zongyuan Ge, Yuan Lu, and Xuelian Cheng. Video-thinker: Sparking” thinking with videos” via reinforcement learning.arXiv preprint arXiv:2510.23473,

work page arXiv

[48] [48]

LVBench: An Extreme Long Video Understanding Benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiao- han Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, et al. Lvbench: An extreme long video understanding benchmark.arXiv preprint arXiv:2406.08035, 2024. 2, 3, 7, 8, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision lan- guage model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025. 2, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Sari: Structured audio reasoning via curriculum-guided reinforcement learning.arXiv preprint arXiv:2504.15900, 2025

Cheng Wen, Tingwei Guo, Shuaijiang Zhao, Wei Zou, and Xiangang Li. Sari: Structured audio reasoning via curriculum-guided reinforcement learning.arXiv preprint arXiv:2504.15900, 2025. 3

work page arXiv 2025

[51] [51]

Longvideobench: A benchmark for long-context interleaved video-language understanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. InAdvances in Neural Infor- mation Processing Systems, pages 28828–28857, 2024. 2, 3

work page 2024

[52] [52]

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven think- ing and visual drawing.arXiv preprint arXiv:2506.09965,

work page internal anchor Pith review Pith/arXiv arXiv

[53] [53]

Llava-cot: Let vision language models reason step-by-step

Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2087– 2098, 2025. 3

work page 2087

[54] [54]

Vidchapters-7m: Video chapters at scale

Antoine Yang, Arsha Nagrani, Ivan Laptev, Josef Sivic, and Cordelia Schmid. Vidchapters-7m: Video chapters at scale. Advances in Neural Information Processing Systems, 36: 49428–49444, 2023. 2

work page 2023

[55] [55]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 6, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

Mermaid: Multi-perspective self-reflective agents with generative augmentation for emotion recogni- tion

Zhongyu Yang, Junhao Song, Siyang Song, Wei Pang, and Yingfang Yuan. Mermaid: Multi-perspective self-reflective agents with generative augmentation for emotion recogni- tion. InProceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing, pages 24650– 24666, 2025. 3

work page 2025

[57] [57]

Timeexpert: An expert-guided video llm for video temporal grounding

Zuhao Yang, Yingchen Yu, Yunqing Zhao, Shijian Lu, and Song Bai. Timeexpert: An expert-guided video llm for video temporal grounding. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 24286– 24296, 2025. 2 11

work page 2025

[58] [58]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing.arXiv preprint arXiv:2501.13106, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning

Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning. arXiv preprint arXiv:2508.04416, 2025. 2, 3, 7

work page arXiv 2025

[60] [60]

Lmms-eval: Re- ality check on the evaluation of large multimodal models

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Re- ality check on the evaluation of large multimodal models. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 881–916, 2025. 7, 5

work page 2025

[61] [61]

Open- mmreasoner: Pushing the frontiers for multimodal rea- soning with an open and general recipe.arXiv preprint arXiv:2511.16334, 2025

Kaichen Zhang, Keming Wu, Zuhao Yang, Kairui Hu, Bin Wang, Ziwei Liu, Xingxuan Li, and Lidong Bing. Open- mmreasoner: Pushing the frontiers for multimodal rea- soning with an open and general recipe.arXiv preprint arXiv:2511.16334, 2025. 3

work page arXiv 2025

[62] [62]

Sglang: Efficient execution of structured language model programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs. InAdvances in neural information processing systems, pages 62557–62583, 2024. 4

work page 2024

[63] [63]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[64] [64]

Omni-r1: Reinforcement learning for omnimodal reasoning via two-system collaboration.arXiv preprint arXiv:2505.20256, 2025

Hao Zhong, Muzhi Zhu, Zongze Du, Zheng Huang, Canyu Zhao, Mingyu Liu, Wen Wang, Hao Chen, and Chunhua Shen. Omni-r1: Reinforcement learning for omnimodal reasoning via two-system collaboration.arXiv preprint arXiv:2505.20256, 2025. 3

work page arXiv 2025

[65] [65]

Thinking with Long Videos

Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. InProceedings of the AAAI conference on artificial intelligence, 2018. 2 12 LongVT: Incentivizing “Thinking with Long Videos” via Native Tool Calling Supplementary Material Outline This Supplementary Material complements the main paper, prov...

work page 2018

[66] [66]

global- to-local

LongVT Performs Human-Aligned Think- ing like Leading Proprietary LMMs The core philosophy of our proposed interleaved Multi- modal Chain-of-Tool-Thought (iMCoTT) entails a “global- to-local” thinking pattern: the model first performs a coarse skim to formulate a hypothesis, and subsequently invokes the nativecrop video()tool to inspect specific tempo- ra...

work page

[67] [67]

black-box

What Motivates VideoSIAH? Unveiling the Data Contamination in Qwen-VL Series With the rapid advancements of LMMs, model performance on various benchmarks has steadily improved. However, the “black-box” nature of training data raises a critical ques- tion:Do these improvements reflect genuine reasoning ca- pability, or are they partly due to the model memo...

work page

[68] [68]

Additional VideoSIAH Details Source Purpose Samples LLaV A-CoT [53] General Visual Reasoning 54,591 OpenVLThinker [6] Complex Reasoning 2,829 We-Math 2.0 [33] Mathematical Reasoning 602 Table 5.Detailed Statistics of Image-based CoT Data for Cold- Start SFT. Breakdown of Image-based CoT Data.As detailed in Table 5, we construct a diverse mixture of image-...

work page

[69] [69]

For a sequence of to- kensx= (x 1, x2,

Additional Methodological Details Next-Token Prediction.During SFT, we train our model by minimizing the negative log-likelihood of the target to- kens given their preceding context. For a sequence of to- kensx= (x 1, x2, . . . , xT )and a model parameterized byθ that defines conditional probabilitiesp θ(xt |x <t), the loss function is defined as L(θ) =− ...

work page

[70] [70]

segment,

Reflection Trajectory: From Verbose Self- Correction to Internalized Tool Usage We visualize the evolution of the model’s internal thought process in Figure 7 (left). Echoing the training dynam- ics observed in DeepEyes [63], the trajectory of reflection token proportion discloses a distinct three-phase evolution from exploratory correction to efficient t...

work page

[71] [71]

of Training Steps 3000 160 1600 No

Additional Implementation Details Component SFT RL RFT Optimizer AdamW [30] AdamW AdamW Learning Rate (LR) 5e-5 1e-6 5e-5 LR Scheduler cosine constant cosine Weight Decay 0.0 1e-2 0.0 No. of Training Steps 3000 160 1600 No. of Warmup Steps 300 0 160 Max Length 51200 52384 51200 Dynamic Batch Size True False True Remove Padding True True True Liger Kernel ...

work page

[72] [72]

To optimize training throughput and mini- mize memory overhead, we employ an online stream pack- ing strategy on iterable datasets

framework. To optimize training throughput and mini- mize memory overhead, we employ an online stream pack- ing strategy on iterable datasets. Specifically, instead of padding individual sequences, we concatenate input sam- ples to fill a fixed buffer size of 51,200 tokens, thereby elim- inating redundant computation on padding tokens. Incom- ing data is ...

work page arXiv 2014

[73] [73]

blindly rephrasing

Inference Efficiency Analysis Efficiency Analysis.We present a comparative analysis of inference latency across four benchmarks in Table 7. De- spite incorporating multi-turn tool interactions, LongVT- 7B-RFT demonstrates remarkable efficiency, achieving the lowest latency on VideoMMMU (1329.8 seconds) and LVBench (1509.3 seconds), and maintaining highly ...

work page

[74] [74]

Figure 8 shows the RL prompt template, while Figure 9 presents the evaluation prompts used in LLM-as-a-Judge [55] for measuring an- swer’s accuracy during RL

Examples Prompts and Data Examples.To enhance reproducibil- ity and transparency, we provide concrete examples of the key resources used in our experiments. Figure 8 shows the RL prompt template, while Figure 9 presents the evaluation prompts used in LLM-as-a-Judge [55] for measuring an- swer’s accuracy during RL. One representative sample from both SFT a...

work page

[75] [75]

which video-game device

Failure Case Analysis To further illustrate the instability of the RL-only variant discussed in Section 5.3 of the main paper, we present a rep- resentative failure case. As shown in Figure 14, the model correctly recognizes the need to invoke a tool to inspect the glass coffee table. However, after receiving the resampled video frames, it fails to integr...

work page

[76] [76]

Manager Agent

Limitation and Future Direction While our efficiency analysis in Section 13 confirms that multi-turn tool interactions do not impose significant la- tency penalties, the memory footprint of such recursive rea- soning remains a bottleneck. The single-agent architecture of LongVT is constrained by the inherent context window of the underlying LMM: as the nu...

work page

[77] [77]

Broader Impact LongVT advances the field of long-video understanding by introducing an agentic framework capable of proactive ev- idence seeking and self-correction. By enabling LMMs to dynamically inspect and re-examine video segments, this work addresses critical reliability issues—such as hallu- cinations and temporal misalignment that hinder the de- p...

work page

[78] [78]

type\": \

Ethical Considerations Advancing Reliability and Safety.LongVT is explicitly designed to enhance the reliability of video LMMs by mit- igating hallucinations through on-demand visual verifica- tion. By grounding answers in retrieved video evidence, the system reduces the likelihood of fabricating events or misinterpreting context, thereby fostering more t...

work page