pith. sign in

arxiv: 2605.16079 · v1 · pith:NCDFAWGEnew · submitted 2026-05-15 · 💻 cs.CV · cs.AI· cs.HC

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

Pith reviewed 2026-05-20 19:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.HC
keywords instance-level video understandingagentic tool invocationvision-language modelsvisual promptsdata synthesis pipelinereinforcement learningproactive perception
0
0 comments X

The pith

VideoSeeker lets vision-language models proactively call visual tools to locate and understand specific instances in video.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VideoSeeker as a way to move beyond text prompts in video understanding tasks that require exact spatial and temporal references to objects or events. Standard large vision-language models struggle here because they reason mainly from language rather than actively examining fine-grained visual evidence. VideoSeeker builds in agentic tool invocation so the model can request and retrieve relevant video segments on its own. A four-stage automated pipeline creates the necessary training data at scale, followed by cold-start supervision and reinforcement learning to embed the new behavior. If the approach holds, models gain clear advantages on precise instance-level tasks and show carry-over benefits to broader video benchmarks.

Core claim

VideoSeeker shows that native agentic tool invocation, combined with visual prompts, allows large vision-language models to perform proactive perception and retrieval of instance-level video content. The model internalizes this capability through a four-stage automated data synthesis pipeline plus cold-start supervision and RL training, delivering an average 13.7 percent improvement over baselines on instance-level video understanding tasks and outperforming closed-source systems such as GPT-4o and Gemini-2.5-Pro while transferring to general video understanding benchmarks.

What carries the argument

Native agentic tool invocation, the mechanism that lets the model decide when and how to perceive and retrieve specific video segments through visual prompts rather than relying on external text instructions.

If this is right

  • Models achieve an average 13.7 percent gain on instance-level video understanding tasks that demand precise spatiotemporal localization.
  • Performance exceeds that of closed-source models including GPT-4o and Gemini-2.5-Pro on those tasks.
  • Capabilities transfer to general video understanding benchmarks beyond the instance-level setting.
  • User interaction improves because visual prompts replace imprecise text descriptions for referencing specific video content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same tool-invocation pattern could apply to other sequential data such as audio streams or sensor time series where proactive localization of events matters.
  • Real-time applications like live monitoring or robotics might benefit if the internalized tool calling reduces latency compared with external prompting systems.
  • The automated data pipeline could be reused to bootstrap similar capabilities in new domains without large manual annotation efforts.

Load-bearing premise

The four-stage automated pipeline can produce enough high-quality instance-level video data to make cold-start supervision and reinforcement learning effective.

What would settle it

Running the trained model on the same instance-level video understanding benchmarks without the agentic tool-calling behavior and finding performance equal to or below standard text-prompt baselines would disprove the central claim.

Figures

Figures reproduced from arXiv: 2605.16079 by Feng Zhao, Jiawei Zhao, Jiayin Cai, Lin Chen, Qing Miao, Qisheng Su, Wenxuan Huang, Xiaolong Jiang, Yao Hu, Yiming Zhao, Yukun Qi, Yu Zeng, Zehui Chen, Zhen Fang.

Figure 1
Figure 1. Figure 1: Overview of VideoSeeker. (A): Instance-level video understanding tasks require models to accurately locate and reason about specific instances in videos guided by visual prompts, given a video, a visual prompt frame, and a query. Compared to text-only prompts that require lengthy referential descriptions, visual prompts provide a more intuitive interaction method. (B): Pipeline overview. We design a four-s… view at source ↗
Figure 2
Figure 2. Figure 2: Our Data Pipeline. (1) Low-cost Text Filtering rapidly filters pure text QA pairs; (2) Video-level Verification verifies target uniqueness and generates semantic tags; (3) Pixel-level Mask Generation produces pixel-wise masks via SAM3; (4) Visual Prompt Rendering renders diverse visual prompt types and rewrites QA to depend on them. In a nutshell, our contributions are as follows: • We propose VideoSeeker,… view at source ↗
Figure 3
Figure 3. Figure 3: Effect of Data Scale. Data Ablation. We construct several subsets by progressively increasing the sampling ratio from the full training corpus to investigate the impact of SFT data scale on model performance. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distillation Paradox. The heterogeneous distillation paradox: stronger teachers may produce weaker students. We ex￾periment with two teacher models: Qwen3-VL￾235B-A22B-Thinking and Gemini-3.1-Pro, achiev￾ing 78.4% and 83.8% accuracy on the rejection￾sampled dataset respectively. After SFT training Qwen3-VL-8B, the resulting student models achieve 70.4% and 64.7% on V2P-Bench, with relative per￾formance deg… view at source ↗
Figure 5
Figure 5. Figure 5: Inference Latency. Time Efficiency. We uniformly evaluate inference costs under the Agent mode. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: RL training curves. F Case Study See [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Case Study 1. The model invokes tools to proactively perceive instances and retrieve video [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Case Study 2. The question only requires visual cue information, so the model adaptively [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt of Text Filtering. Video Verification Prompt System Prompt: You are a video understanding expert specializing in visual prompting data construction. Your task: Given a video and an existing question-answer pair about the video, analyze whether the question targets a specific, uniquely identifiable object/person in the video, and if so, produce structured metadata for constructing visual-prompted QA … view at source ↗
Figure 10
Figure 10. Figure 10: Prompt of Video Verification. Rendering and Rewrite Prompt System Prompt: You are an expert dataset writer for visual prompted video QA. Your job is to rewrite QA text so that answering requires the visual prompt on the frame, not the target’s original textual attributes. Context: - Upstream data already contains: question, options - The placeholder <vp> marks where the target reference should be replaced… view at source ↗
Figure 11
Figure 11. Figure 11: Prompt of Rendering and Rewriting. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
read the original abstract

Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interaction, but these prompts struggle to provide precise spatial and temporal references, resulting in poor user experience. Furthermore, current approaches typically decouple visual perception from language reasoning, centering reasoning around language rather than visual content, which limits the model's ability to proactively perceive fine-grained visual evidence. To address these challenges, we propose VideoSeeker, a novel paradigm for instance-level video understanding through visual prompts. VideoSeeker seamlessly integrates agentic reasoning with instance-level video understanding tasks, enabling the model to proactively perceive and retrieve relevant video segments on demand. We construct a four-stage fully automated data synthesis pipeline to efficiently generate large-scale, high-quality instance-level video data. We internalize tool-calling and proactive perception capabilities into the model via cold-start supervision and RL training, building a powerful video understanding model. Experiments demonstrate that our model achieves an average improvement of +13.7% over baselines on instance-level video understanding tasks, surpassing powerful closed-source models such as GPT-4o and Gemini-2.5-Pro, while also showing effective transferability on general video understanding benchmarks. The relevant datasets and code will be released publicly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces VideoSeeker, a paradigm for instance-level video understanding in Large Vision-Language Models that integrates agentic reasoning with visual prompts to enable proactive perception and retrieval of relevant video segments. It describes a four-stage fully automated data synthesis pipeline to generate large-scale instance-level video data, which supports cold-start supervision followed by RL training to internalize tool-calling capabilities. The central empirical claim is an average +13.7% improvement over baselines on instance-level video understanding tasks, with outperformance of closed-source models such as GPT-4o and Gemini-2.5-Pro, plus effective transfer to general video understanding benchmarks. The authors commit to public release of the relevant datasets and code.

Significance. If the performance claims are substantiated with rigorous experimental details, this work could meaningfully advance video understanding by moving beyond text-prompt interactions toward native agentic tool use for precise spatiotemporal localization. The combination of automated data synthesis, cold-start, and RL to incentivize proactive visual perception addresses a recognized limitation in current LVLMs. The planned public release of datasets and code is a clear strength that would support reproducibility and community follow-up.

major comments (2)
  1. [Abstract] Abstract: The headline claim of a +13.7% average improvement (and outperformance of GPT-4o and Gemini-2.5-Pro) is presented without any description of the experimental protocol, specific baselines, task definitions, number of evaluation instances, statistical tests, or error bars. This absence prevents verification of the central performance result and its attribution to native agentic tool invocation.
  2. [Abstract / Methods] Four-stage data synthesis pipeline (Abstract and Methods): The pipeline is asserted to be fully automated and to produce high-quality instance-level supervision suitable for cold-start and RL, yet no quantitative validation is reported (e.g., precision/recall on generated bounding boxes or temporal intervals, inter-annotator agreement, or failure-mode analysis). Without such checks, it is unclear whether systematic spatial/temporal misalignments in the synthetic labels could inflate downstream benchmark scores.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'native agentic tool invocation' is introduced without a brief parenthetical gloss; adding one sentence clarifying how tool calling differs from standard text prompting would improve immediate readability for a broad CV audience.
  2. [Abstract] Abstract: The transferability claim on 'general video understanding benchmarks' would benefit from naming the specific benchmarks (e.g., ActivityNet, MSVD) even at the abstract level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments and for recognizing the potential of VideoSeeker to advance instance-level video understanding. We provide detailed responses to the major comments below and outline the revisions we will make to address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claim of a +13.7% average improvement (and outperformance of GPT-4o and Gemini-2.5-Pro) is presented without any description of the experimental protocol, specific baselines, task definitions, number of evaluation instances, statistical tests, or error bars. This absence prevents verification of the central performance result and its attribution to native agentic tool invocation.

    Authors: The experimental details supporting the +13.7% improvement, including the specific baselines (such as open-source LVLMs and closed-source models like GPT-4o), task definitions for instance-level understanding, evaluation instances, and results with statistical significance, are comprehensively described in the Experiments section (Section 4) and associated tables. We agree that including a high-level summary of the protocol in the abstract would improve readability. We will revise the abstract to incorporate a concise description of the evaluation setup and direct readers to the main text for full details, thereby strengthening the presentation of our central claims. revision: yes

  2. Referee: [Abstract / Methods] Four-stage data synthesis pipeline (Abstract and Methods): The pipeline is asserted to be fully automated and to produce high-quality instance-level supervision suitable for cold-start and RL, yet no quantitative validation is reported (e.g., precision/recall on generated bounding boxes or temporal intervals, inter-annotator agreement, or failure-mode analysis). Without such checks, it is unclear whether systematic spatial/temporal misalignments in the synthetic labels could inflate downstream benchmark scores.

    Authors: We recognize the value of providing quantitative evidence for the quality of our automated data synthesis pipeline. The four-stage pipeline is detailed in Section 3, emphasizing its fully automated nature for scalability. To address the referee's concern, we will include additional validation results in the revised Methods section, such as precision and recall metrics computed on a sample of generated annotations verified against human judgments, as well as a discussion of potential failure modes and how they are mitigated. This addition will confirm the reliability of the synthetic data and its contribution to the observed performance gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; performance claims are direct experimental outcomes

full rationale

The paper presents an engineering approach consisting of a four-stage automated data synthesis pipeline, cold-start supervision, and RL training to internalize tool-calling capabilities, followed by empirical evaluation reporting a +13.7% average improvement on instance-level video understanding tasks. No mathematical derivations, equations, or first-principles predictions appear in the provided text. The claimed results are framed as outcomes of running the trained model on benchmarks rather than quantities defined in terms of fitted parameters or reduced by construction to the pipeline inputs. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked in the abstract or description. The central claims rest on standard ML training and evaluation procedures that remain independent of the reported metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central performance claim depends on the unverified quality of the automated data pipeline and the effectiveness of cold-start plus RL in internalizing tool-calling behavior; these are not independently evidenced in the provided text.

axioms (1)
  • domain assumption Large Vision-Language Models have shown significant progress in video understanding yet face challenges in precise spatiotemporal localization.
    Stated in the opening of the abstract as background.
invented entities (1)
  • VideoSeeker paradigm no independent evidence
    purpose: Seamless integration of agentic reasoning with instance-level video understanding via visual prompts and native tool invocation.
    Newly introduced framework whose effectiveness is asserted but not demonstrated beyond the abstract claim.

pith-pipeline@v0.9.0 · 5819 in / 1221 out tokens · 54504 ms · 2026-05-20T19:32:18.149705+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 27 internal anchors

  1. [1]

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  2. [2]

    SAM 3: Segment Anything with Concepts

    N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

  3. [3]

    A. Chen, A. Li, B. Gong, B. Jiang, B. Fei, B. Yang, B. Shan, C. Yu, C. Wang, C. Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

  4. [4]

    L. Chen, J. Li, X. Dong, P. Zhang, C. He, J. Wang, F. Zhao, and D. Lin. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision, pages 370–387. Springer, 2024

  5. [5]

    L. Chen, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, J. Wang, Y . Qiao, D. Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

  6. [6]

    Y . Chen, W. Huang, B. Shi, Q. Hu, H. Ye, L. Zhu, Z. Liu, P. Molchanov, J. Kautz, X. Qi, et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

  7. [7]

    Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

    C. Clark, J. Zhang, Z. Ma, J. S. Park, M. Salehi, R. Tripathi, S. Lee, Z. Ren, C. D. Kim, Y . Yang, et al. Molmo2: Open weights and data for vision-language models with video understanding and grounding. arXiv preprint arXiv:2601.10611, 2026

  8. [8]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  9. [9]

    Deitke, C

    M. Deitke, C. Clark, S. Lee, R. Tripathi, Y . Yang, J. S. Park, M. Salehi, N. Muennighoff, K. Lo, L. Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 91–104, 2025

  10. [10]

    Y . Deng, H. Bansal, F. Yin, N. Peng, W. Wang, and K.-W. Chang. Openvlthinker: Complex vision-language reasoning via iterative sft-rl cycles.arXiv preprint arXiv:2503.17352, 2025

  11. [11]

    K. Feng, K. Gong, B. Li, Z. Guo, Y . Wang, T. Peng, J. Wu, X. Zhang, B. Wang, and X. Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

  12. [12]

    C. Fu, Y . Dai, Y . Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y . Shen, M. Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025

  13. [13]

    D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  14. [14]

    J. Hong, C. Zhao, C. Zhu, W. Lu, G. Xu, and X. Yu. Deepeyesv2: Toward agentic multimodal model. arXiv preprint arXiv:2511.05271, 2025

  15. [15]

    W. Hong, X. Gu, Z. Pan, Z. Yang, Y . Wang, Y . Wang, Y . Yue, Y . Wang, Y . Wang, Y . Wang, et al. Glm- 5v-turbo: Toward a native foundation model for multimodal agents.arXiv preprint arXiv:2604.26752, 2026

  16. [16]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    W. Huang, B. Jia, Z. Zhai, S. Cao, Z. Ye, F. Zhao, Z. Xu, X. Tang, Y . Hu, and S. Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

  17. [17]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  18. [18]

    OpenAI o1 System Card

    A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  19. [19]

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023. 10

  20. [20]

    X. Li, Z. Yan, D. Meng, L. Dong, X. Zeng, Y . He, Y . Wang, Y . Qiao, Y . Wang, and L. Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025

  21. [21]

    Z. Liu, Z. Sun, Y . Zang, X. Dong, Y . Cao, H. Duan, D. Lin, and J. Wang. Visual-rft: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2034–2044, 2025

  22. [22]

    F. Meng, L. Du, Z. Liu, Z. Zhou, Q. Lu, D. Fu, T. Han, B. Shi, W. Wang, J. He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2503.07365, 2025

  23. [23]

    Y . Qi, Y . Zhao, Y . Zeng, X. Bao, W. Huang, L. Chen, Z. Chen, J. Zhao, Z. Qi, and F. Zhao. Vcr- bench: A comprehensive evaluation framework for video chain-of-thought reasoning.arXiv preprint arXiv:2504.07956, 2025

  24. [24]

    Rafailov, A

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

  25. [25]

    S. Ren, L. Yao, S. Li, X. Sun, and L. Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14313–14323, 2024

  26. [26]

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  27. [27]

    H. Shen, P. Liu, J. Li, C. Fang, Y . Ma, J. Liao, Q. Shen, Z. Zhang, K. Zhao, Q. Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

  28. [28]

    HybridFlow: A Flexible and Efficient RLHF Framework

    G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y . Peng, H. Lin, and C. Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

  29. [29]

    K. Team, T. Bai, Y . Bai, Y . Bao, S. Cai, Y . Cao, Y . Charles, H. Che, C. Chen, G. Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

  30. [30]

    O. Team. Thinking with images.https://openai.com/index/thinking-with-images/, 2025

  31. [31]

    S. Tian, R. Wang, H. Guo, P. Wu, Y . Dong, X. Wang, J. Yang, H. Zhang, H. Zhu, and Z. Liu. Ego-r1: Chain-of-tool-thought for ultra-long egocentric video reasoning.arXiv preprint arXiv:2506.13654, 2025

  32. [32]

    C. Wang, K. Feng, D. Chen, Z. Wang, Z. Li, S. Gao, M. Meng, X. Zhou, M. Zhang, Y . Shang, et al. Adatooler-v: Adaptive tool-use for images and videos.arXiv preprint arXiv:2512.16918, 2025

  33. [33]

    H. Wang, A. Su, W. Ren, F. Lin, and W. Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025

  34. [34]

    Q. Wang, Y . Yu, Y . Yuan, R. Mao, and T. Zhou. Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025

  35. [35]

    Video-thinker: Sparking” thinking with videos” via reinforcement learning.arXiv preprint arXiv:2510.23473,

    S. Wang, J. Jin, X. Wang, L. Song, R. Fu, H. Wang, Z. Ge, Y . Lu, and X. Cheng. Video-thinker: Sparking" thinking with videos" via reinforcement learning.arXiv preprint arXiv:2510.23473, 2025

  36. [36]

    W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  37. [37]

    Y . Wang, Z. Wang, B. Xu, Y . Du, K. Lin, Z. Xiao, Z. Yue, J. Ju, L. Zhang, D. Yang, et al. Time-r1: Post-training large vision language model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025

  38. [38]

    H. Wu, D. Li, B. Chen, and J. Li. Longvideobench: A benchmark for long-context interleaved video- language understanding.Advances in Neural Information Processing Systems, 37:28828–28857, 2024

  39. [39]

    J. Wu, J. Guan, K. Feng, Q. Liu, S. Wu, L. Wang, W. Wu, and T. Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv preprint arXiv:2506.09965, 2025. 11

  40. [40]

    L. Xing, X. Dong, Y . Zang, Y . Cao, J. Liang, Q. Huang, J. Wang, F. Wu, and D. Lin. Caprl: Stimulating dense image caption capabilities via reinforcement learning.arXiv preprint arXiv:2509.22647, 2025

  41. [41]

    G. Xu, P. Jin, Z. Wu, H. Li, Y . Song, L. Sun, and L. Yuan. Llava-cot: Let vision language models reason step-by-step. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2087–2098, 2025

  42. [42]

    Y . Yang, X. He, H. Pan, X. Jiang, Y . Deng, X. Yang, H. Lu, D. Yin, F. Rao, M. Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2376–2385, 2025

  43. [43]

    LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

    Z. Yang, S. Wang, K. Zhang, K. Wu, S. Leng, Y . Zhang, B. Li, C. Qin, S. Lu, X. Li, et al. Longvt: Incentivizing" thinking with long videos" via native tool calling.arXiv preprint arXiv:2511.20785, 2025

  44. [44]

    E. Yu, K. Lin, L. Zhao, J. Yin, Y . Wei, Y . Peng, H. Wei, J. Sun, C. Han, Z. Ge, et al. Perception-r1: Pioneering perception policy with reinforcement learning.arXiv preprint arXiv:2504.07954, 2025

  45. [45]

    Y . Zeng, W. Huang, S. Huang, X. Bao, Y . Qi, Y . Zhao, Q. Wang, L. Chen, Z. Chen, H. Chen, et al. Agentic jigsaw interaction learning for enhancing visual perception and reasoning in vision-language models.arXiv preprint arXiv:2510.01304, 2025

  46. [46]

    Y . Zeng, Y . Qi, Y . Zhao, X. Bao, L. Chen, Z. Chen, S. Huang, J. Zhao, and F. Zhao. Enhancing large vision- language models with ultra-detailed image caption generation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 26703–26729, 2025

  47. [47]

    Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning

    H. Zhang, X. Gu, J. Li, C. Ma, S. Bai, C. Zhang, B. Zhang, Z. Zhou, D. He, and Y . Tang. Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning.arXiv preprint arXiv:2508.04416, 2025

  48. [48]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Y . Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li. Llava-video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024

  49. [49]

    S. Zhao, H. Zhang, S. Lin, M. Li, Q. Wu, K. Zhang, and C. Wei. Pyvision: Agentic vision with dynamic tooling.arXiv preprint arXiv:2507.07998, 2025

  50. [50]

    Y . Zhao, Y . Zeng, Y . Qi, Y . Liu, X. Bao, L. Chen, Z. Chen, Q. Miao, C. Liu, J. Zhao, et al. V2p-bench: Evaluating video-language understanding with visual prompts for better human-model interaction.arXiv preprint arXiv:2503.17736, 2025

  51. [51]

    Zheng, R

    Y . Zheng, R. Zhang, J. Zhang, Y . Ye, and Z. Luo. Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 3: system demonstrations), pages 400–410, 2024

  52. [52]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Z. Zheng, M. Yang, J. Hong, C. Zhao, G. Xu, L. Yang, C. Shen, and X. Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025. 12 Appendix Overview •Section A: Dataset Details. •Section B: Benchmark Information. •Section C: Hyperparameters. •Section D: Limitations and Social Impacts. •Section E: T...

  53. [53]

    Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...