pith. machine review for the scientific record. sign in

arxiv: 2504.06958 · v5 · submitted 2025-04-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords reinforcement fine-tuningvideo multimodal modelsspatio-temporal perceptiontemporal groundingobject trackingvideo reasoningrule-based rewards
0
0 comments X

The pith

Reinforcement fine-tuning with rule-based temporal rewards creates a video model with state-of-the-art spatio-temporal perception.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether adding targeted rule-based rewards during reinforcement fine-tuning can strengthen how multimodal models handle time and space relations in videos. The authors apply this jointly across several perception tasks to produce VideoChat-R1. The resulting model records large gains on temporal grounding and object tracking while also lifting scores on general question-answering benchmarks. It keeps original chat abilities intact and supports a new inference method called temporal clue-driven reasoning. The work aims at data-efficient ways to build more reliable video dialogue systems.

Core claim

By applying reinforcement fine-tuning with rule-based rewards that target temporal associations across multiple spatio-temporal perception tasks, the resulting VideoChat-R1 model achieves substantial improvements in video understanding tasks such as temporal grounding and object tracking, while preserving or enhancing performance on general QA benchmarks and enabling a temporal clue-driven reasoning approach for dialogue.

What carries the argument

Reinforcement Fine-Tuning (RFT) driven by rule-based rewards that emphasize long-range temporal associations in video data.

If this is right

  • Large gains on temporal grounding (+31.8) and object tracking (+31.2).
  • Better results on general QA benchmarks.
  • More reliable video dialogue systems.
  • A new temporal clue-driven reasoning schema for inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reward design could transfer to other sequential data types such as audio or sensor streams.
  • It may lower the data volume needed to adapt video models for new domains.
  • Testing on real-time streaming video would show whether the gains survive latency constraints.

Load-bearing premise

The rule-based rewards on temporal associations will create broad video reasoning gains that work on new cases without hidden data tricks.

What would settle it

Evaluation on a fresh set of video tasks or longer sequences where the model shows no gain or a drop relative to the base model without the reinforcement step.

read the original abstract

Reinforcement Learning (RL) benefits Large Language Models (LLMs) for complex reasoning. Inspired by this, we explore integrating spatio-temporal specific rewards into Multimodal Large Language Models (MLLMs) to address the unique challenges of video understanding, such as long-range temporal associations. This paper investigates how rule-based rewards, particularly temporal ones, can improve video reasoning and their generalizability. Our study proposes Reinforcement Fine-Tuning (RFT) as a data-efficient method to enhance video reasoning on specific tasks without sacrificing original capabilities. Through joint RFT on multiple spatio-temporal perception tasks, we developed VideoChat-R1, a powerful Video MLLM. VideoChat-R1 achieves state-of-the-art spatio-temporal perception, demonstrating significant improvements in tasks like temporal grounding (+31.8) and object tracking (+31.2), while also improving general QA benchmarks. The enhanced perception and preserved chat abilities contribute to a more reliable video dialogue system, leading to our ``Temporal Clue-driven Reasoning" inference schema. This work provides a foundation for developing robust, real-world video comprehension agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces VideoChat-R1, a Video Multimodal Large Language Model enhanced via Reinforcement Fine-Tuning (RFT) that incorporates rule-based rewards targeting spatio-temporal perception tasks such as long-range temporal associations. It claims state-of-the-art results on video understanding benchmarks, with reported gains of +31.8 on temporal grounding and +31.2 on object tracking, alongside improvements on general QA tasks, preservation of chat capabilities, and a new 'Temporal Clue-driven Reasoning' inference schema.

Significance. If the gains prove robust under matched baselines and non-circular reward definitions, the work would offer a data-efficient RL approach for video MLLMs that improves targeted perception without degrading general capabilities. The joint multi-task RFT and inference schema could provide a practical template for building more reliable video dialogue agents.

major comments (2)
  1. [Abstract] Abstract: the reported deltas (+31.8 temporal grounding, +31.2 object tracking) are presented without any definition or equation for the rule-based temporal rewards, so it is impossible to determine whether the gains arise from RFT or from direct optimization of the evaluation metrics themselves.
  2. [Experiments] Experimental section: no information is given on training data volume/composition, whether baselines received identical supervised fine-tuning on the same splits, or ablation controls that isolate the contribution of the spatio-temporal rewards versus standard SFT.
minor comments (2)
  1. [Abstract] The acronym RFT is used without initial expansion; add '(Reinforcement Fine-Tuning)' on first use.
  2. [Tables/Figures] Figure captions and table headers should explicitly state the exact evaluation metrics (e.g., mIoU, accuracy) and baseline model versions for each reported number.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point-by-point below and will incorporate clarifications and additional details in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported deltas (+31.8 temporal grounding, +31.2 object tracking) are presented without any definition or equation for the rule-based temporal rewards, so it is impossible to determine whether the gains arise from RFT or from direct optimization of the evaluation metrics themselves.

    Authors: We agree the abstract omits the reward definitions. In the revision we will insert a concise description of the rule-based temporal rewards (e.g., verification of long-range temporal associations and object-relation consistency via deterministic rules) and note that these rewards are distinct from the downstream evaluation metrics. The full mathematical formulation appears in Section 3.2; the observed gains result from the RL optimization process rather than direct metric hacking. revision: yes

  2. Referee: [Experiments] Experimental section: no information is given on training data volume/composition, whether baselines received identical supervised fine-tuning on the same splits, or ablation controls that isolate the contribution of the spatio-temporal rewards versus standard SFT.

    Authors: We apologize for the insufficient detail in the main text. The revision will explicitly state the training data volume and composition (including dataset sources and split sizes), confirm that all reported baselines underwent identical supervised fine-tuning on the same data splits, and add ablation experiments that directly compare the full RFT setting against standard SFT without the spatio-temporal rewards. These controls will be placed in the Experiments section (with additional results in the supplement if space-constrained). revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical claims with no derivation chain

full rationale

The paper presents an empirical study of Reinforcement Fine-Tuning (RFT) applied to video MLLMs using rule-based rewards for spatio-temporal tasks. No equations, derivations, or theoretical steps exist that could reduce to self-definitions, fitted inputs renamed as predictions, or self-citation load-bearing arguments. Reported gains (e.g., +31.8 temporal grounding) are measured outcomes on benchmarks, not quantities forced by construction from the training procedure itself. The work is self-contained as an experimental report; any reward-design details would be implementation specifics rather than circular logic in a claimed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven premise that rule-based temporal rewards will reliably improve video reasoning in MLLMs; no free parameters or invented entities are stated in the abstract.

axioms (1)
  • domain assumption Rule-based rewards focused on temporal associations can be defined that improve video reasoning without harming general capabilities
    Invoked when the authors state that joint RFT on spatio-temporal tasks yields both perception gains and preserved chat ability.

pith-pipeline@v0.9.0 · 5515 in / 1157 out tokens · 22884 ms · 2026-05-15T20:52:13.629443+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Through joint RFT on multiple spatio-temporal perception tasks, we developed VideoChat-R1... significant improvements in tasks like temporal grounding (+31.8) and object tracking (+31.2)

  • Foundation.DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    we explore integrating spatio-temporal specific rewards into Multimodal Large Language Models (MLLMs) to address the unique challenges of video understanding, such as long-range temporal associations

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

  2. AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    AdaFocus achieves better accuracy on long-video benchmarks with roughly 33 times fewer visual tokens by combining query-aware adaptive sampling and zero-cache disk-based refinement.

  3. MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection

    cs.CV 2026-05 unverdicted novelty 7.0

    MMVIAD is the first multi-view continuous video dataset for industrial anomaly detection with four supported tasks, and the VISTA model improves average benchmark scores from 45.0 to 57.5 on unseen data while surpassi...

  4. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 7.0

    VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...

  5. SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

    cs.CV 2026-04 unverdicted novelty 7.0

    SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.

  6. Motion-o: Trajectory-Grounded Video Reasoning

    cs.CV 2026-03 conditional novelty 7.0

    Motion-o extends VLMs with Motion Chain of Thought (MCoT) using <motion/> tags and perturbation rewards to make object trajectories explicit and supervised in video reasoning.

  7. From Priors to Perception: Grounding Video-LLMs in Physical Reality

    cs.CV 2026-05 unverdicted novelty 6.0

    Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard L...

  8. Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    VideoThinker improves lightweight MLLM video reasoning by creating a bias model to capture shortcuts and applying causal debiasing policy optimization to push away from them, achieving SOTA efficiency with minimal data.

  9. Co-Evolving Policy Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...

  10. Video-ToC: Video Tree-of-Cue Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.

  11. Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    Chain-of-Glimpse is a reinforcement learning framework that builds progressive, spatially grounded reasoning traces around task-relevant objects in videos to enable more accurate and interpretable multi-step decisions.

  12. Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    Video-MME-v2 is a new benchmark that applies progressive visual-to-reasoning levels and non-linear group scoring to expose gaps in video MLLM capabilities.

  13. Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.

  14. STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering

    cs.CV 2026-04 unverdicted novelty 6.0

    STRIVE stabilizes RL for video QA by creating spatiotemporal video variants and using importance-aware sampling, yielding consistent gains over baselines on six benchmarks.

  15. GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking

    cs.CV 2026-02 unverdicted novelty 6.0

    GraphThinker reduces temporal hallucinations in video reasoning by constructing event-based scene graphs and applying visual attention rewards in reinforcement finetuning.

  16. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 5.0

    VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.

  17. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 5.0

    VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...

  18. Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt

    cs.SD 2026-04 unverdicted novelty 5.0

    TimePro-RL interleaves timestamp embeddings in audio sequences and applies RL post-SFT to boost temporal alignment in LALMs, yielding gains on grounding, event detection, and dense captioning.

  19. OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

    cs.CV 2026-04 unverdicted novelty 5.0

    OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoni...

  20. RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation

    cs.CV 2026-05 unverdicted novelty 4.0

    RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.

  21. EasyVideoR1: Easier RL for Video Understanding

    cs.CV 2026-04 unverdicted novelty 4.0

    EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 19 Pith papers · 13 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  2. [2]

    Flashvtg: Feature layering and adaptive score handling network for video temporal grounding.arXiv preprint arXiv:2412.13441, 2024

    Zhuo Cao, Bingqing Zhang, Heming Du, Xin Yu, Xue Li, and Sen Wang. Flashvtg: Feature layering and adaptive score handling network for video temporal grounding.arXiv preprint arXiv:2412.13441, 2024

  3. [3]

    Boosting the generaliza- tion and reasoning of vision language models with curriculum reinforcement learning.arXiv preprint arXiv:2503.07065, 2025

    Huilin Deng, Ding Zou, Rui Ma, Hongchen Luo, Yang Cao, and Yu Kang. Boosting the generaliza- tion and reasoning of vision language models with curriculum reinforcement learning.arXiv preprint arXiv:2503.07065, 2025

  4. [4]

    Openvlthinker: An early exploration to complex vision-language reasoning via iterative self-improvement.arXiv preprint arXiv:2503.17352, 2025

    Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. Openvlthinker: An early exploration to complex vision-language reasoning via iterative self-improvement.arXiv preprint arXiv:2503.17352, 2025

  5. [5]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

  6. [6]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024

  7. [7]

    Tall: Temporal activity localization via language query

    Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. InProceedings of the IEEE international conference on computer vision, pages 5267–5275, 2017

  8. [8]

    Saliency-guided detr for moment retrieval and highlight detection.arXiv preprint arXiv:2410.01615, 2024

    Aleksandr Gordeev, Vladimir Dokholyan, Irina Tolstykh, and Maksim Kuprashevich. Saliency-guided detr for moment retrieval and highlight detection.arXiv preprint arXiv:2410.01615, 2024

  9. [9]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  10. [10]

    Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE transactions on pattern analysis and machine intelligence, 43(5): 1562–1577, 2019

    Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE transactions on pattern analysis and machine intelligence, 43(5): 1562–1577, 2019

  11. [11]

    Online video understanding: A comprehensive benchmark and memory-augmented method.arXiv preprint arXiv:2501.00584, 2024

    Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, and Limin Wang. Online video understanding: A comprehensive benchmark and memory-augmented method.arXiv preprint arXiv:2501.00584, 2024

  12. [12]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Hel- yar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  13. [13]

    Dense-captioning events in videos

    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. InProceedings of the IEEE international conference on computer vision, pages 706–715, 2017

  14. [14]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023

  15. [15]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

  16. [16]

    Videoeval: Comprehensive benchmark suite for low-cost evaluation of video foundation model.arXiv preprint arXiv:2407.06491, 2024

    Xinhao Li, Zhenpeng Huang, Jing Wang, Kunchang Li, and Limin Wang. Videoeval: Comprehensive benchmark suite for low-cost evaluation of video foundation model.arXiv preprint arXiv:2407.06491, 2024

  17. [17]

    Videochat-flash: Hierarchical compression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024

    Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical compression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024

  18. [18]

    Seg-zero: Reasoning- chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025

    Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning- chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025

  19. [19]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 10

  20. [20]

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023

  21. [21]

    Perception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36: 42748–42761, 2023

    Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Recasens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Perception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36: 42748–42761, 2023

  22. [22]

    Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025

    Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025

  23. [23]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  24. [24]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

  25. [25]

    Tarsier: Recipes for training and evaluating large video description models.arXiv preprint arXiv:2407.00634, 2024

    Jiawei Wang, Liping Yuan, Yuchen Zhang, and Haomiao Sun. Tarsier: Recipes for training and evaluating large video description models.arXiv preprint arXiv:2407.00634, 2024

  26. [26]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  27. [27]

    Timezero: Temporal video grounding with reasoning-guided lvlm.arXiv preprint arXiv:2503.13377, 2025

    Ye Wang, Boshen Xu, Zihao Yue, Zihan Xiao, Ziheng Wang, Liang Zhang, Dingyi Yang, Wenxuan Wang, and Qin Jin. Timezero: Temporal video grounding with reasoning-guided lvlm.arXiv preprint arXiv:2503.13377, 2025

  28. [28]

    Internvideo2: Scaling foundation models for multimodal video understanding

    Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for multimodal video understanding. InEuropean Conference on Computer Vision, pages 396–416. Springer, 2024

  29. [29]

    Internvideo2

    Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling.arXiv preprint arXiv:2501.12386, 2025

  30. [30]

    Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37: 28828–28857, 2024

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37: 28828–28857, 2024

  31. [31]

    Can i trust your answer? visually grounded video question answering

    Junbin Xiao, Angela Yao, Yicong Li, and Tat-Seng Chua. Can i trust your answer? visually grounded video question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13204–13214, 2024

  32. [32]

    Fine-grained video-text retrieval: A new benchmark and method.arXiv preprint arXiv:2501.00513, 2024

    Yifan Xu, Xinhao Li, Yichun Yang, Rui Huang, and Limin Wang. Fine-grained video-text retrieval: A new benchmark and method.arXiv preprint arXiv:2501.00513, 2024

  33. [33]

    Task preference optimization: Improving multimodal large language models with vision task alignment.arXiv preprint arXiv:2412.19326, 2024

    Ziang Yan, Zhilin Li, Yinan He, Chenting Wang, Kunchang Li, Xinhao Li, Xiangyu Zeng, Zilei Wang, Yali Wang, Yu Qiao, et al. Task preference optimization: Improving multimodal large language models with vision task alignment.arXiv preprint arXiv:2412.19326, 2024

  34. [34]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024

  35. [35]

    R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025

    Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025

  36. [36]

    Merlin: Empowering multimodal llms with foresight minds

    En Yu, Liang Zhao, Yana Wei, Jinrong Yang, Dongming Wu, Lingyu Kong, Haoran Wei, Tiancai Wang, Zheng Ge, Xiangyu Zhang, et al. Merlin: Empowering multimodal llms with foresight minds. InEuropean Conference on Computer Vision, pages 425–443. Springer, 2024

  37. [37]

    Timesuite: Improving mllms for long video understanding via grounded tuning.arXiv preprint arXiv:2410.19702, 2024

    Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhengrong Yue, Yi Wang, et al. Timesuite: Improving mllms for long video understanding via grounded tuning.arXiv preprint arXiv:2410.19702, 2024. 11

  38. [39]

    Vision-r1: Evolving human-free alignment in large vision-language models via vision-guided reinforcement learning.arXiv preprint arXiv:2503.18013, 2025

    Yufei Zhan, Yousong Zhu, Shurong Zheng, Hongyin Zhao, Fan Yang, Ming Tang, and Jinqiao Wang. Vision-r1: Evolving human-free alignment in large vision-language models via vision-guided reinforcement learning.arXiv preprint arXiv:2503.18013, 2025

  39. [40]

    R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization.arXiv preprint arXiv:2503.12937, 2025

    Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization.arXiv preprint arXiv:2503.12937, 2025

  40. [41]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024

  41. [42]

    R1-omni: Explainable omni-multimodal emotion recognition with reinforcement learning.arXiv e-prints, pages arXiv–2503, 2025

    Jiaxing Zhao, Xihan Wei, and Liefeng Bo. R1-omni: Explainable omni-multimodal emotion recognition with reinforcement learning.arXiv e-prints, pages arXiv–2503, 2025

  42. [43]

    R1-zero’s” aha moment” in visual reasoning on a 2b non-sft model.arXiv preprint arXiv:2503.05132, 2025

    Hengguang Zhou, Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. R1-zero’s” aha moment” in visual reasoning on a 2b non-sft model.arXiv preprint arXiv:2503.05132, 2025. 12