arxiv: 2504.06958 · v5 · submitted 2025-04-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li , Ziang Yan , Desen Meng , Lu Dong , Xiangyu Zeng , Yinan He , Yali Wang , Yu Qiao

show 2 more authors

Yi Wang Limin Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords reinforcement fine-tuningvideo multimodal modelsspatio-temporal perceptiontemporal groundingobject trackingvideo reasoningrule-based rewards

0 comments

The pith

Reinforcement fine-tuning with rule-based temporal rewards creates a video model with state-of-the-art spatio-temporal perception.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether adding targeted rule-based rewards during reinforcement fine-tuning can strengthen how multimodal models handle time and space relations in videos. The authors apply this jointly across several perception tasks to produce VideoChat-R1. The resulting model records large gains on temporal grounding and object tracking while also lifting scores on general question-answering benchmarks. It keeps original chat abilities intact and supports a new inference method called temporal clue-driven reasoning. The work aims at data-efficient ways to build more reliable video dialogue systems.

Core claim

By applying reinforcement fine-tuning with rule-based rewards that target temporal associations across multiple spatio-temporal perception tasks, the resulting VideoChat-R1 model achieves substantial improvements in video understanding tasks such as temporal grounding and object tracking, while preserving or enhancing performance on general QA benchmarks and enabling a temporal clue-driven reasoning approach for dialogue.

What carries the argument

Reinforcement Fine-Tuning (RFT) driven by rule-based rewards that emphasize long-range temporal associations in video data.

If this is right

Large gains on temporal grounding (+31.8) and object tracking (+31.2).
Better results on general QA benchmarks.
More reliable video dialogue systems.
A new temporal clue-driven reasoning schema for inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reward design could transfer to other sequential data types such as audio or sensor streams.
It may lower the data volume needed to adapt video models for new domains.
Testing on real-time streaming video would show whether the gains survive latency constraints.

Load-bearing premise

The rule-based rewards on temporal associations will create broad video reasoning gains that work on new cases without hidden data tricks.

What would settle it

Evaluation on a fresh set of video tasks or longer sequences where the model shows no gain or a drop relative to the base model without the reinforcement step.

read the original abstract

Reinforcement Learning (RL) benefits Large Language Models (LLMs) for complex reasoning. Inspired by this, we explore integrating spatio-temporal specific rewards into Multimodal Large Language Models (MLLMs) to address the unique challenges of video understanding, such as long-range temporal associations. This paper investigates how rule-based rewards, particularly temporal ones, can improve video reasoning and their generalizability. Our study proposes Reinforcement Fine-Tuning (RFT) as a data-efficient method to enhance video reasoning on specific tasks without sacrificing original capabilities. Through joint RFT on multiple spatio-temporal perception tasks, we developed VideoChat-R1, a powerful Video MLLM. VideoChat-R1 achieves state-of-the-art spatio-temporal perception, demonstrating significant improvements in tasks like temporal grounding (+31.8) and object tracking (+31.2), while also improving general QA benchmarks. The enhanced perception and preserved chat abilities contribute to a more reliable video dialogue system, leading to our ``Temporal Clue-driven Reasoning" inference schema. This work provides a foundation for developing robust, real-world video comprehension agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts RL fine-tuning to video MLLMs with temporal rewards and reports big perception gains, but lacks details needed to verify the source of those gains.

read the letter

The main point to know is that VideoChat-R1 uses reinforcement fine-tuning with custom temporal rewards to improve video MLLMs on perception tasks, claiming substantial improvements on benchmarks like temporal grounding and object tracking while maintaining general QA performance. They do a decent job showing a practical path for enhancing specific capabilities in existing models through joint training on multiple tasks. The introduction of the Temporal Clue-driven Reasoning schema at inference is a nice addition that could help in real applications for video agents. Where it gets soft is in the experimental validation. The reported deltas are large, but without clear definitions of how the rule-based rewards score temporal associations or confirmation that baselines underwent the same fine-tuning regime, it's tough to isolate the effect of the RFT. The abstract leaves out training data details and ablations, so the generalizability claim rests on unshown evidence. If the rewards turn out to be metric-specific heuristics, the attribution weakens. This paper would interest people building or fine-tuning video multimodal models for better spatio-temporal understanding. Anyone working on data-efficient adaptation techniques for MLLMs could find the recipe useful if the methods are reproducible. I think it deserves a serious referee because the idea is timely and the empirical claims are bold enough to warrant scrutiny, even with the current gaps in presentation.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces VideoChat-R1, a Video Multimodal Large Language Model enhanced via Reinforcement Fine-Tuning (RFT) that incorporates rule-based rewards targeting spatio-temporal perception tasks such as long-range temporal associations. It claims state-of-the-art results on video understanding benchmarks, with reported gains of +31.8 on temporal grounding and +31.2 on object tracking, alongside improvements on general QA tasks, preservation of chat capabilities, and a new 'Temporal Clue-driven Reasoning' inference schema.

Significance. If the gains prove robust under matched baselines and non-circular reward definitions, the work would offer a data-efficient RL approach for video MLLMs that improves targeted perception without degrading general capabilities. The joint multi-task RFT and inference schema could provide a practical template for building more reliable video dialogue agents.

major comments (2)

[Abstract] Abstract: the reported deltas (+31.8 temporal grounding, +31.2 object tracking) are presented without any definition or equation for the rule-based temporal rewards, so it is impossible to determine whether the gains arise from RFT or from direct optimization of the evaluation metrics themselves.
[Experiments] Experimental section: no information is given on training data volume/composition, whether baselines received identical supervised fine-tuning on the same splits, or ablation controls that isolate the contribution of the spatio-temporal rewards versus standard SFT.

minor comments (2)

[Abstract] The acronym RFT is used without initial expansion; add '(Reinforcement Fine-Tuning)' on first use.
[Tables/Figures] Figure captions and table headers should explicitly state the exact evaluation metrics (e.g., mIoU, accuracy) and baseline model versions for each reported number.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point-by-point below and will incorporate clarifications and additional details in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the reported deltas (+31.8 temporal grounding, +31.2 object tracking) are presented without any definition or equation for the rule-based temporal rewards, so it is impossible to determine whether the gains arise from RFT or from direct optimization of the evaluation metrics themselves.

Authors: We agree the abstract omits the reward definitions. In the revision we will insert a concise description of the rule-based temporal rewards (e.g., verification of long-range temporal associations and object-relation consistency via deterministic rules) and note that these rewards are distinct from the downstream evaluation metrics. The full mathematical formulation appears in Section 3.2; the observed gains result from the RL optimization process rather than direct metric hacking. revision: yes
Referee: [Experiments] Experimental section: no information is given on training data volume/composition, whether baselines received identical supervised fine-tuning on the same splits, or ablation controls that isolate the contribution of the spatio-temporal rewards versus standard SFT.

Authors: We apologize for the insufficient detail in the main text. The revision will explicitly state the training data volume and composition (including dataset sources and split sizes), confirm that all reported baselines underwent identical supervised fine-tuning on the same data splits, and add ablation experiments that directly compare the full RFT setting against standard SFT without the spatio-temporal rewards. These controls will be placed in the Experiments section (with additional results in the supplement if space-constrained). revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical claims with no derivation chain

full rationale

The paper presents an empirical study of Reinforcement Fine-Tuning (RFT) applied to video MLLMs using rule-based rewards for spatio-temporal tasks. No equations, derivations, or theoretical steps exist that could reduce to self-definitions, fitted inputs renamed as predictions, or self-citation load-bearing arguments. Reported gains (e.g., +31.8 temporal grounding) are measured outcomes on benchmarks, not quantities forced by construction from the training procedure itself. The work is self-contained as an experimental report; any reward-design details would be implementation specifics rather than circular logic in a claimed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven premise that rule-based temporal rewards will reliably improve video reasoning in MLLMs; no free parameters or invented entities are stated in the abstract.

axioms (1)

domain assumption Rule-based rewards focused on temporal associations can be defined that improve video reasoning without harming general capabilities
Invoked when the authors state that joint RFT on spatio-temporal tasks yields both perception gains and preserved chat ability.

pith-pipeline@v0.9.0 · 5515 in / 1157 out tokens · 22884 ms · 2026-05-15T20:52:13.629443+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Through joint RFT on multiple spatio-temporal perception tasks, we developed VideoChat-R1... significant improvements in tasks like temporal grounding (+31.8) and object tracking (+31.2)
Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we explore integrating spatio-temporal specific rewards into Multimodal Large Language Models (MLLMs) to address the unique challenges of video understanding, such as long-range temporal associations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
cs.CV 2026-05 unverdicted novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

AdaFocus achieves better accuracy on long-video benchmarks with roughly 33 times fewer visual tokens by combining query-aware adaptive sampling and zero-cache disk-based refinement.
MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection
cs.CV 2026-05 unverdicted novelty 7.0

MMVIAD is the first multi-view continuous video dataset for industrial anomaly detection with four supported tasks, and the VISTA model improves average benchmark scores from 45.0 to 57.5 on unseen data while surpassi...
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 7.0

VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
cs.CV 2026-04 unverdicted novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
Motion-o: Trajectory-Grounded Video Reasoning
cs.CV 2026-03 conditional novelty 7.0

Motion-o extends VLMs with Motion Chain of Thought (MCoT) using <motion/> tags and perturbation rewards to make object trajectories explicit and supervised in video reasoning.
From Priors to Perception: Grounding Video-LLMs in Physical Reality
cs.CV 2026-05 unverdicted novelty 6.0

Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard L...
Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs
cs.CV 2026-05 unverdicted novelty 6.0

VideoThinker improves lightweight MLLM video reasoning by creating a bias model to capture shortcuts and applying causal debiasing policy optimization to push away from them, achieving SOTA efficiency with minimal data.
Co-Evolving Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...
Video-ToC: Video Tree-of-Cue Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.
Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Chain-of-Glimpse is a reinforcement learning framework that builds progressive, spatially grounded reasoning traces around task-relevant objects in videos to enable more accurate and interpretable multi-step decisions.
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Video-MME-v2 is a new benchmark that applies progressive visual-to-reasoning levels and non-linear group scoring to expose gaps in video MLLM capabilities.
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering
cs.CV 2026-04 unverdicted novelty 6.0

STRIVE stabilizes RL for video QA by creating spatiotemporal video variants and using importance-aware sampling, yielding consistent gains over baselines on six benchmarks.
GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking
cs.CV 2026-02 unverdicted novelty 6.0

GraphThinker reduces temporal hallucinations in video reasoning by constructing event-based scene graphs and applying visual attention rewards in reinforcement finetuning.
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 5.0

VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 5.0

VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...
Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt
cs.SD 2026-04 unverdicted novelty 5.0

TimePro-RL interleaves timestamp embeddings in audio sequences and applies RL post-SFT to boost temporal alignment in LALMs, yielding gains on grounding, event detection, and dense captioning.
OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering
cs.CV 2026-04 unverdicted novelty 5.0

OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoni...
RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation
cs.CV 2026-05 unverdicted novelty 4.0

RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.
EasyVideoR1: Easier RL for Video Understanding
cs.CV 2026-04 unverdicted novelty 4.0

EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 19 Pith papers · 13 internal anchors

[1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Flashvtg: Feature layering and adaptive score handling network for video temporal grounding.arXiv preprint arXiv:2412.13441, 2024

Zhuo Cao, Bingqing Zhang, Heming Du, Xin Yu, Xue Li, and Sen Wang. Flashvtg: Feature layering and adaptive score handling network for video temporal grounding.arXiv preprint arXiv:2412.13441, 2024

work page arXiv 2024
[3]

Boosting the generaliza- tion and reasoning of vision language models with curriculum reinforcement learning.arXiv preprint arXiv:2503.07065, 2025

Huilin Deng, Ding Zou, Rui Ma, Hongchen Luo, Yang Cao, and Yu Kang. Boosting the generaliza- tion and reasoning of vision language models with curriculum reinforcement learning.arXiv preprint arXiv:2503.07065, 2025

work page arXiv 2025
[4]

Openvlthinker: An early exploration to complex vision-language reasoning via iterative self-improvement.arXiv preprint arXiv:2503.17352, 2025

Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. Openvlthinker: An early exploration to complex vision-language reasoning via iterative self-improvement.arXiv preprint arXiv:2503.17352, 2025

work page arXiv 2025
[5]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Tall: Temporal activity localization via language query

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. InProceedings of the IEEE international conference on computer vision, pages 5267–5275, 2017

work page 2017
[8]

Saliency-guided detr for moment retrieval and highlight detection.arXiv preprint arXiv:2410.01615, 2024

Aleksandr Gordeev, Vladimir Dokholyan, Irina Tolstykh, and Maksim Kuprashevich. Saliency-guided detr for moment retrieval and highlight detection.arXiv preprint arXiv:2410.01615, 2024

work page arXiv 2024
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE transactions on pattern analysis and machine intelligence, 43(5): 1562–1577, 2019

Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE transactions on pattern analysis and machine intelligence, 43(5): 1562–1577, 2019

work page 2019
[11]

Online video understanding: A comprehensive benchmark and memory-augmented method.arXiv preprint arXiv:2501.00584, 2024

Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, and Limin Wang. Online video understanding: A comprehensive benchmark and memory-augmented method.arXiv preprint arXiv:2501.00584, 2024

work page arXiv 2024
[12]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Hel- yar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Dense-captioning events in videos

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. InProceedings of the IEEE international conference on computer vision, pages 706–715, 2017

work page 2017
[14]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

work page 2024
[16]

Videoeval: Comprehensive benchmark suite for low-cost evaluation of video foundation model.arXiv preprint arXiv:2407.06491, 2024

Xinhao Li, Zhenpeng Huang, Jing Wang, Kunchang Li, and Limin Wang. Videoeval: Comprehensive benchmark suite for low-cost evaluation of video foundation model.arXiv preprint arXiv:2407.06491, 2024

work page arXiv 2024
[17]

Videochat-flash: Hierarchical compression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024

Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical compression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024

work page arXiv 2024
[18]

Seg-zero: Reasoning- chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025

Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning- chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025

work page arXiv 2025
[19]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Perception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36: 42748–42761, 2023

Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Recasens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Perception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36: 42748–42761, 2023

work page 2023
[22]

Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025

Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025

work page arXiv 2025
[23]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Tarsier: Recipes for training and evaluating large video description models.arXiv preprint arXiv:2407.00634, 2024

Jiawei Wang, Liping Yuan, Yuchen Zhang, and Haomiao Sun. Tarsier: Recipes for training and evaluating large video description models.arXiv preprint arXiv:2407.00634, 2024

work page arXiv 2024
[26]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Timezero: Temporal video grounding with reasoning-guided lvlm.arXiv preprint arXiv:2503.13377, 2025

Ye Wang, Boshen Xu, Zihao Yue, Zihan Xiao, Ziheng Wang, Liang Zhang, Dingyi Yang, Wenxuan Wang, and Qin Jin. Timezero: Temporal video grounding with reasoning-guided lvlm.arXiv preprint arXiv:2503.13377, 2025

work page arXiv 2025
[28]

Internvideo2: Scaling foundation models for multimodal video understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for multimodal video understanding. InEuropean Conference on Computer Vision, pages 396–416. Springer, 2024

work page 2024
[29]

Internvideo2

Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling.arXiv preprint arXiv:2501.12386, 2025

work page arXiv 2025
[30]

Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37: 28828–28857, 2024

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37: 28828–28857, 2024

work page 2024
[31]

Can i trust your answer? visually grounded video question answering

Junbin Xiao, Angela Yao, Yicong Li, and Tat-Seng Chua. Can i trust your answer? visually grounded video question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13204–13214, 2024

work page 2024
[32]

Fine-grained video-text retrieval: A new benchmark and method.arXiv preprint arXiv:2501.00513, 2024

Yifan Xu, Xinhao Li, Yichun Yang, Rui Huang, and Limin Wang. Fine-grained video-text retrieval: A new benchmark and method.arXiv preprint arXiv:2501.00513, 2024

work page arXiv 2024
[33]

Task preference optimization: Improving multimodal large language models with vision task alignment.arXiv preprint arXiv:2412.19326, 2024

Ziang Yan, Zhilin Li, Yinan He, Chenting Wang, Kunchang Li, Xinhao Li, Xiangyu Zeng, Zilei Wang, Yali Wang, Yu Qiao, et al. Task preference optimization: Improving multimodal large language models with vision task alignment.arXiv preprint arXiv:2412.19326, 2024

work page arXiv 2024
[34]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025

work page arXiv 2025
[36]

Merlin: Empowering multimodal llms with foresight minds

En Yu, Liang Zhao, Yana Wei, Jinrong Yang, Dongming Wu, Lingyu Kong, Haoran Wei, Tiancai Wang, Zheng Ge, Xiangyu Zhang, et al. Merlin: Empowering multimodal llms with foresight minds. InEuropean Conference on Computer Vision, pages 425–443. Springer, 2024

work page 2024
[37]

Timesuite: Improving mllms for long video understanding via grounded tuning.arXiv preprint arXiv:2410.19702, 2024

Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhengrong Yue, Yi Wang, et al. Timesuite: Improving mllms for long video understanding via grounded tuning.arXiv preprint arXiv:2410.19702, 2024. 11

work page arXiv 2024
[39]

Vision-r1: Evolving human-free alignment in large vision-language models via vision-guided reinforcement learning.arXiv preprint arXiv:2503.18013, 2025

Yufei Zhan, Yousong Zhu, Shurong Zheng, Hongyin Zhao, Fan Yang, Ming Tang, and Jinqiao Wang. Vision-r1: Evolving human-free alignment in large vision-language models via vision-guided reinforcement learning.arXiv preprint arXiv:2503.18013, 2025

work page arXiv 2025
[40]

R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization.arXiv preprint arXiv:2503.12937, 2025

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization.arXiv preprint arXiv:2503.12937, 2025

work page arXiv 2025
[41]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

R1-omni: Explainable omni-multimodal emotion recognition with reinforcement learning.arXiv e-prints, pages arXiv–2503, 2025

Jiaxing Zhao, Xihan Wei, and Liefeng Bo. R1-omni: Explainable omni-multimodal emotion recognition with reinforcement learning.arXiv e-prints, pages arXiv–2503, 2025

work page 2025
[43]

R1-zero’s” aha moment” in visual reasoning on a 2b non-sft model.arXiv preprint arXiv:2503.05132, 2025

Hengguang Zhou, Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. R1-zero’s” aha moment” in visual reasoning on a 2b non-sft model.arXiv preprint arXiv:2503.05132, 2025. 12

work page arXiv 2025