{"paper":{"title":"VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Reinforcement fine-tuning with rule-based temporal rewards creates a video model with state-of-the-art spatio-temporal perception.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Desen Meng, Limin Wang, Lu Dong, Xiangyu Zeng, Xinhao Li, Yali Wang, Yinan He, Yi Wang, Yu Qiao, Ziang Yan","submitted_at":"2025-04-09T15:09:27Z","abstract_excerpt":"Reinforcement Learning (RL) benefits Large Language Models (LLMs) for complex reasoning. Inspired by this, we explore integrating spatio-temporal specific rewards into Multimodal Large Language Models (MLLMs) to address the unique challenges of video understanding, such as long-range temporal associations. This paper investigates how rule-based rewards, particularly temporal ones, can improve video reasoning and their generalizability. Our study proposes Reinforcement Fine-Tuning (RFT) as a data-efficient method to enhance video reasoning on specific tasks without sacrificing original capabili"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Through joint RFT on multiple spatio-temporal perception tasks, we developed VideoChat-R1, a powerful Video MLLM. VideoChat-R1 achieves state-of-the-art spatio-temporal perception, demonstrating significant improvements in tasks like temporal grounding (+31.8) and object tracking (+31.2), while also improving general QA benchmarks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That carefully designed rule-based rewards focused on temporal associations will produce generalizable improvements in video reasoning without post-hoc tuning or hidden data selection that inflates the reported deltas.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Reinforcement fine-tuning with rule-based temporal rewards creates a video model with state-of-the-art spatio-temporal perception.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"1e6ba8c30e20bfae80cc583289f8d278c090c847f62641c85a531daaed64360d"},"source":{"id":"2504.06958","kind":"arxiv","version":5},"verdict":{"id":"46600adc-b342-4d2b-93fa-f09c53eabfa1","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T20:52:13.629443Z","strongest_claim":"Through joint RFT on multiple spatio-temporal perception tasks, we developed VideoChat-R1, a powerful Video MLLM. VideoChat-R1 achieves state-of-the-art spatio-temporal perception, demonstrating significant improvements in tasks like temporal grounding (+31.8) and object tracking (+31.2), while also improving general QA benchmarks.","one_line_summary":"Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That carefully designed rule-based rewards focused on temporal associations will produce generalizable improvements in video reasoning without post-hoc tuning or hidden data selection that inflates the reported deltas.","pith_extraction_headline":"Reinforcement fine-tuning with rule-based temporal rewards creates a video model with state-of-the-art spatio-temporal perception."},"references":{"count":42,"sample":[{"doi":"","year":2025,"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","ref_index":1,"cited_arxiv_id":"2502.13923","is_internal_anchor":true},{"doi":"","year":2024,"title":"et al.: FlashVTG: Feature layering and adaptive score handling network for video temporal grounding","work_id":"c5f0f767-bf04-40e1-894c-327d95458eb7","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Boosting the generalization and reasoning of vision language models with curriculum reinforcement learning","work_id":"45c3a58c-d62f-4f0b-84d6-98d7c29563e4","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Open- vlthinker: Complex vision-language reasoning via iterative sft-rl cycles","work_id":"de4c64e1-82b1-4f70-8311-a3539e7bf400","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Video-R1: Reinforcing Video Reasoning in MLLMs","work_id":"0ce88332-564c-4361-8e2a-3850eb1ace9c","ref_index":5,"cited_arxiv_id":"2503.21776","is_internal_anchor":true}],"resolved_work":42,"snapshot_sha256":"5d9ce39e683f9d81d550955ea6fc3b6eb7b34451bcb19dbec72ab096f15e4a6e","internal_anchors":19},"formal_canon":{"evidence_count":2,"snapshot_sha256":"c514041f82825babf008d72b0e9474228497e59a03ee864cd9356814755bc085"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}