{"paper":{"title":"R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Step-wise reinforcement learning enables multimodal models to improve their own reasoning beyond imitation.","cross_cats":["cs.CL","cs.CV","cs.LG"],"primary_cat":"cs.AI","authors_text":"Dacheng Tao, Huanjin Yao, Jiaxing Huang, Jingyi Zhang, Shijian Lu, Shunyu Liu, Xikun Zhang","submitted_at":"2025-03-17T08:51:44Z","abstract_excerpt":"Recent studies generally enhance MLLMs' reasoning capabilities via supervised fine-tuning on high-quality chain-of-thought reasoning data, which often leads models to merely imitate successful reasoning paths without understanding what the wrong reasoning paths are. In this work, we aim to enhance the MLLMs' reasoning ability beyond passively imitating positive reasoning paths. To this end, we design Step-wise Group Relative Policy Optimization (StepGRPO), a new online reinforcement learning framework that enables MLLMs to self-improve reasoning ability via simple, effective and dense step-wis"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"With the proposed StepGRPO, we introduce R1-VL, a series of MLLMs with outstanding capabilities in step-by-step reasoning. Extensive experiments over 8 benchmarks demonstrate the superiority of our methods.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The rule-based StepRAR and StepRVR rewards accurately identify necessary and logically sound reasoning steps without introducing bias or rewarding superficial patterns that do not reflect true understanding.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Step-wise reinforcement learning enables multimodal models to improve their own reasoning beyond imitation.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"9981077fccd724e6bf512dd79966b64c92c25c7eca8f6570b9e047e9a0354caf"},"source":{"id":"2503.12937","kind":"arxiv","version":2},"verdict":{"id":"0ec518c3-bb2d-4666-9be2-a66ec6fe1948","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T14:59:54.216667Z","strongest_claim":"With the proposed StepGRPO, we introduce R1-VL, a series of MLLMs with outstanding capabilities in step-by-step reasoning. Extensive experiments over 8 benchmarks demonstrate the superiority of our methods.","one_line_summary":"R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The rule-based StepRAR and StepRVR rewards accurately identify necessary and logically sound reasoning steps without introducing bias or rewarding superficial patterns that do not reflect true understanding.","pith_extraction_headline":"Step-wise reinforcement learning enables multimodal models to improve their own reasoning beyond imitation."},"references":{"count":57,"sample":[{"doi":"","year":2024,"title":"Claude 3.5 sonnet, 2024","work_id":"b72d9c68-3a94-4a1a-92b1-eb1a97352e5f","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","ref_index":2,"cited_arxiv_id":"2502.13923","is_internal_anchor":true},{"doi":"","year":2022,"title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","work_id":"a1f2574b-a899-4713-be60-c87ba332656c","ref_index":3,"cited_arxiv_id":"2204.05862","is_internal_anchor":true},{"doi":"","year":1901,"title":"Lan- guage models are few-shot learners","work_id":"5b23bebc-10b7-4150-9a97-e3f37825079e","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"arXiv preprint arXiv:2406.10858 , year=","work_id":"2523ac3a-94a4-4667-a2ce-de8ecadb2936","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":57,"snapshot_sha256":"c7bcb0b619aa1a5c59c53dd876f2931341fd9459117f9da8da3b08a10c942e14","internal_anchors":24},"formal_canon":{"evidence_count":3,"snapshot_sha256":"33bd4268d721bb47534d28f93bb8333b277175ca0a5870078de40cdc32ad31a5"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}