{"paper":{"title":"SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models","license":"http://creativecommons.org/licenses/by/4.0/","headline":"SFT induces pseudo reasoning paths that undermine subsequent RL in vision-language models.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Cihang Xie, Fali Wang, Haoqin Tu, Hardy Chen, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou","submitted_at":"2025-04-10T16:54:05Z","abstract_excerpt":"This work revisits the dominant supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm for training Large Vision-Language Models (LVLMs), and reveals a key finding: SFT can significantly undermine subsequent RL by inducing ``pseudo reasoning paths'' imitated from expert models. While these paths may resemble the native reasoning paths of RL models, they often involve prolonged, hesitant, less informative steps, and incorrect reasoning. To systematically study this effect, we introduce VLAA-Thinking, a new multimodal dataset designed to support reasoning in LVLMs. Constructed vi"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"SFT can significantly undermine subsequent RL by inducing ``pseudo reasoning paths'' imitated from expert models. While these paths may resemble the native reasoning paths of RL models, they often involve prolonged, hesitant, less informative steps, and incorrect reasoning.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the performance gap between SFT-then-RL and RL-only is caused by the induction of pseudo-reasoning paths rather than differences in data difficulty, reward design, or training hyperparameters.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"SFT induces pseudo-reasoning paths that undermine RL in LVLMs, while RL with GRPO and mixed perception-cognition rewards on the new VLAA-Thinking dataset produces more genuine reasoning and top leaderboard performance.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"SFT induces pseudo reasoning paths that undermine subsequent RL in vision-language models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"0c4e0d2b8c80bd84a211c982cab46aa5df07177372c2d37d3e131b8d316876d0"},"source":{"id":"2504.11468","kind":"arxiv","version":1},"verdict":{"id":"8058202a-4513-4352-9ac4-d91df8e07a1e","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T15:39:06.693023Z","strongest_claim":"SFT can significantly undermine subsequent RL by inducing ``pseudo reasoning paths'' imitated from expert models. While these paths may resemble the native reasoning paths of RL models, they often involve prolonged, hesitant, less informative steps, and incorrect reasoning.","one_line_summary":"SFT induces pseudo-reasoning paths that undermine RL in LVLMs, while RL with GRPO and mixed perception-cognition rewards on the new VLAA-Thinking dataset produces more genuine reasoning and top leaderboard performance.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the performance gap between SFT-then-RL and RL-only is caused by the induction of pseudo-reasoning paths rather than differences in data difficulty, reward design, or training hyperparameters.","pith_extraction_headline":"SFT induces pseudo reasoning paths that undermine subsequent RL in vision-language models."},"references":{"count":25,"sample":[{"doi":"","year":null,"title":"**Replace references to “description”, “caption” and ”rationale”** with wording that references **“the image.”** - For example, “The description says...” could become “The image shows...” - “The capti","work_id":"fa56cd8a-a83b-40cd-9646-70a2cc4490bf","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"**Preserve all line breaks, punctuation, and spacing** as much as possible, and make **no additional edits** outside of these replacements","work_id":"13af225f-5c58-4177-9c35-0db9c08151ec","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"—— Here is the input: {input} Figure 10: Prompt for answer rewriting with GPT-4-Turbo","work_id":"1653d069-e272-4934-8d7e-7d6e41093a32","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"MathVista: The Test Mini split of MathVista dataset; overall accuracy","work_id":"90c903b1-c672-419c-9a3e-2efc1e1f4146","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"MathVision: The Full test set of MathVision; overall accuracy","work_id":"112a63fe-d265-48a4-9f9c-3414d617e282","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":25,"snapshot_sha256":"9be76a11b87e06e3feac0f739f610cb7455062d0e26108ff56a02b72ec75f4c3","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"2a4fb0a1a501ae463f2a94eb803bc456bfeb34635e8584518cdf2579a99a56dc"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}