{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:O672LEEW77T7WB3SZGNDBUGL4U","short_pith_number":"pith:O672LEEW","schema_version":"1.0","canonical_sha256":"77bfa59096ffe7fb0772c99a30d0cbe50abc4ccafe74fad7fbc453e9469792ba","source":{"kind":"arxiv","id":"2504.11468","version":1},"attestation_state":"computed","paper":{"title":"SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models","license":"http://creativecommons.org/licenses/by/4.0/","headline":"SFT induces pseudo reasoning paths that undermine subsequent RL in vision-language models.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Cihang Xie, Fali Wang, Haoqin Tu, Hardy Chen, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou","submitted_at":"2025-04-10T16:54:05Z","abstract_excerpt":"This work revisits the dominant supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm for training Large Vision-Language Models (LVLMs), and reveals a key finding: SFT can significantly undermine subsequent RL by inducing ``pseudo reasoning paths'' imitated from expert models. While these paths may resemble the native reasoning paths of RL models, they often involve prolonged, hesitant, less informative steps, and incorrect reasoning. To systematically study this effect, we introduce VLAA-Thinking, a new multimodal dataset designed to support reasoning in LVLMs. Constructed vi"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2504.11468","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CL","submitted_at":"2025-04-10T16:54:05Z","cross_cats_sorted":[],"title_canon_sha256":"91c06de3452239265422fec0bb7bfc6768f80afb481f73766f6cdd37c08d11ad","abstract_canon_sha256":"86f110acf42ce70033c15b9dfc44bfdb0cb15e4c6844ca5669f276b2aac0b858"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:13.672935Z","signature_b64":"5q30JkiO9QO8J2Fl/lhJFM6CYwLag8DC+bzC3tSfWQ19AvESTX9V7O8NxQkmiN/hpRtOvfjtnirxN83BmmnKBw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"77bfa59096ffe7fb0772c99a30d0cbe50abc4ccafe74fad7fbc453e9469792ba","last_reissued_at":"2026-05-17T23:38:13.672223Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:13.672223Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models","license":"http://creativecommons.org/licenses/by/4.0/","headline":"SFT induces pseudo reasoning paths that undermine subsequent RL in vision-language models.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Cihang Xie, Fali Wang, Haoqin Tu, Hardy Chen, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou","submitted_at":"2025-04-10T16:54:05Z","abstract_excerpt":"This work revisits the dominant supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm for training Large Vision-Language Models (LVLMs), and reveals a key finding: SFT can significantly undermine subsequent RL by inducing ``pseudo reasoning paths'' imitated from expert models. While these paths may resemble the native reasoning paths of RL models, they often involve prolonged, hesitant, less informative steps, and incorrect reasoning. To systematically study this effect, we introduce VLAA-Thinking, a new multimodal dataset designed to support reasoning in LVLMs. Constructed vi"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"SFT can significantly undermine subsequent RL by inducing ``pseudo reasoning paths'' imitated from expert models. While these paths may resemble the native reasoning paths of RL models, they often involve prolonged, hesitant, less informative steps, and incorrect reasoning.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the performance gap between SFT-then-RL and RL-only is caused by the induction of pseudo-reasoning paths rather than differences in data difficulty, reward design, or training hyperparameters.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"SFT induces pseudo-reasoning paths that undermine RL in LVLMs, while RL with GRPO and mixed perception-cognition rewards on the new VLAA-Thinking dataset produces more genuine reasoning and top leaderboard performance.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"SFT induces pseudo reasoning paths that undermine subsequent RL in vision-language models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"0c4e0d2b8c80bd84a211c982cab46aa5df07177372c2d37d3e131b8d316876d0"},"source":{"id":"2504.11468","kind":"arxiv","version":1},"verdict":{"id":"8058202a-4513-4352-9ac4-d91df8e07a1e","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T15:39:06.693023Z","strongest_claim":"SFT can significantly undermine subsequent RL by inducing ``pseudo reasoning paths'' imitated from expert models. While these paths may resemble the native reasoning paths of RL models, they often involve prolonged, hesitant, less informative steps, and incorrect reasoning.","one_line_summary":"SFT induces pseudo-reasoning paths that undermine RL in LVLMs, while RL with GRPO and mixed perception-cognition rewards on the new VLAA-Thinking dataset produces more genuine reasoning and top leaderboard performance.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the performance gap between SFT-then-RL and RL-only is caused by the induction of pseudo-reasoning paths rather than differences in data difficulty, reward design, or training hyperparameters.","pith_extraction_headline":"SFT induces pseudo reasoning paths that undermine subsequent RL in vision-language models."},"references":{"count":25,"sample":[{"doi":"","year":null,"title":"**Replace references to “description”, “caption” and ”rationale”** with wording that references **“the image.”** - For example, “The description says...” could become “The image shows...” - “The capti","work_id":"fa56cd8a-a83b-40cd-9646-70a2cc4490bf","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"**Preserve all line breaks, punctuation, and spacing** as much as possible, and make **no additional edits** outside of these replacements","work_id":"13af225f-5c58-4177-9c35-0db9c08151ec","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"—— Here is the input: {input} Figure 10: Prompt for answer rewriting with GPT-4-Turbo","work_id":"1653d069-e272-4934-8d7e-7d6e41093a32","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"MathVista: The Test Mini split of MathVista dataset; overall accuracy","work_id":"90c903b1-c672-419c-9a3e-2efc1e1f4146","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"MathVision: The Full test set of MathVision; overall accuracy","work_id":"112a63fe-d265-48a4-9f9c-3414d617e282","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":25,"snapshot_sha256":"9be76a11b87e06e3feac0f739f610cb7455062d0e26108ff56a02b72ec75f4c3","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"2a4fb0a1a501ae463f2a94eb803bc456bfeb34635e8584518cdf2579a99a56dc"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2504.11468","created_at":"2026-05-17T23:38:13.672334+00:00"},{"alias_kind":"arxiv_version","alias_value":"2504.11468v1","created_at":"2026-05-17T23:38:13.672334+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2504.11468","created_at":"2026-05-17T23:38:13.672334+00:00"},{"alias_kind":"pith_short_12","alias_value":"O672LEEW77T7","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"O672LEEW77T7WB3S","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"O672LEEW","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":20,"internal_anchor_count":20,"sample":[{"citing_arxiv_id":"2509.18847","citing_title":"Failure Makes the Agent Stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2509.22746","citing_title":"Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2511.17652","citing_title":"TeamPath: Building MultiModal Pathology Experts with Reasoning AI Copilots","ref_index":62,"is_internal_anchor":true},{"citing_arxiv_id":"2507.02592","citing_title":"WebSailor: Navigating Super-human Reasoning for Web Agent","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2511.22396","citing_title":"Asking like Socrates: Socrates helps VLMs understand remote sensing images","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2512.12623","citing_title":"Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2511.05271","citing_title":"DeepEyesV2: Toward Agentic Multimodal Model","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2509.24251","citing_title":"Latent Visual Reasoning","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2603.19500","citing_title":"Teaching an Agent to Sketch One Part at a Time","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2507.21046","citing_title":"A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence","ref_index":240,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13230","citing_title":"Teacher-Guided Policy Optimization for LLM Distillation","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03179","citing_title":"Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12004","citing_title":"Learning Agentic Policy from Action Guidance","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08817","citing_title":"How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2604.22498","citing_title":"CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08545","citing_title":"Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2604.05117","citing_title":"Watch Before You Answer: Learning from Visually Grounded Post-Training","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18839","citing_title":"One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models","ref_index":144,"is_internal_anchor":true},{"citing_arxiv_id":"2604.15306","citing_title":"Generalization in LLM Problem Solving: The Case of the Shortest Path","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17614","citing_title":"Characterizing Model-Native Skills","ref_index":76,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/O672LEEW77T7WB3SZGNDBUGL4U","json":"https://pith.science/pith/O672LEEW77T7WB3SZGNDBUGL4U.json","graph_json":"https://pith.science/api/pith-number/O672LEEW77T7WB3SZGNDBUGL4U/graph.json","events_json":"https://pith.science/api/pith-number/O672LEEW77T7WB3SZGNDBUGL4U/events.json","paper":"https://pith.science/paper/O672LEEW"},"agent_actions":{"view_html":"https://pith.science/pith/O672LEEW77T7WB3SZGNDBUGL4U","download_json":"https://pith.science/pith/O672LEEW77T7WB3SZGNDBUGL4U.json","view_paper":"https://pith.science/paper/O672LEEW","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2504.11468&json=true","fetch_graph":"https://pith.science/api/pith-number/O672LEEW77T7WB3SZGNDBUGL4U/graph.json","fetch_events":"https://pith.science/api/pith-number/O672LEEW77T7WB3SZGNDBUGL4U/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/O672LEEW77T7WB3SZGNDBUGL4U/action/timestamp_anchor","attest_storage":"https://pith.science/pith/O672LEEW77T7WB3SZGNDBUGL4U/action/storage_attestation","attest_author":"https://pith.science/pith/O672LEEW77T7WB3SZGNDBUGL4U/action/author_attestation","sign_citation":"https://pith.science/pith/O672LEEW77T7WB3SZGNDBUGL4U/action/citation_signature","submit_replication":"https://pith.science/pith/O672LEEW77T7WB3SZGNDBUGL4U/action/replication_record"}},"created_at":"2026-05-17T23:38:13.672334+00:00","updated_at":"2026-05-17T23:38:13.672334+00:00"}