{"paper":{"title":"AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"Reinforcement learning with tailored rewards and a two-stage strategy improves vision-language models for autonomous driving planning.","cross_cats":["cs.RO"],"primary_cat":"cs.CV","authors_text":"Bo Jiang, Qian Zhang, Shaoyu Chen, Wenyu Liu, Xinggang Wang","submitted_at":"2025-03-10T17:59:42Z","abstract_excerpt":"OpenAI o1 and DeepSeek R1 achieve or even surpass human expert-level performance in complex domains like mathematics and science, with reinforcement learning (RL) and reasoning playing a crucial role. In autonomous driving, recent end-to-end models have greatly improved planning performance but still struggle with long-tailed problems due to limited common sense and reasoning abilities. Some studies integrate vision-language models (VLMs) into autonomous driving, but they typically rely on pre-trained models with simple supervised fine-tuning (SFT) on driving data, without further exploration "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"AlphaDrive significantly improves both planning performance and training efficiency compared to using only SFT or without reasoning, and following RL training exhibits emergent multimodal planning capabilities.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the four GRPO-based RL rewards and two-stage training strategy produce generalizable, safe improvements on real-world driving data rather than overfitting to the training distribution.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"AlphaDrive uses GRPO-based RL rewards and two-stage SFT+RL training on VLMs to improve autonomous driving planning performance and efficiency while producing emergent multimodal capabilities.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Reinforcement learning with tailored rewards and a two-stage strategy improves vision-language models for autonomous driving planning.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"0f635f6f218a8fe37b2ed36a843b8b0ed939e4aa0dd1f623f4fea8311164831e"},"source":{"id":"2503.07608","kind":"arxiv","version":1},"verdict":{"id":"8b8c67fa-e1a6-49b9-b03a-3a5416fde7cf","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T20:03:30.788130Z","strongest_claim":"AlphaDrive significantly improves both planning performance and training efficiency compared to using only SFT or without reasoning, and following RL training exhibits emergent multimodal planning capabilities.","one_line_summary":"AlphaDrive uses GRPO-based RL rewards and two-stage SFT+RL training on VLMs to improve autonomous driving planning performance and efficiency while producing emergent multimodal capabilities.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the four GRPO-based RL rewards and two-stage training strategy produce generalizable, safe improvements on real-world driving data rather than overfitting to the training distribution.","pith_extraction_headline":"Reinforcement learning with tailored rewards and a two-stage strategy improves vision-language models for autonomous driving planning."},"references":{"count":49,"sample":[{"doi":"","year":null,"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","ref_index":1,"cited_arxiv_id":"2303.08774","is_internal_anchor":true},{"doi":"","year":2022,"title":"Flamingo: a visual language model for few-shot learning","work_id":"2d29aa49-7f72-4532-8c66-e33ed3d6d8a8","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","ref_index":3,"cited_arxiv_id":"2308.12966","is_internal_anchor":true},{"doi":"","year":2005,"title":"Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments","work_id":"90de9967-cc22-427f-91fb-ed50f063376c","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control","work_id":"ff438a8a-8003-4fae-9131-acd418b3597b","ref_index":5,"cited_arxiv_id":"2307.15818","is_internal_anchor":true}],"resolved_work":49,"snapshot_sha256":"a80272b7b288b32d0835e4f514004ed6eafe32c0ace5e6a9eca451ddc76446f5","internal_anchors":16},"formal_canon":{"evidence_count":2,"snapshot_sha256":"4050fc7d31763a3a0ce57228bfdcbd91b98fa13dafc1b3731d305dae64b84142"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}