{"paper":{"title":"DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Adding world modeling to predict future images lets vision-language-action models use large driving datasets more effectively and accelerate performance gains as data scales.","cross_cats":["cs.AI"],"primary_cat":"cs.CV","authors_text":"Bing Zhan, Chufeng Tang, Haochen Wang, Lue Fan, Lu Hou, Shuyao Shang, Weisong Liu, Xiaoman Wang, Yasong An, Yingyan Li, Yuntao Chen, Yuqi Wang, Zhaoxiang Zhang","submitted_at":"2025-10-14T17:59:47Z","abstract_excerpt":"Scaling Vision-Language-Action (VLA) models on large-scale data offers a promising path to achieving a more generalized driving intelligence. However, VLA models are limited by a ``supervision deficit'': the vast model capacity is supervised by sparse, low-dimensional actions, leaving much of their representational power underutilized. To remedy this, we propose \\textbf{DriveVLA-W0}, a training paradigm that employs world modeling to predict future images. This task generates a dense, self-supervised signal that compels the model to learn the underlying dynamics of the driving environment. We "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"we propose DriveVLA-W0, a training paradigm that employs world modeling to predict future images. ... Crucially, it amplifies the data scaling law, showing that performance gains accelerate as the training dataset size increases.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the added world modeling task of predicting future images supplies a dense, unbiased self-supervised signal that meaningfully utilizes unused model capacity without requiring extra labels or introducing new failure modes in driving dynamics.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"DriveVLA-W0 adds world modeling to predict future images in VLA models, overcoming sparse action supervision and amplifying data scaling laws on NAVSIM benchmarks and a large in-house dataset.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Adding world modeling to predict future images lets vision-language-action models use large driving datasets more effectively and accelerate performance gains as data scales.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"b1bf9098c357bc593fcd6e5671323d349b63e1254a40e7a4893235c251225b2a"},"source":{"id":"2510.12796","kind":"arxiv","version":2},"verdict":{"id":"cf6793d1-473e-4e13-9da8-1ff37c4b0c32","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T06:42:20.341660Z","strongest_claim":"we propose DriveVLA-W0, a training paradigm that employs world modeling to predict future images. ... Crucially, it amplifies the data scaling law, showing that performance gains accelerate as the training dataset size increases.","one_line_summary":"DriveVLA-W0 adds world modeling to predict future images in VLA models, overcoming sparse action supervision and amplifying data scaling laws on NAVSIM benchmarks and a large in-house dataset.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the added world modeling task of predicting future images supplies a dense, unbiased self-supervised signal that meaningfully utilizes unused model capacity without requiring extra labels or introducing new failure modes in driving dynamics.","pith_extraction_headline":"Adding world modeling to predict future images lets vision-language-action models use large driving datasets more effectively and accelerate performance gains as data scales."},"references":{"count":39,"sample":[{"doi":"","year":2025,"title":"Covla: Comprehensive vision-language-action dataset for autonomous driving","work_id":"6783599d-5a5e-4a21-a4b0-e2ed5582cf30","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","ref_index":2,"cited_arxiv_id":"2502.13923","is_internal_anchor":true},{"doi":"","year":null,"title":"Scaling Laws of Mo- tion Forecasting and Planning – Technical Report","work_id":"dfe35d03-cdc6-4f73-940e-1ae1ceb82a53","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Vavim and vavam: Autonomous driving through video generative modeling","work_id":"b75ea66d-dafb-43ec-9345-f50eb3d615e2","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","work_id":"f790abdc-a796-482f-a40d-f8ee035ecfc2","ref_index":6,"cited_arxiv_id":"2410.24164","is_internal_anchor":true}],"resolved_work":39,"snapshot_sha256":"81a2968186f58f7db9dd6e0c27c178cccb4f6c53537cbf5a001d03d57aa9f2f7","internal_anchors":15},"formal_canon":{"evidence_count":2,"snapshot_sha256":"854227cfb28bcd6bbc663def7c2d1a8978711bed00d69c4e32b3263629f9c637"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}