arxiv: 2510.12796 · v2 · submitted 2025-10-14 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

Yingyan Li , Shuyao Shang , Weisong Liu , Bing Zhan , Haochen Wang , Yuqi Wang , Yuntao Chen , Xiaoman Wang

show 5 more authors

Yasong An Chufeng Tang Lu Hou Lue Fan Zhaoxiang Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-17 06:42 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords autonomous drivingvision-language-action modelsworld modelsdata scaling lawsself-supervised learningfuture image predictionVLA training

0 comments

The pith

Adding world modeling to predict future images lets vision-language-action models use large driving datasets more effectively and accelerate performance gains as data scales.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language-action models for autonomous driving face a supervision deficit because sparse action labels leave much of their capacity underutilized. DriveVLA-W0 adds world modeling to predict future images, creating a dense self-supervised signal that teaches the model the underlying dynamics of the driving environment. The paradigm applies to both autoregressive models with discrete visual tokens and diffusion models with continuous features, then adds a lightweight action expert for real-time inference. Experiments on NAVSIM benchmarks and a dataset 680 times larger show consistent outperformance over BEV and VLA baselines, with gains that grow faster as training data increases.

Core claim

DriveVLA-W0 remedies the supervision deficit in VLA models by training them to predict future images through world modeling, which supplies a dense self-supervised signal for learning driving dynamics and amplifies the data scaling law so that performance improvements accelerate with larger training datasets.

What carries the argument

The world modeling task of predicting future images, which generates a dense self-supervised signal to capture driving environment dynamics and is paired with a lightweight action expert for deployment.

If this is right

The method outperforms standard BEV and VLA baselines on NAVSIM v1/v2 and large in-house datasets.
Performance gains accelerate rather than plateau as the training dataset size increases.
The same world modeling approach works for both discrete-token autoregressive VLAs and continuous-feature diffusion VLAs.
A lightweight action expert enables real-time inference while preserving the benefits of world modeling pretraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar auxiliary prediction tasks could address supervision deficits in other large multimodal models beyond driving.
Autonomous driving development might require fewer expensive action labels if world modeling is used to bootstrap representations.
The amplified scaling could be tested by varying the prediction horizon or scene complexity to find optimal world modeling setups.

Load-bearing premise

Predicting future images supplies a dense, unbiased self-supervised signal that engages unused model capacity without introducing new failure modes or biases in driving dynamics.

What would settle it

Train identical VLA architectures with and without the world modeling objective on progressively larger subsets of the same driving data and measure whether the performance advantage of the world modeling version widens as dataset size increases.

read the original abstract

Scaling Vision-Language-Action (VLA) models on large-scale data offers a promising path to achieving a more generalized driving intelligence. However, VLA models are limited by a ``supervision deficit'': the vast model capacity is supervised by sparse, low-dimensional actions, leaving much of their representational power underutilized. To remedy this, we propose \textbf{DriveVLA-W0}, a training paradigm that employs world modeling to predict future images. This task generates a dense, self-supervised signal that compels the model to learn the underlying dynamics of the driving environment. We showcase the paradigm's versatility by instantiating it for two dominant VLA archetypes: an autoregressive world model for VLAs that use discrete visual tokens, and a diffusion world model for those operating on continuous visual features. Building on the rich representations learned from world modeling, we introduce a lightweight action expert to address the inference latency for real-time deployment. Extensive experiments on the NAVSIM v1/v2 benchmark and a 680x larger in-house dataset demonstrate that DriveVLA-W0 significantly outperforms BEV and VLA baselines. Crucially, it amplifies the data scaling law, showing that performance gains accelerate as the training dataset size increases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DriveVLA-W0 adds world modeling to VLA training for denser supervision and claims it amplifies data scaling, but the acceleration part needs explicit multi-scale curves to land.

read the letter

Hi colleague, The one or two things to know about this paper are that it adds world modeling for future image prediction to VLA training in driving to get denser supervision, and it claims this setup amplifies the benefits of scaling up the training data. They show this with both autoregressive and diffusion world models and a small action expert on top. What the paper does well is take the idea of underutilized capacity in VLAs seriously and try to fill it with a self-supervised task that doesn't need new labels. The experiments on NAVSIM and especially the much larger in-house set give some evidence that the approach works at scale. Outperforming BEV and standard VLA baselines is a reasonable result, and covering two main VLA styles makes the paradigm look more general. The soft spots are around the scaling claim. The abstract says performance gains accelerate as dataset size increases, but to make that stick you really want to see the scaling curves or tables at multiple sizes for the baseline VLA versus DriveVLA-W0. If the paper has those and the world model version shows a steeper slope, great; if not, the amplification part is more of an observation than a demonstrated effect. The stress-test note is on point here unless the full text has the comparative plots. Minor point: make sure the added world modeling doesn't just increase effective training compute without being accounted for. The citation pattern looks standard, covering prior VLA and world model papers. No obvious fitting issues or circular definitions. This paper is for people in the autonomous driving community who are scaling up VLAs or similar models. A reader interested in using video data more effectively for better driving policies would find it useful. It deserves a serious referee because the core idea is clear, the experiments are on relevant benchmarks, and the practical addition of the action expert addresses a real deployment issue. I'd recommend putting it through peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes DriveVLA-W0, a training paradigm that augments Vision-Language-Action (VLA) models for autonomous driving by adding a world-modeling objective to predict future images. This supplies a dense self-supervised signal to mitigate the supervision deficit from sparse action labels. The paradigm is instantiated for both autoregressive (discrete-token) and diffusion (continuous-feature) VLAs; a lightweight action expert is added for low-latency inference. Experiments on NAVSIM v1/v2 and a 680x larger in-house dataset report outperformance over BEV and standard VLA baselines, with the central claim that the approach amplifies the data scaling law by accelerating performance gains as dataset size increases.

Significance. If the scaling-amplification claim is substantiated, the work would offer a practical route to better utilizing model capacity on large driving datasets without extra labels, potentially improving data efficiency and generalization in embodied AI. The dual instantiation across VLA architectures and the addition of the action expert for deployment are pragmatic strengths. The result would be of interest to the autonomous-driving and scaling-laws communities provided the comparative scaling evidence is supplied.

major comments (2)

[Experiments] Experiments / scaling results: The central claim that DriveVLA-W0 'amplifies the data scaling law, showing that performance gains accelerate as the training dataset size increases' (Abstract) is load-bearing for the title and contribution. The manuscript must report performance at multiple dataset scales (e.g., 10 %, 50 %, 100 % of the in-house set) for both DriveVLA-W0 and matched VLA baselines, with either raw curves or fitted exponents demonstrating a steeper slope under the world-modeling regime. Full-scale results alone do not isolate the amplification effect from differences in total training compute, token count, or capacity.
[Method] §3 (World Model Instantiation): The assumption that the future-image prediction task supplies an unbiased, dense signal that meaningfully utilizes unused capacity without introducing new failure modes in driving dynamics is stated but not quantitatively validated. Ablations isolating the world-modeling loss weight, its effect on action prediction accuracy, and any introduced dynamics bias are required to support the 'remedy' narrative.

minor comments (2)

[Abstract] Abstract: The phrase '680x larger in-house dataset' should be accompanied by absolute sizes or a reference to the baseline dataset used for the multiplier.
[Method] Notation: Define the precise form of the combined loss (world-model + action) and any weighting hyper-parameters at first use rather than deferring to supplementary material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important ways to strengthen the evidence for our central claims. We respond to each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Experiments] Experiments / scaling results: The central claim that DriveVLA-W0 'amplifies the data scaling law, showing that performance gains accelerate as the training dataset size increases' (Abstract) is load-bearing for the title and contribution. The manuscript must report performance at multiple dataset scales (e.g., 10 %, 50 %, 100 % of the in-house set) for both DriveVLA-W0 and matched VLA baselines, with either raw curves or fitted exponents demonstrating a steeper slope under the world-modeling regime. Full-scale results alone do not isolate the amplification effect from differences in total training compute, token count, or capacity.

Authors: We agree that isolating the amplification effect requires explicit scaling curves across dataset sizes. The original manuscript demonstrates gains on NAVSIM (smaller scale) versus the full 680× in-house dataset and compares against matched VLA baselines, but does not include intermediate fractions of the in-house set. In the revision we will add experiments training both DriveVLA-W0 and the baseline VLA on 10 %, 50 %, and 100 % subsets of the in-house data under matched compute budgets. We will report the resulting performance curves together with fitted scaling exponents to quantify the steeper slope under the world-modeling regime. revision: yes
Referee: [Method] §3 (World Model Instantiation): The assumption that the future-image prediction task supplies an unbiased, dense signal that meaningfully utilizes unused capacity without introducing new failure modes in driving dynamics is stated but not quantitatively validated. Ablations isolating the world-modeling loss weight, its effect on action prediction accuracy, and any introduced dynamics bias are required to support the 'remedy' narrative.

Authors: We acknowledge that quantitative validation of the world-modeling objective is needed. In the revised manuscript we will include ablations that vary the world-modeling loss weight while measuring (i) action-prediction accuracy on held-out planning metrics and (ii) potential dynamics bias via closed-loop trajectory consistency and collision rates. These results will show that the added dense signal improves representation utilization without measurable degradation in driving dynamics. revision: yes

Circularity Check

0 steps flagged

Empirical training paradigm with external benchmarks exhibits no derivation circularity

full rationale

The paper describes an empirical training paradigm (DriveVLA-W0) that adds world modeling for future image prediction to VLA models, followed by a lightweight action expert, and reports performance on NAVSIM v1/v2 plus a 680x in-house dataset. The scaling-law amplification claim rests on observed performance trends across dataset sizes in these external benchmarks rather than any equations, fitted parameters renamed as predictions, or self-citation chains that reduce the result to its own inputs by construction. No self-definitional steps, ansatzes, or uniqueness theorems appear in the provided text; the work is self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method builds on standard VLA and world-model components already present in the literature.

pith-pipeline@v0.9.0 · 5562 in / 1105 out tokens · 60810 ms · 2026-05-17T06:42:20.341660+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.LawOfExistence defect_zero_iff_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

This task generates a dense, self-supervised signal that compels the model to learn the underlying dynamics of the driving environment.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 7.0

MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving
cs.CV 2026-05 unverdicted novelty 7.0

VECTOR-DRIVE couples vision-language reasoning and trajectory planning in one Transformer via semantic expert routing and flow-matching, reaching 88.91 driving score on Bench2Drive.
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
cs.RO 2026-05 unverdicted novelty 7.0

ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.
Driving Intents Amplify Planning-Oriented Reinforcement Learning
cs.RO 2026-05 unverdicted novelty 6.0

DIAL uses intent-conditioned CFG and multi-intent GRPO to expand and preserve diverse modes in continuous-action preference RL, lifting RFS to 9.14 and surpassing both prior best (8.5) and human demonstration (8.13).
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 6.0

MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...
DriveFuture: Future-Aware Latent World Models for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
cs.RO 2026-05 unverdicted novelty 6.0

ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
Infrastructure-Centric World Models: Bridging Temporal Depth and Spatial Breadth for Roadside Perception
cs.CV 2026-04 unverdicted novelty 6.0

Infrastructure-centric world models use roadside sensors' temporal depth to complement vehicle spatial breadth for better traffic simulation and prediction.
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
cs.CV 2026-04 unverdicted novelty 6.0

SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and t...
ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving
cs.CV 2026-04 unverdicted novelty 6.0

ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
cs.CV 2026-04 unverdicted novelty 6.0

DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...
OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness
cs.CV 2026-02 unverdicted novelty 6.0

OpenVO estimates ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras by encoding temporal dynamics in a two-frame regression framework and using 3D priors from foundation ...
DriveLaW:Unifying Planning and Video Generation in a Latent Driving World
cs.CV 2025-12 unverdicted novelty 6.0

DriveLaW unifies video world modeling and trajectory planning by injecting video-generator latents into a diffusion planner, achieving SOTA video prediction and a new record on the NAVSIM planning benchmark.
SimScale: Learning to Drive via Real-World Simulation at Scale
cs.CV 2025-11 conditional novelty 6.0

SimScale synthesizes unseen driving states from real logs via neural rendering and reactive environments, generates pseudo-expert trajectories, and shows that co-training on real plus simulated data improves planning ...
EponaV2: Driving World Model with Comprehensive Future Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
Driving Intents Amplify Planning-Oriented Reinforcement Learning
cs.RO 2026-05 unverdicted novelty 5.0

DIAL expands continuous-action driving policies via intent-conditioned flow matching and multi-intent GRPO, lifting best-of-N preference scores above human demonstrations for the first time on WOD-E2E.
CRAFT: Counterfactual-to-Interactive Reinforcement Fine-Tuning for Driving Policies
cs.LG 2026-05 unverdicted novelty 5.0

CRAFT is an on-policy RL fine-tuning framework that decomposes closed-loop policy gradients into a group-normalized counterfactual proxy plus residual correction from interaction events, achieving top closed-loop perf...
SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model
cs.CV 2026-04 unverdicted novelty 5.0

SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.
OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL
cs.RO 2026-04 unverdicted novelty 4.0

OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 17 Pith papers · 15 internal anchors

[1]

Covla: Comprehensive vision-language-action dataset for autonomous driving

Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 1933–1943. IEEE,

work page 2025
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Scaling laws of motion forecasting and planning–a technical report.arXiv preprint arXiv:2506.08228,

Mustafa Baniodeh, Kratarth Goel, Scott Ettinger, Carlos Fuertes, Ari Seff, Tim Shen, Cole Gulino, Chenjie Yang, Ghassen Jerfel, Dokook Choe, et al. Scaling laws of motion forecasting and planning–a technical report.arXiv preprint arXiv:2506.08228,

work page arXiv
[4]

Vavim and vavam: Autonomous driving through video generative modeling.arXiv preprint arXiv:2502.15672,

Florent Bartoccioni, Elias Ramzi, Victor Besnier, Shashanka Venkataramanan, Tuan-Hung Vu, Yihong Xu, Loick Chambon, Spyros Gidaris, Serkan Odabas, David Hurych, et al. Vavim and vavam: Autonomous driving through video generative modeling.arXiv preprint arXiv:2502.15672,

work page arXiv
[6]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

doi: 10.48550.arXiv preprint ARXIV .2410.24164. Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Pseudo-simulation for autonomous driving

Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, et al. Pseudo-simulation for autonomous driving. arXiv preprint arXiv:2506.04218,

work page arXiv
[8]

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Ts-vlm: Text-guided softsort pooling for vision-language models in multi-view driving reasoning.arXiv preprint arXiv:2505.12670,

Lihong Chen, Hossein Hassani, and Soodeh Nikan. Ts-vlm: Text-guided softsort pooling for vision-language models in multi-view driving reasoning.arXiv preprint arXiv:2505.12670,

work page arXiv
[10]

Drivinggpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers.arXiv preprint arXiv:2412.18607,

Yuntao Chen, Yuqi Wang, and Zhaoxiang Zhang. Drivinggpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers.arXiv preprint arXiv:2412.18607,

work page arXiv
[11]

Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking

Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.arXiv preprint arXiv:2406.15349,

work page arXiv
[12]

Cunxin Fan, Xiaosong Jia, Yihang Sun, Yixiao Wang, Jianglan Wei, Ziyang Gong, Xiangyu Zhao, Masayoshi Tomizuka, Xue Yang, Junchi Yan, et al

11 Preprint. Cunxin Fan, Xiaosong Jia, Yihang Sun, Yixiao Wang, Jianglan Wei, Ziyang Gong, Xiangyu Zhao, Masayoshi Tomizuka, Xue Yang, Junchi Yan, et al. Interleave-vla: Enhancing robot manipulation with interleaved image-text instructions.arXiv preprint arXiv:2505.02152,

work page arXiv
[13]

Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving.arXiv preprint arXiv:2504.19580,

Renju Feng, Ning Xi, Duanfeng Chu, Rukang Wang, Zejian Deng, Anzheng Wang, Liping Lu, Jinxiang Wang, and Yanjun Huang. Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving.arXiv preprint arXiv:2504.19580,

work page arXiv
[14]

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation.arXiv:2503.19755,

work page internal anchor Pith review arXiv
[15]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

St-p3: End-to- end vision-based autonomous driving via spatial-temporal feature learning

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tian- wei Lin, Wenhai Wang, et al. Goal-oriented autonomous driving.arXiv preprint arXiv:2212.10156,

work page arXiv
[18]

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

Junjie Huang, Guan Huang, Zheng Zhu, and Dalong Du. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view.arXiv preprint arXiv:2112.11790,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

EMMA: End-to-End Multimodal Model for Autonomous Driving

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. \π_0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Diffvla: Vision-language guided diffusion planning for autonomous driving.arXiv preprint arXiv:2505.19381,

Anqing Jiang, Yu Gao, Zhigang Sun, Yiru Wang, Jijun Wang, Jinghao Chai, Qian Cao, Yuweng Heng, Hao Jiang, Yunda Dong, et al. Diffvla: Vision-language guided diffusion planning for autonomous driving.arXiv preprint arXiv:2505.19381,

work page arXiv
[22]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[23]

Enhancing End-to-End Autonomous Driving with Latent World Model

Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. En- hancing end-to-end autonomous driving with latent world model.arXiv preprint arXiv:2406.08481, 2024a. Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online trajectory evaluation via bev world model.arXiv preprint ...

work page internal anchor Pith review arXiv
[24]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Carllava: Vision language models for camera-only closed-loop driving

Katrin Renz, Long Chen, Ana-Maria Marcu, Jan Hünermann, Benoit Hanotte, Alice Karnsund, Jamie Shotton, Elahe Arani, and Oleg Sinavski. Carllava: Vision language models for camera-only closed-loop driving.arXiv preprint arXiv:2406.10165,

work page arXiv
[27]

DINOv3

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Large trajectory models are scalable motion predictors and planners.arXiv preprint arXiv:2310.19620,

Qiao Sun, Shiduo Zhang, Danjiao Ma, Jingzhe Shi, Derun Li, Simian Luo, Yu Wang, Ningyi Xu, Guangzhi Cao, and Hang Zhao. Large trajectory models are scalable motion predictors and planners.arXiv preprint arXiv:2310.19620,

work page arXiv
[29]

Reconstructive visual instruction tuning.arXiv preprint arXiv:2410.09575, 2024a

Haochen Wang, Anlin Zheng, Yucheng Zhao, Tiancai Wang, Zheng Ge, Xiangyu Zhang, and Zhaoxiang Zhang. Reconstructive visual instruction tuning.arXiv preprint arXiv:2410.09575, 2024a. Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you nee...

work page arXiv
[30]

Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end au- tonomous driving.arXiv preprint arXiv:2505.16278,

Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end au- tonomous driving.arXiv preprint arXiv:2505.16278,

work page arXiv
[31]

Drivesuprim: Towards precise trajectory selection for end-to-end planning.arXiv preprint arXiv:2506.06659,

Wenhao Yao, Zhenxin Li, Shiyi Lan, Zi Wang, Xinglong Sun, Jose M Alvarez, and Zuxuan Wu. Drivesuprim: Towards precise trajectory selection for end-to-end planning.arXiv preprint arXiv:2506.06659,

work page arXiv
[32]

Rag-driver: Generalisable driving explanations with retrieval-augmented in-context learning in multi-modal large language model.arXiv preprint arXiv:2402.10828,

Jianhao Yuan, Shuyang Sun, Daniel Omeiza, Bo Zhao, Paul Newman, Lars Kunze, and Matthew Gadd. Rag-driver: Generalisable driving explanations with retrieval-augmented in-context learning in multi-modal large language model.arXiv preprint arXiv:2402.10828,

work page arXiv
[33]

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei, and Ning Guo. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Safeauto: Knowledge-enhanced safe autonomous driving with multimodal foundation models.arXiv preprint arXiv:2503.00211,

Jiawei Zhang, Xuan Yang, Taiqi Wang, Yu Yao, Aleksandr Petiushko, and Bo Li. Safeauto: Knowledge-enhanced safe autonomous driving with multimodal foundation models.arXiv preprint arXiv:2503.00211,

work page arXiv
[35]

Learning unsuper- vised world models for autonomous driving via discrete diffusion.arXiv preprint arXiv:2311.01017,

Lunjun Zhang, Yuwen Xiong, Ze Yang, Sergio Casas, Rui Hu, and Raquel Urtasun. Learning unsuper- vised world models for autonomous driving via discrete diffusion.arXiv preprint arXiv:2311.01017,

work page arXiv
[36]

Doe-1: Closed- loop autonomous driving with large world model.arXiv preprint arXiv:2412.09627, 2024a

Wenzhao Zheng, Zetian Xia, Yuanhui Huang, Sicheng Zuo, Jie Zhou, and Jiwen Lu. Doe-1: Closed- loop autonomous driving with large world model.arXiv preprint arXiv:2412.09627, 2024a. Yupeng Zheng, Zhongpu Xia, Qichao Zhang, Teng Zhang, Ben Lu, Xiaochuang Huo, Chao Han, Yixian Li, Mengjie Yu, Bu Jin, et al. Preliminary investigation into data scaling laws fo...

work page arXiv
[37]

go straight

and evaluate their final PDMS scores. The results confirm a strong positive correlation: the 6V A checkpoint, which had superior generative fidelity, also achieves higher planning performance after fine-tuning. This provides compelling evidence thatthe model’s ability to generate high-quality, realistic future images is directly linked to its capacity for...

work page 2020
[38]

In computer vision, Zhai et al

then showed many LMs were undertrained and derived a compute- optimal prescription that scales model size and tokens proportionally. In computer vision, Zhai et al. (2022) charted ViT scaling law with stable training recipes, and ViT-22B (Dehghani et al.,

work page 2022
[39]

Lin et al

scaled ViTs to 22B parameters, verifying predictable multi-task improvements. Lin et al. (2025) conducted a large-scale study of imitation-learning data scaling in robotics, and found near–power-law gains from increasing environmental and object diversity with improved zero-shot generalization. In autonomous driving, STR (Sun et al.,

work page 2025
[40]

(2025) reported power-law improvements for joint motion forecasting and planning with large driving datasets

shows large trajectory models scale steadily in both prediction and planning, and Baniodeh et al. (2025) reported power-law improvements for joint motion forecasting and planning with large driving datasets. For end-to-end driving, Naumann et al. (2025) observe roughly log-linear gains in both open- and closed-loop metrics as training data scale increases...

work page 2025