Recognition: 2 theorem links
· Lean TheoremDriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving
Pith reviewed 2026-05-17 06:42 UTC · model grok-4.3
The pith
Adding world modeling to predict future images lets vision-language-action models use large driving datasets more effectively and accelerate performance gains as data scales.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DriveVLA-W0 remedies the supervision deficit in VLA models by training them to predict future images through world modeling, which supplies a dense self-supervised signal for learning driving dynamics and amplifies the data scaling law so that performance improvements accelerate with larger training datasets.
What carries the argument
The world modeling task of predicting future images, which generates a dense self-supervised signal to capture driving environment dynamics and is paired with a lightweight action expert for deployment.
If this is right
- The method outperforms standard BEV and VLA baselines on NAVSIM v1/v2 and large in-house datasets.
- Performance gains accelerate rather than plateau as the training dataset size increases.
- The same world modeling approach works for both discrete-token autoregressive VLAs and continuous-feature diffusion VLAs.
- A lightweight action expert enables real-time inference while preserving the benefits of world modeling pretraining.
Where Pith is reading between the lines
- Similar auxiliary prediction tasks could address supervision deficits in other large multimodal models beyond driving.
- Autonomous driving development might require fewer expensive action labels if world modeling is used to bootstrap representations.
- The amplified scaling could be tested by varying the prediction horizon or scene complexity to find optimal world modeling setups.
Load-bearing premise
Predicting future images supplies a dense, unbiased self-supervised signal that engages unused model capacity without introducing new failure modes or biases in driving dynamics.
What would settle it
Train identical VLA architectures with and without the world modeling objective on progressively larger subsets of the same driving data and measure whether the performance advantage of the world modeling version widens as dataset size increases.
read the original abstract
Scaling Vision-Language-Action (VLA) models on large-scale data offers a promising path to achieving a more generalized driving intelligence. However, VLA models are limited by a ``supervision deficit'': the vast model capacity is supervised by sparse, low-dimensional actions, leaving much of their representational power underutilized. To remedy this, we propose \textbf{DriveVLA-W0}, a training paradigm that employs world modeling to predict future images. This task generates a dense, self-supervised signal that compels the model to learn the underlying dynamics of the driving environment. We showcase the paradigm's versatility by instantiating it for two dominant VLA archetypes: an autoregressive world model for VLAs that use discrete visual tokens, and a diffusion world model for those operating on continuous visual features. Building on the rich representations learned from world modeling, we introduce a lightweight action expert to address the inference latency for real-time deployment. Extensive experiments on the NAVSIM v1/v2 benchmark and a 680x larger in-house dataset demonstrate that DriveVLA-W0 significantly outperforms BEV and VLA baselines. Crucially, it amplifies the data scaling law, showing that performance gains accelerate as the training dataset size increases.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes DriveVLA-W0, a training paradigm that augments Vision-Language-Action (VLA) models for autonomous driving by adding a world-modeling objective to predict future images. This supplies a dense self-supervised signal to mitigate the supervision deficit from sparse action labels. The paradigm is instantiated for both autoregressive (discrete-token) and diffusion (continuous-feature) VLAs; a lightweight action expert is added for low-latency inference. Experiments on NAVSIM v1/v2 and a 680x larger in-house dataset report outperformance over BEV and standard VLA baselines, with the central claim that the approach amplifies the data scaling law by accelerating performance gains as dataset size increases.
Significance. If the scaling-amplification claim is substantiated, the work would offer a practical route to better utilizing model capacity on large driving datasets without extra labels, potentially improving data efficiency and generalization in embodied AI. The dual instantiation across VLA architectures and the addition of the action expert for deployment are pragmatic strengths. The result would be of interest to the autonomous-driving and scaling-laws communities provided the comparative scaling evidence is supplied.
major comments (2)
- [Experiments] Experiments / scaling results: The central claim that DriveVLA-W0 'amplifies the data scaling law, showing that performance gains accelerate as the training dataset size increases' (Abstract) is load-bearing for the title and contribution. The manuscript must report performance at multiple dataset scales (e.g., 10 %, 50 %, 100 % of the in-house set) for both DriveVLA-W0 and matched VLA baselines, with either raw curves or fitted exponents demonstrating a steeper slope under the world-modeling regime. Full-scale results alone do not isolate the amplification effect from differences in total training compute, token count, or capacity.
- [Method] §3 (World Model Instantiation): The assumption that the future-image prediction task supplies an unbiased, dense signal that meaningfully utilizes unused capacity without introducing new failure modes in driving dynamics is stated but not quantitatively validated. Ablations isolating the world-modeling loss weight, its effect on action prediction accuracy, and any introduced dynamics bias are required to support the 'remedy' narrative.
minor comments (2)
- [Abstract] Abstract: The phrase '680x larger in-house dataset' should be accompanied by absolute sizes or a reference to the baseline dataset used for the multiplier.
- [Method] Notation: Define the precise form of the combined loss (world-model + action) and any weighting hyper-parameters at first use rather than deferring to supplementary material.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight important ways to strengthen the evidence for our central claims. We respond to each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Experiments] Experiments / scaling results: The central claim that DriveVLA-W0 'amplifies the data scaling law, showing that performance gains accelerate as the training dataset size increases' (Abstract) is load-bearing for the title and contribution. The manuscript must report performance at multiple dataset scales (e.g., 10 %, 50 %, 100 % of the in-house set) for both DriveVLA-W0 and matched VLA baselines, with either raw curves or fitted exponents demonstrating a steeper slope under the world-modeling regime. Full-scale results alone do not isolate the amplification effect from differences in total training compute, token count, or capacity.
Authors: We agree that isolating the amplification effect requires explicit scaling curves across dataset sizes. The original manuscript demonstrates gains on NAVSIM (smaller scale) versus the full 680× in-house dataset and compares against matched VLA baselines, but does not include intermediate fractions of the in-house set. In the revision we will add experiments training both DriveVLA-W0 and the baseline VLA on 10 %, 50 %, and 100 % subsets of the in-house data under matched compute budgets. We will report the resulting performance curves together with fitted scaling exponents to quantify the steeper slope under the world-modeling regime. revision: yes
-
Referee: [Method] §3 (World Model Instantiation): The assumption that the future-image prediction task supplies an unbiased, dense signal that meaningfully utilizes unused capacity without introducing new failure modes in driving dynamics is stated but not quantitatively validated. Ablations isolating the world-modeling loss weight, its effect on action prediction accuracy, and any introduced dynamics bias are required to support the 'remedy' narrative.
Authors: We acknowledge that quantitative validation of the world-modeling objective is needed. In the revised manuscript we will include ablations that vary the world-modeling loss weight while measuring (i) action-prediction accuracy on held-out planning metrics and (ii) potential dynamics bias via closed-loop trajectory consistency and collision rates. These results will show that the added dense signal improves representation utilization without measurable degradation in driving dynamics. revision: yes
Circularity Check
Empirical training paradigm with external benchmarks exhibits no derivation circularity
full rationale
The paper describes an empirical training paradigm (DriveVLA-W0) that adds world modeling for future image prediction to VLA models, followed by a lightweight action expert, and reports performance on NAVSIM v1/v2 plus a 680x in-house dataset. The scaling-law amplification claim rests on observed performance trends across dataset sizes in these external benchmarks rather than any equations, fitted parameters renamed as predictions, or self-citation chains that reduce the result to its own inputs by construction. No self-definitional steps, ansatzes, or uniqueness theorems appear in the provided text; the work is self-contained against independent benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.LawOfExistencedefect_zero_iff_one echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
This task generates a dense, self-supervised signal that compels the model to learn the underlying dynamics of the driving environment.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 21 Pith papers
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
-
VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving
VECTOR-DRIVE couples vision-language reasoning and trajectory planning in one Transformer via semantic expert routing and flow-matching, reaching 88.91 driving score on Bench2Drive.
-
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.
-
Driving Intents Amplify Planning-Oriented Reinforcement Learning
DIAL uses intent-conditioned CFG and multi-intent GRPO to expand and preserve diverse modes in continuous-action preference RL, lifting RFS to 9.14 and surpassing both prior best (8.5) and human demonstration (8.13).
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...
-
DriveFuture: Future-Aware Latent World Models for Autonomous Driving
DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.
-
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
-
Infrastructure-Centric World Models: Bridging Temporal Depth and Spatial Breadth for Roadside Perception
Infrastructure-centric world models use roadside sensors' temporal depth to complement vehicle spatial breadth for better traffic simulation and prediction.
-
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and t...
-
ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving
ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.
-
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...
-
OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness
OpenVO estimates ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras by encoding temporal dynamics in a two-frame regression framework and using 3D priors from foundation ...
-
DriveLaW:Unifying Planning and Video Generation in a Latent Driving World
DriveLaW unifies video world modeling and trajectory planning by injecting video-generator latents into a diffusion planner, achieving SOTA video prediction and a new record on the NAVSIM planning benchmark.
-
SimScale: Learning to Drive via Real-World Simulation at Scale
SimScale synthesizes unseen driving states from real logs via neural rendering and reactive environments, generates pseudo-expert trajectories, and shows that co-training on real plus simulated data improves planning ...
-
EponaV2: Driving World Model with Comprehensive Future Reasoning
EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
-
Driving Intents Amplify Planning-Oriented Reinforcement Learning
DIAL expands continuous-action driving policies via intent-conditioned flow matching and multi-intent GRPO, lifting best-of-N preference scores above human demonstrations for the first time on WOD-E2E.
-
CRAFT: Counterfactual-to-Interactive Reinforcement Fine-Tuning for Driving Policies
CRAFT is an on-policy RL fine-tuning framework that decomposes closed-loop policy gradients into a group-normalized counterfactual proxy plus residual correction from interaction events, achieving top closed-loop perf...
-
SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model
SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.
-
OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL
OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.
Reference graph
Works this paper leans on
-
[1]
Covla: Comprehensive vision-language-action dataset for autonomous driving
Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 1933–1943. IEEE,
work page 2025
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Scaling laws of motion forecasting and planning–a technical report.arXiv preprint arXiv:2506.08228,
Mustafa Baniodeh, Kratarth Goel, Scott Ettinger, Carlos Fuertes, Ari Seff, Tim Shen, Cole Gulino, Chenjie Yang, Ghassen Jerfel, Dokook Choe, et al. Scaling laws of motion forecasting and planning–a technical report.arXiv preprint arXiv:2506.08228,
-
[4]
Florent Bartoccioni, Elias Ramzi, Victor Besnier, Shashanka Venkataramanan, Tuan-Hung Vu, Yihong Xu, Loick Chambon, Spyros Gidaris, Serkan Odabas, David Hurych, et al. Vavim and vavam: Autonomous driving through video generative modeling.arXiv preprint arXiv:2502.15672,
-
[6]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
doi: 10.48550.arXiv preprint ARXIV .2410.24164. Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Pseudo-simulation for autonomous driving
Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, et al. Pseudo-simulation for autonomous driving. arXiv preprint arXiv:2506.04218,
-
[8]
WorldVLA: Towards Autoregressive Action World Model
Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Lihong Chen, Hossein Hassani, and Soodeh Nikan. Ts-vlm: Text-guided softsort pooling for vision-language models in multi-view driving reasoning.arXiv preprint arXiv:2505.12670,
-
[10]
Yuntao Chen, Yuqi Wang, and Zhaoxiang Zhang. Drivinggpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers.arXiv preprint arXiv:2412.18607,
-
[11]
Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking
Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.arXiv preprint arXiv:2406.15349,
-
[12]
11 Preprint. Cunxin Fan, Xiaosong Jia, Yihang Sun, Yixiao Wang, Jianglan Wei, Ziyang Gong, Xiangyu Zhao, Masayoshi Tomizuka, Xue Yang, Junchi Yan, et al. Interleave-vla: Enhancing robot manipulation with interleaved image-text instructions.arXiv preprint arXiv:2505.02152,
-
[13]
Renju Feng, Ning Xi, Duanfeng Chu, Rukang Wang, Zejian Deng, Anzheng Wang, Liping Lu, Jinxiang Wang, and Yanjun Huang. Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving.arXiv preprint arXiv:2504.19580,
-
[14]
Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation.arXiv:2503.19755,
work page internal anchor Pith review arXiv
-
[15]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
GAIA-1: A Generative World Model for Autonomous Driving
Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
St-p3: End-to- end vision-based autonomous driving via spatial-temporal feature learning
Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tian- wei Lin, Wenhai Wang, et al. Goal-oriented autonomous driving.arXiv preprint arXiv:2212.10156,
-
[18]
BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View
Junjie Huang, Guan Huang, Zheng Zhu, and Dalong Du. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view.arXiv preprint arXiv:2112.11790,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
EMMA: End-to-End Multimodal Model for Autonomous Driving
Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. \π_0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Anqing Jiang, Yu Gao, Zhigang Sun, Yiru Wang, Jijun Wang, Jinghao Chai, Qian Cao, Yuweng Heng, Hao Jiang, Yunda Dong, et al. Diffvla: Vision-language guided diffusion planning for autonomous driving.arXiv preprint arXiv:2505.19381,
-
[22]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[23]
Enhancing End-to-End Autonomous Driving with Latent World Model
Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. En- hancing end-to-end autonomous driving with latent world model.arXiv preprint arXiv:2406.08481, 2024a. Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online trajectory evaluation via bev world model.arXiv preprint ...
work page internal anchor Pith review arXiv
-
[24]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Carllava: Vision language models for camera-only closed-loop driving
Katrin Renz, Long Chen, Ana-Maria Marcu, Jan Hünermann, Benoit Hanotte, Alice Karnsund, Jamie Shotton, Elahe Arani, and Oleg Sinavski. Carllava: Vision language models for camera-only closed-loop driving.arXiv preprint arXiv:2406.10165,
-
[27]
Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Large trajectory models are scalable motion predictors and planners.arXiv preprint arXiv:2310.19620,
Qiao Sun, Shiduo Zhang, Danjiao Ma, Jingzhe Shi, Derun Li, Simian Luo, Yu Wang, Ningyi Xu, Guangzhi Cao, and Hang Zhao. Large trajectory models are scalable motion predictors and planners.arXiv preprint arXiv:2310.19620,
-
[29]
Reconstructive visual instruction tuning.arXiv preprint arXiv:2410.09575, 2024a
Haochen Wang, Anlin Zheng, Yucheng Zhao, Tiancai Wang, Zheng Ge, Xiangyu Zhang, and Zhaoxiang Zhang. Reconstructive visual instruction tuning.arXiv preprint arXiv:2410.09575, 2024a. Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you nee...
-
[30]
Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end au- tonomous driving.arXiv preprint arXiv:2505.16278,
-
[31]
Wenhao Yao, Zhenxin Li, Shiyi Lan, Zi Wang, Xinglong Sun, Jose M Alvarez, and Zuxuan Wu. Drivesuprim: Towards precise trajectory selection for end-to-end planning.arXiv preprint arXiv:2506.06659,
-
[32]
Jianhao Yuan, Shuyang Sun, Daniel Omeiza, Bo Zhao, Paul Newman, Lars Kunze, and Matthew Gadd. Rag-driver: Generalisable driving explanations with retrieval-augmented in-context learning in multi-modal large language model.arXiv preprint arXiv:2402.10828,
-
[33]
FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving
Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei, and Ning Guo. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Jiawei Zhang, Xuan Yang, Taiqi Wang, Yu Yao, Aleksandr Petiushko, and Bo Li. Safeauto: Knowledge-enhanced safe autonomous driving with multimodal foundation models.arXiv preprint arXiv:2503.00211,
-
[35]
Lunjun Zhang, Yuwen Xiong, Ze Yang, Sergio Casas, Rui Hu, and Raquel Urtasun. Learning unsuper- vised world models for autonomous driving via discrete diffusion.arXiv preprint arXiv:2311.01017,
-
[36]
Doe-1: Closed- loop autonomous driving with large world model.arXiv preprint arXiv:2412.09627, 2024a
Wenzhao Zheng, Zetian Xia, Yuanhui Huang, Sicheng Zuo, Jie Zhou, and Jiwen Lu. Doe-1: Closed- loop autonomous driving with large world model.arXiv preprint arXiv:2412.09627, 2024a. Yupeng Zheng, Zhongpu Xia, Qichao Zhang, Teng Zhang, Ben Lu, Xiaochuang Huo, Chao Han, Yixian Li, Mengjie Yu, Bu Jin, et al. Preliminary investigation into data scaling laws fo...
-
[37]
and evaluate their final PDMS scores. The results confirm a strong positive correlation: the 6V A checkpoint, which had superior generative fidelity, also achieves higher planning performance after fine-tuning. This provides compelling evidence thatthe model’s ability to generate high-quality, realistic future images is directly linked to its capacity for...
work page 2020
-
[38]
In computer vision, Zhai et al
then showed many LMs were undertrained and derived a compute- optimal prescription that scales model size and tokens proportionally. In computer vision, Zhai et al. (2022) charted ViT scaling law with stable training recipes, and ViT-22B (Dehghani et al.,
work page 2022
-
[39]
scaled ViTs to 22B parameters, verifying predictable multi-task improvements. Lin et al. (2025) conducted a large-scale study of imitation-learning data scaling in robotics, and found near–power-law gains from increasing environmental and object diversity with improved zero-shot generalization. In autonomous driving, STR (Sun et al.,
work page 2025
-
[40]
shows large trajectory models scale steadily in both prediction and planning, and Baniodeh et al. (2025) reported power-law improvements for joint motion forecasting and planning with large driving datasets. For end-to-end driving, Naumann et al. (2025) observe roughly log-linear gains in both open- and closed-loop metrics as training data scale increases...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.