pith. sign in

arxiv: 2605.28544 · v1 · pith:4YOKZKTWnew · submitted 2026-05-27 · 💻 cs.CV

DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving

Pith reviewed 2026-06-29 13:39 UTC · model grok-4.3

classification 💻 cs.CV
keywords autonomous drivingvideo diffusionworld modelsaction generationflow matchingend-to-end planninggenerative priorsvideo-action modeling
0
0 comments X

The pith

DriveWAM adapts a pretrained video diffusion transformer into an autoregressive video-action policy by unifying video frames and driving actions in a single token sequence under joint flow matching.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that video generative models pretrained on large-scale data already encode motion and temporal dynamics useful for driving, and that these priors can be transferred to action generation without major architectural overhaul. It does so by interleaving video and action tokens in one temporal stream and optimizing them together with flow matching while freezing the core transformer weights. Scene intent from a separate vision-language model steers the generation at each chunk, and a selective memory cache keeps long rollouts from growing unbounded. Results on standard driving benchmarks improve as the number of training clips rises, indicating that the approach benefits from scale in a way that purely task-specific models may not.

Core claim

DriveWAM organizes video and action streams into a unified temporal token sequence and trains them under a joint flow-matching objective, preserving the pretrained video-generation architecture while adapting its large-scale video priors to action generation. Scene-evolving driving guidance from a frozen VLM supplies chunk-specific semantic intent, and selective KV memory maintains bounded modality-aware pools through relevance-redundancy selection at inference. Experiments on NAVSIM and the PhysicalAI-Autonomous-Vehicles benchmark show strong planning performance, and a data-scaling study from 4k to 100k driving clips confirms the scaling potential of world-action modeling for end-to-end au

What carries the argument

Unified temporal token sequence of video and actions trained with joint flow-matching on a frozen video diffusion transformer backbone.

If this is right

  • Strong planning performance is achieved on NAVSIM and PhysicalAI-Autonomous-Vehicles benchmarks.
  • Performance improves as training data increases from 4k to 100k driving clips.
  • Scene-evolving guidance from a frozen VLM supplies high-level intent to the generation process.
  • Selective KV memory keeps long-horizon video-action rollouts computationally bounded.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same unified-sequence approach could be tested on other embodied control problems where video priors already exist, such as robotic navigation.
  • If the scaling trend continues beyond 100k clips, the method would imply that collecting larger unlabeled driving video sets becomes the dominant lever for policy improvement.
  • The selective memory mechanism suggests a general pattern for keeping autoregressive world models tractable in any long-horizon visual task.
  • Combining the frozen VLM guidance with additional modalities like lidar or maps could be explored without retraining the core transformer.

Load-bearing premise

The joint flow-matching objective on the unified video-action token sequence successfully transfers the pretrained video-generation priors to action generation without requiring substantial architectural changes.

What would settle it

If planning metrics on NAVSIM or PhysicalAI-Autonomous-Vehicles show no improvement when training data scales from 4k to 100k clips, or if DriveWAM performs no better than the unmodified video model, the transfer of priors would be falsified.

Figures

Figures reproduced from arXiv: 2605.28544 by Bo Zhang, Chen Shi, Jinrui Xu, Kehua Sheng, Li Jiang, Shaoshuai Shi.

Figure 1
Figure 1. Figure 1: Overview of DriveWAM, which adapts a pretrained video generation backbone into a [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Attention mask used during Drive￾WAM training. Colored entries indicate al￾lowed attention; blank entries are masked. Reference frame Chunk 1 Chunk 2 Chunk 3 After query chunk 4 After query chunk 5 Low score High score Query chunk [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on NAVSIM (left) and PhysicalAI-Autonomous-Vehicles benchmark [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Data scaling on PhysicalAI￾Autonomous-Vehicles. Data Scaling. We investigate the data scalability of Drive￾WAM by varying the training set size from 4k to 20k and 100k clips under a fixed 50k-iteration training procedure. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Representative scene tagging results for dataset curation. For each clip, the left panel shows [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Examples of scene-evolving VLM guidance. The guidance adapts to changing scene [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative results on NAVSIM (top two rows) and PhysicalAI-Autonomous-Vehicles [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
read the original abstract

Pretrained foundation models have become an important basis for end-to-end autonomous driving. In contrast to vision-language models pretrained primarily on static image-text pairs, video generative models capture temporal dynamics and motion priors that are naturally suited for driving. We present DriveWAM, a driving world-action model that adapts a pretrained video diffusion transformer into an autoregressive video-action policy. DriveWAM organizes video and action streams into a unified temporal token sequence and trains them under a joint flow-matching objective, preserving the pretrained video-generation architecture while adapting its large-scale video priors to action generation. To incorporate high-level scene understanding, we introduce scene-evolving driving guidance, where a frozen VLM produces chunk-specific semantic intent to guide video-action generation. To keep long-horizon rollout bounded, we further introduce selective KV memory, which maintains bounded modality-aware video and action memory pools through relevance-redundancy cache selection at inference time. Experiments on NAVSIM and the PhysicalAI-Autonomous-Vehicles benchmark show that DriveWAM achieves strong planning performance, and a data-scaling study from 4k to 100k driving clips further confirms the scaling potential of world-action modeling for end-to-end autonomous driving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces DriveWAM, a driving world-action model that adapts a pretrained video diffusion transformer into an autoregressive video-action policy. The model unifies video and action streams into a temporal token sequence trained under a joint flow-matching objective to preserve video-generation priors. It adds scene-evolving driving guidance using a frozen VLM for semantic intent and selective KV memory for bounded long-horizon rollouts. Experiments on NAVSIM and PhysicalAI-Autonomous-Vehicles benchmarks are claimed to show strong planning performance, with a scaling study from 4k to 100k clips supporting the approach's potential for end-to-end autonomous driving.

Significance. If the experimental results hold, the paper would make a significant contribution by demonstrating an effective method to transfer large-scale video generative priors to action generation for driving tasks. The joint objective and memory management techniques could enable scalable world-action modeling, with the data scaling study providing evidence for the benefits of larger datasets in this domain.

major comments (1)
  1. Abstract: The abstract asserts 'strong planning performance' and 'scaling potential' but supplies no quantitative metrics, baseline comparisons, ablation results, or error analysis. Without these details the support for the central claim cannot be evaluated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive comment. We agree that the abstract would benefit from greater specificity to support its claims and will revise it accordingly in the next version of the manuscript.

read point-by-point responses
  1. Referee: Abstract: The abstract asserts 'strong planning performance' and 'scaling potential' but supplies no quantitative metrics, baseline comparisons, ablation results, or error analysis. Without these details the support for the central claim cannot be evaluated.

    Authors: We agree with this observation. The current abstract is intentionally concise but does not include the specific numbers that would allow readers to immediately assess the strength of the reported results. In the revised manuscript we will expand the abstract to include the primary quantitative outcomes (e.g., planning metrics on NAVSIM and PhysicalAI-Autonomous-Vehicles, the performance delta relative to the strongest baselines, and the scaling trend from 4k to 100k clips) while remaining within typical abstract length limits. This change directly addresses the concern without altering the technical content of the paper. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract and description present DriveWAM as an empirical adaptation of an external pretrained video diffusion transformer using a joint flow-matching objective on unified video-action tokens, with scene guidance from a frozen VLM and selective KV memory. Performance is validated on external benchmarks (NAVSIM, PhysicalAI-Autonomous-Vehicles) and a scaling study from 4k to 100k clips. No equations, fitted parameters renamed as predictions, self-citations, or derivation steps that reduce claims to inputs by construction are present. The approach is self-contained against external benchmarks with no load-bearing internal reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no equations, training details, or modeling assumptions are stated that would allow identification of free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5757 in / 1155 out tokens · 40779 ms · 2026-06-29T13:39:39.605654+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Kamera: Unified Position-Invariant Multimodal KV Cache for Training-Free Reuse

    cs.DC 2026-06 unverdicted novelty 6.0

    Kamera stores a low-rank patch with each position-free KV chunk to restore cross-chunk conditioning lost in naive reuse, enabling cheap reordering, sliding windows, and recall across attention mechanisms.

  2. Diffusion Transformer World-Action Model for AV Scene Prediction

    cs.CV 2026-06 unverdicted novelty 6.0

    A Diffusion Transformer world model in V-JEPA2 latent space predicts action-conditioned future scenes on nuScenes, outperforming regression on KID/FID while preserving steering controllability and adding a jump model ...

  3. Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning

    cs.RO 2026-06 unverdicted novelty 5.0

    Discrete-WAM unifies world modeling and policy learning for autonomous driving by representing observations, states, decisions, and actions as tokens in one space and using hierarchical token editing for planning.

  4. World Action Models: A Survey

    cs.RO 2026-06 unverdicted novelty 3.0

    A survey that clarifies boundaries and organizes World Action Models by generation requirements and predictive substrates, identifying a trend toward generating less of the future.

Reference graph

Works this paper leans on

63 extracted references · 21 canonical work pages · cited by 4 Pith papers · 14 internal anchors

  1. [1]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  2. [2]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

  3. [3]

    DriveVLA-w0: World models amplify data scaling law in autonomous driving

    Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, AnYasong, Chufeng Tang, Lu Hou, Lue Fan, and Zhaoxiang Zhang. DriveVLA-w0: World models amplify data scaling law in autonomous driving. InICLR, 2026

  4. [4]

    Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving

    Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving. InICLR, 2026

  5. [5]

    Simlingo: Vision-only closed-loop autonomous driving with language-action alignment

    Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment. InCVPR, 2025

  6. [6]

    Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving

    Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving. InCVPR, 2026

  7. [7]

    Langcoop: Collaborative driving with language

    Xiangbo Gao, Yuheng Wu, Rujia Wang, Chenxi Liu, Yang Zhou, and Zhengzhong Tu. Langcoop: Collaborative driving with language. InCVPR, 2025

  8. [8]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

  9. [9]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  10. [10]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023

  11. [11]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

  12. [12]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InNeurIPS, 2022

  13. [13]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  14. [14]

    Drivedreamer-policy: A geometry-grounded world-action model for unified generation and planning.arXiv preprint arXiv:2604.01765, 2026

    Yang Zhou, Xiaofeng Wang, Hao Shao, Letian Wang, Guosheng Zhao, Jiangnan Shao, Jiagang Zhu, Tingdong Yu, Zheng Zhu, Guan Huang, et al. Drivedreamer-policy: A geometry-grounded world-action model for unified generation and planning.arXiv preprint arXiv:2604.01765, 2026

  15. [15]

    Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving

    Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving. InNeurIPS, 2025

  16. [16]

    Cosmos policy: Fine-tuning video models for visuomotor control and planning

    Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning. InICLR, 2026. 10

  17. [17]

    World Action Models are Zero-shot Policies

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

  18. [18]

    Fast-WAM: Do World Action Models Need Test-time Future Imagination?

    Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

  19. [19]

    Causal World Modeling for Robot Control

    Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

  20. [20]

    Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

    Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Hengtao Li, Jie Li, Jindi Lv, Jingyu Liu, et al. Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

  21. [21]

    Video prediction policy: A generalist robot policy with predictive visual representations

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations. InICML, 2025

  22. [22]

    Bridging scene generation and planning: Driving with world model via unifying vision and motion representation.arXiv preprint arXiv:2603.14948, 2026

    Xingtai Gui, Meijie Zhang, Tianyi Yan, Wencheng Han, Jiahao Gong, Feiyang Tan, Cheng-zhong Xu, and Jianbing Shen. Bridging scene generation and planning: Driving with world model via unifying vision and motion representation.arXiv preprint arXiv:2603.14948, 2026

  23. [23]

    Vavim and vavam: Autonomous driving through video generative modeling.arXiv preprint arXiv:2502.15672, 2025

    Florent Bartoccioni, Elias Ramzi, Victor Besnier, Shashanka Venkataramanan, Tuan-Hung Vu, Yihong Xu, Loick Chambon, Spyros Gidaris, Serkan Odabas, David Hurych, Renaud Marlet, Alexandre Boulch, Mickael Chen, Eloi Zablocki, Andrei Bursuc, Eduardo Valle, and Matthieu Cord. Vavim and vavam: Autonomous driving through video generative modeling.arXiv preprint ...

  24. [24]

    Epona: Autoregressive diffusion world model for autonomous driving

    Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jingwei Huang, Li Yuan, Qian Zhang, Xiao-Xiao Long, et al. Epona: Autoregressive diffusion world model for autonomous driving. InICCV, 2025

  25. [25]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023

  26. [26]

    Flow caching for autoregressive video generation

    Yuexiao Ma, Xuzhe Zheng, Jing Xu, Xiwei Xu, Feng Ling, Xiawu Zheng, Huafeng Kuang, Huixia Li, Xing Wang, Xuefeng Xiao, et al. Flow caching for autoregressive video generation. InICLR, 2026

  27. [27]

    Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking

    Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking. InNeurIPS, 2024

  28. [28]

    Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

    Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Diamond, Yifan Ding, Wenhao Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

  29. [29]

    Drivegpt4: Interpretable end-to-end autonomous driving via large language model

    Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Letters, 2024

  30. [30]

    Drivevlm: The convergence of autonomous driving and large vision-language models

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Zhiyong Zhao, Yang Wang, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. InCoRL, 2024

  31. [31]

    Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

    Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving. arXiv preprint arXiv:2410.22313, 2024

  32. [32]

    DriveLM: Driving with graph visual question answering

    Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. DriveLM: Driving with graph visual question answering. In ECCV, 2024

  33. [33]

    Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving.arXiv preprint arXiv:2312.09245, 2023

    Wenhai Wang, Jiangwei Xie, ChuanYang Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, et al. Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving.arXiv preprint arXiv:2312.09245, 2023

  34. [34]

    A language agent for autonomous driving.arXiv preprint arXiv:2311.10813, 2023

    Jiageng Mao, Junjie Ye, Yuxi Qian, Marco Pavone, and Yue Wang. A language agent for autonomous driving.arXiv preprint arXiv:2311.10813, 2023. 11

  35. [35]

    Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M. Alvarez. OmniDrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InCVPR, 2025

  36. [36]

    EMMA: End-to-End Multimodal Model for Autonomous Driving

    Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving. arXiv preprint arXiv:2410.23262, 2024

  37. [37]

    Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma

    Zewei Zhou, Tianhui Cai, Seth Z. Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Au- toVLA: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning. InNeurIPS, 2025

  38. [38]

    MoVQ: Modulating quantized vectors for high-fidelity image generation

    Chuanxia Zheng, Long Tung Vuong, Jianfei Cai, and Dinh Phung. MoVQ: Modulating quantized vectors for high-fidelity image generation. InNeurIPS, 2022

  39. [39]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

  40. [40]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  41. [41]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan.Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. In ICLR, 2025

  42. [42]

    Neural discrete representation learning

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017

  43. [43]

    Language models are unsupervised multitask learners.OpenAI blog, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 2019

  44. [44]

    Self forcing: Bridging the train-test gap in autoregressive video diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. InNeurIPS, 2025

  45. [45]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion

    Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. InNeurIPS, 2024

  46. [46]

    From slow bidirectional to fast autoregressive video diffusion models

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, 2025

  47. [47]

    Sparse video-gen: Accelerating video diffusion transformers with spatial-temporal sparsity

    Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, Jianfei Chen, Ion Stoica, Kurt Keutzer, and Song Han. Sparse video-gen: Accelerating video diffusion transformers with spatial-temporal sparsity. InICML, 2025

  48. [48]

    Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation

    Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, Jianfei Chen, Song Han, Kurt Keutzer, and Ion Stoica. Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation. InNeurIPS, 2025

  49. [49]

    Adaptive caching for faster video generation with diffusion transformers

    Kumara Kahatapitiya, Haozhe Liu, Sen He, Ding Liu, Menglin Jia, Chenyang Zhang, Michael S Ryoo, and Tian Xie. Adaptive caching for faster video generation with diffusion transformers. InICCV, 2025

  50. [50]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023

  51. [51]

    Fifo-diffusion: Generating infinite videos from text without training.NeurIPS, 2024

    Jihwan Kim, Junoh Kang, Jinyoung Choi, and Bohyung Han. Fifo-diffusion: Generating infinite videos from text without training.NeurIPS, 2024

  52. [52]

    X-world: Controllable ego-centric multi-camera world models for scalable end-to-end driving.arXiv preprint arXiv:2603.19979, 2026

    Chaoda Zheng, Sean Li, Jinhao Deng, Zhennan Wang, Shijia Chen, Liqiang Xiao, Ziheng Chi, Hongbin Lin, Kangjie Chen, Boyang Wang, et al. X-world: Controllable ego-centric multi-camera world models for scalable end-to-end driving.arXiv preprint arXiv:2603.19979, 2026

  53. [53]

    Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

    Zhenxin Li, Kailin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Yishen Ji, Zhiqi Li, Ziyue Zhu, Jan Kautz, Zuxuan Wu, et al. Hydra-mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024

  54. [54]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InCVPR, 2023

  55. [55]

    Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.TPAMI, 2023

    Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.TPAMI, 2023. 12

  56. [56]

    Para-drive: Parallelized architecture for real-time autonomous driving

    Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Parallelized architecture for real-time autonomous driving. InCVPR, 2024

  57. [57]

    Enhancing end-to-end autonomous driving with latent world model

    Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model. InICLR, 2025

  58. [58]

    Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

    Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InCVPR, 2025

  59. [59]

    End-to-end driving with online trajectory evaluation via bev world model

    Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online trajectory evaluation via bev world model. InICCV, 2025

  60. [60]

    Openscene: The largest up-to-date 3d occupancy prediction benchmark in autonomous driving.https://github.com/OpenDriveLab/OpenScene, 2023

    OpenScene Contributors. Openscene: The largest up-to-date 3d occupancy prediction benchmark in autonomous driving.https://github.com/OpenDriveLab/OpenScene, 2023

  61. [61]

    Towards learning-based planning: The nuplan benchmark for real-world autonomous driving

    Napat Karnchanachari, Dimitris Geromichalos, Kok Seang Tan, Nanxiang Li, Christopher Eriksen, Shakiba Yaghoubi, Noushin Mehdipour, Gianmarco Bernasconi, Whye Kit Fong, Yiluan Guo, et al. Towards learning-based planning: The nuplan benchmark for real-world autonomous driving. InICRA, 2024

  62. [62]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019

  63. [63]

    Generalized predictive model for autonomous driving

    Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, et al. Generalized predictive model for autonomous driving. InCVPR, 2024. 13 A Dataset Curation The PhysicalAI-Autonomous-Vehicles benchmark contains roughly 1,700 hours of driving organized into 306,152 20-second clips. To focus evaluation a...