DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving

Bo Zhang; Chen Shi; Jinrui Xu; Kehua Sheng; Li Jiang; Shaoshuai Shi

arxiv: 2605.28544 · v1 · pith:4YOKZKTWnew · submitted 2026-05-27 · 💻 cs.CV

DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving

Chen Shi , Jinrui Xu , Shaoshuai Shi , Kehua Sheng , Bo Zhang , Li Jiang This is my paper

Pith reviewed 2026-06-29 13:39 UTC · model grok-4.3

classification 💻 cs.CV

keywords autonomous drivingvideo diffusionworld modelsaction generationflow matchingend-to-end planninggenerative priorsvideo-action modeling

0 comments

The pith

DriveWAM adapts a pretrained video diffusion transformer into an autoregressive video-action policy by unifying video frames and driving actions in a single token sequence under joint flow matching.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that video generative models pretrained on large-scale data already encode motion and temporal dynamics useful for driving, and that these priors can be transferred to action generation without major architectural overhaul. It does so by interleaving video and action tokens in one temporal stream and optimizing them together with flow matching while freezing the core transformer weights. Scene intent from a separate vision-language model steers the generation at each chunk, and a selective memory cache keeps long rollouts from growing unbounded. Results on standard driving benchmarks improve as the number of training clips rises, indicating that the approach benefits from scale in a way that purely task-specific models may not.

Core claim

DriveWAM organizes video and action streams into a unified temporal token sequence and trains them under a joint flow-matching objective, preserving the pretrained video-generation architecture while adapting its large-scale video priors to action generation. Scene-evolving driving guidance from a frozen VLM supplies chunk-specific semantic intent, and selective KV memory maintains bounded modality-aware pools through relevance-redundancy selection at inference. Experiments on NAVSIM and the PhysicalAI-Autonomous-Vehicles benchmark show strong planning performance, and a data-scaling study from 4k to 100k driving clips confirms the scaling potential of world-action modeling for end-to-end au

What carries the argument

Unified temporal token sequence of video and actions trained with joint flow-matching on a frozen video diffusion transformer backbone.

If this is right

Strong planning performance is achieved on NAVSIM and PhysicalAI-Autonomous-Vehicles benchmarks.
Performance improves as training data increases from 4k to 100k driving clips.
Scene-evolving guidance from a frozen VLM supplies high-level intent to the generation process.
Selective KV memory keeps long-horizon video-action rollouts computationally bounded.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same unified-sequence approach could be tested on other embodied control problems where video priors already exist, such as robotic navigation.
If the scaling trend continues beyond 100k clips, the method would imply that collecting larger unlabeled driving video sets becomes the dominant lever for policy improvement.
The selective memory mechanism suggests a general pattern for keeping autoregressive world models tractable in any long-horizon visual task.
Combining the frozen VLM guidance with additional modalities like lidar or maps could be explored without retraining the core transformer.

Load-bearing premise

The joint flow-matching objective on the unified video-action token sequence successfully transfers the pretrained video-generation priors to action generation without requiring substantial architectural changes.

What would settle it

If planning metrics on NAVSIM or PhysicalAI-Autonomous-Vehicles show no improvement when training data scales from 4k to 100k clips, or if DriveWAM performs no better than the unmodified video model, the transfer of priors would be falsified.

Figures

Figures reproduced from arXiv: 2605.28544 by Bo Zhang, Chen Shi, Jinrui Xu, Kehua Sheng, Li Jiang, Shaoshuai Shi.

**Figure 2.** Figure 2: Attention mask used during DriveWAM training. Colored entries indicate allowed attention; blank entries are masked. Reference frame Chunk 1 Chunk 2 Chunk 3 After query chunk 4 After query chunk 5 Low score High score Query chunk [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results on NAVSIM (left) and PhysicalAI-Autonomous-Vehicles benchmark [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Data scaling on PhysicalAIAutonomous-Vehicles. Data Scaling. We investigate the data scalability of DriveWAM by varying the training set size from 4k to 20k and 100k clips under a fixed 50k-iteration training procedure. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Representative scene tagging results for dataset curation. For each clip, the left panel shows [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Examples of scene-evolving VLM guidance. The guidance adapts to changing scene [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative results on NAVSIM (top two rows) and PhysicalAI-Autonomous-Vehicles [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Pretrained foundation models have become an important basis for end-to-end autonomous driving. In contrast to vision-language models pretrained primarily on static image-text pairs, video generative models capture temporal dynamics and motion priors that are naturally suited for driving. We present DriveWAM, a driving world-action model that adapts a pretrained video diffusion transformer into an autoregressive video-action policy. DriveWAM organizes video and action streams into a unified temporal token sequence and trains them under a joint flow-matching objective, preserving the pretrained video-generation architecture while adapting its large-scale video priors to action generation. To incorporate high-level scene understanding, we introduce scene-evolving driving guidance, where a frozen VLM produces chunk-specific semantic intent to guide video-action generation. To keep long-horizon rollout bounded, we further introduce selective KV memory, which maintains bounded modality-aware video and action memory pools through relevance-redundancy cache selection at inference time. Experiments on NAVSIM and the PhysicalAI-Autonomous-Vehicles benchmark show that DriveWAM achieves strong planning performance, and a data-scaling study from 4k to 100k driving clips further confirms the scaling potential of world-action modeling for end-to-end autonomous driving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DriveWAM adapts a video diffusion transformer to joint video-action generation for driving with unified tokens, VLM guidance, and selective KV memory, but the abstract supplies no metrics to support the performance claims.

read the letter

DriveWAM takes a pretrained video diffusion transformer and repurposes it as an autoregressive policy by packing video frames and actions into one token sequence trained under joint flow matching. The three concrete additions are scene-evolving guidance from a frozen VLM that supplies chunk-level intent, selective KV memory that keeps separate video and action caches bounded by relevance-redundancy pruning, and the decision to leave the core transformer mostly unchanged.

Those choices are sensible engineering moves. Video models already encode motion, so folding actions into the same sequence avoids building a separate policy head from scratch. The memory trick directly tackles the rollout length problem that usually kills autoregressive video models in driving. The scaling experiment idea, even if only sketched, shows they are thinking about data efficiency rather than single-run results.

The obvious gap is the complete absence of numbers. The abstract says the model achieves strong planning performance on NAVSIM and PhysicalAI-AV and that scaling from 4k to 100k clips works, yet it gives no success rates, no baseline comparisons, no ablation on the new components, and no failure cases. Without those, it is impossible to tell whether the joint objective actually transfers useful priors or whether the model simply learns to copy the video distribution while actions lag. The transfer assumption itself may be load-bearing and needs direct evidence.

The paper is aimed at researchers working on foundation-model approaches to end-to-end driving. A reader already following video diffusion or world-model work will pick up the tokenization and memory details quickly. It is coherent enough on its own terms to deserve referee time; the method is described clearly and the practical constraints are addressed, even if the current evidence is thin.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces DriveWAM, a driving world-action model that adapts a pretrained video diffusion transformer into an autoregressive video-action policy. The model unifies video and action streams into a temporal token sequence trained under a joint flow-matching objective to preserve video-generation priors. It adds scene-evolving driving guidance using a frozen VLM for semantic intent and selective KV memory for bounded long-horizon rollouts. Experiments on NAVSIM and PhysicalAI-Autonomous-Vehicles benchmarks are claimed to show strong planning performance, with a scaling study from 4k to 100k clips supporting the approach's potential for end-to-end autonomous driving.

Significance. If the experimental results hold, the paper would make a significant contribution by demonstrating an effective method to transfer large-scale video generative priors to action generation for driving tasks. The joint objective and memory management techniques could enable scalable world-action modeling, with the data scaling study providing evidence for the benefits of larger datasets in this domain.

major comments (1)

Abstract: The abstract asserts 'strong planning performance' and 'scaling potential' but supplies no quantitative metrics, baseline comparisons, ablation results, or error analysis. Without these details the support for the central claim cannot be evaluated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive comment. We agree that the abstract would benefit from greater specificity to support its claims and will revise it accordingly in the next version of the manuscript.

read point-by-point responses

Referee: Abstract: The abstract asserts 'strong planning performance' and 'scaling potential' but supplies no quantitative metrics, baseline comparisons, ablation results, or error analysis. Without these details the support for the central claim cannot be evaluated.

Authors: We agree with this observation. The current abstract is intentionally concise but does not include the specific numbers that would allow readers to immediately assess the strength of the reported results. In the revised manuscript we will expand the abstract to include the primary quantitative outcomes (e.g., planning metrics on NAVSIM and PhysicalAI-Autonomous-Vehicles, the performance delta relative to the strongest baselines, and the scaling trend from 4k to 100k clips) while remaining within typical abstract length limits. This change directly addresses the concern without altering the technical content of the paper. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract and description present DriveWAM as an empirical adaptation of an external pretrained video diffusion transformer using a joint flow-matching objective on unified video-action tokens, with scene guidance from a frozen VLM and selective KV memory. Performance is validated on external benchmarks (NAVSIM, PhysicalAI-Autonomous-Vehicles) and a scaling study from 4k to 100k clips. No equations, fitted parameters renamed as predictions, self-citations, or derivation steps that reduce claims to inputs by construction are present. The approach is self-contained against external benchmarks with no load-bearing internal reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no equations, training details, or modeling assumptions are stated that would allow identification of free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5757 in / 1155 out tokens · 40779 ms · 2026-06-29T13:39:39.605654+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Kamera: Unified Position-Invariant Multimodal KV Cache for Training-Free Reuse
cs.DC 2026-06 unverdicted novelty 6.0

Kamera stores a low-rank patch with each position-free KV chunk to restore cross-chunk conditioning lost in naive reuse, enabling cheap reordering, sliding windows, and recall across attention mechanisms.
Diffusion Transformer World-Action Model for AV Scene Prediction
cs.CV 2026-06 unverdicted novelty 6.0

A Diffusion Transformer world model in V-JEPA2 latent space predicts action-conditioned future scenes on nuScenes, outperforming regression on KID/FID while preserving steering controllability and adding a jump model ...
Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning
cs.RO 2026-06 unverdicted novelty 5.0

Discrete-WAM unifies world modeling and policy learning for autonomous driving by representing observations, states, decisions, and actions as tokens in one space and using hierarchical token editing for planning.
World Action Models: A Survey
cs.RO 2026-06 unverdicted novelty 3.0

A survey that clarifies boundaries and organizes World Action Models by generation requirements and predictive substrates, identifying a trend toward generating less of the future.

Reference graph

Works this paper leans on

63 extracted references · 21 canonical work pages · cited by 4 Pith papers · 14 internal anchors

[1]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

DriveVLA-w0: World models amplify data scaling law in autonomous driving

Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, AnYasong, Chufeng Tang, Lu Hou, Lue Fan, and Zhaoxiang Zhang. DriveVLA-w0: World models amplify data scaling law in autonomous driving. InICLR, 2026

2026
[4]

Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving

Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving. InICLR, 2026

2026
[5]

Simlingo: Vision-only closed-loop autonomous driving with language-action alignment

Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment. InCVPR, 2025

2025
[6]

Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving

Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving. InCVPR, 2026

2026
[7]

Langcoop: Collaborative driving with language

Xiangbo Gao, Yuheng Wu, Rujia Wang, Chenxi Liu, Yang Zhou, and Zhengzhong Tu. Langcoop: Collaborative driving with language. InCVPR, 2025

2025
[8]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023

2023
[11]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InNeurIPS, 2022

2022
[13]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Drivedreamer-policy: A geometry-grounded world-action model for unified generation and planning.arXiv preprint arXiv:2604.01765, 2026

Yang Zhou, Xiaofeng Wang, Hao Shao, Letian Wang, Guosheng Zhao, Jiangnan Shao, Jiagang Zhu, Tingdong Yu, Zheng Zhu, Guan Huang, et al. Drivedreamer-policy: A geometry-grounded world-action model for unified generation and planning.arXiv preprint arXiv:2604.01765, 2026

work page arXiv 2026
[15]

Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving. InNeurIPS, 2025

2025
[16]

Cosmos policy: Fine-tuning video models for visuomotor control and planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning. InICLR, 2026. 10

2026
[17]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Hengtao Li, Jie Li, Jindi Lv, Jingyu Liu, et al. Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

work page arXiv 2026
[21]

Video prediction policy: A generalist robot policy with predictive visual representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations. InICML, 2025

2025
[22]

Bridging scene generation and planning: Driving with world model via unifying vision and motion representation.arXiv preprint arXiv:2603.14948, 2026

Xingtai Gui, Meijie Zhang, Tianyi Yan, Wencheng Han, Jiahao Gong, Feiyang Tan, Cheng-zhong Xu, and Jianbing Shen. Bridging scene generation and planning: Driving with world model via unifying vision and motion representation.arXiv preprint arXiv:2603.14948, 2026

work page arXiv 2026
[23]

Vavim and vavam: Autonomous driving through video generative modeling.arXiv preprint arXiv:2502.15672, 2025

Florent Bartoccioni, Elias Ramzi, Victor Besnier, Shashanka Venkataramanan, Tuan-Hung Vu, Yihong Xu, Loick Chambon, Spyros Gidaris, Serkan Odabas, David Hurych, Renaud Marlet, Alexandre Boulch, Mickael Chen, Eloi Zablocki, Andrei Bursuc, Eduardo Valle, and Matthieu Cord. Vavim and vavam: Autonomous driving through video generative modeling.arXiv preprint ...

work page arXiv 2025
[24]

Epona: Autoregressive diffusion world model for autonomous driving

Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jingwei Huang, Li Yuan, Qian Zhang, Xiao-Xiao Long, et al. Epona: Autoregressive diffusion world model for autonomous driving. InICCV, 2025

2025
[25]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023

2023
[26]

Flow caching for autoregressive video generation

Yuexiao Ma, Xuzhe Zheng, Jing Xu, Xiwei Xu, Feng Ling, Xiawu Zheng, Huafeng Kuang, Huixia Li, Xing Wang, Xuefeng Xiao, et al. Flow caching for autoregressive video generation. InICLR, 2026

2026
[27]

Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking

Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking. InNeurIPS, 2024

2024
[28]

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Diamond, Yifan Ding, Wenhao Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Letters, 2024

2024
[30]

Drivevlm: The convergence of autonomous driving and large vision-language models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Zhiyong Zhao, Yang Wang, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. InCoRL, 2024

2024
[31]

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving. arXiv preprint arXiv:2410.22313, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

DriveLM: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. DriveLM: Driving with graph visual question answering. In ECCV, 2024

2024
[33]

Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving.arXiv preprint arXiv:2312.09245, 2023

Wenhai Wang, Jiangwei Xie, ChuanYang Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, et al. Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving.arXiv preprint arXiv:2312.09245, 2023

work page arXiv 2023
[34]

A language agent for autonomous driving.arXiv preprint arXiv:2311.10813, 2023

Jiageng Mao, Junjie Ye, Yuxi Qian, Marco Pavone, and Yue Wang. A language agent for autonomous driving.arXiv preprint arXiv:2311.10813, 2023. 11

work page arXiv 2023
[35]

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M. Alvarez. OmniDrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InCVPR, 2025

2025
[36]

EMMA: End-to-End Multimodal Model for Autonomous Driving

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving. arXiv preprint arXiv:2410.23262, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma

Zewei Zhou, Tianhui Cai, Seth Z. Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Au- toVLA: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning. InNeurIPS, 2025

2025
[38]

MoVQ: Modulating quantized vectors for high-fidelity image generation

Chuanxia Zheng, Long Tung Vuong, Jianfei Cai, and Dinh Phung. MoVQ: Modulating quantized vectors for high-fidelity image generation. InNeurIPS, 2022

2022
[39]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

2023
[40]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan.Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. In ICLR, 2025

2025
[42]

Neural discrete representation learning

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017

2017
[43]

Language models are unsupervised multitask learners.OpenAI blog, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 2019

2019
[44]

Self forcing: Bridging the train-test gap in autoregressive video diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. InNeurIPS, 2025

2025
[45]

Diffusion forcing: Next-token prediction meets full-sequence diffusion

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. InNeurIPS, 2024

2024
[46]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, 2025

2025
[47]

Sparse video-gen: Accelerating video diffusion transformers with spatial-temporal sparsity

Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, Jianfei Chen, Ion Stoica, Kurt Keutzer, and Song Han. Sparse video-gen: Accelerating video diffusion transformers with spatial-temporal sparsity. InICML, 2025

2025
[48]

Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation

Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, Jianfei Chen, Song Han, Kurt Keutzer, and Ion Stoica. Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation. InNeurIPS, 2025

2025
[49]

Adaptive caching for faster video generation with diffusion transformers

Kumara Kahatapitiya, Haozhe Liu, Sen He, Ding Liu, Menglin Jia, Chenyang Zhang, Michael S Ryoo, and Tian Xie. Adaptive caching for faster video generation with diffusion transformers. InICCV, 2025

2025
[50]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023

2023
[51]

Fifo-diffusion: Generating infinite videos from text without training.NeurIPS, 2024

Jihwan Kim, Junoh Kang, Jinyoung Choi, and Bohyung Han. Fifo-diffusion: Generating infinite videos from text without training.NeurIPS, 2024

2024
[52]

X-world: Controllable ego-centric multi-camera world models for scalable end-to-end driving.arXiv preprint arXiv:2603.19979, 2026

Chaoda Zheng, Sean Li, Jinhao Deng, Zhennan Wang, Shijia Chen, Liqiang Xiao, Ziheng Chi, Hongbin Lin, Kangjie Chen, Boyang Wang, et al. X-world: Controllable ego-centric multi-camera world models for scalable end-to-end driving.arXiv preprint arXiv:2603.19979, 2026

work page arXiv 2026
[53]

Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

Zhenxin Li, Kailin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Yishen Ji, Zhiqi Li, Ziyue Zhu, Jan Kautz, Zuxuan Wu, et al. Hydra-mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InCVPR, 2023

2023
[55]

Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.TPAMI, 2023

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.TPAMI, 2023. 12

2023
[56]

Para-drive: Parallelized architecture for real-time autonomous driving

Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Parallelized architecture for real-time autonomous driving. InCVPR, 2024

2024
[57]

Enhancing end-to-end autonomous driving with latent world model

Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model. InICLR, 2025

2025
[58]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InCVPR, 2025

2025
[59]

End-to-end driving with online trajectory evaluation via bev world model

Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online trajectory evaluation via bev world model. InICCV, 2025

2025
[60]

Openscene: The largest up-to-date 3d occupancy prediction benchmark in autonomous driving.https://github.com/OpenDriveLab/OpenScene, 2023

OpenScene Contributors. Openscene: The largest up-to-date 3d occupancy prediction benchmark in autonomous driving.https://github.com/OpenDriveLab/OpenScene, 2023

2023
[61]

Towards learning-based planning: The nuplan benchmark for real-world autonomous driving

Napat Karnchanachari, Dimitris Geromichalos, Kok Seang Tan, Nanxiang Li, Christopher Eriksen, Shakiba Yaghoubi, Noushin Mehdipour, Gianmarco Bernasconi, Whye Kit Fong, Yiluan Guo, et al. Towards learning-based planning: The nuplan benchmark for real-world autonomous driving. InICRA, 2024

2024
[62]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019

2019
[63]

Generalized predictive model for autonomous driving

Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, et al. Generalized predictive model for autonomous driving. InCVPR, 2024. 13 A Dataset Curation The PhysicalAI-Autonomous-Vehicles benchmark contains roughly 1,700 hours of driving organized into 306,152 20-second clips. To focus evaluation a...

2024

[1] [1]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

DriveVLA-w0: World models amplify data scaling law in autonomous driving

Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, AnYasong, Chufeng Tang, Lu Hou, Lue Fan, and Zhaoxiang Zhang. DriveVLA-w0: World models amplify data scaling law in autonomous driving. InICLR, 2026

2026

[4] [4]

Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving

Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving. InICLR, 2026

2026

[5] [5]

Simlingo: Vision-only closed-loop autonomous driving with language-action alignment

Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment. InCVPR, 2025

2025

[6] [6]

Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving

Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving. InCVPR, 2026

2026

[7] [7]

Langcoop: Collaborative driving with language

Xiangbo Gao, Yuheng Wu, Rujia Wang, Chenxi Liu, Yang Zhou, and Zhengzhong Tu. Langcoop: Collaborative driving with language. InCVPR, 2025

2025

[8] [8]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023

2023

[11] [11]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InNeurIPS, 2022

2022

[13] [13]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Drivedreamer-policy: A geometry-grounded world-action model for unified generation and planning.arXiv preprint arXiv:2604.01765, 2026

Yang Zhou, Xiaofeng Wang, Hao Shao, Letian Wang, Guosheng Zhao, Jiangnan Shao, Jiagang Zhu, Tingdong Yu, Zheng Zhu, Guan Huang, et al. Drivedreamer-policy: A geometry-grounded world-action model for unified generation and planning.arXiv preprint arXiv:2604.01765, 2026

work page arXiv 2026

[15] [15]

Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving. InNeurIPS, 2025

2025

[16] [16]

Cosmos policy: Fine-tuning video models for visuomotor control and planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning. InICLR, 2026. 10

2026

[17] [17]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Hengtao Li, Jie Li, Jindi Lv, Jingyu Liu, et al. Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

work page arXiv 2026

[21] [21]

Video prediction policy: A generalist robot policy with predictive visual representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations. InICML, 2025

2025

[22] [22]

Bridging scene generation and planning: Driving with world model via unifying vision and motion representation.arXiv preprint arXiv:2603.14948, 2026

Xingtai Gui, Meijie Zhang, Tianyi Yan, Wencheng Han, Jiahao Gong, Feiyang Tan, Cheng-zhong Xu, and Jianbing Shen. Bridging scene generation and planning: Driving with world model via unifying vision and motion representation.arXiv preprint arXiv:2603.14948, 2026

work page arXiv 2026

[23] [23]

Vavim and vavam: Autonomous driving through video generative modeling.arXiv preprint arXiv:2502.15672, 2025

Florent Bartoccioni, Elias Ramzi, Victor Besnier, Shashanka Venkataramanan, Tuan-Hung Vu, Yihong Xu, Loick Chambon, Spyros Gidaris, Serkan Odabas, David Hurych, Renaud Marlet, Alexandre Boulch, Mickael Chen, Eloi Zablocki, Andrei Bursuc, Eduardo Valle, and Matthieu Cord. Vavim and vavam: Autonomous driving through video generative modeling.arXiv preprint ...

work page arXiv 2025

[24] [24]

Epona: Autoregressive diffusion world model for autonomous driving

Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jingwei Huang, Li Yuan, Qian Zhang, Xiao-Xiao Long, et al. Epona: Autoregressive diffusion world model for autonomous driving. InICCV, 2025

2025

[25] [25]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023

2023

[26] [26]

Flow caching for autoregressive video generation

Yuexiao Ma, Xuzhe Zheng, Jing Xu, Xiwei Xu, Feng Ling, Xiawu Zheng, Huafeng Kuang, Huixia Li, Xing Wang, Xuefeng Xiao, et al. Flow caching for autoregressive video generation. InICLR, 2026

2026

[27] [27]

Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking

Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking. InNeurIPS, 2024

2024

[28] [28]

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Diamond, Yifan Ding, Wenhao Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Letters, 2024

2024

[30] [30]

Drivevlm: The convergence of autonomous driving and large vision-language models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Zhiyong Zhao, Yang Wang, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. InCoRL, 2024

2024

[31] [31]

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving. arXiv preprint arXiv:2410.22313, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

DriveLM: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. DriveLM: Driving with graph visual question answering. In ECCV, 2024

2024

[33] [33]

Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving.arXiv preprint arXiv:2312.09245, 2023

Wenhai Wang, Jiangwei Xie, ChuanYang Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, et al. Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving.arXiv preprint arXiv:2312.09245, 2023

work page arXiv 2023

[34] [34]

A language agent for autonomous driving.arXiv preprint arXiv:2311.10813, 2023

Jiageng Mao, Junjie Ye, Yuxi Qian, Marco Pavone, and Yue Wang. A language agent for autonomous driving.arXiv preprint arXiv:2311.10813, 2023. 11

work page arXiv 2023

[35] [35]

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M. Alvarez. OmniDrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InCVPR, 2025

2025

[36] [36]

EMMA: End-to-End Multimodal Model for Autonomous Driving

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving. arXiv preprint arXiv:2410.23262, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma

Zewei Zhou, Tianhui Cai, Seth Z. Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Au- toVLA: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning. InNeurIPS, 2025

2025

[38] [38]

MoVQ: Modulating quantized vectors for high-fidelity image generation

Chuanxia Zheng, Long Tung Vuong, Jianfei Cai, and Dinh Phung. MoVQ: Modulating quantized vectors for high-fidelity image generation. InNeurIPS, 2022

2022

[39] [39]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

2023

[40] [40]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan.Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. In ICLR, 2025

2025

[42] [42]

Neural discrete representation learning

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017

2017

[43] [43]

Language models are unsupervised multitask learners.OpenAI blog, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 2019

2019

[44] [44]

Self forcing: Bridging the train-test gap in autoregressive video diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. InNeurIPS, 2025

2025

[45] [45]

Diffusion forcing: Next-token prediction meets full-sequence diffusion

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. InNeurIPS, 2024

2024

[46] [46]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, 2025

2025

[47] [47]

Sparse video-gen: Accelerating video diffusion transformers with spatial-temporal sparsity

Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, Jianfei Chen, Ion Stoica, Kurt Keutzer, and Song Han. Sparse video-gen: Accelerating video diffusion transformers with spatial-temporal sparsity. InICML, 2025

2025

[48] [48]

Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation

Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, Jianfei Chen, Song Han, Kurt Keutzer, and Ion Stoica. Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation. InNeurIPS, 2025

2025

[49] [49]

Adaptive caching for faster video generation with diffusion transformers

Kumara Kahatapitiya, Haozhe Liu, Sen He, Ding Liu, Menglin Jia, Chenyang Zhang, Michael S Ryoo, and Tian Xie. Adaptive caching for faster video generation with diffusion transformers. InICCV, 2025

2025

[50] [50]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023

2023

[51] [51]

Fifo-diffusion: Generating infinite videos from text without training.NeurIPS, 2024

Jihwan Kim, Junoh Kang, Jinyoung Choi, and Bohyung Han. Fifo-diffusion: Generating infinite videos from text without training.NeurIPS, 2024

2024

[52] [52]

X-world: Controllable ego-centric multi-camera world models for scalable end-to-end driving.arXiv preprint arXiv:2603.19979, 2026

Chaoda Zheng, Sean Li, Jinhao Deng, Zhennan Wang, Shijia Chen, Liqiang Xiao, Ziheng Chi, Hongbin Lin, Kangjie Chen, Boyang Wang, et al. X-world: Controllable ego-centric multi-camera world models for scalable end-to-end driving.arXiv preprint arXiv:2603.19979, 2026

work page arXiv 2026

[53] [53]

Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

Zhenxin Li, Kailin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Yishen Ji, Zhiqi Li, Ziyue Zhu, Jan Kautz, Zuxuan Wu, et al. Hydra-mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [54]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InCVPR, 2023

2023

[55] [55]

Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.TPAMI, 2023

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.TPAMI, 2023. 12

2023

[56] [56]

Para-drive: Parallelized architecture for real-time autonomous driving

Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Parallelized architecture for real-time autonomous driving. InCVPR, 2024

2024

[57] [57]

Enhancing end-to-end autonomous driving with latent world model

Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model. InICLR, 2025

2025

[58] [58]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InCVPR, 2025

2025

[59] [59]

End-to-end driving with online trajectory evaluation via bev world model

Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online trajectory evaluation via bev world model. InICCV, 2025

2025

[60] [60]

Openscene: The largest up-to-date 3d occupancy prediction benchmark in autonomous driving.https://github.com/OpenDriveLab/OpenScene, 2023

OpenScene Contributors. Openscene: The largest up-to-date 3d occupancy prediction benchmark in autonomous driving.https://github.com/OpenDriveLab/OpenScene, 2023

2023

[61] [61]

Towards learning-based planning: The nuplan benchmark for real-world autonomous driving

Napat Karnchanachari, Dimitris Geromichalos, Kok Seang Tan, Nanxiang Li, Christopher Eriksen, Shakiba Yaghoubi, Noushin Mehdipour, Gianmarco Bernasconi, Whye Kit Fong, Yiluan Guo, et al. Towards learning-based planning: The nuplan benchmark for real-world autonomous driving. InICRA, 2024

2024

[62] [62]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019

2019

[63] [63]

Generalized predictive model for autonomous driving

Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, et al. Generalized predictive model for autonomous driving. InCVPR, 2024. 13 A Dataset Curation The PhysicalAI-Autonomous-Vehicles benchmark contains roughly 1,700 hours of driving organized into 306,152 20-second clips. To focus evaluation a...

2024