arxiv: 2604.09059 · v1 · submitted 2026-04-10 · 💻 cs.CV · cs.AI

Recognition: unknown

Learning Vision-Language-Action World Models for Autonomous Driving

Guoqing Wang , Pin Tang , Xiangxuan Ren , Guodongfang Zhao , Bailan Feng , Chao Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:02 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords Vision-Language-Action modelsWorld modelsAutonomous drivingTrajectory planningFuture scene generationReinforcement learningnuScenes dataset

0 comments

The pith

VLA-World unifies predictive imagination and reflective reasoning to enhance autonomous driving foresight and safety.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VLA-World, a vision-language-action world model that generates future scene images guided by predicted trajectories and then reasons over those imagined frames to refine the trajectories. This addresses the limitations of standard VLA models lacking temporal dynamics and world models lacking reasoning. By curating a new dataset and using a three-stage training process including reinforcement learning, the approach aims to improve both planning decisions and the quality of predicted future scenes. A sympathetic reader would care because better foresight in self-driving systems could lead to safer navigation in dynamic environments. The model demonstrates superior performance on benchmarks for planning and future generation compared to existing methods.

Core claim

VLA-World first uses an action-derived feasible trajectory to guide the generation of the next-frame image, capturing rich spatial and temporal cues. It then reasons over this self-generated future imagined frame to refine the predicted trajectory. Supported by the nuScenes-GR-20K dataset and three-stage training, this unification of imagination and reflection leads to higher performance and better interpretability in autonomous driving tasks.

What carries the argument

The action-guided future-frame generation followed by reflective reasoning over the imagined scene, which refines the trajectory prediction.

If this is right

Improved trajectory prediction accuracy by incorporating future scene reasoning.
Better interpretability of the model's driving decisions through explicit future simulation.
Enhanced performance on both planning and scene generation benchmarks over prior VLA and world model approaches.
More robust handling of temporal dynamics and global world consistency in driving scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method suggests that closing the loop between action prediction and visual imagination could generalize to other robotics domains like manipulation or navigation.
If generation errors are mitigated, the reflective step might enable safer long-horizon planning without manual rule-based safety checks.
Future work could test whether this self-refinement reduces the need for extensive human-labeled trajectory data.

Load-bearing premise

That reasoning over the model's own generated future frames will consistently correct trajectory errors rather than amplify inaccuracies from imperfect image synthesis.

What would settle it

A controlled ablation experiment showing that disabling the reasoning-over-imagined-frame step results in no performance drop or even improvement on the planning benchmarks would falsify the benefit of the reflective component.

Figures

Figures reproduced from arXiv: 2604.09059 by Bailan Feng, Chao Ma, Guodongfang Zhao, Guoqing Wang, Pin Tang, Xiangxuan Ren.

**Figure 1.** Figure 1: Visual overview of VLA-World. The model learns through three progressive stages. We first activate visual generation by [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Comparison of the (a) VLA, (b) World Model, and (c) [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The illustration of the three-stage training and inference pipeline of VLA-World. Our training pipeline consists of three key [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of our VLA-World compared with the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Data sample of (a) pretraining stage, (b) supervised fine [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison between our VLA-World and the state-of-the-art FSDrive [ [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of 3-second future trajectory predictions generated by our VLA-World and the state-of-the-art FSDrive [ [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

Vision-Language-Action (VLA) models have recently achieved notable progress in end-to-end autonomous driving by integrating perception, reasoning, and control within a unified multimodal framework. However, they often lack explicit modeling of temporal dynamics and global world consistency, which limits their foresight and safety. In contrast, world models can simulate plausible future scenes but generally struggle to reason about or evaluate the imagined future they generate. In this work, we present VLA-World, a simple yet effective VLA world model that unifies predictive imagination with reflective reasoning to improve driving foresight. VLA-World first uses an action-derived feasible trajectory to guide the generation of the next-frame image, capturing rich spatial and temporal cues that describe how the surrounding environment evolves. The model then reasons over this self-generated future imagined frame to refine the predicted trajectory, achieving higher performance and better interpretability. To support this pipeline, we curate nuScenes-GR-20K, a generative reasoning dataset derived from nuScenes, and employ a three-stage training strategy that includes pretraining, supervised fine-tuning, and reinforcement learning. Extensive experiments demonstrate that VLA-World consistently surpasses state-of-the-art VLA and world-model baselines on both planning and future-generation benchmarks. Project page: https://vlaworld.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VLA-World links trajectory-guided image generation to reasoning-based refinement in a driving model, but the abstract gives no metrics or ablations to show the loop actually improves outcomes.

read the letter

The main thing to know is that this paper describes a two-step process for autonomous driving: first generate the next frame conditioned on an initial action-derived trajectory, then run reasoning over that generated image to output a refined trajectory. They support it with a new nuScenes-GR-20K dataset and a three-stage training recipe that ends in reinforcement learning. That specific chaining of imagination into reflective reasoning over the model's own output is the clearest new piece, and the dataset curation is a usable contribution that others could build on. The motivation section does a clean job separating the weaknesses of current VLA models (weak temporal dynamics) from those of standalone world models (no evaluation of their own predictions). The pipeline itself is straightforward and avoids unnecessary complexity. The soft spots are mostly around validation. The abstract states consistent superiority on planning and future-generation benchmarks, yet supplies no numbers, no ablation that removes only the reasoning step, and no analysis of cases where poor generation quality leads to worse final trajectories. The stress-test concern about error compounding is fair on the current evidence; without per-scene correlations between generation fidelity and planning deltas, it is hard to know whether the refinement step nets positive more often than it amplifies artifacts. The full manuscript would need to show those controls before the superiority claim can be taken as settled. This work is aimed at people already building multimodal models for driving or robotics who want a concrete example of closing the generation-reasoning loop. A reader looking for new datasets or training schedules would find material here. It is worth sending to peer review because the idea is coherent and the dataset is new, even though the empirical section will almost certainly need expansion and targeted ablations before acceptance.

Referee Report

3 major / 2 minor

Summary. The paper introduces VLA-World, a vision-language-action world model for autonomous driving that unifies predictive imagination with reflective reasoning. The pipeline first conditions next-frame generation on an action-derived initial trajectory to capture spatial-temporal evolution, then reasons over the self-generated imagined frame to refine the trajectory. It curates the nuScenes-GR-20K generative reasoning dataset from nuScenes and employs a three-stage training strategy (pretraining, supervised fine-tuning, and reinforcement learning). The central claim is that VLA-World consistently surpasses state-of-the-art VLA and world-model baselines on both planning and future-generation benchmarks.

Significance. If the results hold under rigorous validation, this work could meaningfully advance end-to-end autonomous driving by bridging world models' predictive capabilities with VLA-style reasoning, potentially improving foresight and interpretability. The dataset curation and staged training approach are constructive contributions that may enable follow-on research. The significance is limited by the absence of targeted validation for the core refinement loop.

major comments (3)

[§4] §4 (Experiments): No ablation is reported that removes only the reflective reasoning module while keeping the initial trajectory-guided generation fixed. This is load-bearing for the headline claim, as the superiority over baselines is attributed to the reasoning step refining trajectories; without it, it is impossible to determine whether the loop provides net benefit or amplifies generation artifacts.
[§3 and §4] §3 (Method) and §4 (Experiments): The manuscript provides no per-scene or per-metric correlation between future-frame generation quality (e.g., FID, PSNR, or LPIPS on nuScenes-GR-20K) and the delta in planning metrics (e.g., collision rate or trajectory error) before versus after reasoning. This leaves the assumption that reasoning reliably refines rather than compounds errors untested.
[§4] §4 (Experiments): Failure-case analysis is absent; there is no examination of scenes where low-quality imagined frames lead to worse final trajectories than the initial action-derived prediction, which would directly address the risk of error propagation highlighted in the pipeline design.

minor comments (2)

[Abstract] The abstract would be strengthened by including one or two key quantitative results (e.g., planning metric improvements) to substantiate the superiority claim rather than stating it qualitatively.
[§3] A diagram or pseudocode for the three-stage training pipeline and the exact conditioning of generation on the initial trajectory would improve clarity of the method.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of validating the reflective reasoning component in our pipeline. We have addressed each point by adding the requested analyses to the revised manuscript. Our responses are provided below.

read point-by-point responses

Referee: §4 (Experiments): No ablation is reported that removes only the reflective reasoning module while keeping the initial trajectory-guided generation fixed. This is load-bearing for the headline claim, as the superiority over baselines is attributed to the reasoning step refining trajectories; without it, it is impossible to determine whether the loop provides net benefit or amplifies generation artifacts.

Authors: We agree that isolating the contribution of the reflective reasoning module is necessary to support the central claim. In the revised manuscript, we have added an ablation study in Section 4 that compares the full VLA-World model to a variant using only the action-guided generation step without reflective reasoning. The results show that including the reasoning module yields consistent gains in planning metrics such as lower collision rates and reduced trajectory error, confirming a net benefit. The updated experiments and table are now included. revision: yes
Referee: §3 (Method) and §4 (Experiments): The manuscript provides no per-scene or per-metric correlation between future-frame generation quality (e.g., FID, PSNR, or LPIPS on nuScenes-GR-20K) and the delta in planning metrics (e.g., collision rate or trajectory error) before versus after reasoning. This leaves the assumption that reasoning reliably refines rather than compounds errors untested.

Authors: We acknowledge the value of correlating generation quality with planning improvements to test the refinement assumption. We have added this analysis to the revised Section 4, reporting per-metric correlations across the test set (e.g., between FID and trajectory error delta). A positive correlation is observed, indicating that better generation quality is associated with larger planning gains after reasoning. Per-scene breakdowns are provided for representative examples due to high scene variability; aggregate statistics and discussion of limitations are included. revision: yes
Referee: §4 (Experiments): Failure-case analysis is absent; there is no examination of scenes where low-quality imagined frames lead to worse final trajectories than the initial action-derived prediction, which would directly address the risk of error propagation highlighted in the pipeline design.

Authors: We recognize that failure-case analysis is important for addressing potential error propagation. In the revised manuscript, we have added a dedicated failure-case subsection in Section 4. This examines scenes where low-quality imagined frames (high FID/LPIPS) lead to final trajectories worse than the initial prediction. We provide quantitative frequency statistics and qualitative examples, noting that such cases are infrequent and typically arise in complex dynamic scenes. Mitigation approaches are discussed. revision: yes

Circularity Check

0 steps flagged

No significant circularity; pipeline is data-driven and externally benchmarked

full rationale

The VLA-World architecture is presented as a trained multimodal model that generates next-frame images conditioned on action-derived trajectories and then applies reasoning to refine those trajectories. This is supported by a curated external dataset (nuScenes-GR-20K) and a three-stage training procedure (pretraining, SFT, RL) evaluated on standard planning and generation benchmarks. No equations, uniqueness theorems, or central claims reduce by construction to fitted parameters or self-citations; the derivation chain relies on independent data and external baselines rather than self-referential definitions or imported ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not provide enough detail to identify specific free parameters, axioms, or invented entities. The model architecture, dataset curation, and training stages are described at a high level without mathematical formulations or assumptions listed.

pith-pipeline@v0.9.0 · 5535 in / 1215 out tokens · 83467 ms · 2026-05-10T17:02:57.417442+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Driving Intents Amplify Planning-Oriented Reinforcement Learning
cs.RO 2026-05 unverdicted novelty 6.0

DIAL uses intent-conditioned CFG and multi-intent GRPO to expand and preserve diverse modes in continuous-action preference RL, lifting RFS to 9.14 and surpassing both prior best (8.5) and human demonstration (8.13).
Driving Intents Amplify Planning-Oriented Reinforcement Learning
cs.RO 2026-05 unverdicted novelty 5.0

DIAL expands continuous-action driving policies via intent-conditioned flow matching and multi-intent GRPO, lifting best-of-N preference scores above human demonstrations for the first time on WOD-E2E.

Reference graph

Works this paper leans on

87 extracted references · 45 canonical work pages · cited by 1 Pith paper · 23 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InNeurIPS, pages 23716–23736, 2022. 1

2022
[3]

Covla: Comprehensive vision-language-action dataset for autonomous driving

Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watan- abe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. InWACV, pages 1933–1943, 2025. 1

1933
[4]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- vl technical report.a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Belkhale, T

Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823, 2024. 3

work page arXiv 2024
[7]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi_0: A vision-language- action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. 3

work page internal anchor Pith review arXiv 2022
[9]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024. 3

2024
[10]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InCVPR, pages 11621– 11631, 2020. 2, 6, 7, 3

2020
[11]

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model. arXiv preprint arXiv:2506.21539, 2025. 2, 3

work page internal anchor Pith review arXiv 2025
[12]

End-to-end autonomous driving: Challenges and frontiers.IEEE TPAMI, 2024

Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers.IEEE TPAMI, 2024. 1

2024
[13]

Vadv2: End-to-end vectorized autonomous driving via probabilistic planning, 2024

Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning, 2024. 1, 3, 6, 4

2024
[14]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025. 2

work page internal anchor Pith review arXiv 2025
[15]

Ppad: Iterative interactions of prediction and planning for end-to-end autonomous driving

Zhili Chen, Maosheng Ye, Shuangjie Xu, Tongyi Cao, and Qifeng Chen. Ppad: Iterative interactions of prediction and planning for end-to-end autonomous driving. InECCV, pages 239–256, 2024. 1

2024
[16]

arXiv preprint arXiv:2505.23757 (2025) 4

Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, et al. Impromptu vla: Open weights and open data for driving vision-language-action models.arXiv preprint arXiv:2505.23757, 2025. 1

work page arXiv 2025
[17]

Palm-e: An embodied multimodal language model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, et al. Palm-e: An embodied multimodal language model. InICML, 2023. 3

2023
[18]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InCVPR, pages 12873–12883, 2021. 5

2021
[19]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning ca- pability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Dme-driver: Integrating human decision logic and 3d scene perception in autonomous driving

Wencheng Han, Dongqian Guo, Cheng-Zhong Xu, and Jian- bing Shen. Dme-driver: Integrating human decision logic and 3d scene perception in autonomous driving. InAAAI, pages 3347–3355, 2025. 1

2025
[21]

Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene com- position control

Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, et al. Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene com- position control. InCVPR, pages 22404–22415, 2025. 7

2025
[22]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Cor- rado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023. 1

work page internal anchor Pith review arXiv 2023
[23]

St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning

Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based au- tonomous driving via spatial-temporal feature learning. In ECCV, pages 533–549, 2022. 6, 7, 4

2022
[24]

Drivingworld: Constructing world model for autonomous driving via video gpt

Xiaotao Hu, Wei Yin, Mingkai Jia, Junyuan Deng, Xiaoyang Guo, Qian Zhang, Xiaoxiao Long, and Ping Tan. Driving- world: Constructing world model for autonomous driving via video gpt.arXiv preprint arXiv:2412.19505, 2024. 3

work page arXiv 2024
[25]

Planning-oriented autonomous driv- ing

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driv- ing. InCVPR, pages 17853–17862, 2023. 1, 3, 6, 7, 4

2023
[26]

Zhijian Huang, Tao Tang, Shaoxiang Chen, Sihao Lin, and Ze- qun et al. Jie. Making large language models better planners with reasoning-decision alignment. InECCV, pages 73–90,
[27]

EMMA: End-to-End Multimodal Model for Autonomous Driving

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262,

work page internal anchor Pith review arXiv
[28]

Available: https://arxiv.org/abs/2311.13549

Fan Jia, Weixin Mao, Yingfei Liu, Yucheng Zhao, Yuqing Wen, Chi Zhang, Xiangyu Zhang, and Tiancai Wang. Adriver-i: A general world model for autonomous driving. arXiv:2311.13549, 2023. 1

work page arXiv 2023
[29]

Vad: Vectorized scene representation for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InICCV, pages 8306–8316,
[30]

Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning.arXiv preprint arXiv:2503.07608, 2025

Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, and Xing- gang Wang. Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning. arXiv preprint arXiv:2503.07608, 2025. 1, 2, 3

work page arXiv 2025
[31]

Drivegan: Towards a controllable high-quality neural simulation

Seung Wook Kim, Jonah Philion, Antonio Torralba, and Sanja Fidler. Drivegan: Towards a controllable high-quality neural simulation. InCVPR, pages 5820–5829, 2021. 6, 7, 3

2021
[32]

Uniscene: Unified occupancy-centric driving scene generation.arXiv preprint arXiv:2412.05435, 2024

Bohan Li, Jiazhe Guo, Hongsi Liu, Yingshuang Zou, Yikang Ding, Xiwu Chen, Hu Zhu, Feiyang Tan, Chi Zhang, Tiancai Wang, et al. Uniscene: Unified occupancy-centric driving scene generation.arXiv preprint arXiv:2412.05435, 2024. 3

work page arXiv 2024
[33]

Driving everywhere with large language model policy adaptation

Boyi Li, Yue Wang, Jiageng Mao, Boris Ivanovic, Sushant Veer, Karen Leung, and Marco Pavone. Driving everywhere with large language model policy adaptation. InCVPR, pages 14948–14957, 2024. 1

2024
[34]

arXiv preprint arXiv:2510.18313 (2025)

Bohan Li, Zhuang Ma, Dalong Du, Baorui Peng, Zhujin Liang, Zhenqiang Liu, Chao Ma, Yueming Jin, Hao Zhao, Wenjun Zeng, et al. Omninwm: Omniscient driving naviga- tion world models.arXiv preprint arXiv:2510.18313, 2025. 3

work page arXiv 2025
[35]

Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. InICML, pages 12888–12900, 2022. 3

2022
[36]

Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, pages 19730–19742, 2023. 3

2023
[37]

Semi-supervised vision-centric 3d occu- pancy world model for autonomous driving.arXiv preprint arXiv:2502.07309, 2025

Xiang Li, Pengfei Li, Yupeng Zheng, Wei Sun, Yan Wang, and Yilun Chen. Semi-supervised vision-centric 3d occu- pancy world model for autonomous driving.arXiv preprint arXiv:2502.07309, 2025. 7

work page arXiv 2025
[38]

Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M. Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InCVPR, pages 14864–14873, 2024. 6, 7

2024
[39]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, pages 34892–34916,
[40]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, pages 26296–26306, 2024. 1, 3

2024
[41]

Wovogen: World volume-aware diffusion for control- lable multi-camera driving scene generation

Jiachen Lu, Ze Huang, Zeyu Yang, Jiahui Zhang, and Li Zhang. Wovogen: World volume-aware diffusion for control- lable multi-camera driving scene generation. InECCV, pages 329–345, 2024. 1

2024
[42]

Dolphins: Multimodal language model for driving

Yingzi Ma, Yulong Cao, Jiachen Sun, Marco Pavone, and Chaowei Xiao. Dolphins: Multimodal language model for driving. InECCV, pages 403–420, 2024. 1

2024
[43]

A language agent for au- tonomous driving

Jiageng Mao, Junjie Ye, Yuxi Qian, Marco Pavone, and Yue Wang. A language agent for autonomous driving.arXiv preprint arXiv:2311.10813, 2023. 3

work page arXiv 2023
[44]

Driveworld: 4d pre-trained scene understanding via world models for autonomous driving

Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. InCVPR, pages 15522– 15533, 2024. 1

2024
[45]

Reason2drive: Towards interpretable and chain-based reasoning for autonomous driv- ing

Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jian- hua Han, Hang Xu, and Li Zhang. Reason2drive: Towards interpretable and chain-based reasoning for autonomous driv- ing. InECCV, pages 292–308, 2024. 3

2024
[46]

arXiv preprint arXiv:2505.15298 1,

Kangan Qian, Sicong Jiang, Yang Zhong, Ziang Luo, Zilin Huang, Tianze Zhu, Kun Jiang, Mengmeng Yang, Zheng Fu, Jinyu Miao, et al. Agentthink: A unified framework for tool-augmented chain-of-thought reasoning in vision- language models for autonomous driving.arXiv preprint arXiv:2505.15298, 2025. 3

work page arXiv 2025
[47]

Grounding Everything in Tokens for Multimodal Large Language Models

Xiangxuan Ren, Zhongdao Wang, Liping Hou, Pin Tang, Guoqing Wang, and Chao Ma. Grounding everything in tokens for multimodal large language models.arXiv preprint arXiv:2512.10554, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Reasoning in computer vi- sion: Taxonomy, models, tasks, and methodologies,

Ayushman Sarkar, Mohd Yamani Idna Idris, and Zhenyu Yu. Reasoning in computer vision: Taxonomy, models, tasks, and methodologies.arXiv preprint arXiv:2508.10523, 2025. 3

work page arXiv 2025
[49]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 1

work page internal anchor Pith review Pith/arXiv arXiv 2017
[50]

Lmdrive: Closed-loop end-to-end driving with large language models

Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hongsheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. InCVPR, pages 15120–15130, 2024. 3

2024
[51]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 3, 6, 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf frame- work.arXiv preprint arXiv: 2409.19256, 2024. 3

work page internal anchor Pith review arXiv 2024
[53]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024. 2, 3

work page internal anchor Pith review arXiv 2024
[54]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- tinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Roz- ière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Neural discrete representation learning

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, pages 6309–6318, 2017. 5

2017
[56]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Om- nidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Om- nidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InCVPR, pages 22442– 22452, 2025. 3, 6, 7, 4

2025
[58]

Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving.arXiv preprint arXiv:2312.09245, 2023

Wenhai Wang, Jiangwei Xie, ChuanYang Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, et al. Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving.arXiv preprint arXiv:2312.09245, 2023. 3

work page arXiv 2023
[59]

DriveDreamer: Towards real-world-driven world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world-driven world models for autonomous driving. arXiv:2309.09777, 2023. 1, 3, 6, 7, 4

work page arXiv 2023
[60]

WorldDreamer: Towards general world models for video generation via predicting masked tokens.arXiv preprint arXiv:2401.09985,

Xiaofeng Wang, Zheng Zhu, Guan Huang, Boyuan Wang, Xinze Chen, and Jiwen Lu. Worlddreamer: Towards general world models for video generation via predicting masked tokens.arXiv:2401.09985, 2024

work page arXiv 2024
[61]

Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving

Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving. InCVPR, pages 14749–14759, 2024. 1, 3, 6, 7, 4

2024
[62]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, F. Xia, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. InNeurIPS, pages 24824–24837, 2022. 3

2022
[63]

Janus: Decoupling visual encoding for unified multimodal understanding and generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InCVPR, pages 12966–12977, 2025. 2

2025
[64]

Mars: An instance-aware, modular and realistic simulator for autonomous driving

Zirui Wu, Tianyu Liu, Liyi Luo, Zhide Zhong, Jianteng Chen, Hongmin Xiao, Chao Hou, Haozhe Lou, Yuantao Chen, Runyi Yang, et al. Mars: An instance-aware, modular and realistic simulator for autonomous driving. InCAAI, pages 3–15, 2023. 3

2023
[65]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and genera- tion.arXiv preprint arXiv:2408.12528, 2024. 2

work page internal anchor Pith review arXiv 2024
[66]

Occ-llm: Enhancing autonomous driving with occupancy-based large language models,

Tianshuo Xu, Hao Lu, Xu Yan, Yingjie Cai, Bingbing Liu, and Yingcong Chen. Occ-llm: Enhancing au- tonomous driving with occupancy-based large language mod- els.arXiv:2502.06419, 2025. 3

work page arXiv 2025
[67]

Wong, Zhenguo Li, and Hengshuang Zhao

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.arXiv preprint arXiv:2310.01412,

work page arXiv
[68]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

Generalized predictive model for autonomous driving

Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, et al. Generalized predictive model for autonomous driving. InCVPR, pages 14662–14672, 2024. 3, 7

2024
[70]

ReSim: Reliable World Simulation for Autonomous Driving

Jiazhi Yang, Kashyap Chitta, Shenyuan Gao, Long Chen, Yuqian Shao, Xiaosong Jia, Hongyang Li, Andreas Geiger, Xiangyu Yue, and Li Chen. Resim: Reliable world simulation for autonomous driving.arXiv preprint arXiv:2506.09981, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[71]

Driving in the occupancy world: Vision-centric 4d occupancy forecasting and planning via world models for autonomous driving

Yu Yang, Jianbiao Mei, Yukai Ma, Siliang Du, Wenqing Chen, Yijie Qian, Yuxiang Feng, and Yong Liu. Driving in the occupancy world: Vision-centric 4d occupancy forecasting and planning via world models for autonomous driving. In AAAI, pages 9327–9335, 2025. 3

2025
[72]

Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving.arXiv preprint arXiv:2505.16278, 2025

Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end- to-end autonomous driving.arXiv preprint arXiv:2505.16278,

work page arXiv
[73]

AutoDrive-R$^2$: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving

Zhenlong Yuan, Jing Tang, Jinguo Luo, Rui Chen, Chengxuan Qian, Lei Sun, Xiangxiang Chu, Yujun Cai, Dapeng Zhang, and Shuo Li. Autodrive-r 2: Incentivizing reasoning and self- reflection capacity for vla model in autonomous driving.arXiv preprint arXiv:2509.01944, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving,

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yi- fan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.arXiv preprint arXiv:2505.17685, 2025. 2, 3, 4, 6, 7, 8, 5

work page arXiv 2025
[75]

Feedback-guided autonomous driving

Jimuyang Zhang, Zanming Huang, Arijit Ray, and Eshed Ohn-Bar. Feedback-guided autonomous driving. InCVPR, pages 15000–15011, 2024. 7

2024
[76]

Chatscene: Knowledge- enabled safety-critical scenario generation for autonomous vehicles

Jiawei Zhang, Chejian Xu, and Bo Li. Chatscene: Knowledge- enabled safety-critical scenario generation for autonomous vehicles. InCVPR, pages 15459–15469, 2024. 1

2024
[77]

Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint arXiv:2507.04447, 2025

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xin- qiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint arXiv:2507.04447, 2025. 2

work page arXiv 2025
[78]

Occworld: Learning a 3d occupancy world model for autonomous driving

Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. Occworld: Learning a 3d occupancy world model for autonomous driving. In ECCV, pages 55–72, 2024. 3, 7

2024
[79]

Genad: Generative end-to-end autonomous driving.arXiv preprint arXiv:2402.11502, 2024

Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Generative end-to-end autonomous driving.arXiv preprint arXiv:2402.11502, 2024

work page arXiv 2024
[80]

arXiv preprint arXiv:2412.09627 (2024)

Wenzhao Zheng, Zetian Xia, Yuanhui Huang, Sicheng Zuo, Jie Zhou, and Jiwen Lu. Doe-1: Closed-loop autonomous driv- ing with large world model.arXiv preprint arXiv:2412.09627,

work page arXiv

Showing first 80 references.