pith. sign in

arxiv: 2504.18576 · v2 · submitted 2025-04-22 · 💻 cs.RO

DriVerse: Navigation World Model for Driving Simulation via Multimodal Trajectory Prompting and Motion Alignment

Pith reviewed 2026-05-22 17:47 UTC · model grok-4.3

classification 💻 cs.RO
keywords driving simulationworld modelsvideo generationtrajectory promptingmotion alignmentautonomous drivinggenerative modelsnavigation
0
0 comments X

The pith

DriVerse generates higher-fidelity driving videos from one image and a future trajectory by converting paths into textual prompts and 2D motion priors plus a motion alignment step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to fix the mismatch between trajectory controls and the internal features of 2D generative models that has produced low-quality driving scene videos in prior work. It does so by turning a planned trajectory into two forms of guidance at once: text tokens drawn from a fixed vocabulary of driving trends, and 2D spatial motion priors derived from the 3D path. A separate lightweight module then enforces frame-to-frame consistency only on the pixels that belong to moving objects. The resulting model produces trajectory-specific videos that match real driving data more closely than earlier specialized systems, all while using very little additional training and no new data. Accurate simulations of this kind would let researchers test autonomous driving planners against realistic future scenes instead of relying on coarse command inputs.

Core claim

DriVerse is a generative model that takes a single starting image and a future trajectory and produces a corresponding video of the driving scene. It achieves explicit control by first tokenizing the trajectory into language prompts drawn from a predefined trend vocabulary and second by projecting the 3D trajectory into 2D spatial motion priors that steer the static elements of the scene. A motion alignment module then improves temporal consistency specifically for dynamic pixels across frames. When trained with minimal updates and no extra data, the model yields higher-quality future video predictions than prior specialized approaches on both the nuScenes and Waymo datasets.

What carries the argument

Multimodal trajectory prompting that supplies both tokenized textual trend prompts and 2D spatial motion priors extracted from 3D trajectories, together with a lightweight motion alignment module that targets inter-frame consistency of dynamic pixels.

If this is right

  • Trajectory-specific videos become usable for evaluating actual autonomous driving planners instead of relying on coarse text commands or discrete signals.
  • Static scene content and dynamic object motion are controlled more precisely because guidance is supplied in both language and spatial forms.
  • Temporal coherence of moving elements improves over long sequences once the alignment module enforces consistency on dynamic pixels.
  • The same performance gains appear on multiple real-world driving datasets while requiring only minimal training and no new data collection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-channel prompting pattern could be tested on video generation tasks outside driving, such as robot manipulation or animated character control.
  • Because the alignment module is lightweight, it could be added to many existing base generative models without retraining the entire system from scratch.
  • If the trend vocabulary proves sufficient, future work might explore whether learned vocabularies could further reduce the gap between text and precise spatial control.

Load-bearing premise

The combination of a fixed trend vocabulary for text prompts, 2D motion priors from 3D trajectories, and the motion alignment module is enough to correct the alignment problems that arise when trajectory signals are fed directly into a base 2D generative model.

What would settle it

Run the same future-video generation benchmarks on nuScenes and Waymo and measure whether DriVerse videos show measurably lower trajectory adherence or visual quality than baselines that use direct trajectory input or discrete controls.

Figures

Figures reproduced from arXiv: 2504.18576 by Chenming Wu, Dingkang Liang, Ji Wan, Jun Wang, Xiaofan Li, Yumeng Zhang, Zhao Yang, Zhihao Xu.

Figure 1
Figure 1. Figure 1: Our proposed navigation world model for driving simulation, referred to as [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the DriVerse framework. Given a single scene image and a future trajectory, DriVerse decomposes the generation task [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The top and bottom parts of the figure visualize, through [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison with existing methods. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of inference results on the Waymo Open Dataset. The top two rows show the diverse future predictions [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

This paper presents DriVerse, a generative model for simulating navigation-driven driving scenes from a single image and a future trajectory. Previous autonomous driving world models either directly feed the trajectory or discrete control signals into the generation pipeline, leading to poor alignment between the control inputs and the implicit features of the 2D base generative model, which results in low-fidelity video outputs. Some methods use coarse textual commands or discrete vehicle control signals, which lack the precision to guide fine-grained, trajectory-specific video generation, making them unsuitable for evaluating actual autonomous driving algorithms. DriVerse introduces explicit trajectory guidance in two complementary forms: it tokenizes trajectories into textual prompts using a predefined trend vocabulary for seamless language integration, and converts 3D trajectories into 2D spatial motion priors to enhance control over static content within the driving scene. To better handle dynamic objects, we further introduce a lightweight motion alignment module, which focuses on the inter-frame consistency of dynamic pixels, significantly enhancing the temporal coherence of moving elements over long sequences. With minimal training and no need for additional data, DriVerse outperforms specialized models on future video generation tasks across both the nuScenes and Waymo datasets. The code and models will be released to the public.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces DriVerse, a generative world model for driving simulation that produces future video from a single image and a future trajectory. It addresses misalignment in prior models by tokenizing trajectories into textual prompts via a predefined trend vocabulary, converting 3D trajectories into 2D spatial motion priors, and adding a lightweight motion alignment module to improve inter-frame consistency for dynamic objects. The central claim is that this multimodal approach yields higher-fidelity outputs than specialized models on nuScenes and Waymo with only minimal training and no extra data.

Significance. If the outperformance claims are substantiated, the work could improve controllability in driving world models, enabling more precise trajectory-guided video generation for autonomous driving evaluation. The explicit use of tokenized language prompts and 2D motion priors offers a lightweight alternative to direct control injection or coarse commands.

major comments (1)
  1. Abstract: the claim that DriVerse 'outperforms specialized models on future video generation tasks across both the nuScenes and Waymo datasets' is presented without any quantitative metrics (FVD, FID, trajectory adherence), baselines, ablation results, or experimental protocol details. This directly undermines assessment of whether the tokenized prompts, 2D priors, and motion alignment module produce the asserted gains rather than implementation artifacts.
minor comments (2)
  1. The size, construction, and coverage of the 'predefined trend vocabulary' are not specified, which affects reproducibility of the textual prompting component.
  2. The base generative model, exact training regime (e.g., number of epochs, learning rate schedule), and definition of 'minimal training' should be stated explicitly to clarify the data-efficiency claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and for highlighting an important clarity issue in the abstract. We address the comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: Abstract: the claim that DriVerse 'outperforms specialized models on future video generation tasks across both the nuScenes and Waymo datasets' is presented without any quantitative metrics (FVD, FID, trajectory adherence), baselines, ablation results, or experimental protocol details. This directly undermines assessment of whether the tokenized prompts, 2D priors, and motion alignment module produce the asserted gains rather than implementation artifacts.

    Authors: We agree that the abstract claim would be more informative with supporting quantitative evidence. The full manuscript (Section 4 and Tables 1-3) reports FVD, FID, and trajectory adherence metrics, along with comparisons to baselines such as DriveDreamer, Vista, and GAIA-1, plus ablations isolating the contributions of tokenized prompts, 2D motion priors, and the motion alignment module. Experimental protocols (training details, dataset splits, evaluation metrics) are described in Section 3. To address the referee's concern directly, we will revise the abstract to include the key quantitative results (e.g., relative FVD improvements on nuScenes and Waymo) and a brief reference to the evaluation protocol. This change will make the performance claims self-contained while preserving the abstract's brevity. revision: yes

Circularity Check

0 steps flagged

No circularity: novel modules and empirical claims are independent of inputs

full rationale

The paper introduces new architectural elements—tokenized textual prompts via a predefined trend vocabulary, conversion of 3D trajectories to 2D spatial motion priors, and a lightweight motion alignment module for inter-frame consistency—explicitly to address alignment problems in existing base generative models. These components are described as additions rather than redefinitions or fits of prior outputs. The central claim of outperformance on nuScenes and Waymo with minimal training is presented as an empirical result to be validated externally, with no equations or derivations shown that reduce by construction to the input data or self-citations. No load-bearing self-citation chains, fitted parameters renamed as predictions, or ansatzes smuggled via prior work are evident in the provided text. The derivation chain remains self-contained as a proposed system whose validity rests on future experimental verification rather than tautological equivalence to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on specific parameters, axioms, or new entities; the model introduces 'predefined trend vocabulary' and 'lightweight motion alignment module' but without specifics.

pith-pipeline@v0.9.0 · 5767 in / 1044 out tokens · 122486 ms · 2026-05-22T17:47:16.658658+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 15 internal anchors

  1. [1]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 1728–1738,

  2. [2]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023. 3

  3. [3]

    Muvo: A multimodal generative world model for autonomous driving with geometric representations

    Daniel Bogdoll, Yitian Yang, and J Marius Z ¨ollner. Muvo: A multimodal generative world model for autonomous driving with geometric representations. arXiv preprint arXiv:2311.11762, 2023. 2

  4. [4]

    Video generation models as world simula- tors

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simula- tors. https://openai.com/index/video-generation-models-as- world-simulators/, 2024. 3

  5. [5]

    nuscenes: A mul- timodal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A mul- timodal dataset for autonomous driving. In CVPR, pages 11621–11631, 2020. 6

  6. [6]

    Egocentric vehicle dense video captioning

    Feiyu Chen, Cong Xu, Qi Jia, Yihua Wang, Yuhan Liu, Hao- tian Zhang, and Endong Wang. Egocentric vehicle dense video captioning. In Proceedings of the 32nd ACM Inter- national Conference on Multimedia , pages 137–146, 2024. 3

  7. [7]

    Videocrafter1: Open diffusion models for high-quality video generation, 2023

    Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation, 2023. 3

  8. [8]

    Motion- Conditioned Diffusion Model for Controllable Video Synthesis,

    Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung- Yi Lin, and Ming-Hsuan Yang. Motion-conditioned diffu- sion model for controllable video synthesis. arXiv preprint arXiv:2304.14404, 2023. 3

  9. [9]

    Control-a-video: Controllable text-to-video generation with diffusion models

    Weifeng Chen, Jie Wu, Pan Xie, Hefeng Wu, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023. 3

  10. [10]

    Seine: Short-to-long video diffu- sion model for generative transition and prediction

    Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffu- sion model for generative transition and prediction. In The Twelfth International Conference on Learning Representa- tions, 2023. 3

  11. [11]

    Driving- gpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers

    Yuntao Chen, Yuqi Wang, and Zhaoxiang Zhang. Driving- gpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers. arXiv preprint arXiv:2412.18607, 2024. 3, 4

  12. [12]

    Structure and content-guided video synthesis with diffusion models

    Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023. 3

  13. [13]

    Magicdrive: Street view generation with diverse 3d geometry control.arXiv preprint arXiv:2310.02601, 2023

    Ruiyuan Gao, Kai Chen, Enze Xie, Lanqing Hong, Zhenguo Li, Dit-Yan Yeung, and Qiang Xu. Magicdrive: Street view generation with diverse 3d geometry control. arXiv preprint arXiv:2310.02601, 2023. 2

  14. [14]

    Vista: A generalizable driving world model with high fidelity and versatile controllability

    Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yi- hang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. In NeurIPS, 2024. 2, 4, 6, 7

  15. [15]

    Imagine-2-drive: High-fidelity world modeling in carla for autonomous vehi- cles

    Anant Garg and K Madhava Krishna. Imagine-2-drive: High-fidelity world modeling in carla for autonomous vehi- cles. arXiv preprint arXiv:2411.10171, 2024. 2

  16. [16]

    Worldgpt: Empowering llm as multimodal world model

    Zhiqi Ge, Hongzhe Huang, Mingze Zhou, Juncheng Li, Guoming Wang, Siliang Tang, and Yueting Zhuang. Worldgpt: Empowering llm as multimodal world model. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 7346–7355, 2024. 2

  17. [17]

    Dome: Tam- ing diffusion model into high-fidelity controllable occupancy world model

    Songen Gu, Wei Yin, Bu Jin, Xiaoyang Guo, Junming Wang, Haodong Li, Qian Zhang, and Xiaoxiao Long. Dome: Tam- ing diffusion model into high-fidelity controllable occupancy world model. arXiv preprint arXiv:2410.10429, 2024. 2

  18. [18]

    World models for autonomous driving: An initial survey

    Yanchen Guan, Haicheng Liao, Zhenning Li, Jia Hu, Runze Yuan, Yunjian Li, Guohui Zhang, and Chengzhong Xu. World models for autonomous driving: An initial survey. IEEE Transactions on Intelligent Vehicles, 2024. 3

  19. [19]

    Infinitydrive: Breaking time limits in driving world models

    Xi Guo, Chenjing Ding, Haoxuan Dou, Xin Zhang, Weix- uan Tang, and Wei Wu. Infinitydrive: Breaking time limits in driving world models. arXiv preprint arXiv:2412.01522,

  20. [20]

    Sparsectrl: Adding sparse con- trols to text-to-video diffusion models

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Sparsectrl: Adding sparse con- trols to text-to-video diffusion models. arXiv preprint arXiv:2311.16933, 2023. 3

  21. [21]

    Photorealistic video generation with diffusion models

    Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and Jos ´e Lezama. Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662, 2023. 3

  22. [22]

    World models

    David Ha and J ¨urgen Schmidhuber. World models. In NeurIPS, 2018. 3

  23. [23]

    CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101, 2024. 3

  24. [24]

    Latent Video Diffusion Models for High-Fidelity Long Video Generation

    Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022. 3

  25. [25]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. In NeurIPS, volume 30, 2017. 6

  26. [26]

    Denoising diffu- sion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020. 3 10

  27. [27]

    Video Diffusion Models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models. arXiv:2204.03458, 2022. 3

  28. [28]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022. 3

  29. [29]

    GAIA-1: A Generative World Model for Autonomous Driving

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gian- luca Corrado. Gaia-1: A generative world model for au- tonomous driving. arXiv preprint arXiv:2309.17080, 2023. 2, 3

  30. [30]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. 3

  31. [31]

    Animate anyone: Consistent and controllable image-to-video synthesis for character animation

    Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, and Liefeng Bo. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. arXiv preprint arXiv:2311.17117, 2023. 3

  32. [32]

    Driving- world: Constructingworld model for autonomous driving via video gpt

    Xiaotao Hu, Wei Yin, Mingkai Jia, Junyuan Deng, Xiaoyang Guo, Qian Zhang, Xiaoxiao Long, and Ping Tan. Driving- world: Constructingworld model for autonomous driving via video gpt. arXiv preprint arXiv:2412.19505, 2024. 4

  33. [34]

    Adriver-i: A general world model for autonomous driving

    Fan Jia, Weixin Mao, Yingfei Liu, Yucheng Zhao, Yuqing Wen, Chi Zhang, Xiangyu Zhang, and Tiancai Wang. Adriver-i: A general world model for autonomous driving. arXiv preprint arXiv:2311.13549, 2023. 3

  34. [35]

    Dive: Dit-based video generation with enhanced control

    Junpeng Jiang, Gangyi Hong, Lijun Zhou, Enhui Ma, Heng- tong Hu, Xia Zhou, Jie Xiang, Fan Liu, Kaicheng Yu, Haiyang Sun, et al. Dive: Dit-based video generation with enhanced control. arXiv preprint arXiv:2409.01595, 2024. 3

  35. [36]

    Cotracker3: Simpler and better point tracking by pseudo-labelling real videos

    Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos. In Proc. arXiv:2410.11831, 2024. 5

  36. [37]

    Dreampose: Fashion video synthesis with stable diffusion

    Johanna Karras, Aleksander Holynski, Ting-Chun Wang, and Ira Kemelmacher-Shlizerman. Dreampose: Fashion video synthesis with stable diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 22680–22690, 2023. 3

  37. [38]

    Text2video-zero: Text-to- image diffusion models are zero-shot video generators.IEEE International Conference on Computer Vision (ICCV), 2023

    Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to- image diffusion models are zero-shot video generators.IEEE International Conference on Computer Vision (ICCV), 2023. 3

  38. [39]

    Drivegan: Towards a controllable high-quality neural simulation

    Seung Wook Kim, Jonah Philion, Antonio Torralba, and Sanja Fidler. Drivegan: Towards a controllable high-quality neural simulation. In CVPR, pages 5820–5829, 2021. 3, 7

  39. [40]

    VideoPoet: A Large Language Model for Zero-Shot Video Generation

    Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jos ´e Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023. 3

  40. [41]

    Drivingdiffusion: Layout-guided multi-view driving scenarios video genera- tion with latent diffusion model

    Xiaofan Li, Yifu Zhang, and Xiaoqing Ye. Drivingdiffusion: Layout-guided multi-view driving scenarios video genera- tion with latent diffusion model. In European Conference on Computer Vision, pages 469–485. Springer, 2024. 2, 3, 6

  41. [42]

    Seeing the future, perceiving the future: A unified driving world model for future generation and perception

    Dingkang Liang, Dingyuan Zhang, Xin Zhou, Sifan Tu, Tianrui Feng, Xiaofan Li, Yumeng Zhang, Mingyang Du, Xiao Tan, and Xiang Bai. Seeing the future, perceiving the future: A unified driving world model for future generation and perception. arXiv preprint arXiv:2503.13587, 2025. 4

  42. [43]

    Wovogen: World volume-aware diffusion for con- trollable multi-camera driving scene generation

    Jiachen Lu, Ze Huang, Zeyu Yang, Jiahui Zhang, and Li Zhang. Wovogen: World volume-aware diffusion for con- trollable multi-camera driving scene generation. In ECCV, pages 329–345. Springer, 2024. 2, 3, 7

  43. [44]

    Latte: Latent Diffusion Transformer for Video Generation

    Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Zi- wei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024. 3

  44. [45]

    Driveworld: 4d pre-trained scene understanding via world models for autonomous driving

    Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. In CVPR, pages 15522–15533, 2024. 2

  45. [46]

    Orb-slam: a versatile and accurate monocular slam system

    Raul Mur-Artal, JMM Montiel, and JD Tardos. Orb-slam: a versatile and accurate monocular slam system. IEEE Trans- actions on Robotics, 31(5):1147–1163, 2015. 8

  46. [47]

    Scalable Diffusion Models with Transformers

    Geoffrey Peebles and Saining Xie. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022. 4

  47. [48]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,

  48. [49]

    Com- positional 3d scene generation using locally conditioned dif- fusion

    Ryan Po, Wang Yifan, and Vladislav Golyanik et al. Com- positional 3d scene generation using locally conditioned dif- fusion. In ArXiv, 2023. 3

  49. [50]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3

  50. [51]

    Mm-diffusion: Learning multi-modal diffusion mod- els for joint audio and video generation

    Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, and Baining Guo. Mm-diffusion: Learning multi-modal diffusion mod- els for joint audio and video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10219–10228, 2023. 3

  51. [52]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. 3

  52. [53]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, pages 2446–2454, 2020. 6 11

  53. [54]

    The role of world mod- els in shaping autonomous driving: A comprehensive survey

    Sifan Tu, Xin Zhou, Dingkang Liang, Xingyu Jiang, Yumeng Zhang, Xiaofan Li, and Xiang Bai. The role of world mod- els in shaping autonomous driving: A comprehensive survey. arXiv preprint arXiv:2502.10498, 2025. 3

  54. [55]

    Fvd: A new metric for video generation

    Thomas Unterthiner, Sjoerd van Steenkiste, Karol Ku- rach, Rapha ¨el Marinier, Marcin Michalski, and Syl- vain Gelly. Fvd: A new metric for video generation. https://openreview.net/forum?id=rylgEULtdN, 2019. 6

  55. [56]

    Occsora: 4d occupancy generation models as world simulators for au- tonomous driving

    Lening Wang, Wenzhao Zheng, Yilong Ren, Han Jiang, Zhiyong Cui, Haiyang Yu, and Jiwen Lu. Occsora: 4d occupancy generation models as world simulators for au- tonomous driving. arXiv preprint arXiv:2405.20337, 2024. 2

  56. [57]

    Drivedreamer: Towards real-world- drive world models for autonomous driving

    Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jia- gang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world- drive world models for autonomous driving. InECCV, pages 55–72. Springer, 2024. 4, 7

  57. [58]

    Drivedreamer: Towards real-world- driven world models for autonomous driving

    Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jia- gang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world- driven world models for autonomous driving. In ECCV,

  58. [59]

    Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving

    Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving. In CVPR, pages 14749–14759, 2024. 2, 4, 7

  59. [60]

    Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving

    Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving. In CVPR, pages 14749–14759, 2024. 4

  60. [61]

    Motionctrl: A unified and flexible motion controller for video generation

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. arXiv preprint arXiv:2312.03641, 2023. 3

  61. [62]

    Wan: Open and Advanced Large-Scale Video Generative Models

    WanTeam, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang...

  62. [64]

    Panacea: Panoramic and controllable video generation for autonomous driving

    Yuqing Wen, Yucheng Zhao, Yingfei Liu, Fan Jia, Yanhui Wang, Chong Luo, Chi Zhang, Tiancai Wang, Xiaoyan Sun, and Xiangyu Zhang. Panacea: Panoramic and controllable video generation for autonomous driving. In CVPR, pages 6902–6912, 2024. 3

  63. [65]

    Holodrive: Holistic 2d-3d multi-modal street scene generation for au- tonomous driving.arXiv preprint arXiv:2412.01407, 2024

    Zehuan Wu, Jingcheng Ni, Xiaodong Wang, Yuxin Guo, Rui Chen, Lewei Lu, Jifeng Dai, and Yuwen Xiong. Holodrive: Holistic 2d-3d multi-modal street scene generation for au- tonomous driving. arXiv preprint arXiv:2412.01407, 2024. 2

  64. [66]

    H.; Yan, H.; Liu, J.-W.; Zhang, C.; Feng, J.; and Shou, M

    Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human im- age animation using diffusion model. arXiv preprint arXiv:2311.16498, 2023. 3

  65. [67]

    Generalized predictive model for autonomous driving

    Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, Jun Zhang, Andreas Geiger, Yu Qiao, and Hongyang Li. Generalized predictive model for autonomous driving. In CVPR, 2024. 4, 7

  66. [68]

    Direct-a-video: Customized video generation with user- directed camera movement and object motion.arXiv preprint arXiv:2402.03162, 2024

    Shiyuan Yang, Liang Hou, Haibin Huang, Chongyang Ma, Pengfei Wan, Di Zhang, Xiaodong Chen, and Jing Liao. Direct-a-video: Customized video generation with user- directed camera movement and object motion.arXiv preprint arXiv:2402.03162, 2024. 3

  67. [69]

    Physical informed driving world model

    Zhuoran Yang, Xi Guo, Chenjing Ding, Chiyu Wang, and Wei Wu. Physical informed driving world model. arXiv preprint arXiv:2412.08410, 2024. 2

  68. [70]

    Dualdiff+: Dual-branch diffusion for high-fidelity video generation with reward guidance

    Zhao Yang, Zezhong Qian, Xiaofan Li, Weixiang Xu, Gongpeng Zhao, Ruohong Yu, Lingsi Zhu, and Longjun Liu. Dualdiff+: Dual-branch diffusion for high-fidelity video generation with reward guidance. arXiv preprint arXiv:2503.03689, 2025. 3

  69. [71]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024. 3

  70. [72]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Jiahui Zhang, Kangle Han, Zhen Li, Di He, Hao Fan, Yin- peng Wu, Lei Zhou, Ping Liu, Jiaying Dong, Dongdong Chen, et al. Pixart- α: A powerful text-to-image generation foundation model. arXiv preprint arXiv:2310.00426, 2023. 4

  71. [73]

    I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

    Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023. 3

  72. [74]

    Bevworld: A multimodal world model for au- tonomous driving via unified bev latent space.arXiv preprint arXiv:2407.05679, 2024

    Yumeng Zhang, Shi Gong, Kaixin Xiong, Xiaoqing Ye, Xiao Tan, Fan Wang, Jizhou Huang, Hua Wu, and Haifeng Wang. Bevworld: A multimodal world model for au- tonomous driving via unified bev latent space.arXiv preprint arXiv:2407.05679, 2024. 2

  73. [75]

    the words ‘KEEP OFF THE GRASS

    Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023. 3

  74. [76]

    Drivedreamer4d: World models are effective data machines for 4d driving scene rep- resentation

    Guosheng Zhao, Chaojun Ni, Xiaofeng Wang, Zheng Zhu, Xueyang Zhang, Yida Wang, Guan Huang, Xinze Chen, 12 Boyuan Wang, Youyi Zhang, et al. Drivedreamer4d: World models are effective data machines for 4d driving scene rep- resentation. arXiv preprint arXiv:2410.13571, 2024. 2, 4

  75. [77]

    Occworld: Learning a 3d occupancy world model for autonomous driving

    Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. Occworld: Learning a 3d occupancy world model for autonomous driving. In ECCV, pages 55–72. Springer, 2024. 2

  76. [78]

    Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation

    Xin Zhou, Dingkang Liang, Sifan Tu, Xiwu Chen, Yikang Ding, Dingyuan Zhang, Feiyang Tan, Hengshuang Zhao, and Xiang Bai. Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation. arXiv preprint arXiv:2501.14729, 2025. 2 13