DriVerse: Navigation World Model for Driving Simulation via Multimodal Trajectory Prompting and Motion Alignment

Chenming Wu; Dingkang Liang; Ji Wan; Jun Wang; Xiaofan Li; Yumeng Zhang; Zhao Yang; Zhihao Xu

arxiv: 2504.18576 · v2 · submitted 2025-04-22 · 💻 cs.RO

DriVerse: Navigation World Model for Driving Simulation via Multimodal Trajectory Prompting and Motion Alignment

Xiaofan Li , Chenming Wu , Zhao Yang , Zhihao Xu , Dingkang Liang , Yumeng Zhang , Ji Wan , Jun Wang This is my paper

Pith reviewed 2026-05-22 17:47 UTC · model grok-4.3

classification 💻 cs.RO

keywords driving simulationworld modelsvideo generationtrajectory promptingmotion alignmentautonomous drivinggenerative modelsnavigation

0 comments

The pith

DriVerse generates higher-fidelity driving videos from one image and a future trajectory by converting paths into textual prompts and 2D motion priors plus a motion alignment step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to fix the mismatch between trajectory controls and the internal features of 2D generative models that has produced low-quality driving scene videos in prior work. It does so by turning a planned trajectory into two forms of guidance at once: text tokens drawn from a fixed vocabulary of driving trends, and 2D spatial motion priors derived from the 3D path. A separate lightweight module then enforces frame-to-frame consistency only on the pixels that belong to moving objects. The resulting model produces trajectory-specific videos that match real driving data more closely than earlier specialized systems, all while using very little additional training and no new data. Accurate simulations of this kind would let researchers test autonomous driving planners against realistic future scenes instead of relying on coarse command inputs.

Core claim

DriVerse is a generative model that takes a single starting image and a future trajectory and produces a corresponding video of the driving scene. It achieves explicit control by first tokenizing the trajectory into language prompts drawn from a predefined trend vocabulary and second by projecting the 3D trajectory into 2D spatial motion priors that steer the static elements of the scene. A motion alignment module then improves temporal consistency specifically for dynamic pixels across frames. When trained with minimal updates and no extra data, the model yields higher-quality future video predictions than prior specialized approaches on both the nuScenes and Waymo datasets.

What carries the argument

Multimodal trajectory prompting that supplies both tokenized textual trend prompts and 2D spatial motion priors extracted from 3D trajectories, together with a lightweight motion alignment module that targets inter-frame consistency of dynamic pixels.

If this is right

Trajectory-specific videos become usable for evaluating actual autonomous driving planners instead of relying on coarse text commands or discrete signals.
Static scene content and dynamic object motion are controlled more precisely because guidance is supplied in both language and spatial forms.
Temporal coherence of moving elements improves over long sequences once the alignment module enforces consistency on dynamic pixels.
The same performance gains appear on multiple real-world driving datasets while requiring only minimal training and no new data collection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-channel prompting pattern could be tested on video generation tasks outside driving, such as robot manipulation or animated character control.
Because the alignment module is lightweight, it could be added to many existing base generative models without retraining the entire system from scratch.
If the trend vocabulary proves sufficient, future work might explore whether learned vocabularies could further reduce the gap between text and precise spatial control.

Load-bearing premise

The combination of a fixed trend vocabulary for text prompts, 2D motion priors from 3D trajectories, and the motion alignment module is enough to correct the alignment problems that arise when trajectory signals are fed directly into a base 2D generative model.

What would settle it

Run the same future-video generation benchmarks on nuScenes and Waymo and measure whether DriVerse videos show measurably lower trajectory adherence or visual quality than baselines that use direct trajectory input or discrete controls.

Figures

Figures reproduced from arXiv: 2504.18576 by Chenming Wu, Dingkang Liang, Ji Wan, Jun Wang, Xiaofan Li, Yumeng Zhang, Zhao Yang, Zhihao Xu.

**Figure 2.** Figure 2: Overview of the DriVerse framework. Given a single scene image and a future trajectory, DriVerse decomposes the generation task [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The top and bottom parts of the figure visualize, through [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison with existing methods. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of inference results on the Waymo Open Dataset. The top two rows show the diverse future predictions [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

This paper presents DriVerse, a generative model for simulating navigation-driven driving scenes from a single image and a future trajectory. Previous autonomous driving world models either directly feed the trajectory or discrete control signals into the generation pipeline, leading to poor alignment between the control inputs and the implicit features of the 2D base generative model, which results in low-fidelity video outputs. Some methods use coarse textual commands or discrete vehicle control signals, which lack the precision to guide fine-grained, trajectory-specific video generation, making them unsuitable for evaluating actual autonomous driving algorithms. DriVerse introduces explicit trajectory guidance in two complementary forms: it tokenizes trajectories into textual prompts using a predefined trend vocabulary for seamless language integration, and converts 3D trajectories into 2D spatial motion priors to enhance control over static content within the driving scene. To better handle dynamic objects, we further introduce a lightweight motion alignment module, which focuses on the inter-frame consistency of dynamic pixels, significantly enhancing the temporal coherence of moving elements over long sequences. With minimal training and no need for additional data, DriVerse outperforms specialized models on future video generation tasks across both the nuScenes and Waymo datasets. The code and models will be released to the public.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DriVerse turns trajectories into text tokens plus 2D motion priors and adds a small alignment module to tighten control in driving video generators, but the outperformance claim on nuScenes and Waymo still needs the actual numbers to be convincing.

read the letter

DriVerse's main contribution is a multimodal way to prompt a driving video generator using tokenized trajectories and 2D motion priors, backed by a motion alignment module for better dynamic object handling. The authors tokenize future paths with a trend vocabulary so the signal fits naturally into a language-conditioned model, then map the 3D trajectory to 2D spatial priors that guide static scene elements. They add a lightweight module that enforces inter-frame consistency specifically on moving pixels to reduce drift over longer clips. This directly targets the alignment gap they describe in prior work that either feeds raw controls or coarse text commands into base generative models. The approach is practical because it claims to work with minimal training and no extra data, which matters for simulation pipelines that need to stay lightweight. The ideas are clearly motivated by real problems in autonomous driving evaluation, where trajectory-specific video is needed to test planners. The soft spot is exactly what the stress-test flags: the abstract asserts outperformance on nuScenes and Waymo yet supplies no metrics, no baseline descriptions, no ablation results, and no training details. Without those, it is impossible to tell whether the new modules close the gap or whether gains come from implementation choices or dataset quirks. If the full paper includes quantitative scores on FVD, FID, or trajectory adherence plus clear comparisons, that would strengthen the case considerably. This paper is aimed at researchers building world models or simulators for self-driving systems. Readers working on generative video for robotics or control would find the prompting and alignment tricks worth examining even if they end up adapting only parts of it. It deserves a serious referee because the problem is well-posed and the proposed components are described at a level that can be evaluated and reproduced. I would send it out for review to get the experimental evidence checked rather than desk-rejecting it on the abstract alone.

Referee Report

1 major / 2 minor

Summary. The paper introduces DriVerse, a generative world model for driving simulation that produces future video from a single image and a future trajectory. It addresses misalignment in prior models by tokenizing trajectories into textual prompts via a predefined trend vocabulary, converting 3D trajectories into 2D spatial motion priors, and adding a lightweight motion alignment module to improve inter-frame consistency for dynamic objects. The central claim is that this multimodal approach yields higher-fidelity outputs than specialized models on nuScenes and Waymo with only minimal training and no extra data.

Significance. If the outperformance claims are substantiated, the work could improve controllability in driving world models, enabling more precise trajectory-guided video generation for autonomous driving evaluation. The explicit use of tokenized language prompts and 2D motion priors offers a lightweight alternative to direct control injection or coarse commands.

major comments (1)

Abstract: the claim that DriVerse 'outperforms specialized models on future video generation tasks across both the nuScenes and Waymo datasets' is presented without any quantitative metrics (FVD, FID, trajectory adherence), baselines, ablation results, or experimental protocol details. This directly undermines assessment of whether the tokenized prompts, 2D priors, and motion alignment module produce the asserted gains rather than implementation artifacts.

minor comments (2)

The size, construction, and coverage of the 'predefined trend vocabulary' are not specified, which affects reproducibility of the textual prompting component.
The base generative model, exact training regime (e.g., number of epochs, learning rate schedule), and definition of 'minimal training' should be stated explicitly to clarify the data-efficiency claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and for highlighting an important clarity issue in the abstract. We address the comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: the claim that DriVerse 'outperforms specialized models on future video generation tasks across both the nuScenes and Waymo datasets' is presented without any quantitative metrics (FVD, FID, trajectory adherence), baselines, ablation results, or experimental protocol details. This directly undermines assessment of whether the tokenized prompts, 2D priors, and motion alignment module produce the asserted gains rather than implementation artifacts.

Authors: We agree that the abstract claim would be more informative with supporting quantitative evidence. The full manuscript (Section 4 and Tables 1-3) reports FVD, FID, and trajectory adherence metrics, along with comparisons to baselines such as DriveDreamer, Vista, and GAIA-1, plus ablations isolating the contributions of tokenized prompts, 2D motion priors, and the motion alignment module. Experimental protocols (training details, dataset splits, evaluation metrics) are described in Section 3. To address the referee's concern directly, we will revise the abstract to include the key quantitative results (e.g., relative FVD improvements on nuScenes and Waymo) and a brief reference to the evaluation protocol. This change will make the performance claims self-contained while preserving the abstract's brevity. revision: yes

Circularity Check

0 steps flagged

No circularity: novel modules and empirical claims are independent of inputs

full rationale

The paper introduces new architectural elements—tokenized textual prompts via a predefined trend vocabulary, conversion of 3D trajectories to 2D spatial motion priors, and a lightweight motion alignment module for inter-frame consistency—explicitly to address alignment problems in existing base generative models. These components are described as additions rather than redefinitions or fits of prior outputs. The central claim of outperformance on nuScenes and Waymo with minimal training is presented as an empirical result to be validated externally, with no equations or derivations shown that reduce by construction to the input data or self-citations. No load-bearing self-citation chains, fitted parameters renamed as predictions, or ansatzes smuggled via prior work are evident in the provided text. The derivation chain remains self-contained as a proposed system whose validity rests on future experimental verification rather than tautological equivalence to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on specific parameters, axioms, or new entities; the model introduces 'predefined trend vocabulary' and 'lightweight motion alignment module' but without specifics.

pith-pipeline@v0.9.0 · 5767 in / 1044 out tokens · 122486 ms · 2026-05-22T17:47:16.658658+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Multimodal Trajectory Prompting (MTP) ... trend vocabulary ... 12 angular sectors ... Trajectory-Guided Spatial Anchors (TSA) ... Latent Motion Alignment (LMA) ... motion-weighted consistency loss
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Dynamic Window Generation (DWG) ... anchor visibility Vt ... heading-angle change

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 15 internal anchors

[1]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 1728–1738,

work page
[2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Muvo: A multimodal generative world model for autonomous driving with geometric representations

Daniel Bogdoll, Yitian Yang, and J Marius Z ¨ollner. Muvo: A multimodal generative world model for autonomous driving with geometric representations. arXiv preprint arXiv:2311.11762, 2023. 2

work page arXiv 2023
[4]

Video generation models as world simula- tors

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simula- tors. https://openai.com/index/video-generation-models-as- world-simulators/, 2024. 3

work page 2024
[5]

nuscenes: A mul- timodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A mul- timodal dataset for autonomous driving. In CVPR, pages 11621–11631, 2020. 6

work page 2020
[6]

Egocentric vehicle dense video captioning

Feiyu Chen, Cong Xu, Qi Jia, Yihua Wang, Yuhan Liu, Hao- tian Zhang, and Endong Wang. Egocentric vehicle dense video captioning. In Proceedings of the 32nd ACM Inter- national Conference on Multimedia , pages 137–146, 2024. 3

work page 2024
[7]

Videocrafter1: Open diffusion models for high-quality video generation, 2023

Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation, 2023. 3

work page 2023
[8]

Motion- Conditioned Diffusion Model for Controllable Video Synthesis,

Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung- Yi Lin, and Ming-Hsuan Yang. Motion-conditioned diffu- sion model for controllable video synthesis. arXiv preprint arXiv:2304.14404, 2023. 3

work page arXiv 2023
[9]

Control-a-video: Controllable text-to-video generation with diffusion models

Weifeng Chen, Jie Wu, Pan Xie, Hefeng Wu, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023. 3

work page arXiv 2023
[10]

Seine: Short-to-long video diffu- sion model for generative transition and prediction

Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffu- sion model for generative transition and prediction. In The Twelfth International Conference on Learning Representa- tions, 2023. 3

work page 2023
[11]

Driving- gpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers

Yuntao Chen, Yuqi Wang, and Zhaoxiang Zhang. Driving- gpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers. arXiv preprint arXiv:2412.18607, 2024. 3, 4

work page arXiv 2024
[12]

Structure and content-guided video synthesis with diffusion models

Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023. 3

work page 2023
[13]

Magicdrive: Street view generation with diverse 3d geometry control.arXiv preprint arXiv:2310.02601, 2023

Ruiyuan Gao, Kai Chen, Enze Xie, Lanqing Hong, Zhenguo Li, Dit-Yan Yeung, and Qiang Xu. Magicdrive: Street view generation with diverse 3d geometry control. arXiv preprint arXiv:2310.02601, 2023. 2

work page arXiv 2023
[14]

Vista: A generalizable driving world model with high fidelity and versatile controllability

Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yi- hang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. In NeurIPS, 2024. 2, 4, 6, 7

work page 2024
[15]

Imagine-2-drive: High-fidelity world modeling in carla for autonomous vehi- cles

Anant Garg and K Madhava Krishna. Imagine-2-drive: High-fidelity world modeling in carla for autonomous vehi- cles. arXiv preprint arXiv:2411.10171, 2024. 2

work page arXiv 2024
[16]

Worldgpt: Empowering llm as multimodal world model

Zhiqi Ge, Hongzhe Huang, Mingze Zhou, Juncheng Li, Guoming Wang, Siliang Tang, and Yueting Zhuang. Worldgpt: Empowering llm as multimodal world model. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 7346–7355, 2024. 2

work page 2024
[17]

Dome: Tam- ing diffusion model into high-fidelity controllable occupancy world model

Songen Gu, Wei Yin, Bu Jin, Xiaoyang Guo, Junming Wang, Haodong Li, Qian Zhang, and Xiaoxiao Long. Dome: Tam- ing diffusion model into high-fidelity controllable occupancy world model. arXiv preprint arXiv:2410.10429, 2024. 2

work page arXiv 2024
[18]

World models for autonomous driving: An initial survey

Yanchen Guan, Haicheng Liao, Zhenning Li, Jia Hu, Runze Yuan, Yunjian Li, Guohui Zhang, and Chengzhong Xu. World models for autonomous driving: An initial survey. IEEE Transactions on Intelligent Vehicles, 2024. 3

work page 2024
[19]

Infinitydrive: Breaking time limits in driving world models

Xi Guo, Chenjing Ding, Haoxuan Dou, Xin Zhang, Weix- uan Tang, and Wei Wu. Infinitydrive: Breaking time limits in driving world models. arXiv preprint arXiv:2412.01522,

work page arXiv
[20]

Sparsectrl: Adding sparse con- trols to text-to-video diffusion models

Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Sparsectrl: Adding sparse con- trols to text-to-video diffusion models. arXiv preprint arXiv:2311.16933, 2023. 3

work page arXiv 2023
[21]

Photorealistic video generation with diffusion models

Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and Jos ´e Lezama. Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662, 2023. 3

work page arXiv 2023
[22]

World models

David Ha and J ¨urgen Schmidhuber. World models. In NeurIPS, 2018. 3

work page 2018
[23]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. In NeurIPS, volume 30, 2017. 6

work page 2017
[26]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020. 3 10

work page 2020
[27]

Video Diffusion Models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models. arXiv:2204.03458, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gian- luca Corrado. Gaia-1: A generative world model for au- tonomous driving. arXiv preprint arXiv:2309.17080, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021
[31]

Animate anyone: Consistent and controllable image-to-video synthesis for character animation

Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, and Liefeng Bo. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. arXiv preprint arXiv:2311.17117, 2023. 3

work page arXiv 2023
[32]

Driving- world: Constructingworld model for autonomous driving via video gpt

Xiaotao Hu, Wei Yin, Mingkai Jia, Junyuan Deng, Xiaoyang Guo, Qian Zhang, Xiaoxiao Long, and Ping Tan. Driving- world: Constructingworld model for autonomous driving via video gpt. arXiv preprint arXiv:2412.19505, 2024. 4

work page arXiv 2024
[34]

Adriver-i: A general world model for autonomous driving

Fan Jia, Weixin Mao, Yingfei Liu, Yucheng Zhao, Yuqing Wen, Chi Zhang, Xiangyu Zhang, and Tiancai Wang. Adriver-i: A general world model for autonomous driving. arXiv preprint arXiv:2311.13549, 2023. 3

work page arXiv 2023
[35]

Dive: Dit-based video generation with enhanced control

Junpeng Jiang, Gangyi Hong, Lijun Zhou, Enhui Ma, Heng- tong Hu, Xia Zhou, Jie Xiang, Fan Liu, Kaicheng Yu, Haiyang Sun, et al. Dive: Dit-based video generation with enhanced control. arXiv preprint arXiv:2409.01595, 2024. 3

work page arXiv 2024
[36]

Cotracker3: Simpler and better point tracking by pseudo-labelling real videos

Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos. In Proc. arXiv:2410.11831, 2024. 5

work page arXiv 2024
[37]

Dreampose: Fashion video synthesis with stable diffusion

Johanna Karras, Aleksander Holynski, Ting-Chun Wang, and Ira Kemelmacher-Shlizerman. Dreampose: Fashion video synthesis with stable diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 22680–22690, 2023. 3

work page 2023
[38]

Text2video-zero: Text-to- image diffusion models are zero-shot video generators.IEEE International Conference on Computer Vision (ICCV), 2023

Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to- image diffusion models are zero-shot video generators.IEEE International Conference on Computer Vision (ICCV), 2023. 3

work page 2023
[39]

Drivegan: Towards a controllable high-quality neural simulation

Seung Wook Kim, Jonah Philion, Antonio Torralba, and Sanja Fidler. Drivegan: Towards a controllable high-quality neural simulation. In CVPR, pages 5820–5829, 2021. 3, 7

work page 2021
[40]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jos ´e Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Drivingdiffusion: Layout-guided multi-view driving scenarios video genera- tion with latent diffusion model

Xiaofan Li, Yifu Zhang, and Xiaoqing Ye. Drivingdiffusion: Layout-guided multi-view driving scenarios video genera- tion with latent diffusion model. In European Conference on Computer Vision, pages 469–485. Springer, 2024. 2, 3, 6

work page 2024
[42]

Seeing the future, perceiving the future: A unified driving world model for future generation and perception

Dingkang Liang, Dingyuan Zhang, Xin Zhou, Sifan Tu, Tianrui Feng, Xiaofan Li, Yumeng Zhang, Mingyang Du, Xiao Tan, and Xiang Bai. Seeing the future, perceiving the future: A unified driving world model for future generation and perception. arXiv preprint arXiv:2503.13587, 2025. 4

work page arXiv 2025
[43]

Wovogen: World volume-aware diffusion for con- trollable multi-camera driving scene generation

Jiachen Lu, Ze Huang, Zeyu Yang, Jiahui Zhang, and Li Zhang. Wovogen: World volume-aware diffusion for con- trollable multi-camera driving scene generation. In ECCV, pages 329–345. Springer, 2024. 2, 3, 7

work page 2024
[44]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Zi- wei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Driveworld: 4d pre-trained scene understanding via world models for autonomous driving

Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. In CVPR, pages 15522–15533, 2024. 2

work page 2024
[46]

Orb-slam: a versatile and accurate monocular slam system

Raul Mur-Artal, JMM Montiel, and JD Tardos. Orb-slam: a versatile and accurate monocular slam system. IEEE Trans- actions on Robotics, 31(5):1147–1163, 2015. 8

work page 2015
[47]

Scalable Diffusion Models with Transformers

Geoffrey Peebles and Saining Xie. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022. 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[48]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,

work page
[49]

Com- positional 3d scene generation using locally conditioned dif- fusion

Ryan Po, Wang Yifan, and Vladislav Golyanik et al. Com- positional 3d scene generation using locally conditioned dif- fusion. In ArXiv, 2023. 3

work page 2023
[50]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3

work page 2022
[51]

Mm-diffusion: Learning multi-modal diffusion mod- els for joint audio and video generation

Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, and Baining Guo. Mm-diffusion: Learning multi-modal diffusion mod- els for joint audio and video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10219–10228, 2023. 3

work page 2023
[52]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2010
[53]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, pages 2446–2454, 2020. 6 11

work page 2020
[54]

The role of world mod- els in shaping autonomous driving: A comprehensive survey

Sifan Tu, Xin Zhou, Dingkang Liang, Xingyu Jiang, Yumeng Zhang, Xiaofan Li, and Xiang Bai. The role of world mod- els in shaping autonomous driving: A comprehensive survey. arXiv preprint arXiv:2502.10498, 2025. 3

work page arXiv 2025
[55]

Fvd: A new metric for video generation

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Ku- rach, Rapha ¨el Marinier, Marcin Michalski, and Syl- vain Gelly. Fvd: A new metric for video generation. https://openreview.net/forum?id=rylgEULtdN, 2019. 6

work page 2019
[56]

Occsora: 4d occupancy generation models as world simulators for au- tonomous driving

Lening Wang, Wenzhao Zheng, Yilong Ren, Han Jiang, Zhiyong Cui, Haiyang Yu, and Jiwen Lu. Occsora: 4d occupancy generation models as world simulators for au- tonomous driving. arXiv preprint arXiv:2405.20337, 2024. 2

work page arXiv 2024
[57]

Drivedreamer: Towards real-world- drive world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jia- gang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world- drive world models for autonomous driving. InECCV, pages 55–72. Springer, 2024. 4, 7

work page 2024
[58]

Drivedreamer: Towards real-world- driven world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jia- gang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world- driven world models for autonomous driving. In ECCV,

work page
[59]

Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving

Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving. In CVPR, pages 14749–14759, 2024. 2, 4, 7

work page 2024
[60]

Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving

Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving. In CVPR, pages 14749–14759, 2024. 4

work page 2024
[61]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. arXiv preprint arXiv:2312.03641, 2023. 3

work page arXiv 2023
[62]

Wan: Open and Advanced Large-Scale Video Generative Models

WanTeam, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Panacea: Panoramic and controllable video generation for autonomous driving

Yuqing Wen, Yucheng Zhao, Yingfei Liu, Fan Jia, Yanhui Wang, Chong Luo, Chi Zhang, Tiancai Wang, Xiaoyan Sun, and Xiangyu Zhang. Panacea: Panoramic and controllable video generation for autonomous driving. In CVPR, pages 6902–6912, 2024. 3

work page 2024
[65]

Holodrive: Holistic 2d-3d multi-modal street scene generation for au- tonomous driving.arXiv preprint arXiv:2412.01407, 2024

Zehuan Wu, Jingcheng Ni, Xiaodong Wang, Yuxin Guo, Rui Chen, Lewei Lu, Jifeng Dai, and Yuwen Xiong. Holodrive: Holistic 2d-3d multi-modal street scene generation for au- tonomous driving. arXiv preprint arXiv:2412.01407, 2024. 2

work page arXiv 2024
[66]

H.; Yan, H.; Liu, J.-W.; Zhang, C.; Feng, J.; and Shou, M

Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human im- age animation using diffusion model. arXiv preprint arXiv:2311.16498, 2023. 3

work page arXiv 2023
[67]

Generalized predictive model for autonomous driving

Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, Jun Zhang, Andreas Geiger, Yu Qiao, and Hongyang Li. Generalized predictive model for autonomous driving. In CVPR, 2024. 4, 7

work page 2024
[68]

Direct-a-video: Customized video generation with user- directed camera movement and object motion.arXiv preprint arXiv:2402.03162, 2024

Shiyuan Yang, Liang Hou, Haibin Huang, Chongyang Ma, Pengfei Wan, Di Zhang, Xiaodong Chen, and Jing Liao. Direct-a-video: Customized video generation with user- directed camera movement and object motion.arXiv preprint arXiv:2402.03162, 2024. 3

work page arXiv 2024
[69]

Physical informed driving world model

Zhuoran Yang, Xi Guo, Chenjing Ding, Chiyu Wang, and Wei Wu. Physical informed driving world model. arXiv preprint arXiv:2412.08410, 2024. 2

work page arXiv 2024
[70]

Dualdiff+: Dual-branch diffusion for high-fidelity video generation with reward guidance

Zhao Yang, Zezhong Qian, Xiaofan Li, Weixiang Xu, Gongpeng Zhao, Ruohong Yu, Lingsi Zhu, and Longjun Liu. Dualdiff+: Dual-branch diffusion for high-fidelity video generation with reward guidance. arXiv preprint arXiv:2503.03689, 2025. 3

work page arXiv 2025
[71]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[72]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Jiahui Zhang, Kangle Han, Zhen Li, Di He, Hao Fan, Yin- peng Wu, Lei Zhou, Ping Liu, Jiaying Dong, Dongdong Chen, et al. Pixart- α: A powerful text-to-image generation foundation model. arXiv preprint arXiv:2310.00426, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

Bevworld: A multimodal world model for au- tonomous driving via unified bev latent space.arXiv preprint arXiv:2407.05679, 2024

Yumeng Zhang, Shi Gong, Kaixin Xiong, Xiaoqing Ye, Xiao Tan, Fan Wang, Jizhou Huang, Hua Wu, and Haifeng Wang. Bevworld: A multimodal world model for au- tonomous driving via unified bev latent space.arXiv preprint arXiv:2407.05679, 2024. 2

work page arXiv 2024
[75]

the words ‘KEEP OFF THE GRASS

Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023. 3

work page arXiv 2023
[76]

Drivedreamer4d: World models are effective data machines for 4d driving scene rep- resentation

Guosheng Zhao, Chaojun Ni, Xiaofeng Wang, Zheng Zhu, Xueyang Zhang, Yida Wang, Guan Huang, Xinze Chen, 12 Boyuan Wang, Youyi Zhang, et al. Drivedreamer4d: World models are effective data machines for 4d driving scene rep- resentation. arXiv preprint arXiv:2410.13571, 2024. 2, 4

work page arXiv 2024
[77]

Occworld: Learning a 3d occupancy world model for autonomous driving

Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. Occworld: Learning a 3d occupancy world model for autonomous driving. In ECCV, pages 55–72. Springer, 2024. 2

work page 2024
[78]

Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation

Xin Zhou, Dingkang Liang, Sifan Tu, Xiwu Chen, Yikang Ding, Dingyuan Zhang, Feiyang Tan, Hengshuang Zhao, and Xiang Bai. Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation. arXiv preprint arXiv:2501.14729, 2025. 2 13

work page arXiv 2025

[1] [1]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 1728–1738,

work page

[2] [2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Muvo: A multimodal generative world model for autonomous driving with geometric representations

Daniel Bogdoll, Yitian Yang, and J Marius Z ¨ollner. Muvo: A multimodal generative world model for autonomous driving with geometric representations. arXiv preprint arXiv:2311.11762, 2023. 2

work page arXiv 2023

[4] [4]

Video generation models as world simula- tors

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simula- tors. https://openai.com/index/video-generation-models-as- world-simulators/, 2024. 3

work page 2024

[5] [5]

nuscenes: A mul- timodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A mul- timodal dataset for autonomous driving. In CVPR, pages 11621–11631, 2020. 6

work page 2020

[6] [6]

Egocentric vehicle dense video captioning

Feiyu Chen, Cong Xu, Qi Jia, Yihua Wang, Yuhan Liu, Hao- tian Zhang, and Endong Wang. Egocentric vehicle dense video captioning. In Proceedings of the 32nd ACM Inter- national Conference on Multimedia , pages 137–146, 2024. 3

work page 2024

[7] [7]

Videocrafter1: Open diffusion models for high-quality video generation, 2023

Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation, 2023. 3

work page 2023

[8] [8]

Motion- Conditioned Diffusion Model for Controllable Video Synthesis,

Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung- Yi Lin, and Ming-Hsuan Yang. Motion-conditioned diffu- sion model for controllable video synthesis. arXiv preprint arXiv:2304.14404, 2023. 3

work page arXiv 2023

[9] [9]

Control-a-video: Controllable text-to-video generation with diffusion models

Weifeng Chen, Jie Wu, Pan Xie, Hefeng Wu, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023. 3

work page arXiv 2023

[10] [10]

Seine: Short-to-long video diffu- sion model for generative transition and prediction

Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffu- sion model for generative transition and prediction. In The Twelfth International Conference on Learning Representa- tions, 2023. 3

work page 2023

[11] [11]

Driving- gpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers

Yuntao Chen, Yuqi Wang, and Zhaoxiang Zhang. Driving- gpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers. arXiv preprint arXiv:2412.18607, 2024. 3, 4

work page arXiv 2024

[12] [12]

Structure and content-guided video synthesis with diffusion models

Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023. 3

work page 2023

[13] [13]

Magicdrive: Street view generation with diverse 3d geometry control.arXiv preprint arXiv:2310.02601, 2023

Ruiyuan Gao, Kai Chen, Enze Xie, Lanqing Hong, Zhenguo Li, Dit-Yan Yeung, and Qiang Xu. Magicdrive: Street view generation with diverse 3d geometry control. arXiv preprint arXiv:2310.02601, 2023. 2

work page arXiv 2023

[14] [14]

Vista: A generalizable driving world model with high fidelity and versatile controllability

Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yi- hang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. In NeurIPS, 2024. 2, 4, 6, 7

work page 2024

[15] [15]

Imagine-2-drive: High-fidelity world modeling in carla for autonomous vehi- cles

Anant Garg and K Madhava Krishna. Imagine-2-drive: High-fidelity world modeling in carla for autonomous vehi- cles. arXiv preprint arXiv:2411.10171, 2024. 2

work page arXiv 2024

[16] [16]

Worldgpt: Empowering llm as multimodal world model

Zhiqi Ge, Hongzhe Huang, Mingze Zhou, Juncheng Li, Guoming Wang, Siliang Tang, and Yueting Zhuang. Worldgpt: Empowering llm as multimodal world model. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 7346–7355, 2024. 2

work page 2024

[17] [17]

Dome: Tam- ing diffusion model into high-fidelity controllable occupancy world model

Songen Gu, Wei Yin, Bu Jin, Xiaoyang Guo, Junming Wang, Haodong Li, Qian Zhang, and Xiaoxiao Long. Dome: Tam- ing diffusion model into high-fidelity controllable occupancy world model. arXiv preprint arXiv:2410.10429, 2024. 2

work page arXiv 2024

[18] [18]

World models for autonomous driving: An initial survey

Yanchen Guan, Haicheng Liao, Zhenning Li, Jia Hu, Runze Yuan, Yunjian Li, Guohui Zhang, and Chengzhong Xu. World models for autonomous driving: An initial survey. IEEE Transactions on Intelligent Vehicles, 2024. 3

work page 2024

[19] [19]

Infinitydrive: Breaking time limits in driving world models

Xi Guo, Chenjing Ding, Haoxuan Dou, Xin Zhang, Weix- uan Tang, and Wei Wu. Infinitydrive: Breaking time limits in driving world models. arXiv preprint arXiv:2412.01522,

work page arXiv

[20] [20]

Sparsectrl: Adding sparse con- trols to text-to-video diffusion models

Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Sparsectrl: Adding sparse con- trols to text-to-video diffusion models. arXiv preprint arXiv:2311.16933, 2023. 3

work page arXiv 2023

[21] [21]

Photorealistic video generation with diffusion models

Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and Jos ´e Lezama. Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662, 2023. 3

work page arXiv 2023

[22] [22]

World models

David Ha and J ¨urgen Schmidhuber. World models. In NeurIPS, 2018. 3

work page 2018

[23] [23]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[25] [25]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. In NeurIPS, volume 30, 2017. 6

work page 2017

[26] [26]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020. 3 10

work page 2020

[27] [27]

Video Diffusion Models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models. arXiv:2204.03458, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[28] [28]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gian- luca Corrado. Gaia-1: A generative world model for au- tonomous driving. arXiv preprint arXiv:2309.17080, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021

[31] [31]

Animate anyone: Consistent and controllable image-to-video synthesis for character animation

Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, and Liefeng Bo. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. arXiv preprint arXiv:2311.17117, 2023. 3

work page arXiv 2023

[32] [32]

Driving- world: Constructingworld model for autonomous driving via video gpt

Xiaotao Hu, Wei Yin, Mingkai Jia, Junyuan Deng, Xiaoyang Guo, Qian Zhang, Xiaoxiao Long, and Ping Tan. Driving- world: Constructingworld model for autonomous driving via video gpt. arXiv preprint arXiv:2412.19505, 2024. 4

work page arXiv 2024

[33] [34]

Adriver-i: A general world model for autonomous driving

Fan Jia, Weixin Mao, Yingfei Liu, Yucheng Zhao, Yuqing Wen, Chi Zhang, Xiangyu Zhang, and Tiancai Wang. Adriver-i: A general world model for autonomous driving. arXiv preprint arXiv:2311.13549, 2023. 3

work page arXiv 2023

[34] [35]

Dive: Dit-based video generation with enhanced control

Junpeng Jiang, Gangyi Hong, Lijun Zhou, Enhui Ma, Heng- tong Hu, Xia Zhou, Jie Xiang, Fan Liu, Kaicheng Yu, Haiyang Sun, et al. Dive: Dit-based video generation with enhanced control. arXiv preprint arXiv:2409.01595, 2024. 3

work page arXiv 2024

[35] [36]

Cotracker3: Simpler and better point tracking by pseudo-labelling real videos

Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos. In Proc. arXiv:2410.11831, 2024. 5

work page arXiv 2024

[36] [37]

Dreampose: Fashion video synthesis with stable diffusion

Johanna Karras, Aleksander Holynski, Ting-Chun Wang, and Ira Kemelmacher-Shlizerman. Dreampose: Fashion video synthesis with stable diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 22680–22690, 2023. 3

work page 2023

[37] [38]

Text2video-zero: Text-to- image diffusion models are zero-shot video generators.IEEE International Conference on Computer Vision (ICCV), 2023

Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to- image diffusion models are zero-shot video generators.IEEE International Conference on Computer Vision (ICCV), 2023. 3

work page 2023

[38] [39]

Drivegan: Towards a controllable high-quality neural simulation

Seung Wook Kim, Jonah Philion, Antonio Torralba, and Sanja Fidler. Drivegan: Towards a controllable high-quality neural simulation. In CVPR, pages 5820–5829, 2021. 3, 7

work page 2021

[39] [40]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jos ´e Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [41]

Drivingdiffusion: Layout-guided multi-view driving scenarios video genera- tion with latent diffusion model

Xiaofan Li, Yifu Zhang, and Xiaoqing Ye. Drivingdiffusion: Layout-guided multi-view driving scenarios video genera- tion with latent diffusion model. In European Conference on Computer Vision, pages 469–485. Springer, 2024. 2, 3, 6

work page 2024

[41] [42]

Seeing the future, perceiving the future: A unified driving world model for future generation and perception

Dingkang Liang, Dingyuan Zhang, Xin Zhou, Sifan Tu, Tianrui Feng, Xiaofan Li, Yumeng Zhang, Mingyang Du, Xiao Tan, and Xiang Bai. Seeing the future, perceiving the future: A unified driving world model for future generation and perception. arXiv preprint arXiv:2503.13587, 2025. 4

work page arXiv 2025

[42] [43]

Wovogen: World volume-aware diffusion for con- trollable multi-camera driving scene generation

Jiachen Lu, Ze Huang, Zeyu Yang, Jiahui Zhang, and Li Zhang. Wovogen: World volume-aware diffusion for con- trollable multi-camera driving scene generation. In ECCV, pages 329–345. Springer, 2024. 2, 3, 7

work page 2024

[43] [44]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Zi- wei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [45]

Driveworld: 4d pre-trained scene understanding via world models for autonomous driving

Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. In CVPR, pages 15522–15533, 2024. 2

work page 2024

[45] [46]

Orb-slam: a versatile and accurate monocular slam system

Raul Mur-Artal, JMM Montiel, and JD Tardos. Orb-slam: a versatile and accurate monocular slam system. IEEE Trans- actions on Robotics, 31(5):1147–1163, 2015. 8

work page 2015

[46] [47]

Scalable Diffusion Models with Transformers

Geoffrey Peebles and Saining Xie. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022. 4

work page internal anchor Pith review Pith/arXiv arXiv 2022

[47] [48]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,

work page

[48] [49]

Com- positional 3d scene generation using locally conditioned dif- fusion

Ryan Po, Wang Yifan, and Vladislav Golyanik et al. Com- positional 3d scene generation using locally conditioned dif- fusion. In ArXiv, 2023. 3

work page 2023

[49] [50]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3

work page 2022

[50] [51]

Mm-diffusion: Learning multi-modal diffusion mod- els for joint audio and video generation

Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, and Baining Guo. Mm-diffusion: Learning multi-modal diffusion mod- els for joint audio and video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10219–10228, 2023. 3

work page 2023

[51] [52]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2010

[52] [53]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, pages 2446–2454, 2020. 6 11

work page 2020

[53] [54]

The role of world mod- els in shaping autonomous driving: A comprehensive survey

Sifan Tu, Xin Zhou, Dingkang Liang, Xingyu Jiang, Yumeng Zhang, Xiaofan Li, and Xiang Bai. The role of world mod- els in shaping autonomous driving: A comprehensive survey. arXiv preprint arXiv:2502.10498, 2025. 3

work page arXiv 2025

[54] [55]

Fvd: A new metric for video generation

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Ku- rach, Rapha ¨el Marinier, Marcin Michalski, and Syl- vain Gelly. Fvd: A new metric for video generation. https://openreview.net/forum?id=rylgEULtdN, 2019. 6

work page 2019

[55] [56]

Occsora: 4d occupancy generation models as world simulators for au- tonomous driving

Lening Wang, Wenzhao Zheng, Yilong Ren, Han Jiang, Zhiyong Cui, Haiyang Yu, and Jiwen Lu. Occsora: 4d occupancy generation models as world simulators for au- tonomous driving. arXiv preprint arXiv:2405.20337, 2024. 2

work page arXiv 2024

[56] [57]

Drivedreamer: Towards real-world- drive world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jia- gang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world- drive world models for autonomous driving. InECCV, pages 55–72. Springer, 2024. 4, 7

work page 2024

[57] [58]

Drivedreamer: Towards real-world- driven world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jia- gang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world- driven world models for autonomous driving. In ECCV,

work page

[58] [59]

Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving

Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving. In CVPR, pages 14749–14759, 2024. 2, 4, 7

work page 2024

[59] [60]

Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving

Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving. In CVPR, pages 14749–14759, 2024. 4

work page 2024

[60] [61]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. arXiv preprint arXiv:2312.03641, 2023. 3

work page arXiv 2023

[61] [62]

Wan: Open and Advanced Large-Scale Video Generative Models

WanTeam, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [64]

Panacea: Panoramic and controllable video generation for autonomous driving

Yuqing Wen, Yucheng Zhao, Yingfei Liu, Fan Jia, Yanhui Wang, Chong Luo, Chi Zhang, Tiancai Wang, Xiaoyan Sun, and Xiangyu Zhang. Panacea: Panoramic and controllable video generation for autonomous driving. In CVPR, pages 6902–6912, 2024. 3

work page 2024

[63] [65]

Holodrive: Holistic 2d-3d multi-modal street scene generation for au- tonomous driving.arXiv preprint arXiv:2412.01407, 2024

Zehuan Wu, Jingcheng Ni, Xiaodong Wang, Yuxin Guo, Rui Chen, Lewei Lu, Jifeng Dai, and Yuwen Xiong. Holodrive: Holistic 2d-3d multi-modal street scene generation for au- tonomous driving. arXiv preprint arXiv:2412.01407, 2024. 2

work page arXiv 2024

[64] [66]

H.; Yan, H.; Liu, J.-W.; Zhang, C.; Feng, J.; and Shou, M

Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human im- age animation using diffusion model. arXiv preprint arXiv:2311.16498, 2023. 3

work page arXiv 2023

[65] [67]

Generalized predictive model for autonomous driving

Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, Jun Zhang, Andreas Geiger, Yu Qiao, and Hongyang Li. Generalized predictive model for autonomous driving. In CVPR, 2024. 4, 7

work page 2024

[66] [68]

Direct-a-video: Customized video generation with user- directed camera movement and object motion.arXiv preprint arXiv:2402.03162, 2024

Shiyuan Yang, Liang Hou, Haibin Huang, Chongyang Ma, Pengfei Wan, Di Zhang, Xiaodong Chen, and Jing Liao. Direct-a-video: Customized video generation with user- directed camera movement and object motion.arXiv preprint arXiv:2402.03162, 2024. 3

work page arXiv 2024

[67] [69]

Physical informed driving world model

Zhuoran Yang, Xi Guo, Chenjing Ding, Chiyu Wang, and Wei Wu. Physical informed driving world model. arXiv preprint arXiv:2412.08410, 2024. 2

work page arXiv 2024

[68] [70]

Dualdiff+: Dual-branch diffusion for high-fidelity video generation with reward guidance

Zhao Yang, Zezhong Qian, Xiaofan Li, Weixiang Xu, Gongpeng Zhao, Ruohong Yu, Lingsi Zhu, and Longjun Liu. Dualdiff+: Dual-branch diffusion for high-fidelity video generation with reward guidance. arXiv preprint arXiv:2503.03689, 2025. 3

work page arXiv 2025

[69] [71]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[70] [72]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Jiahui Zhang, Kangle Han, Zhen Li, Di He, Hao Fan, Yin- peng Wu, Lei Zhou, Ping Liu, Jiaying Dong, Dongdong Chen, et al. Pixart- α: A powerful text-to-image generation foundation model. arXiv preprint arXiv:2310.00426, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[71] [73]

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[72] [74]

Bevworld: A multimodal world model for au- tonomous driving via unified bev latent space.arXiv preprint arXiv:2407.05679, 2024

Yumeng Zhang, Shi Gong, Kaixin Xiong, Xiaoqing Ye, Xiao Tan, Fan Wang, Jizhou Huang, Hua Wu, and Haifeng Wang. Bevworld: A multimodal world model for au- tonomous driving via unified bev latent space.arXiv preprint arXiv:2407.05679, 2024. 2

work page arXiv 2024

[73] [75]

the words ‘KEEP OFF THE GRASS

Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023. 3

work page arXiv 2023

[74] [76]

Drivedreamer4d: World models are effective data machines for 4d driving scene rep- resentation

Guosheng Zhao, Chaojun Ni, Xiaofeng Wang, Zheng Zhu, Xueyang Zhang, Yida Wang, Guan Huang, Xinze Chen, 12 Boyuan Wang, Youyi Zhang, et al. Drivedreamer4d: World models are effective data machines for 4d driving scene rep- resentation. arXiv preprint arXiv:2410.13571, 2024. 2, 4

work page arXiv 2024

[75] [77]

Occworld: Learning a 3d occupancy world model for autonomous driving

Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. Occworld: Learning a 3d occupancy world model for autonomous driving. In ECCV, pages 55–72. Springer, 2024. 2

work page 2024

[76] [78]

Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation

Xin Zhou, Dingkang Liang, Sifan Tu, Xiwu Chen, Yikang Ding, Dingyuan Zhang, Feiyang Tan, Hengshuang Zhao, and Xiang Bai. Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation. arXiv preprint arXiv:2501.14729, 2025. 2 13

work page arXiv 2025