DriVerse: Navigation World Model for Driving Simulation via Multimodal Trajectory Prompting and Motion Alignment
Pith reviewed 2026-05-22 17:47 UTC · model grok-4.3
The pith
DriVerse generates higher-fidelity driving videos from one image and a future trajectory by converting paths into textual prompts and 2D motion priors plus a motion alignment step.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DriVerse is a generative model that takes a single starting image and a future trajectory and produces a corresponding video of the driving scene. It achieves explicit control by first tokenizing the trajectory into language prompts drawn from a predefined trend vocabulary and second by projecting the 3D trajectory into 2D spatial motion priors that steer the static elements of the scene. A motion alignment module then improves temporal consistency specifically for dynamic pixels across frames. When trained with minimal updates and no extra data, the model yields higher-quality future video predictions than prior specialized approaches on both the nuScenes and Waymo datasets.
What carries the argument
Multimodal trajectory prompting that supplies both tokenized textual trend prompts and 2D spatial motion priors extracted from 3D trajectories, together with a lightweight motion alignment module that targets inter-frame consistency of dynamic pixels.
If this is right
- Trajectory-specific videos become usable for evaluating actual autonomous driving planners instead of relying on coarse text commands or discrete signals.
- Static scene content and dynamic object motion are controlled more precisely because guidance is supplied in both language and spatial forms.
- Temporal coherence of moving elements improves over long sequences once the alignment module enforces consistency on dynamic pixels.
- The same performance gains appear on multiple real-world driving datasets while requiring only minimal training and no new data collection.
Where Pith is reading between the lines
- The same two-channel prompting pattern could be tested on video generation tasks outside driving, such as robot manipulation or animated character control.
- Because the alignment module is lightweight, it could be added to many existing base generative models without retraining the entire system from scratch.
- If the trend vocabulary proves sufficient, future work might explore whether learned vocabularies could further reduce the gap between text and precise spatial control.
Load-bearing premise
The combination of a fixed trend vocabulary for text prompts, 2D motion priors from 3D trajectories, and the motion alignment module is enough to correct the alignment problems that arise when trajectory signals are fed directly into a base 2D generative model.
What would settle it
Run the same future-video generation benchmarks on nuScenes and Waymo and measure whether DriVerse videos show measurably lower trajectory adherence or visual quality than baselines that use direct trajectory input or discrete controls.
Figures
read the original abstract
This paper presents DriVerse, a generative model for simulating navigation-driven driving scenes from a single image and a future trajectory. Previous autonomous driving world models either directly feed the trajectory or discrete control signals into the generation pipeline, leading to poor alignment between the control inputs and the implicit features of the 2D base generative model, which results in low-fidelity video outputs. Some methods use coarse textual commands or discrete vehicle control signals, which lack the precision to guide fine-grained, trajectory-specific video generation, making them unsuitable for evaluating actual autonomous driving algorithms. DriVerse introduces explicit trajectory guidance in two complementary forms: it tokenizes trajectories into textual prompts using a predefined trend vocabulary for seamless language integration, and converts 3D trajectories into 2D spatial motion priors to enhance control over static content within the driving scene. To better handle dynamic objects, we further introduce a lightweight motion alignment module, which focuses on the inter-frame consistency of dynamic pixels, significantly enhancing the temporal coherence of moving elements over long sequences. With minimal training and no need for additional data, DriVerse outperforms specialized models on future video generation tasks across both the nuScenes and Waymo datasets. The code and models will be released to the public.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DriVerse, a generative world model for driving simulation that produces future video from a single image and a future trajectory. It addresses misalignment in prior models by tokenizing trajectories into textual prompts via a predefined trend vocabulary, converting 3D trajectories into 2D spatial motion priors, and adding a lightweight motion alignment module to improve inter-frame consistency for dynamic objects. The central claim is that this multimodal approach yields higher-fidelity outputs than specialized models on nuScenes and Waymo with only minimal training and no extra data.
Significance. If the outperformance claims are substantiated, the work could improve controllability in driving world models, enabling more precise trajectory-guided video generation for autonomous driving evaluation. The explicit use of tokenized language prompts and 2D motion priors offers a lightweight alternative to direct control injection or coarse commands.
major comments (1)
- Abstract: the claim that DriVerse 'outperforms specialized models on future video generation tasks across both the nuScenes and Waymo datasets' is presented without any quantitative metrics (FVD, FID, trajectory adherence), baselines, ablation results, or experimental protocol details. This directly undermines assessment of whether the tokenized prompts, 2D priors, and motion alignment module produce the asserted gains rather than implementation artifacts.
minor comments (2)
- The size, construction, and coverage of the 'predefined trend vocabulary' are not specified, which affects reproducibility of the textual prompting component.
- The base generative model, exact training regime (e.g., number of epochs, learning rate schedule), and definition of 'minimal training' should be stated explicitly to clarify the data-efficiency claim.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and for highlighting an important clarity issue in the abstract. We address the comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: Abstract: the claim that DriVerse 'outperforms specialized models on future video generation tasks across both the nuScenes and Waymo datasets' is presented without any quantitative metrics (FVD, FID, trajectory adherence), baselines, ablation results, or experimental protocol details. This directly undermines assessment of whether the tokenized prompts, 2D priors, and motion alignment module produce the asserted gains rather than implementation artifacts.
Authors: We agree that the abstract claim would be more informative with supporting quantitative evidence. The full manuscript (Section 4 and Tables 1-3) reports FVD, FID, and trajectory adherence metrics, along with comparisons to baselines such as DriveDreamer, Vista, and GAIA-1, plus ablations isolating the contributions of tokenized prompts, 2D motion priors, and the motion alignment module. Experimental protocols (training details, dataset splits, evaluation metrics) are described in Section 3. To address the referee's concern directly, we will revise the abstract to include the key quantitative results (e.g., relative FVD improvements on nuScenes and Waymo) and a brief reference to the evaluation protocol. This change will make the performance claims self-contained while preserving the abstract's brevity. revision: yes
Circularity Check
No circularity: novel modules and empirical claims are independent of inputs
full rationale
The paper introduces new architectural elements—tokenized textual prompts via a predefined trend vocabulary, conversion of 3D trajectories to 2D spatial motion priors, and a lightweight motion alignment module for inter-frame consistency—explicitly to address alignment problems in existing base generative models. These components are described as additions rather than redefinitions or fits of prior outputs. The central claim of outperformance on nuScenes and Waymo with minimal training is presented as an empirical result to be validated externally, with no equations or derivations shown that reduce by construction to the input data or self-citations. No load-bearing self-citation chains, fitted parameters renamed as predictions, or ansatzes smuggled via prior work are evident in the provided text. The derivation chain remains self-contained as a proposed system whose validity rests on future experimental verification rather than tautological equivalence to its own inputs.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Multimodal Trajectory Prompting (MTP) ... trend vocabulary ... 12 angular sectors ... Trajectory-Guided Spatial Anchors (TSA) ... Latent Motion Alignment (LMA) ... motion-weighted consistency loss
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Dynamic Window Generation (DWG) ... anchor visibility Vt ... heading-angle change
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Frozen in time: A joint video and image encoder for end-to-end retrieval
Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 1728–1738,
-
[2]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Muvo: A multimodal generative world model for autonomous driving with geometric representations
Daniel Bogdoll, Yitian Yang, and J Marius Z ¨ollner. Muvo: A multimodal generative world model for autonomous driving with geometric representations. arXiv preprint arXiv:2311.11762, 2023. 2
-
[4]
Video generation models as world simula- tors
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simula- tors. https://openai.com/index/video-generation-models-as- world-simulators/, 2024. 3
work page 2024
-
[5]
nuscenes: A mul- timodal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A mul- timodal dataset for autonomous driving. In CVPR, pages 11621–11631, 2020. 6
work page 2020
-
[6]
Egocentric vehicle dense video captioning
Feiyu Chen, Cong Xu, Qi Jia, Yihua Wang, Yuhan Liu, Hao- tian Zhang, and Endong Wang. Egocentric vehicle dense video captioning. In Proceedings of the 32nd ACM Inter- national Conference on Multimedia , pages 137–146, 2024. 3
work page 2024
-
[7]
Videocrafter1: Open diffusion models for high-quality video generation, 2023
Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation, 2023. 3
work page 2023
-
[8]
Motion- Conditioned Diffusion Model for Controllable Video Synthesis,
Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung- Yi Lin, and Ming-Hsuan Yang. Motion-conditioned diffu- sion model for controllable video synthesis. arXiv preprint arXiv:2304.14404, 2023. 3
-
[9]
Control-a-video: Controllable text-to-video generation with diffusion models
Weifeng Chen, Jie Wu, Pan Xie, Hefeng Wu, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023. 3
-
[10]
Seine: Short-to-long video diffu- sion model for generative transition and prediction
Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffu- sion model for generative transition and prediction. In The Twelfth International Conference on Learning Representa- tions, 2023. 3
work page 2023
-
[11]
Yuntao Chen, Yuqi Wang, and Zhaoxiang Zhang. Driving- gpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers. arXiv preprint arXiv:2412.18607, 2024. 3, 4
-
[12]
Structure and content-guided video synthesis with diffusion models
Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023. 3
work page 2023
-
[13]
Ruiyuan Gao, Kai Chen, Enze Xie, Lanqing Hong, Zhenguo Li, Dit-Yan Yeung, and Qiang Xu. Magicdrive: Street view generation with diverse 3d geometry control. arXiv preprint arXiv:2310.02601, 2023. 2
-
[14]
Vista: A generalizable driving world model with high fidelity and versatile controllability
Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yi- hang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. In NeurIPS, 2024. 2, 4, 6, 7
work page 2024
-
[15]
Imagine-2-drive: High-fidelity world modeling in carla for autonomous vehi- cles
Anant Garg and K Madhava Krishna. Imagine-2-drive: High-fidelity world modeling in carla for autonomous vehi- cles. arXiv preprint arXiv:2411.10171, 2024. 2
-
[16]
Worldgpt: Empowering llm as multimodal world model
Zhiqi Ge, Hongzhe Huang, Mingze Zhou, Juncheng Li, Guoming Wang, Siliang Tang, and Yueting Zhuang. Worldgpt: Empowering llm as multimodal world model. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 7346–7355, 2024. 2
work page 2024
-
[17]
Dome: Tam- ing diffusion model into high-fidelity controllable occupancy world model
Songen Gu, Wei Yin, Bu Jin, Xiaoyang Guo, Junming Wang, Haodong Li, Qian Zhang, and Xiaoxiao Long. Dome: Tam- ing diffusion model into high-fidelity controllable occupancy world model. arXiv preprint arXiv:2410.10429, 2024. 2
-
[18]
World models for autonomous driving: An initial survey
Yanchen Guan, Haicheng Liao, Zhenning Li, Jia Hu, Runze Yuan, Yunjian Li, Guohui Zhang, and Chengzhong Xu. World models for autonomous driving: An initial survey. IEEE Transactions on Intelligent Vehicles, 2024. 3
work page 2024
-
[19]
Infinitydrive: Breaking time limits in driving world models
Xi Guo, Chenjing Ding, Haoxuan Dou, Xin Zhang, Weix- uan Tang, and Wei Wu. Infinitydrive: Breaking time limits in driving world models. arXiv preprint arXiv:2412.01522,
-
[20]
Sparsectrl: Adding sparse con- trols to text-to-video diffusion models
Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Sparsectrl: Adding sparse con- trols to text-to-video diffusion models. arXiv preprint arXiv:2311.16933, 2023. 3
-
[21]
Photorealistic video generation with diffusion models
Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and Jos ´e Lezama. Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662, 2023. 3
- [22]
-
[23]
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
Gans trained by a two time-scale update rule converge to a local nash equilib- rium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. In NeurIPS, volume 30, 2017. 6
work page 2017
-
[26]
Denoising diffu- sion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020. 3 10
work page 2020
-
[27]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models. arXiv:2204.03458, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[28]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[29]
GAIA-1: A Generative World Model for Autonomous Driving
Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gian- luca Corrado. Gaia-1: A generative world model for au- tonomous driving. arXiv preprint arXiv:2309.17080, 2023. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. 3
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[31]
Animate anyone: Consistent and controllable image-to-video synthesis for character animation
Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, and Liefeng Bo. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. arXiv preprint arXiv:2311.17117, 2023. 3
-
[32]
Driving- world: Constructingworld model for autonomous driving via video gpt
Xiaotao Hu, Wei Yin, Mingkai Jia, Junyuan Deng, Xiaoyang Guo, Qian Zhang, Xiaoxiao Long, and Ping Tan. Driving- world: Constructingworld model for autonomous driving via video gpt. arXiv preprint arXiv:2412.19505, 2024. 4
-
[34]
Adriver-i: A general world model for autonomous driving
Fan Jia, Weixin Mao, Yingfei Liu, Yucheng Zhao, Yuqing Wen, Chi Zhang, Xiangyu Zhang, and Tiancai Wang. Adriver-i: A general world model for autonomous driving. arXiv preprint arXiv:2311.13549, 2023. 3
-
[35]
Dive: Dit-based video generation with enhanced control
Junpeng Jiang, Gangyi Hong, Lijun Zhou, Enhui Ma, Heng- tong Hu, Xia Zhou, Jie Xiang, Fan Liu, Kaicheng Yu, Haiyang Sun, et al. Dive: Dit-based video generation with enhanced control. arXiv preprint arXiv:2409.01595, 2024. 3
-
[36]
Cotracker3: Simpler and better point tracking by pseudo-labelling real videos
Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos. In Proc. arXiv:2410.11831, 2024. 5
-
[37]
Dreampose: Fashion video synthesis with stable diffusion
Johanna Karras, Aleksander Holynski, Ting-Chun Wang, and Ira Kemelmacher-Shlizerman. Dreampose: Fashion video synthesis with stable diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 22680–22690, 2023. 3
work page 2023
-
[38]
Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to- image diffusion models are zero-shot video generators.IEEE International Conference on Computer Vision (ICCV), 2023. 3
work page 2023
-
[39]
Drivegan: Towards a controllable high-quality neural simulation
Seung Wook Kim, Jonah Philion, Antonio Torralba, and Sanja Fidler. Drivegan: Towards a controllable high-quality neural simulation. In CVPR, pages 5820–5829, 2021. 3, 7
work page 2021
-
[40]
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jos ´e Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
Xiaofan Li, Yifu Zhang, and Xiaoqing Ye. Drivingdiffusion: Layout-guided multi-view driving scenarios video genera- tion with latent diffusion model. In European Conference on Computer Vision, pages 469–485. Springer, 2024. 2, 3, 6
work page 2024
-
[42]
Dingkang Liang, Dingyuan Zhang, Xin Zhou, Sifan Tu, Tianrui Feng, Xiaofan Li, Yumeng Zhang, Mingyang Du, Xiao Tan, and Xiang Bai. Seeing the future, perceiving the future: A unified driving world model for future generation and perception. arXiv preprint arXiv:2503.13587, 2025. 4
-
[43]
Wovogen: World volume-aware diffusion for con- trollable multi-camera driving scene generation
Jiachen Lu, Ze Huang, Zeyu Yang, Jiahui Zhang, and Li Zhang. Wovogen: World volume-aware diffusion for con- trollable multi-camera driving scene generation. In ECCV, pages 329–345. Springer, 2024. 2, 3, 7
work page 2024
-
[44]
Latte: Latent Diffusion Transformer for Video Generation
Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Zi- wei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Driveworld: 4d pre-trained scene understanding via world models for autonomous driving
Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. In CVPR, pages 15522–15533, 2024. 2
work page 2024
-
[46]
Orb-slam: a versatile and accurate monocular slam system
Raul Mur-Artal, JMM Montiel, and JD Tardos. Orb-slam: a versatile and accurate monocular slam system. IEEE Trans- actions on Robotics, 31(5):1147–1163, 2015. 8
work page 2015
-
[47]
Scalable Diffusion Models with Transformers
Geoffrey Peebles and Saining Xie. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022. 4
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[48]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,
-
[49]
Com- positional 3d scene generation using locally conditioned dif- fusion
Ryan Po, Wang Yifan, and Vladislav Golyanik et al. Com- positional 3d scene generation using locally conditioned dif- fusion. In ArXiv, 2023. 3
work page 2023
-
[50]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3
work page 2022
-
[51]
Mm-diffusion: Learning multi-modal diffusion mod- els for joint audio and video generation
Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, and Baining Guo. Mm-diffusion: Learning multi-modal diffusion mod- els for joint audio and video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10219–10228, 2023. 3
work page 2023
-
[52]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. 3
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[53]
Scalability in perception for autonomous driving: Waymo open dataset
Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, pages 2446–2454, 2020. 6 11
work page 2020
-
[54]
The role of world mod- els in shaping autonomous driving: A comprehensive survey
Sifan Tu, Xin Zhou, Dingkang Liang, Xingyu Jiang, Yumeng Zhang, Xiaofan Li, and Xiang Bai. The role of world mod- els in shaping autonomous driving: A comprehensive survey. arXiv preprint arXiv:2502.10498, 2025. 3
-
[55]
Fvd: A new metric for video generation
Thomas Unterthiner, Sjoerd van Steenkiste, Karol Ku- rach, Rapha ¨el Marinier, Marcin Michalski, and Syl- vain Gelly. Fvd: A new metric for video generation. https://openreview.net/forum?id=rylgEULtdN, 2019. 6
work page 2019
-
[56]
Occsora: 4d occupancy generation models as world simulators for au- tonomous driving
Lening Wang, Wenzhao Zheng, Yilong Ren, Han Jiang, Zhiyong Cui, Haiyang Yu, and Jiwen Lu. Occsora: 4d occupancy generation models as world simulators for au- tonomous driving. arXiv preprint arXiv:2405.20337, 2024. 2
-
[57]
Drivedreamer: Towards real-world- drive world models for autonomous driving
Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jia- gang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world- drive world models for autonomous driving. InECCV, pages 55–72. Springer, 2024. 4, 7
work page 2024
-
[58]
Drivedreamer: Towards real-world- driven world models for autonomous driving
Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jia- gang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world- driven world models for autonomous driving. In ECCV,
-
[59]
Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving. In CVPR, pages 14749–14759, 2024. 2, 4, 7
work page 2024
-
[60]
Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving. In CVPR, pages 14749–14759, 2024. 4
work page 2024
-
[61]
Motionctrl: A unified and flexible motion controller for video generation
Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. arXiv preprint arXiv:2312.03641, 2023. 3
-
[62]
Wan: Open and Advanced Large-Scale Video Generative Models
WanTeam, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
Panacea: Panoramic and controllable video generation for autonomous driving
Yuqing Wen, Yucheng Zhao, Yingfei Liu, Fan Jia, Yanhui Wang, Chong Luo, Chi Zhang, Tiancai Wang, Xiaoyan Sun, and Xiangyu Zhang. Panacea: Panoramic and controllable video generation for autonomous driving. In CVPR, pages 6902–6912, 2024. 3
work page 2024
-
[65]
Zehuan Wu, Jingcheng Ni, Xiaodong Wang, Yuxin Guo, Rui Chen, Lewei Lu, Jifeng Dai, and Yuwen Xiong. Holodrive: Holistic 2d-3d multi-modal street scene generation for au- tonomous driving. arXiv preprint arXiv:2412.01407, 2024. 2
-
[66]
H.; Yan, H.; Liu, J.-W.; Zhang, C.; Feng, J.; and Shou, M
Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human im- age animation using diffusion model. arXiv preprint arXiv:2311.16498, 2023. 3
-
[67]
Generalized predictive model for autonomous driving
Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, Jun Zhang, Andreas Geiger, Yu Qiao, and Hongyang Li. Generalized predictive model for autonomous driving. In CVPR, 2024. 4, 7
work page 2024
-
[68]
Shiyuan Yang, Liang Hou, Haibin Huang, Chongyang Ma, Pengfei Wan, Di Zhang, Xiaodong Chen, and Jing Liao. Direct-a-video: Customized video generation with user- directed camera movement and object motion.arXiv preprint arXiv:2402.03162, 2024. 3
-
[69]
Physical informed driving world model
Zhuoran Yang, Xi Guo, Chenjing Ding, Chiyu Wang, and Wei Wu. Physical informed driving world model. arXiv preprint arXiv:2412.08410, 2024. 2
-
[70]
Dualdiff+: Dual-branch diffusion for high-fidelity video generation with reward guidance
Zhao Yang, Zezhong Qian, Xiaofan Li, Weixiang Xu, Gongpeng Zhao, Ruohong Yu, Lingsi Zhu, and Longjun Liu. Dualdiff+: Dual-branch diffusion for high-fidelity video generation with reward guidance. arXiv preprint arXiv:2503.03689, 2025. 3
-
[71]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[72]
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Jiahui Zhang, Kangle Han, Zhen Li, Di He, Hao Fan, Yin- peng Wu, Lei Zhou, Ping Liu, Jiaying Dong, Dongdong Chen, et al. Pixart- α: A powerful text-to-image generation foundation model. arXiv preprint arXiv:2310.00426, 2023. 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[73]
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[74]
Yumeng Zhang, Shi Gong, Kaixin Xiong, Xiaoqing Ye, Xiao Tan, Fan Wang, Jizhou Huang, Hua Wu, and Haifeng Wang. Bevworld: A multimodal world model for au- tonomous driving via unified bev latent space.arXiv preprint arXiv:2407.05679, 2024. 2
-
[75]
Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023. 3
-
[76]
Drivedreamer4d: World models are effective data machines for 4d driving scene rep- resentation
Guosheng Zhao, Chaojun Ni, Xiaofeng Wang, Zheng Zhu, Xueyang Zhang, Yida Wang, Guan Huang, Xinze Chen, 12 Boyuan Wang, Youyi Zhang, et al. Drivedreamer4d: World models are effective data machines for 4d driving scene rep- resentation. arXiv preprint arXiv:2410.13571, 2024. 2, 4
-
[77]
Occworld: Learning a 3d occupancy world model for autonomous driving
Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. Occworld: Learning a 3d occupancy world model for autonomous driving. In ECCV, pages 55–72. Springer, 2024. 2
work page 2024
-
[78]
Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation
Xin Zhou, Dingkang Liang, Sifan Tu, Xiwu Chen, Yikang Ding, Dingyuan Zhang, Feiyang Tan, Hengshuang Zhao, and Xiang Bai. Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation. arXiv preprint arXiv:2501.14729, 2025. 2 13
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.