Recognition: unknown
ScenarioControl: Vision-Language Controllable Vectorized Latent Scenario Generation
Pith reviewed 2026-05-10 06:20 UTC · model grok-4.3
The pith
A cross-global attention module lets text or image prompts control the creation of full 3D driving scenes in vectorized space.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ScenarioControl is the first vision-language control mechanism for learned driving scenario generation. Given a text prompt or an input image, it synthesizes diverse, realistic 3D scenario rollouts including map, 3D boxes of reactive actors over time, pedestrians, driving infrastructure, and ego camera observations. The method generates scenes in a vectorized latent space that represents road structure and dynamic agents jointly. To connect multimodal control with sparse vectorized scene elements, it employs a cross-global control mechanism that integrates cross-attention with a lightweight global-context branch, enabling fine-grained control over road layout and traffic conditions while the
What carries the argument
The cross-global control mechanism, which combines cross-attention with a lightweight global-context branch to map sparse multimodal inputs onto the joint vectorized latent representation of roads and agents.
If this is right
- Users can specify desired road shapes and traffic density through ordinary language or reference photos instead of hand-crafted scene files.
- Scenarios remain coherent when viewed from multiple actor perspectives and can be extended over long time horizons without drift.
- The vectorized output format supports direct use in physics-based simulators for testing autonomous driving systems.
- The released text-annotated dataset lowers the barrier for training additional controllable generation models.
Where Pith is reading between the lines
- This style of control could let simulation engineers create targeted edge-case tests on demand rather than relying on large pre-recorded datasets.
- The same latent space might support editing operations, such as changing weather or inserting specific obstacles after initial generation.
- If the vectorized representation proves general enough, similar techniques could transfer to other structured environments like warehouse robotics or urban planning.
Load-bearing premise
The control module can translate brief text or image cues into precise, consistent changes across the entire vectorized scene without breaking realism or continuity.
What would settle it
Run controlled experiments that measure how accurately generated road layouts and actor trajectories match the semantics of the input prompt or image, then compare those scores against prior generation baselines on the released dataset.
Figures
read the original abstract
We introduce ScenarioControl, the first vision-language control mechanism for learned driving scenario generation. Given a text prompt or an input image, Scenario-Control synthesizes diverse, realistic 3D scenario rollouts - including map, 3D boxes of reactive actors over time, pedestrians, driving infrastructure, and ego camera observations. The method generates scenes in a vectorized latent space that represents road structure and dynamic agents jointly. To connect multimodal control with sparse vectorized scene elements, we propose a cross-global control mechanism that integrates crossattention with a lightweight global-context branch, enabling fine-grained control over road layout and traffic conditions while preserving realism. The method produces temporally consistent scenario rollouts from the perspectives different actors in the scene, supporting long-horizon continuation of driving scenarios. To facilitate training and evaluation, we release a dataset with text annotations aligned to vectorized map structures. Extensive experiments validate that the control adherence and fidelity of ScenarioControl compare favorable to all tested methods across all experiments. Project webpage: https://light.princeton.edu/ScenarioControl
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ScenarioControl as the first vision-language control mechanism for learned driving scenario generation. Given text prompts or input images, it synthesizes diverse, realistic 3D scenario rollouts in a vectorized latent space that jointly represents road structure and dynamic agents, including maps, reactive 3D actor boxes over time, pedestrians, infrastructure, and ego camera observations. The core technical contribution is a cross-global control mechanism that combines cross-attention with a lightweight global-context branch to map sparse multimodal inputs onto vectorized scene elements while aiming to preserve realism and temporal consistency across long-horizon rollouts from multiple actor perspectives. The authors release a dataset with text annotations aligned to vectorized maps and report that experiments show favorable control adherence and fidelity relative to tested baselines.
Significance. If the central claims hold, this would represent a meaningful advance in controllable scenario generation for autonomous driving simulation, as it enables intuitive multimodal (text/image) conditioning over complex, reactive, long-horizon vectorized scenes rather than relying solely on structured inputs. The dataset release is a concrete positive contribution that could support future work in vision-language driving models. The approach targets a practical gap between high-level control signals and detailed, temporally coherent scene synthesis.
major comments (2)
- [§3.2] §3.2: The cross-global control mechanism (cross-attention plus lightweight global-context branch) is presented as the key to connecting sparse multimodal prompts to the joint vectorized representation of map and agents. However, the description provides no explicit analysis, equations, or ablation results demonstrating how the global branch avoids diluting local actor-map interactions or producing averaged/non-reactive trajectories over long horizons. This directly bears on the central claims of temporal consistency and realism in the generated rollouts.
- [Experiments] Experiments section: The abstract and manuscript assert that extensive experiments validate superior control adherence and fidelity across all tested methods, yet no quantitative tables, specific metrics (e.g., control error, realism scores, temporal consistency measures), ablation studies on the global-context branch, or details on baseline implementations and post-hoc choices are referenced in the provided review materials. Without these, the load-bearing performance claims cannot be verified.
minor comments (2)
- [Abstract] Abstract: 'compare favorable' is grammatically incorrect and should read 'compare favorably'.
- [Abstract] Abstract: 'from the perspectives different actors' is missing the preposition 'of' and should read 'from the perspectives of different actors'.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for acknowledging the potential significance of ScenarioControl for multimodal controllable scenario generation in autonomous driving. We address the major comments point by point below, providing clarifications from the manuscript and committing to revisions that strengthen the presentation of our technical contributions and experimental validation.
read point-by-point responses
-
Referee: [§3.2] §3.2: The cross-global control mechanism (cross-attention plus lightweight global-context branch) is presented as the key to connecting sparse multimodal prompts to the joint vectorized representation of map and agents. However, the description provides no explicit analysis, equations, or ablation results demonstrating how the global branch avoids diluting local actor-map interactions or producing averaged/non-reactive trajectories over long horizons. This directly bears on the central claims of temporal consistency and realism in the generated rollouts.
Authors: We appreciate the referee's focus on this core technical element. Section 3.2 of the manuscript details the cross-global control mechanism, including the equations for the cross-attention layers that map multimodal inputs to vectorized scene elements and the formulation of the lightweight global-context branch (implemented as a residual MLP operating on pooled scene features). The design explicitly uses separate pathways and residual connections to prevent the global branch from overriding local actor-map interactions. To make this analysis more explicit and directly address concerns about averaged or non-reactive trajectories, we will add a dedicated paragraph in §3.2 with mathematical reasoning on preservation of locality and an expanded ablation study in the experiments section. The ablation will quantify effects on trajectory reactivity (e.g., via variance in actor accelerations) and long-horizon temporal consistency (e.g., via multi-view box overlap metrics). revision: yes
-
Referee: [Experiments] Experiments section: The abstract and manuscript assert that extensive experiments validate superior control adherence and fidelity across all tested methods, yet no quantitative tables, specific metrics (e.g., control error, realism scores, temporal consistency measures), ablation studies on the global-context branch, or details on baseline implementations and post-hoc choices are referenced in the provided review materials. Without these, the load-bearing performance claims cannot be verified.
Authors: We regret if the quantitative results were not readily apparent in the review materials provided. The manuscript's Experiments section (Section 4) contains multiple tables with the requested metrics: control adherence is measured via prompt-to-scene alignment error and image-conditioned fidelity scores; realism is evaluated with adapted FID and perceptual metrics on generated maps/actors; temporal consistency is reported via long-horizon rollout stability across actor perspectives. Ablation results on the global-context branch appear in a dedicated table, and baseline implementations (with post-hoc choices and hyperparameters) are detailed in Appendix B and the supplementary material. We will revise the main text to add explicit cross-references to these tables and metrics in both the abstract and §4, and expand the global-branch ablation to align with the new analysis in §3.2. If any tables were inadvertently omitted from the review copy, we will ensure the revised submission includes complete, clearly labeled results. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces ScenarioControl as a new vision-language control mechanism using cross-attention plus a lightweight global-context branch to map multimodal inputs onto a vectorized latent space for driving scenarios. No equations or steps in the provided abstract or description reduce the claimed outputs (diverse realistic rollouts, temporal consistency) to quantities defined by the method's own fitted parameters or by self-citation chains. The central claims rest on experimental validation against baselines and a released dataset with text annotations, which are independent of the internal control mechanism. This is a standard architectural proposal whose performance assertions are externally falsifiable and not forced by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Advances in neural information processing systems35, 23716–23736 (2022) 7, 9
Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Has- son, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learn- ing. Advances in neural information processing systems35, 23716–23736 (2022) 7, 9
2022
-
[2]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11621–11631 (2020) 1
2020
-
[3]
In: 2020 IEEE International Conference on robotics and automation (ICRA)
Cai, P., Lee, Y., Luo, Y., Hsu, D.: Summit: A simulator for urban driving in massive mixed traffic. In: 2020 IEEE International Conference on robotics and automation (ICRA). pp. 4023–4029. IEEE (2020) 2
2020
-
[4]
Chen, A., Zheng, W., Wang, Y., Zhang, X., Zhan, K., Jia, P., Keutzer, K., Zhang, S.: Geodrive: 3d geometry- informed driving world model with precise action control. arXiv preprint arXiv:2505.22421 (2025) 3
-
[5]
IEEE Transactions on Intelligent Vehicles9(4), 4730–4748 (2024)
Chen, D., Zhu, M., Yang, H., Wang, X., Wang, Y.: Data- driven traffic simulation: A comprehensive review. IEEE Transactions on Intelligent Vehicles9(4), 4730–4748 (2024). https://doi.org/10.1109/TIV.2024.3367919 1
-
[6]
Chen, X., Gao, X., Zhao, J., Ye, K., Xu, C.Z.: Advd- iffuser: Natural adversarial example synthesis with diffu- sion models. In: 2023 IEEE/CVF International Confer- 10 ence on Computer Vision (ICCV). pp. 4539–4549 (2023). https://doi.org/10.1109/ICCV51070.2023.00421 3
-
[7]
In: European Conference on Computer Vision
Chitta, K., Dauner, D., Geiger, A.: Sledge: Synthesizing driving environments with generative models and rule-based traffic. In: European Conference on Computer Vision. pp. 57–74. Springer (2024) 1, 2, 3, 7
2024
-
[8]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Dao, T.: FlashAttention-2: Faster attention with bet- ter parallelism and work partitioning. arXiv preprint arXiv:2307.08691 (2023) 7, 9
work page internal anchor Pith review arXiv 2023
-
[9]
In: Conference on robot learning
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: Carla: An open urban driving simulator. In: Conference on robot learning. pp. 1–16. PMLR (2017) 1, 2
2017
-
[10]
In: 2023 IEEE International Conference on Robotics and Automation (ICRA)
Feng, L., Li, Q., Peng, Z., Tan, S., Zhou, B.: Trafficgen: Learning to generate diverse and realistic traffic scenarios. In: 2023 IEEE International Conference on Robotics and Automation (ICRA). pp. 3567–3575. IEEE (2023) 1, 2, 7, 10
2023
-
[11]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025) 1
Gao, R., Chen, K., Xiao, B., Hong, L., Li, Z., Xu, Q.: MagicDrive-V2: High-resolution long video generation for autonomous driving with adaptive control. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025) 1
2025
-
[12]
Advances in Neural Information Processing Systems37, 91560–91596 (2024) 3
Gao, S., Yang, J., Chen, L., Chitta, K., Qiu, Y., Geiger, A., Zhang, J., Li, H.: Vista: A generalizable driving world model with high fidelity and versatile controllability. Advances in Neural Information Processing Systems37, 91560–91596 (2024) 3
2024
-
[13]
The international journal of robotics research32(11), 1231–1237 (2013) 1
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. The international journal of robotics research32(11), 1231–1237 (2013) 1
2013
-
[14]
Communications Engineering 4(1), 144 (2025) 1
Guan, Y., Liao, H., Wang, C., Liu, X., Zhang, J., Li, Z.: World model-based end-to-end scene generation for accident antici- pation in autonomous driving. Communications Engineering 4(1), 144 (2025) 1
2025
-
[15]
Advances in Neural Information Processing Systems36, 7730–7742 (2023) 1, 2
Gulino, C., Fu, J., Luo, W., Tucker, G., Bronstein, E., Lu, Y., Harb, J., Pan, X., Wang, Y., Chen, X., et al.: Waymax: An ac- celerated, data-driven simulator for large-scale autonomous driving research. Advances in Neural Information Processing Systems36, 7730–7742 (2023) 1, 2
2023
-
[16]
Han, D., Pan, X., Han, Y., Song, S., Huang, G.: FLat- ten Transformer: Vision Transformer using Focused Lin- ear Attention . In: 2023 IEEE/CVF International Confer- ence on Computer Vision (ICCV). pp. 5938–5948. IEEE Computer Society, Los Alamitos, CA, USA (Oct 2023). https://doi.org/10.1109/ICCV51070.2023.00548,https : / / doi . ieeecomputersociety . org...
-
[17]
In: Computer Vision – ECCV 2024: 18th European Conference
Han, D., Ye, T., Han, Y., Xia, Z., Pan, S., Wan, P., Song, S., Huang, G.: Agent attention: On the integration of softmax and linear attention. In: Computer Vision – ECCV 2024: 18th European Conference. pp. 124–140. Springer-Verlag, Berlin, Heidelberg (2024) 7, 9
2024
-
[18]
URL https://arxiv
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models, 2020. URL https://arxiv. org/abs (2006) 4
2020
-
[19]
Ho, J., Salimans, T.: Classifier-free diffusion guidance (2022),https://arxiv.org/abs/2207.125985
work page internal anchor Pith review arXiv 2022
-
[20]
In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T
Hooper, C.R.C., Kim, S., Mohammadzadeh, H., Mah- eswaran, M., Zhao, S., Paik, J., Mahoney, M.W., Keutzer, K., Gholami, A.: Squeezed attention: Accelerating long context length LLM inference. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Proceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long...
-
[21]
Hu, A., Russell, L., Yeo, H., Murez, Z., Fedoseev, G., Kendall, A., Shotton, J., Corrado, G.: Gaia-1: A genera- tive world model for autonomous driving (2023),https: //arxiv.org/abs/2309.170801, 2
work page internal anchor Pith review arXiv 2023
-
[22]
In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition
Hu, Y., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., Wang, W., et al.: Planning-oriented au- tonomous driving. In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition. pp. 17853– 17862 (2023) 1
2023
-
[23]
In: 2024 IEEE International Conference on Robotics and Automation (ICRA)
Karnchanachari, N., Geromichalos, D., Tan, K.S., Li, N., Eriksen, C., Yaghoubi, S., Mehdipour, N., Bernasconi, G., Fong, W.K., Guo, Y., et al.: Towards learning-based plan- ning: The nuplan benchmark for real-world autonomous driv- ing. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 629–636. IEEE (2024) 1, 7
2024
-
[24]
GPUDrive: Data-driven, multi-agent driving simulation at 1 million FPS, February 2025
Kazemkhani, S., Pandya, A., Cornelisse, D., Shacklett, B., Vinitsky, E.: Gpudrive: Data-driven, multi-agent driving simulation at 1 million fps. arXiv preprint arXiv:2408.01584 (2024) 2
-
[25]
IEEE transactions on pattern analysis and machine intelligence45(3), 3461–3475 (2022) 1, 2
Li, Q., Peng, Z., Feng, L., Zhang, Q., Xue, Z., Zhou, B.: Metadrive: Composing diverse driving scenarios for gener- alizable reinforcement learning. IEEE transactions on pattern analysis and machine intelligence45(3), 3461–3475 (2022) 1, 2
2022
-
[26]
In: European Conference on Computer Vision
Li, X., Zhang, Y., Ye, X.: Drivingdiffusion: Layout-guided multi-view driving scenarios video generation with latent dif- fusion model. In: European Conference on Computer Vision. pp. 469–485. Springer (2024) 1
2024
-
[27]
IEEE Transactions on Intelligent Vehicles9(5), 4861–4876 (2024) 1
Li, Y., Yuan, W., Zhang, S., Yan, W., Shen, Q., Wang, C., Yang, M.: Choose your simulator wisely: A review on open- source simulators for autonomous driving. IEEE Transactions on Intelligent Vehicles9(5), 4861–4876 (2024) 1
2024
-
[28]
CVPR (2024) 1
Liao, B., Chen, S., Yin, H., Jiang, B., Wang, C., Yan, S., Zhang, X., Li, X., Zhang, Y., Zhang, Q., Wang, X.: Diffusion- drive: Truncated diffusion model for end-to-end autonomous driving. CVPR (2024) 1
2024
-
[29]
In: Proceedings of the Computer Vision and Pattern Recog- nition Conference
Lin, H., Guo, Z., Zhang, Y., Niu, S., Li, Y., Zhang, R., Cui, S., Li, Z.: Drivegen: Generalized and robust 3d detection in driving via controllable text-to-image diffusion generation. In: Proceedings of the Computer Vision and Pattern Recog- nition Conference. pp. 27497–27507 (2025) 1
2025
-
[30]
In: IEEE International Conference on Robotics and Automation (ICRA) (2024) 1, 2
Lu, J., Wong, K., Zhang, C., Suo, S., Urtasun, R.: Scenecon- trol: Diffusion for controllable traffic scene generation. In: IEEE International Conference on Robotics and Automation (ICRA) (2024) 1, 2
2024
-
[31]
arXiv preprint arXiv:2502.08246 (2025) 7, 9 11
Mazar ´e, P.E., Szilvasy, G., Lomeli, M., Massa, F., Murray, N., J ´egou, H., Douze, M.: Inference-time sparse attention with asymmetric indexing. arXiv preprint arXiv:2502.08246 (2025) 7, 9 11
-
[32]
arXiv preprint arXiv:2508.15769 (2025)
Meng, Y., Wu, H., Zhang, Y., Xie, W.: Scenegen: Single- image 3d scene generation in one feedforward pass. arXiv preprint arXiv:2508.15769 (2025) 2
-
[33]
arXiv (2021) 2
Mi, L., Zhao, H., Nash, C., Jin, X., Gao, J., Sun, C., Schmid, C., Shavit, N., Chai, Y., Anguelov, D.: Hdmapgen: A hierar- chical graph generative model of high definition maps. arXiv (2021) 2
2021
-
[34]
Mou, C., Wang, X., Xie, L., Wu, Y., Zhang, J., Qi, Z., Shan, Y.: T2i-adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models. In: Proceedings of the Thirty-Eighth AAAI Conference on Ar- tificial Intelligence and Thirty-Sixth Conference on Innova- tive Applications of Artificial Intelligence and Fourteenth Symposi...
-
[35]
In: Proceedings of the IEEE/CVF international con- ference on computer vision
Peebles, W., Xie, S.: Scalable diffusion models with trans- formers. In: Proceedings of the IEEE/CVF international con- ference on computer vision. pp. 4195–4205 (2023) 4
2023
-
[36]
Advances in Neural Information Processing Systems36, 68873–68894 (2023) 1
Pronovost, E., Ganesina, M.R., Hendy, N., Wang, Z., Morales, A., Wang, K., Roy, N.: Scenario diffusion: Control- lable driving scenario generation with diffusion. Advances in Neural Information Processing Systems36, 68873–68894 (2023) 1
2023
-
[37]
Available: https://arxiv.org/abs/2506.09042
Ren, X., Lu, Y., Cao, T., Gao, R., Huang, S., Sabour, A., Shen, T., Pfaff, T., Wu, J.Z., Chen, R., et al.: Cosmos-drive-dreams: Scalable synthetic driving data generation with world foun- dation models. arXiv preprint arXiv:2506.09042 (2025) 1, 3
-
[38]
In: 2020 IEEE 23rd International confer- ence on intelligent transportation systems (ITSC)
Rong, G., Shin, B.H., Tabatabaee, H., Lu, Q., Lemke, S., Moˇ zeiko, M., Boise, E., Uhm, G., Gerow, M., Mehta, S., et al.: Lgsvl simulator: A high fidelity simulator for au- tonomous driving. In: 2020 IEEE 23rd International confer- ence on intelligent transportation systems (ITSC). pp. 1–6. IEEE (2020) 2
2020
-
[39]
arXiv preprint arXiv:2403.19918 (2024) 1, 2
Rowe, L., Girgis, R., Gosselin, A., Carrez, B., Golemo, F., Heide, F., Paull, L., Pal, C.: Ctrl-sim: Reactive and con- trollable driving agents with offline reinforcement learning. arXiv preprint arXiv:2403.19918 (2024) 1, 2
-
[40]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Rowe, L., Girgis, R., Gosselin, A., Paull, L., Pal, C., Heide, F.: Scenario dreamer: Vectorized latent diffusion for gen- erating driving simulation environments. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 17207–17218 (2025) 1, 2, 3, 7, 8, 9, 10
2025
-
[41]
Russell, L., Hu, A., Bertoni, L., Fedoseev, G., Shotton, J., Arani, E., Corrado, G.: Gaia-2: A controllable multi- view generative world model for autonomous driving. arXiv preprint arXiv:2503.20523 (2025) 1, 3
-
[42]
Accident Analysis and Prevention163, 106454 (2021),https://api.semanticscholar
Scanlon, J.M., Kusano, K.D., Daniel, T., Alderson, C.J., Ogle, A., Victor, T.: Waymo simulated driving behavior in reconstructed fatal crashes within an autonomous vehicle operating domain. Accident Analysis and Prevention163, 106454 (2021),https://api.semanticscholar. org/CorpusID:2322856421
2021
-
[43]
In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020) 1
Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Pat- naik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., Vasudevan, V., Han, W., Ngiam, J., Zhao, H., Timofeev, A., Ettinger, S., Krivokon, M., Gao, A., Joshi, A., Zhang, Y., Shlens, J., Chen, Z., Anguelov, D.: Scalability in perception for autonomous driving: Waymo open dataset. In: Proceed- i...
2020
-
[44]
In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020) 1, 7, 10
Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Pat- naik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., Vasudevan, V., Han, W., Ngiam, J., Zhao, H., Timofeev, A., Ettinger, S., Krivokon, M., Gao, A., Joshi, A., Zhang, Y., Shlens, J., Chen, Z., Anguelov, D.: Scalability in perception for autonomous driving: Waymo open dataset. In: Proceed- i...
2020
-
[45]
IEEE Robotics and Automation Letters9(8), 7007–7014 (2024)
Sun, S., Gu, Z., Sun, T., Sun, J., Yuan, C., Han, Y., Li, D., Ang, M.H.: Drivescenegen: Generating di- verse and realistic driving scenarios from scratch. IEEE Robotics and Automation Letters9(8), 7007–7014 (2024). https://doi.org/10.1109/LRA.2024.3416792 2
-
[46]
Tonderski, A., Lindstr ¨om, C., Hess, G., Ljungbergh, W., Svensson, L., Petersson, C.: Neurad: Neural rendering for autonomous driving (2024) 2
2024
-
[47]
Advances in neural information processing systems30 (2017) 7, 9
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems30 (2017) 7, 9
2017
-
[48]
arXiv preprint arXiv:2503.12170 (2025) 3
Wang, T., Zhang, C., Qu, X., Li, K., Liu, W., Huang, C.: Dif- fad: A unified diffusion modeling approach for autonomous driving. ArXivabs/2503.12170(2025),https://api. semanticscholar.org/CorpusID:2770665941
-
[49]
arXiv preprint arXiv:2505.18650 (2025) 2, 3
Wang, X., Peng, P.: Prophetdwm: A driving world model for rolling out future actions and videos. arXiv preprint arXiv:2505.18650 (2025) 3
-
[50]
arXiv preprint arXiv:2506.01546 (2025) 3
Wang, X., Wu, Z., Peng, P.: Longdwm: Cross-granularity distillation for building a long-term driving world model. arXiv preprint arXiv:2506.01546 (2025) 3
-
[51]
In: European conference on computer vision
Wang, X., Zhu, Z., Huang, G., Chen, X., Zhu, J., Lu, J.: Drivedreamer: Towards real-world-drive world models for autonomous driving. In: European conference on computer vision. pp. 55–72. Springer (2024) 1
2024
-
[52]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Wen, Y., Zhao, Y., Liu, Y., Jia, F., Wang, Y., Luo, C., Zhang, C., Wang, T., Sun, X., Zhang, X.: Panacea: Panoramic and controllable video generation for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6902–6912 (2024) 1
2024
-
[53]
What’s in the image? a deep-dive into the vision of vision language models
Wu, Y., Zhang, H., Lin, T., Huang, L., Luo, S., Wu, R., Qiu, C., Ke, W., Zhang, T.: Generating Multimodal Driving Scenes via Next-Scene Prediction . In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6844–6853. IEEE Computer Society, Los Alamitos, CA, USA (Jun 2025). https://doi.org/10.1109/CVPR52734.2025.00642,https: / / ...
-
[54]
In: Euro- 12 pean Conference on Computer Vision
Yan, Y., Lin, H., Zhou, C., Wang, W., Sun, H., Zhan, K., Lang, X., Zhou, X., Peng, S.: Street gaussians: Model- ing dynamic urban scenes with gaussian splatting. In: Euro- 12 pean Conference on Computer Vision. pp. 156–173. Springer (2024) 2
2024
-
[55]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Yang, X., Wen, L., Ma, Y., Mei, J., Li, X., Wei, T., Lei, W., Fu, D., Cai, P., Dou, M., Shi, B., He, L., Liu, Y., Qiao, Y.: Drivearena: A closed-loop generative simulation platform for autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 26933– 26943 (2025) 3
2025
-
[56]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Yang, Z., Chen, Y., Wang, J., Manivasagam, S., Ma, W.C., Yang, A.J., Urtasun, R.: Unisim: A neural closed-loop sensor simulator. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1389–1399 (2023) 2
2023
-
[57]
In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV) (2025) 3
Zhang, K., Tang, Z., Hu, X., Pan, X., Guo, X., Liu, Y., Huang, J., Yuan, L., Zhang, Q., Long, X.X., Cao, X., Yin, W.: Epona: Autoregressive diffusion world model for autonomous driv- ing. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV) (2025) 3
2025
-
[58]
In: Proceedings of the IEEE/CVF international conference on computer vision
Zhang, L., Rao, A., Agrawala, M.: Adding conditional con- trol to text-to-image diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3836–3847 (2023) 5
2023
-
[59]
Zhou, M., Luo, J., Villella, J., Yang, Y., Rusu, D., Miao, J., Zhang, W., Alban, M., Fadakar, I., Chen, Z., Huang, A.C., Wen, Y., Hassanzadeh, K., Graves, D., Chen, D., Zhu, Z., Nguyen, N., Elsayed, M., Shao, K., Ahilan, S., Zhang, B., Wu, J., Fu, Z., Rezaee, K., Yadmellat, P., Rohani, M., Nieves, N.P., Ni, Y., Banijamali, S., Rivers, A.C., Tian, Z., Pale...
-
[60]
Advances in Neural Information Processing Systems 37, 48838–48874 (2024) 3
Zhou, Y., Simon, M., Peng, Z., Mo, S., Zhu, H., Guo, M., Zhou, B.: Simgen: Simulator-conditioned driving scene gen- eration. Advances in Neural Information Processing Systems 37, 48838–48874 (2024) 3
2024
-
[61]
In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https: //openreview.net/forum?id=GRmQjLzaPM1
Zhou, Z., HU, H., Chen, X., Wang, J., Guan, N., Wu, K., Li, Y.H., Huang, Y.K., Xue, C.J.: BehaviorGPT: Smart agent simulation for autonomous driving with next-patch prediction. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https: //openreview.net/forum?id=GRmQjLzaPM1
2024
-
[62]
In: International Conference on Learning Representa- tions (ICLR) (2021) 7, 9 13
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: Deformable transformers for end-to-end object detec- tion. In: International Conference on Learning Representa- tions (ICLR) (2021) 7, 9 13
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.