V2XCrafter: Learning to Generate Driving Scene Across Agents
Pith reviewed 2026-06-29 08:52 UTC · model grok-4.3
The pith
V2XCrafter generates high-fidelity collaborative driving scenes with consistency across multiple vehicle views.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
V2XCrafter is the first framework for generating controllable and realistic collaborative driving scenes across agents' camera views. It develops a progressive multi-agent diffusion model based on a single-agent backbone, using neighboring agents' latent states as reference signals to progressively guide the single-to-multi diffusion. To address cross-vehicle inconsistency, it proposes a cross-agent attention module that leverages a collaboration view graph and learnable jointly observed object representation to model the dynamic cross-agent camera view relationships.
What carries the argument
Progressive multi-agent diffusion model guided by neighboring agents' latent states as reference signals, combined with a cross-agent attention module using a collaboration view graph and learnable jointly observed object representations.
If this is right
- High-fidelity and controllable street views can be generated.
- Consistency across agents is achieved for physical attributes such as color and category of jointly observed objects.
- The generated data effectively enhances downstream collaborative 3D object detection tasks.
Where Pith is reading between the lines
- The method could supplement scarce real-world V2X datasets for training perception systems.
- Similar progressive diffusion approaches might apply to other multi-view or multi-sensor generation problems in robotics.
- Controllability could allow targeted simulation of rare driving scenarios.
Load-bearing premise
Neighboring agents' latent states can be used as reference signals to progressively guide single-to-multi diffusion while preserving generation quality.
What would settle it
Generated scenes where the same object shows inconsistent color or category from different agents' perspectives would disprove the consistency claim.
read the original abstract
Collaborative driving systems leverage vehicle-to-everything (V2X) communication for multi-agent collaborative perception to enhance driving safety, yet they remain constrained by scarce annotated real-world V2X driving datasets and limited generalization across diverse driving conditions. While image generation technology offers a feasible solution for data augmentation, existing methods tailored for single-vehicle multi-view scenarios face two fundamental challenges in multi-agent driving settings: (1) the expansion of the learning objective degrades generation quality, and (2) the highly dynamic variations across agents hinder the modeling of consistency for physical attributes (e.g., color, category) in jointly observed objects. To bridge this gap, we propose V2XCrafter, the first framework for generating controllable and realistic collaborative driving scene across agents' camera views. For effective learning, we develop a progressive multi-agent diffusion model based on a single-agent backbone, using neighboring agents' latent states as reference signals to progressively guide the single-to-multi diffusion. To address cross-vehicle inconsistency, we propose a cross-agent attention module that leverages a collaboration view graph and learnable jointly observed object representation to model the dynamic cross-agent camera view relationships. Experiments have shown that V2XCrafter can generate high-fidelity and controllable street views with consistency across agents, thereby effectively enhancing the downstream collaborative 3D object detection tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces V2XCrafter as the first framework for generating controllable, high-fidelity collaborative driving scenes across multiple agents' camera views in V2X settings. It develops a progressive multi-agent diffusion model that starts from a single-agent backbone and uses neighboring agents' latent states as reference signals to guide single-to-multi diffusion, together with a cross-agent attention module that employs a collaboration view graph and learnable jointly observed object representations to enforce consistency in physical attributes (color, category) of shared objects. The approach targets data scarcity and generalization issues in collaborative perception and is reported to improve downstream collaborative 3D object detection.
Significance. If the central claims hold, the work provides a practical route to augment limited real-world V2X datasets with consistent multi-view scenes, which could meaningfully improve training and generalization of collaborative perception models and thereby contribute to safer autonomous driving systems.
major comments (2)
- [Abstract and §3] Abstract and §3: The progressive multi-agent diffusion relies on injecting neighboring agents' latent states as reference signals while preserving generation quality. This assumption is load-bearing for the downstream detection claim; when viewpoint/motion differences are large, misalignment on dynamic objects can conflict with the denoising objective. The manuscript must supply concrete evidence (e.g., targeted ablations on occlusion or lighting variation cases) that the cross-agent attention module and collaboration view graph reliably prevent artifact introduction without degrading fidelity.
- [Abstract] Abstract: The assertion that the generated scenes 'effectively enhanc[e] the downstream collaborative 3D object detection tasks' is central yet unsupported by any visible quantitative metrics, ablation tables, or baseline comparisons. Without these results, it is impossible to verify whether consistency gains translate to detection improvements or whether post-hoc design choices influenced the reported outcomes.
minor comments (1)
- The abstract states 'Experiments have shown' without any numerical highlights; adding one or two key metrics (e.g., FID, consistency scores, or detection mAP gains) would improve immediate readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen the evidence for our claims.
read point-by-point responses
-
Referee: [Abstract and §3] The progressive multi-agent diffusion relies on injecting neighboring agents' latent states as reference signals while preserving generation quality. This assumption is load-bearing for the downstream detection claim; when viewpoint/motion differences are large, misalignment on dynamic objects can conflict with the denoising objective. The manuscript must supply concrete evidence (e.g., targeted ablations on occlusion or lighting variation cases) that the cross-agent attention module and collaboration view graph reliably prevent artifact introduction without degrading fidelity.
Authors: We agree that targeted evidence under challenging conditions is necessary. While our experiments evaluate on V2X datasets with natural variations, we will add specific ablations on occlusion and lighting variation cases in the revised manuscript to demonstrate that the cross-agent attention module and collaboration view graph maintain consistency without introducing artifacts or degrading fidelity. revision: yes
-
Referee: [Abstract] The assertion that the generated scenes 'effectively enhanc[e] the downstream collaborative 3D object detection tasks' is central yet unsupported by any visible quantitative metrics, ablation tables, or baseline comparisons. Without these results, it is impossible to verify whether consistency gains translate to detection improvements or whether post-hoc design choices influenced the reported outcomes.
Authors: The full manuscript contains quantitative results and baseline comparisons for the downstream collaborative 3D object detection task in the experiments section. To improve clarity, we will revise the abstract to include key quantitative metrics and ensure the relevant tables are prominently referenced from the abstract and introduction. revision: yes
Circularity Check
No circularity: method introduces novel components without self-referential reduction
full rationale
The paper describes a progressive multi-agent diffusion model built on a single-agent backbone and a cross-agent attention module using a collaboration view graph, but provides no equations, no fitted parameters renamed as predictions, and no load-bearing self-citations in the abstract or described sections. The central claims about consistency across agents and downstream detection gains are presented as outcomes of the new architecture rather than definitions or fits that reduce to the inputs by construction. This is a standard empirical method paper whose derivation chain remains independent of the target results.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[2]
URLhttps://arxiv.org/abs/2502.13923
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Lang, Sourabh Vora, VeniceErin Liong, Qiang Xu, AnushKrishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom
HolgerCaesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, VeniceErin Liong, Qiang Xu, AnushKrishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020
2020
-
[4]
Geodiffu- sion: Text-prompted geometric control for object detection data generation
Kai Chen, Enze Xie, Zhe Chen, Yibo Wang, Lanqing HONG, Zhenguo Li, and Dit-Yan Yeung. Geodiffu- sion: Text-prompted geometric control for object detection data generation. InThe Twelfth Interna- tional Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id= xBfQZWeDRH
2024
-
[5]
CARLA: An Open Urban Driving Simulator
Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An Open Urban Driving Simulator. InProceedings of the 1st Annual Conference on Robot Learning (CoRL), pages 1–16, 2017
2017
-
[6]
PACP: Priority-Aware Collaborative Perception for Connected and Autonomous Vehicles.IEEE Transactions on Mobile Computing, 23(12):15003–15018, 2024
Zhengru Fang, Senkang Hu, Haonan An, Yuang Zhang, Jingjing Wang, Hangcheng Cao, Xianhao Chen, and Yuguang Fang. PACP: Priority-Aware Collaborative Perception for Connected and Autonomous Vehicles.IEEE Transactions on Mobile Computing, 23(12):15003–15018, 2024
2024
-
[7]
Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations
Zhengru Fang, Yu Guo, Fei Liu, Yuang Zhang, Yihang Tao, Senkang Hu, Wenbo Ding, and Yuguang Fang. Agent-centric visual reinforcement learning under dynamic perturbations.arXiv preprint arXiv:2604.24661, 2026. URLhttps://arxiv.org/abs/2604.24661
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[8]
MagicDrive: Street view generation with diverse 3d geometry control
RuiyuanGao, KaiChen, EnzeXie, LanqingHong, ZhenguoLi, Dit-YanYeung, andQiangXu. MagicDrive: Street view generation with diverse 3d geometry control. InInternational Conference on Learning Representations, 2024
2024
-
[9]
MagicDrive-V2: High- resolution long video generation for autonomous driving with adaptive control
Ruiyuan Gao, Kai Chen, Bo Xiao, Lanqing Hong, Zhenguo Li, and Qiang Xu. MagicDrive-V2: High- resolution long video generation for autonomous driving with adaptive control. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025
2025
-
[10]
Onerestore: A universal restoration framework for composite degradation
Yu Guo, Yuan Gao, Yuxu Lu, Huilin Zhu, Ryan Wen Liu, and Shengfeng He. Onerestore: A universal restoration framework for composite degradation. InEuropean conference on computer vision, pages 255–272. Springer, 2024
2024
-
[11]
Neptune-x: Active x-to-maritime generation for universal maritime object detection
Yu Guo, Shengfeng He, Yuxu Lu, Haonan An, Yihang Tao, Huilin Zhu, Jingxian Liu, and Yuguang Fang. Neptune-x: Active x-to-maritime generation for universal maritime object detection. InAnnual Conference on Neural Information Processing Systems, 2025
2025
-
[12]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
2016
-
[13]
Gans trained by a two time-scale update rule converge to a local nash equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InProceedings of the 31st 15/35 V2XCrafter: Learning to Generate Driving Scene Across Agents International Conference on Neural Information Processing Systems, NIPS’17, pages 6629–664...
2017
-
[14]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InProceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546
2020
-
[15]
Senkang Hu, Zhengru Fang, Yiqin Deng, Xianhao Chen, and Yuguang Fang. Collaborative perception for connected and autonomous driving: Challenges, possible solutions and opportunities.IEEE Wireless Communications, 32(5):228–234, 2025. doi: 10.1109/MWC.002.2400348
-
[16]
CP- Guard: Malicious agent detection and defense in collaborative bird’s eye view perception.Proceedings of the AAAI Conference on Artificial Intelligence, 39(22):23203–23211, Apr
Senkang Hu, Yihang Tao, Guowen Xu, Yiqin Deng, Xianhao Chen, Yuguang Fang, and Sam Kwong. CP- Guard: Malicious agent detection and defense in collaborative bird’s eye view perception.Proceedings of the AAAI Conference on Artificial Intelligence, 39(22):23203–23211, Apr. 2025
2025
-
[17]
Fourier position embedding: Enhancing attention’s periodic extension for length generalization
Ermo Hua, Che Jiang, Xingtai Lv, Kaiyan Zhang, Youbang Sun, Yuchen Fan, Xuekai Zhu, Biqing Qi, Ning Ding, and Bowen Zhou. Fourier position embedding: Enhancing attention’s periodic extension for length generalization. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=ZfDNDkg7Dh
2025
-
[18]
Exponentialtaillocalrademachercomplexity risk bounds without the bernstein condition.J
VarunKanade, PatrickRebeschini, andTomaszVaškevičiu. Exponentialtaillocalrademachercomplexity risk bounds without the bernstein condition.J. Mach. Learn. Res., 25(1), January 2024. ISSN 1532- 4435
2024
-
[19]
Multi-concept customization of text-to-image diffusion
Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
2023
-
[20]
Controllable text-to-image gen- eration
Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip Torr. Controllable text-to-image gen- eration. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Gar- nett, editors,Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/ 1d7...
2019
-
[21]
JunhangLi,YuGuo,ChuhuaXian,andShengfengHe. Instruct2see: Learningtoremoveanyobstructions across distributions.arXiv preprint arXiv:2505.17649, 2025. URLhttps://arxiv.org/abs/2505. 17649
-
[22]
Gligen: Open-set grounded text-to-image generation.CVPR, 2023
Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation.CVPR, 2023
2023
-
[23]
Conflict-averse gradient descent for multi-task learning
Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 18878–18890. Curran As- sociates, Inc., 2021. URLhttps://proceedings.neurips.cc/p...
2021
-
[24]
Genjia Liu, Yue Hu, Chenxin Xu, Weibo Mao, Junhao Ge, Zhengxiang Huang, Yifan Lu, Yinda Xu, Junkai Xia, Yafei Wang, and Siheng Chen. Toward collaborative autonomous driving: Simulation platform and end-to-end system.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(8): 6566–6584, 2025. doi: 10.1109/TPAMI.2025.3560327. 16/35 V2XCrafter: Le...
-
[25]
Bev- fusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation
Zhijian Liu, Haotian Tang, Alexander Amini, Xingyu Yang, Huizi Mao, Daniela Rus, and Song Han. Bev- fusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. InIEEE International Conference on Robotics and Automation (ICRA), 2023
2023
-
[26]
Yanan Ma, Senkang Hu, Zhengru Fang, Yun Ji, Yiqin Deng, and Yuguang Fang. Sense4fl: Vehic- ular crowdsensing enhanced federated learning for object detection in autonomous driving.IEEE Transactions on Mobile Computing, pages 1–15, 2026. doi: 10.1109/TMC.2026.3674333
-
[27]
Driveworld: 4d pre-trained scene understanding via world models for autonomous driving
Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, Liping Jing, Yiming Nie, and Bin Dai. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15522–15533, June 2024
2024
-
[28]
Wein- berger, Bharath Hariharan, and Wei-Lun Chao
Tai-Yu Pan, Sooyoung Jeon, Mengdi Fan, Jinsu Yoo, Zhenyang Feng, Mark Campbell, Kilian Q. Wein- berger, Bharath Hariharan, and Wei-Lun Chao. Transfer your perspective: Controllable 3d generation from any viewpoint in a driving scene. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12027–12036, June 2025
2025
-
[29]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine...
2021
-
[30]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 2022. URL https://arxiv.org/abs/2204.06125
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022
2022
-
[32]
Photorealistic text-to-image diffusion models with deep language understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. O...
2022
-
[33]
Scalability in perception for autonomous driving: Waymo open dataset
Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception...
2020
-
[34]
AlexanderSwerdlow, RunshengXu, andBoleiZhou. Street-viewimagegenerationfromabird’s-eyeview layout.IEEE Robotics and Automation Letters, 9(4):3578–3585, 2024. doi: 10.1109/LRA.2024.3368234. 17/35 V2XCrafter: Learning to Generate Driving Scene Across Agents
-
[35]
Digital twin and drl-driven semantic dissemination for 6g autonomous driving service
Yihang Tao, Jun Wu, Xi Lin, Shahid Mumtaz, and Soumaya Cherkaoui. Digital twin and drl-driven semantic dissemination for 6g autonomous driving service. InGLOBECOM 2023 - 2023 IEEE Global Com- munications Conference, pages 2075–2080, 2023. doi: 10.1109/GLOBECOM54140.2023.10437455
-
[36]
Yihang Tao, Jun Wu, Xi Lin, and Wu Yang. Drl-driven digital twin function virtualization for adaptive service response in 6g networks.IEEE Networking Letters, 5(2):125–129, 2023. doi: 10.1109/LNET. 2023.3269766
-
[37]
Yihang Tao, Jun Wu, Qianqian Pan, Ali Kashif Bashir, and Marwan Omar. O-ran-based digital twin function virtualization for sustainable iov service response: An asynchronous hierarchical reinforcement learning approach.IEEE Transactions on Green Communications and Networking, 8(3):1049–1060,
-
[38]
doi: 10.1109/TGCN.2024.3435796
-
[39]
Yihang Tao, Senkang Hu, Zhengru Fang, and Yuguang Fang. Directed-cp: Directed collaborative perception for connected and autonomous vehicles via proactive attention. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 7004–7010, 2025. doi: 10.1109/ICRA55743. 2025.11127818
-
[40]
Yihang Tao, Senkang Hu, Haonan An, Zhengru Fang, Hangcheng Cao, and Yuguang Fang. Learning mutual view information graph for adaptive adversarial collaborative perception.arXiv preprint arXiv:2602.19596, 2026. URLhttps://arxiv.org/abs/2602.19596
-
[41]
Yihang Tao, Senkang Hu, Yue Hu, Haonan An, Hangcheng Cao, and Yuguang Fang. Gcp: Guarded collaborative perception with spatial-temporal aware malicious agent detection.IEEE Transactions on Dependable and Secure Computing, pages 1–14, 2026. doi: 10.1109/TDSC.2026.3693684
-
[42]
Runqi Wang, Linlin Yang, Baochang Zhang, Wentao Zhu, David Doermann, and Guodong Guo. Con- fidence dimension for deep learning based on hoeffding inequality and relative evaluation.arXiv preprint arXiv:2203.09082, 2022. URLhttps://arxiv.org/abs/2203.09082
-
[43]
V2VNet: Vehicle-to-Vehicle Communication for Joint Perception and Prediction
Tsun-Hsuan Wang, Sivabalan Manivasagam, Ming Liang, Bin Yang, Wenyuan Zeng, and Raquel Urtasun. V2VNet: Vehicle-to-Vehicle Communication for Joint Perception and Prediction. InEuropean Conference on Computer Vision (ECCV), pages 605–621, 2020
2020
-
[44]
Drive- dreamer: Towards real-world-drive world models for autonomous driving
Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drive- dreamer: Towards real-world-drive world models for autonomous driving. InComputer Vision - ECCV 2024: 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XLVIII, pages 55–72, Berlin, Heidelberg, 2024. Springer-Verlag. ISBN 978-3-031-7319...
-
[45]
Editable scene simulation for autonomous driving via collaborative llm-agents
Yuxi Wei, Zi Wang, Yifan Lu, Chenxin Xu, Changxing Liu, Hao Zhao, Siheng Chen, and Yanfeng Wang. Editable scene simulation for autonomous driving via collaborative llm-agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024
2024
-
[46]
Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023
2023
-
[47]
In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp
Hao Xiang, Runsheng Xu, Xin Xia, Zhaoliang Zheng, Bolei Zhou, and Jiaqi Ma. V2xp-asg: Generating adversarial scenes for vehicle-to-everything perception. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 3584–3591, 2023. doi: 10.1109/ICRA48891.2023.10161384. 18/35 V2XCrafter: Learning to Generate Driving Scene Across Agents
-
[48]
V2x-real: A largs-scale dataset for vehicle-to-everything cooperative perception
Hao Xiang, Zhaoliang Zheng, Xin Xia, Runsheng Xu, Letian Gao, Zewei Zhou, Xu Han, Xinkai Ji, Mingxi Li, Zonglin Meng, Li Jin, Mingyue Lei, Zhaoyang Ma, Zihang He, Haoxuan Ma, Yunshuang Yuan, Yingqian Zhao, and Jiaqi Ma. V2x-real: A largs-scale dataset for vehicle-to-everything cooperative perception. InComputer Vision - ECCV 2024: 18th European Conference...
2024
-
[49]
OPV2V: An Open Benchmark Dataset and Fusion Pipeline for Perception with Vehicle-to-Vehicle Communication
Runsheng Xu, Hao Xiang, Xin Xia, Xu Han, Jinlong Li, and Jiaqi Ma. OPV2V: An Open Benchmark Dataset and Fusion Pipeline for Perception with Vehicle-to-Vehicle Communication. In2022 Interna- tional Conference on Robotics and Automation (ICRA), pages 2583–2589, Philadelphia, PA, USA, May
-
[50]
Kairui Yang, Enhui Ma, Jibin Peng, Qing Guo, Di Lin, and Kaicheng Yu. Bevcontrol: Accurately controlling street-view elements with multi-perspective consistency via bev sketch layout.arXiv preprint arXiv:2308.01661, 2023. URLhttps://arxiv.org/abs/2308.01661
-
[51]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, October 2023
2023
-
[52]
Unipc: A unified predictor-corrector framework for fast sampling of diffusion models.NeurIPS, 2023
Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models.NeurIPS, 2023
2023
-
[53]
Layoutdiffusion: Controllable diffusion model for layout-to-image generation
Guangcong Zheng, Xianpan Zhou, Xuewei Li, Zhongang Qi, Ying Shan, and Xi Li. Layoutdiffusion: Controllable diffusion model for layout-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22490–22499, June 2023
2023
-
[54]
World4drive: End-to-end autonomous driving via intention-aware physical latent world model
Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, XianPeng Lang, and Dongbin Zhao. World4drive: End-to-end autonomous driving via intention-aware physical latent world model. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 28632–28642, Oc...
2025
-
[55]
For data augmentation experiments, we additionally evaluate V2VNet
for multi-agent feature fusion. For data augmentation experiments, we additionally evaluate V2VNet
-
[56]
w/o CA&V*
and Late Fusion [14] baselines, testing both camera-only (C) and camera-LiDAR (C+L) configurations. B.2. Baseline Implementation Details To the best of our knowledge, no prior work has addressed the task of multi-agent collaborative driving scene generation. Therefore, we establish baselines by adapting state-of-the-art single-agent driving scene generati...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.