pith. sign in

arxiv: 2605.29471 · v1 · pith:EOK4XZX4new · submitted 2026-05-28 · 💻 cs.CV

V2XCrafter: Learning to Generate Driving Scene Across Agents

Pith reviewed 2026-06-29 08:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords V2Xdriving scene generationdiffusion modelcollaborative perceptionmulti-agent3D object detectioncross-agent attention
0
0 comments X

The pith

V2XCrafter generates high-fidelity collaborative driving scenes with consistency across multiple vehicle views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Collaborative driving relies on V2X communication but lacks enough real data for training. Current image generation methods lose quality and fail to keep object properties consistent when extended to multiple agents. V2XCrafter addresses this by extending a single-agent diffusion model into a progressive multi-agent version that uses other agents' latent states as guides. It adds a cross-agent attention module to link views through a graph and shared object representations. The generated scenes improve performance on collaborative 3D object detection.

Core claim

V2XCrafter is the first framework for generating controllable and realistic collaborative driving scenes across agents' camera views. It develops a progressive multi-agent diffusion model based on a single-agent backbone, using neighboring agents' latent states as reference signals to progressively guide the single-to-multi diffusion. To address cross-vehicle inconsistency, it proposes a cross-agent attention module that leverages a collaboration view graph and learnable jointly observed object representation to model the dynamic cross-agent camera view relationships.

What carries the argument

Progressive multi-agent diffusion model guided by neighboring agents' latent states as reference signals, combined with a cross-agent attention module using a collaboration view graph and learnable jointly observed object representations.

If this is right

  • High-fidelity and controllable street views can be generated.
  • Consistency across agents is achieved for physical attributes such as color and category of jointly observed objects.
  • The generated data effectively enhances downstream collaborative 3D object detection tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could supplement scarce real-world V2X datasets for training perception systems.
  • Similar progressive diffusion approaches might apply to other multi-view or multi-sensor generation problems in robotics.
  • Controllability could allow targeted simulation of rare driving scenarios.

Load-bearing premise

Neighboring agents' latent states can be used as reference signals to progressively guide single-to-multi diffusion while preserving generation quality.

What would settle it

Generated scenes where the same object shows inconsistent color or category from different agents' perspectives would disprove the consistency claim.

read the original abstract

Collaborative driving systems leverage vehicle-to-everything (V2X) communication for multi-agent collaborative perception to enhance driving safety, yet they remain constrained by scarce annotated real-world V2X driving datasets and limited generalization across diverse driving conditions. While image generation technology offers a feasible solution for data augmentation, existing methods tailored for single-vehicle multi-view scenarios face two fundamental challenges in multi-agent driving settings: (1) the expansion of the learning objective degrades generation quality, and (2) the highly dynamic variations across agents hinder the modeling of consistency for physical attributes (e.g., color, category) in jointly observed objects. To bridge this gap, we propose V2XCrafter, the first framework for generating controllable and realistic collaborative driving scene across agents' camera views. For effective learning, we develop a progressive multi-agent diffusion model based on a single-agent backbone, using neighboring agents' latent states as reference signals to progressively guide the single-to-multi diffusion. To address cross-vehicle inconsistency, we propose a cross-agent attention module that leverages a collaboration view graph and learnable jointly observed object representation to model the dynamic cross-agent camera view relationships. Experiments have shown that V2XCrafter can generate high-fidelity and controllable street views with consistency across agents, thereby effectively enhancing the downstream collaborative 3D object detection tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces V2XCrafter as the first framework for generating controllable, high-fidelity collaborative driving scenes across multiple agents' camera views in V2X settings. It develops a progressive multi-agent diffusion model that starts from a single-agent backbone and uses neighboring agents' latent states as reference signals to guide single-to-multi diffusion, together with a cross-agent attention module that employs a collaboration view graph and learnable jointly observed object representations to enforce consistency in physical attributes (color, category) of shared objects. The approach targets data scarcity and generalization issues in collaborative perception and is reported to improve downstream collaborative 3D object detection.

Significance. If the central claims hold, the work provides a practical route to augment limited real-world V2X datasets with consistent multi-view scenes, which could meaningfully improve training and generalization of collaborative perception models and thereby contribute to safer autonomous driving systems.

major comments (2)
  1. [Abstract and §3] Abstract and §3: The progressive multi-agent diffusion relies on injecting neighboring agents' latent states as reference signals while preserving generation quality. This assumption is load-bearing for the downstream detection claim; when viewpoint/motion differences are large, misalignment on dynamic objects can conflict with the denoising objective. The manuscript must supply concrete evidence (e.g., targeted ablations on occlusion or lighting variation cases) that the cross-agent attention module and collaboration view graph reliably prevent artifact introduction without degrading fidelity.
  2. [Abstract] Abstract: The assertion that the generated scenes 'effectively enhanc[e] the downstream collaborative 3D object detection tasks' is central yet unsupported by any visible quantitative metrics, ablation tables, or baseline comparisons. Without these results, it is impossible to verify whether consistency gains translate to detection improvements or whether post-hoc design choices influenced the reported outcomes.
minor comments (1)
  1. The abstract states 'Experiments have shown' without any numerical highlights; adding one or two key metrics (e.g., FID, consistency scores, or detection mAP gains) would improve immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen the evidence for our claims.

read point-by-point responses
  1. Referee: [Abstract and §3] The progressive multi-agent diffusion relies on injecting neighboring agents' latent states as reference signals while preserving generation quality. This assumption is load-bearing for the downstream detection claim; when viewpoint/motion differences are large, misalignment on dynamic objects can conflict with the denoising objective. The manuscript must supply concrete evidence (e.g., targeted ablations on occlusion or lighting variation cases) that the cross-agent attention module and collaboration view graph reliably prevent artifact introduction without degrading fidelity.

    Authors: We agree that targeted evidence under challenging conditions is necessary. While our experiments evaluate on V2X datasets with natural variations, we will add specific ablations on occlusion and lighting variation cases in the revised manuscript to demonstrate that the cross-agent attention module and collaboration view graph maintain consistency without introducing artifacts or degrading fidelity. revision: yes

  2. Referee: [Abstract] The assertion that the generated scenes 'effectively enhanc[e] the downstream collaborative 3D object detection tasks' is central yet unsupported by any visible quantitative metrics, ablation tables, or baseline comparisons. Without these results, it is impossible to verify whether consistency gains translate to detection improvements or whether post-hoc design choices influenced the reported outcomes.

    Authors: The full manuscript contains quantitative results and baseline comparisons for the downstream collaborative 3D object detection task in the experiments section. To improve clarity, we will revise the abstract to include key quantitative metrics and ensure the relevant tables are prominently referenced from the abstract and introduction. revision: yes

Circularity Check

0 steps flagged

No circularity: method introduces novel components without self-referential reduction

full rationale

The paper describes a progressive multi-agent diffusion model built on a single-agent backbone and a cross-agent attention module using a collaboration view graph, but provides no equations, no fitted parameters renamed as predictions, and no load-bearing self-citations in the abstract or described sections. The central claims about consistency across agents and downstream detection gains are presented as outcomes of the new architecture rather than definitions or fits that reduce to the inputs by construction. This is a standard empirical method paper whose derivation chain remains independent of the target results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only input supplies no explicit free parameters, axioms, or invented entities; full paper would be required to enumerate diffusion schedule choices, attention hyperparameters, or any new object representations.

pith-pipeline@v0.9.1-grok · 5783 in / 1114 out tokens · 20269 ms · 2026-06-29T08:52:43.237491+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 18 canonical work pages · 3 internal anchors

  1. [2]

    URLhttps://arxiv.org/abs/2502.13923

  2. [3]

    Lang, Sourabh Vora, VeniceErin Liong, Qiang Xu, AnushKrishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom

    HolgerCaesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, VeniceErin Liong, Qiang Xu, AnushKrishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

  3. [4]

    Geodiffu- sion: Text-prompted geometric control for object detection data generation

    Kai Chen, Enze Xie, Zhe Chen, Yibo Wang, Lanqing HONG, Zhenguo Li, and Dit-Yan Yeung. Geodiffu- sion: Text-prompted geometric control for object detection data generation. InThe Twelfth Interna- tional Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id= xBfQZWeDRH

  4. [5]

    CARLA: An Open Urban Driving Simulator

    Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An Open Urban Driving Simulator. InProceedings of the 1st Annual Conference on Robot Learning (CoRL), pages 1–16, 2017

  5. [6]

    PACP: Priority-Aware Collaborative Perception for Connected and Autonomous Vehicles.IEEE Transactions on Mobile Computing, 23(12):15003–15018, 2024

    Zhengru Fang, Senkang Hu, Haonan An, Yuang Zhang, Jingjing Wang, Hangcheng Cao, Xianhao Chen, and Yuguang Fang. PACP: Priority-Aware Collaborative Perception for Connected and Autonomous Vehicles.IEEE Transactions on Mobile Computing, 23(12):15003–15018, 2024

  6. [7]

    Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations

    Zhengru Fang, Yu Guo, Fei Liu, Yuang Zhang, Yihang Tao, Senkang Hu, Wenbo Ding, and Yuguang Fang. Agent-centric visual reinforcement learning under dynamic perturbations.arXiv preprint arXiv:2604.24661, 2026. URLhttps://arxiv.org/abs/2604.24661

  7. [8]

    MagicDrive: Street view generation with diverse 3d geometry control

    RuiyuanGao, KaiChen, EnzeXie, LanqingHong, ZhenguoLi, Dit-YanYeung, andQiangXu. MagicDrive: Street view generation with diverse 3d geometry control. InInternational Conference on Learning Representations, 2024

  8. [9]

    MagicDrive-V2: High- resolution long video generation for autonomous driving with adaptive control

    Ruiyuan Gao, Kai Chen, Bo Xiao, Lanqing Hong, Zhenguo Li, and Qiang Xu. MagicDrive-V2: High- resolution long video generation for autonomous driving with adaptive control. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

  9. [10]

    Onerestore: A universal restoration framework for composite degradation

    Yu Guo, Yuan Gao, Yuxu Lu, Huilin Zhu, Ryan Wen Liu, and Shengfeng He. Onerestore: A universal restoration framework for composite degradation. InEuropean conference on computer vision, pages 255–272. Springer, 2024

  10. [11]

    Neptune-x: Active x-to-maritime generation for universal maritime object detection

    Yu Guo, Shengfeng He, Yuxu Lu, Haonan An, Yihang Tao, Huilin Zhu, Jingxian Liu, and Yuguang Fang. Neptune-x: Active x-to-maritime generation for universal maritime object detection. InAnnual Conference on Neural Information Processing Systems, 2025

  11. [12]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  12. [13]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InProceedings of the 31st 15/35 V2XCrafter: Learning to Generate Driving Scene Across Agents International Conference on Neural Information Processing Systems, NIPS’17, pages 6629–664...

  13. [14]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InProceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546

  14. [15]

    Collaborative perception for connected and autonomous driving: Challenges, possible solutions and opportunities.IEEE Wireless Communications, 32(5):228–234, 2025

    Senkang Hu, Zhengru Fang, Yiqin Deng, Xianhao Chen, and Yuguang Fang. Collaborative perception for connected and autonomous driving: Challenges, possible solutions and opportunities.IEEE Wireless Communications, 32(5):228–234, 2025. doi: 10.1109/MWC.002.2400348

  15. [16]

    CP- Guard: Malicious agent detection and defense in collaborative bird’s eye view perception.Proceedings of the AAAI Conference on Artificial Intelligence, 39(22):23203–23211, Apr

    Senkang Hu, Yihang Tao, Guowen Xu, Yiqin Deng, Xianhao Chen, Yuguang Fang, and Sam Kwong. CP- Guard: Malicious agent detection and defense in collaborative bird’s eye view perception.Proceedings of the AAAI Conference on Artificial Intelligence, 39(22):23203–23211, Apr. 2025

  16. [17]

    Fourier position embedding: Enhancing attention’s periodic extension for length generalization

    Ermo Hua, Che Jiang, Xingtai Lv, Kaiyan Zhang, Youbang Sun, Yuchen Fan, Xuekai Zhu, Biqing Qi, Ning Ding, and Bowen Zhou. Fourier position embedding: Enhancing attention’s periodic extension for length generalization. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=ZfDNDkg7Dh

  17. [18]

    Exponentialtaillocalrademachercomplexity risk bounds without the bernstein condition.J

    VarunKanade, PatrickRebeschini, andTomaszVaškevičiu. Exponentialtaillocalrademachercomplexity risk bounds without the bernstein condition.J. Mach. Learn. Res., 25(1), January 2024. ISSN 1532- 4435

  18. [19]

    Multi-concept customization of text-to-image diffusion

    Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  19. [20]

    Controllable text-to-image gen- eration

    Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip Torr. Controllable text-to-image gen- eration. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Gar- nett, editors,Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/ 1d7...

  20. [21]

    Instruct2see: Learningtoremoveanyobstructions across distributions.arXiv preprint arXiv:2505.17649, 2025

    JunhangLi,YuGuo,ChuhuaXian,andShengfengHe. Instruct2see: Learningtoremoveanyobstructions across distributions.arXiv preprint arXiv:2505.17649, 2025. URLhttps://arxiv.org/abs/2505. 17649

  21. [22]

    Gligen: Open-set grounded text-to-image generation.CVPR, 2023

    Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation.CVPR, 2023

  22. [23]

    Conflict-averse gradient descent for multi-task learning

    Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 18878–18890. Curran As- sociates, Inc., 2021. URLhttps://proceedings.neurips.cc/p...

  23. [24]

    Toward collaborative autonomous driving: Simulation platform and end-to-end system.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(8): 6566–6584, 2025

    Genjia Liu, Yue Hu, Chenxin Xu, Weibo Mao, Junhao Ge, Zhengxiang Huang, Yifan Lu, Yinda Xu, Junkai Xia, Yafei Wang, and Siheng Chen. Toward collaborative autonomous driving: Simulation platform and end-to-end system.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(8): 6566–6584, 2025. doi: 10.1109/TPAMI.2025.3560327. 16/35 V2XCrafter: Le...

  24. [25]

    Bev- fusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation

    Zhijian Liu, Haotian Tang, Alexander Amini, Xingyu Yang, Huizi Mao, Daniela Rus, and Song Han. Bev- fusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. InIEEE International Conference on Robotics and Automation (ICRA), 2023

  25. [26]

    Sense4fl: Vehic- ular crowdsensing enhanced federated learning for object detection in autonomous driving.IEEE Transactions on Mobile Computing, pages 1–15, 2026

    Yanan Ma, Senkang Hu, Zhengru Fang, Yun Ji, Yiqin Deng, and Yuguang Fang. Sense4fl: Vehic- ular crowdsensing enhanced federated learning for object detection in autonomous driving.IEEE Transactions on Mobile Computing, pages 1–15, 2026. doi: 10.1109/TMC.2026.3674333

  26. [27]

    Driveworld: 4d pre-trained scene understanding via world models for autonomous driving

    Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, Liping Jing, Yiming Nie, and Bin Dai. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15522–15533, June 2024

  27. [28]

    Wein- berger, Bharath Hariharan, and Wei-Lun Chao

    Tai-Yu Pan, Sooyoung Jeon, Mengdi Fan, Jinsu Yoo, Zhenyang Feng, Mark Campbell, Kilian Q. Wein- berger, Bharath Hariharan, and Wei-Lun Chao. Transfer your perspective: Controllable 3d generation from any viewpoint in a driving scene. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12027–12036, June 2025

  28. [29]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine...

  29. [30]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 2022. URL https://arxiv.org/abs/2204.06125

  30. [31]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022

  31. [32]

    Photorealistic text-to-image diffusion models with deep language understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. O...

  32. [33]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception...

  33. [34]

    Street-viewimagegenerationfromabird’s-eyeview layout.IEEE Robotics and Automation Letters, 9(4):3578–3585, 2024

    AlexanderSwerdlow, RunshengXu, andBoleiZhou. Street-viewimagegenerationfromabird’s-eyeview layout.IEEE Robotics and Automation Letters, 9(4):3578–3585, 2024. doi: 10.1109/LRA.2024.3368234. 17/35 V2XCrafter: Learning to Generate Driving Scene Across Agents

  34. [35]

    Digital twin and drl-driven semantic dissemination for 6g autonomous driving service

    Yihang Tao, Jun Wu, Xi Lin, Shahid Mumtaz, and Soumaya Cherkaoui. Digital twin and drl-driven semantic dissemination for 6g autonomous driving service. InGLOBECOM 2023 - 2023 IEEE Global Com- munications Conference, pages 2075–2080, 2023. doi: 10.1109/GLOBECOM54140.2023.10437455

  35. [36]

    Drl-driven digital twin function virtualization for adaptive service response in 6g networks.IEEE Networking Letters, 5(2):125–129, 2023

    Yihang Tao, Jun Wu, Xi Lin, and Wu Yang. Drl-driven digital twin function virtualization for adaptive service response in 6g networks.IEEE Networking Letters, 5(2):125–129, 2023. doi: 10.1109/LNET. 2023.3269766

  36. [37]

    Yihang Tao, Jun Wu, Qianqian Pan, Ali Kashif Bashir, and Marwan Omar. O-ran-based digital twin function virtualization for sustainable iov service response: An asynchronous hierarchical reinforcement learning approach.IEEE Transactions on Green Communications and Networking, 8(3):1049–1060,

  37. [38]

    doi: 10.1109/TGCN.2024.3435796

  38. [39]

    Directed-cp: Directed collaborative perception for connected and autonomous vehicles via proactive attention

    Yihang Tao, Senkang Hu, Zhengru Fang, and Yuguang Fang. Directed-cp: Directed collaborative perception for connected and autonomous vehicles via proactive attention. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 7004–7010, 2025. doi: 10.1109/ICRA55743. 2025.11127818

  39. [40]

    Learning mutual view information graph for adaptive adversarial collaborative perception.arXiv preprint arXiv:2602.19596, 2026

    Yihang Tao, Senkang Hu, Haonan An, Zhengru Fang, Hangcheng Cao, and Yuguang Fang. Learning mutual view information graph for adaptive adversarial collaborative perception.arXiv preprint arXiv:2602.19596, 2026. URLhttps://arxiv.org/abs/2602.19596

  40. [41]

    Gcp: Guarded collaborative perception with spatial-temporal aware malicious agent detection.IEEE Transactions on Dependable and Secure Computing, pages 1–14, 2026

    Yihang Tao, Senkang Hu, Yue Hu, Haonan An, Hangcheng Cao, and Yuguang Fang. Gcp: Guarded collaborative perception with spatial-temporal aware malicious agent detection.IEEE Transactions on Dependable and Secure Computing, pages 1–14, 2026. doi: 10.1109/TDSC.2026.3693684

  41. [42]

    Con- fidence dimension for deep learning based on hoeffding inequality and relative evaluation.arXiv preprint arXiv:2203.09082, 2022

    Runqi Wang, Linlin Yang, Baochang Zhang, Wentao Zhu, David Doermann, and Guodong Guo. Con- fidence dimension for deep learning based on hoeffding inequality and relative evaluation.arXiv preprint arXiv:2203.09082, 2022. URLhttps://arxiv.org/abs/2203.09082

  42. [43]

    V2VNet: Vehicle-to-Vehicle Communication for Joint Perception and Prediction

    Tsun-Hsuan Wang, Sivabalan Manivasagam, Ming Liang, Bin Yang, Wenyuan Zeng, and Raquel Urtasun. V2VNet: Vehicle-to-Vehicle Communication for Joint Perception and Prediction. InEuropean Conference on Computer Vision (ECCV), pages 605–621, 2020

  43. [44]

    Drive- dreamer: Towards real-world-drive world models for autonomous driving

    Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drive- dreamer: Towards real-world-drive world models for autonomous driving. InComputer Vision - ECCV 2024: 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XLVIII, pages 55–72, Berlin, Heidelberg, 2024. Springer-Verlag. ISBN 978-3-031-7319...

  44. [45]

    Editable scene simulation for autonomous driving via collaborative llm-agents

    Yuxi Wei, Zi Wang, Yifan Lu, Chenxin Xu, Changxing Liu, Hao Zhao, Siheng Chen, and Yanfeng Wang. Editable scene simulation for autonomous driving via collaborative llm-agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024

  45. [46]

    Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

    Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023

  46. [47]

    In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp

    Hao Xiang, Runsheng Xu, Xin Xia, Zhaoliang Zheng, Bolei Zhou, and Jiaqi Ma. V2xp-asg: Generating adversarial scenes for vehicle-to-everything perception. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 3584–3591, 2023. doi: 10.1109/ICRA48891.2023.10161384. 18/35 V2XCrafter: Learning to Generate Driving Scene Across Agents

  47. [48]

    V2x-real: A largs-scale dataset for vehicle-to-everything cooperative perception

    Hao Xiang, Zhaoliang Zheng, Xin Xia, Runsheng Xu, Letian Gao, Zewei Zhou, Xu Han, Xinkai Ji, Mingxi Li, Zonglin Meng, Li Jin, Mingyue Lei, Zhaoyang Ma, Zihang He, Haoxuan Ma, Yunshuang Yuan, Yingqian Zhao, and Jiaqi Ma. V2x-real: A largs-scale dataset for vehicle-to-everything cooperative perception. InComputer Vision - ECCV 2024: 18th European Conference...

  48. [49]

    OPV2V: An Open Benchmark Dataset and Fusion Pipeline for Perception with Vehicle-to-Vehicle Communication

    Runsheng Xu, Hao Xiang, Xin Xia, Xu Han, Jinlong Li, and Jiaqi Ma. OPV2V: An Open Benchmark Dataset and Fusion Pipeline for Perception with Vehicle-to-Vehicle Communication. In2022 Interna- tional Conference on Robotics and Automation (ICRA), pages 2583–2589, Philadelphia, PA, USA, May

  49. [50]

    Bevcontrol: Accurately controlling street-view elements with multi-perspective consistency via bev sketch layout.arXiv preprint arXiv:2308.01661, 2023

    Kairui Yang, Enhui Ma, Jibin Peng, Qing Guo, Di Lin, and Kaicheng Yu. Bevcontrol: Accurately controlling street-view elements with multi-perspective consistency via bev sketch layout.arXiv preprint arXiv:2308.01661, 2023. URLhttps://arxiv.org/abs/2308.01661

  50. [51]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, October 2023

  51. [52]

    Unipc: A unified predictor-corrector framework for fast sampling of diffusion models.NeurIPS, 2023

    Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models.NeurIPS, 2023

  52. [53]

    Layoutdiffusion: Controllable diffusion model for layout-to-image generation

    Guangcong Zheng, Xianpan Zhou, Xuewei Li, Zhongang Qi, Ying Shan, and Xi Li. Layoutdiffusion: Controllable diffusion model for layout-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22490–22499, June 2023

  53. [54]

    World4drive: End-to-end autonomous driving via intention-aware physical latent world model

    Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, XianPeng Lang, and Dongbin Zhao. World4drive: End-to-end autonomous driving via intention-aware physical latent world model. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 28632–28642, Oc...

  54. [55]

    For data augmentation experiments, we additionally evaluate V2VNet

    for multi-agent feature fusion. For data augmentation experiments, we additionally evaluate V2VNet

  55. [56]

    w/o CA&V*

    and Late Fusion [14] baselines, testing both camera-only (C) and camera-LiDAR (C+L) configurations. B.2. Baseline Implementation Details To the best of our knowledge, no prior work has addressed the task of multi-agent collaborative driving scene generation. Therefore, we establish baselines by adapting state-of-the-art single-agent driving scene generati...