V2XCrafter: Learning to Generate Driving Scene Across Agents

Sam Kwong; Senkang Hu; Yanan Ma; Yihang Tao; Yuguang Fang; Yu Guo; Zihan Fang

arxiv: 2605.29471 · v1 · pith:EOK4XZX4new · submitted 2026-05-28 · 💻 cs.CV

V2XCrafter: Learning to Generate Driving Scene Across Agents

Yihang Tao , Yu Guo , Senkang Hu , Yanan Ma , Zihan Fang , Sam Kwong , Yuguang Fang This is my paper

Pith reviewed 2026-06-29 08:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords V2Xdriving scene generationdiffusion modelcollaborative perceptionmulti-agent3D object detectioncross-agent attention

0 comments

The pith

V2XCrafter generates high-fidelity collaborative driving scenes with consistency across multiple vehicle views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Collaborative driving relies on V2X communication but lacks enough real data for training. Current image generation methods lose quality and fail to keep object properties consistent when extended to multiple agents. V2XCrafter addresses this by extending a single-agent diffusion model into a progressive multi-agent version that uses other agents' latent states as guides. It adds a cross-agent attention module to link views through a graph and shared object representations. The generated scenes improve performance on collaborative 3D object detection.

Core claim

V2XCrafter is the first framework for generating controllable and realistic collaborative driving scenes across agents' camera views. It develops a progressive multi-agent diffusion model based on a single-agent backbone, using neighboring agents' latent states as reference signals to progressively guide the single-to-multi diffusion. To address cross-vehicle inconsistency, it proposes a cross-agent attention module that leverages a collaboration view graph and learnable jointly observed object representation to model the dynamic cross-agent camera view relationships.

What carries the argument

Progressive multi-agent diffusion model guided by neighboring agents' latent states as reference signals, combined with a cross-agent attention module using a collaboration view graph and learnable jointly observed object representations.

If this is right

High-fidelity and controllable street views can be generated.
Consistency across agents is achieved for physical attributes such as color and category of jointly observed objects.
The generated data effectively enhances downstream collaborative 3D object detection tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could supplement scarce real-world V2X datasets for training perception systems.
Similar progressive diffusion approaches might apply to other multi-view or multi-sensor generation problems in robotics.
Controllability could allow targeted simulation of rare driving scenarios.

Load-bearing premise

Neighboring agents' latent states can be used as reference signals to progressively guide single-to-multi diffusion while preserving generation quality.

What would settle it

Generated scenes where the same object shows inconsistent color or category from different agents' perspectives would disprove the consistency claim.

read the original abstract

Collaborative driving systems leverage vehicle-to-everything (V2X) communication for multi-agent collaborative perception to enhance driving safety, yet they remain constrained by scarce annotated real-world V2X driving datasets and limited generalization across diverse driving conditions. While image generation technology offers a feasible solution for data augmentation, existing methods tailored for single-vehicle multi-view scenarios face two fundamental challenges in multi-agent driving settings: (1) the expansion of the learning objective degrades generation quality, and (2) the highly dynamic variations across agents hinder the modeling of consistency for physical attributes (e.g., color, category) in jointly observed objects. To bridge this gap, we propose V2XCrafter, the first framework for generating controllable and realistic collaborative driving scene across agents' camera views. For effective learning, we develop a progressive multi-agent diffusion model based on a single-agent backbone, using neighboring agents' latent states as reference signals to progressively guide the single-to-multi diffusion. To address cross-vehicle inconsistency, we propose a cross-agent attention module that leverages a collaboration view graph and learnable jointly observed object representation to model the dynamic cross-agent camera view relationships. Experiments have shown that V2XCrafter can generate high-fidelity and controllable street views with consistency across agents, thereby effectively enhancing the downstream collaborative 3D object detection tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

V2XCrafter claims first multi-agent V2X scene generation via progressive diffusion plus cross-agent attention, but the abstract supplies no numbers or ablations to check whether consistency or downstream gains actually hold.

read the letter

The main thing here is a new technical combination aimed at a real bottleneck: generating consistent camera views across multiple vehicles when real V2X data is scarce. The progressive single-to-multi diffusion that starts from a single-agent backbone and injects neighboring latents as references, paired with the collaboration view graph and learnable joint-object representations in the attention module, is presented as the fix for both quality drop-off and physical-attribute drift. That framing is new relative to single-vehicle multi-view generators.

What the work does cleanly is name the two concrete problems—objective expansion hurting fidelity, and viewpoint/motion differences breaking consistency on shared objects—and tie them directly to the V2X setting. The downstream claim that the generated data improves collaborative 3D detection is the practical hook.

The soft spot is the complete absence of any quantitative evidence in the supplied abstract. No FID or detection mAP deltas, no ablation on the graph construction or the progressive schedule, and no discussion of how the method behaves when dynamic objects are only partially shared or when lighting differs sharply between agents. The stress-test worry about latent misalignment introducing artifacts on transients is therefore still open; if the full experiments do not show that the attention module keeps color/category stable without new hallucinations, the consistency story weakens. Citation pattern looks standard for the subfield, with no obvious self-loop.

This is for people working on data augmentation for multi-agent perception or on controllable diffusion for driving scenes. A reader who already knows the single-agent baselines will get the most out of it. The idea is coherent enough and the problem important enough that it should go to serious referees rather than desk reject; they can check whether the claimed gains survive the dynamic cases the abstract flags.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces V2XCrafter as the first framework for generating controllable, high-fidelity collaborative driving scenes across multiple agents' camera views in V2X settings. It develops a progressive multi-agent diffusion model that starts from a single-agent backbone and uses neighboring agents' latent states as reference signals to guide single-to-multi diffusion, together with a cross-agent attention module that employs a collaboration view graph and learnable jointly observed object representations to enforce consistency in physical attributes (color, category) of shared objects. The approach targets data scarcity and generalization issues in collaborative perception and is reported to improve downstream collaborative 3D object detection.

Significance. If the central claims hold, the work provides a practical route to augment limited real-world V2X datasets with consistent multi-view scenes, which could meaningfully improve training and generalization of collaborative perception models and thereby contribute to safer autonomous driving systems.

major comments (2)

[Abstract and §3] Abstract and §3: The progressive multi-agent diffusion relies on injecting neighboring agents' latent states as reference signals while preserving generation quality. This assumption is load-bearing for the downstream detection claim; when viewpoint/motion differences are large, misalignment on dynamic objects can conflict with the denoising objective. The manuscript must supply concrete evidence (e.g., targeted ablations on occlusion or lighting variation cases) that the cross-agent attention module and collaboration view graph reliably prevent artifact introduction without degrading fidelity.
[Abstract] Abstract: The assertion that the generated scenes 'effectively enhanc[e] the downstream collaborative 3D object detection tasks' is central yet unsupported by any visible quantitative metrics, ablation tables, or baseline comparisons. Without these results, it is impossible to verify whether consistency gains translate to detection improvements or whether post-hoc design choices influenced the reported outcomes.

minor comments (1)

The abstract states 'Experiments have shown' without any numerical highlights; adding one or two key metrics (e.g., FID, consistency scores, or detection mAP gains) would improve immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen the evidence for our claims.

read point-by-point responses

Referee: [Abstract and §3] The progressive multi-agent diffusion relies on injecting neighboring agents' latent states as reference signals while preserving generation quality. This assumption is load-bearing for the downstream detection claim; when viewpoint/motion differences are large, misalignment on dynamic objects can conflict with the denoising objective. The manuscript must supply concrete evidence (e.g., targeted ablations on occlusion or lighting variation cases) that the cross-agent attention module and collaboration view graph reliably prevent artifact introduction without degrading fidelity.

Authors: We agree that targeted evidence under challenging conditions is necessary. While our experiments evaluate on V2X datasets with natural variations, we will add specific ablations on occlusion and lighting variation cases in the revised manuscript to demonstrate that the cross-agent attention module and collaboration view graph maintain consistency without introducing artifacts or degrading fidelity. revision: yes
Referee: [Abstract] The assertion that the generated scenes 'effectively enhanc[e] the downstream collaborative 3D object detection tasks' is central yet unsupported by any visible quantitative metrics, ablation tables, or baseline comparisons. Without these results, it is impossible to verify whether consistency gains translate to detection improvements or whether post-hoc design choices influenced the reported outcomes.

Authors: The full manuscript contains quantitative results and baseline comparisons for the downstream collaborative 3D object detection task in the experiments section. To improve clarity, we will revise the abstract to include key quantitative metrics and ensure the relevant tables are prominently referenced from the abstract and introduction. revision: yes

Circularity Check

0 steps flagged

No circularity: method introduces novel components without self-referential reduction

full rationale

The paper describes a progressive multi-agent diffusion model built on a single-agent backbone and a cross-agent attention module using a collaboration view graph, but provides no equations, no fitted parameters renamed as predictions, and no load-bearing self-citations in the abstract or described sections. The central claims about consistency across agents and downstream detection gains are presented as outcomes of the new architecture rather than definitions or fits that reduce to the inputs by construction. This is a standard empirical method paper whose derivation chain remains independent of the target results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only input supplies no explicit free parameters, axioms, or invented entities; full paper would be required to enumerate diffusion schedule choices, attention hyperparameters, or any new object representations.

pith-pipeline@v0.9.1-grok · 5783 in / 1114 out tokens · 20269 ms · 2026-06-29T08:52:43.237491+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 18 canonical work pages · 3 internal anchors

[2]

URLhttps://arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Lang, Sourabh Vora, VeniceErin Liong, Qiang Xu, AnushKrishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom

HolgerCaesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, VeniceErin Liong, Qiang Xu, AnushKrishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

2020
[4]

Geodiffu- sion: Text-prompted geometric control for object detection data generation

Kai Chen, Enze Xie, Zhe Chen, Yibo Wang, Lanqing HONG, Zhenguo Li, and Dit-Yan Yeung. Geodiffu- sion: Text-prompted geometric control for object detection data generation. InThe Twelfth Interna- tional Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id= xBfQZWeDRH

2024
[5]

CARLA: An Open Urban Driving Simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An Open Urban Driving Simulator. InProceedings of the 1st Annual Conference on Robot Learning (CoRL), pages 1–16, 2017

2017
[6]

PACP: Priority-Aware Collaborative Perception for Connected and Autonomous Vehicles.IEEE Transactions on Mobile Computing, 23(12):15003–15018, 2024

Zhengru Fang, Senkang Hu, Haonan An, Yuang Zhang, Jingjing Wang, Hangcheng Cao, Xianhao Chen, and Yuguang Fang. PACP: Priority-Aware Collaborative Perception for Connected and Autonomous Vehicles.IEEE Transactions on Mobile Computing, 23(12):15003–15018, 2024

2024
[7]

Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations

Zhengru Fang, Yu Guo, Fei Liu, Yuang Zhang, Yihang Tao, Senkang Hu, Wenbo Ding, and Yuguang Fang. Agent-centric visual reinforcement learning under dynamic perturbations.arXiv preprint arXiv:2604.24661, 2026. URLhttps://arxiv.org/abs/2604.24661

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

MagicDrive: Street view generation with diverse 3d geometry control

RuiyuanGao, KaiChen, EnzeXie, LanqingHong, ZhenguoLi, Dit-YanYeung, andQiangXu. MagicDrive: Street view generation with diverse 3d geometry control. InInternational Conference on Learning Representations, 2024

2024
[9]

MagicDrive-V2: High- resolution long video generation for autonomous driving with adaptive control

Ruiyuan Gao, Kai Chen, Bo Xiao, Lanqing Hong, Zhenguo Li, and Qiang Xu. MagicDrive-V2: High- resolution long video generation for autonomous driving with adaptive control. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

2025
[10]

Onerestore: A universal restoration framework for composite degradation

Yu Guo, Yuan Gao, Yuxu Lu, Huilin Zhu, Ryan Wen Liu, and Shengfeng He. Onerestore: A universal restoration framework for composite degradation. InEuropean conference on computer vision, pages 255–272. Springer, 2024

2024
[11]

Neptune-x: Active x-to-maritime generation for universal maritime object detection

Yu Guo, Shengfeng He, Yuxu Lu, Haonan An, Yihang Tao, Huilin Zhu, Jingxian Liu, and Yuguang Fang. Neptune-x: Active x-to-maritime generation for universal maritime object detection. InAnnual Conference on Neural Information Processing Systems, 2025

2025
[12]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016
[13]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InProceedings of the 31st 15/35 V2XCrafter: Learning to Generate Driving Scene Across Agents International Conference on Neural Information Processing Systems, NIPS’17, pages 6629–664...

2017
[14]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InProceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546

2020
[15]

Collaborative perception for connected and autonomous driving: Challenges, possible solutions and opportunities.IEEE Wireless Communications, 32(5):228–234, 2025

Senkang Hu, Zhengru Fang, Yiqin Deng, Xianhao Chen, and Yuguang Fang. Collaborative perception for connected and autonomous driving: Challenges, possible solutions and opportunities.IEEE Wireless Communications, 32(5):228–234, 2025. doi: 10.1109/MWC.002.2400348

work page doi:10.1109/mwc.002.2400348 2025
[16]

CP- Guard: Malicious agent detection and defense in collaborative bird’s eye view perception.Proceedings of the AAAI Conference on Artificial Intelligence, 39(22):23203–23211, Apr

Senkang Hu, Yihang Tao, Guowen Xu, Yiqin Deng, Xianhao Chen, Yuguang Fang, and Sam Kwong. CP- Guard: Malicious agent detection and defense in collaborative bird’s eye view perception.Proceedings of the AAAI Conference on Artificial Intelligence, 39(22):23203–23211, Apr. 2025

2025
[17]

Fourier position embedding: Enhancing attention’s periodic extension for length generalization

Ermo Hua, Che Jiang, Xingtai Lv, Kaiyan Zhang, Youbang Sun, Yuchen Fan, Xuekai Zhu, Biqing Qi, Ning Ding, and Bowen Zhou. Fourier position embedding: Enhancing attention’s periodic extension for length generalization. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=ZfDNDkg7Dh

2025
[18]

Exponentialtaillocalrademachercomplexity risk bounds without the bernstein condition.J

VarunKanade, PatrickRebeschini, andTomaszVaškevičiu. Exponentialtaillocalrademachercomplexity risk bounds without the bernstein condition.J. Mach. Learn. Res., 25(1), January 2024. ISSN 1532- 4435

2024
[19]

Multi-concept customization of text-to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023
[20]

Controllable text-to-image gen- eration

Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip Torr. Controllable text-to-image gen- eration. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Gar- nett, editors,Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/ 1d7...

2019
[21]

Instruct2see: Learningtoremoveanyobstructions across distributions.arXiv preprint arXiv:2505.17649, 2025

JunhangLi,YuGuo,ChuhuaXian,andShengfengHe. Instruct2see: Learningtoremoveanyobstructions across distributions.arXiv preprint arXiv:2505.17649, 2025. URLhttps://arxiv.org/abs/2505. 17649

work page arXiv 2025
[22]

Gligen: Open-set grounded text-to-image generation.CVPR, 2023

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation.CVPR, 2023

2023
[23]

Conflict-averse gradient descent for multi-task learning

Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 18878–18890. Curran As- sociates, Inc., 2021. URLhttps://proceedings.neurips.cc/p...

2021
[24]

Toward collaborative autonomous driving: Simulation platform and end-to-end system.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(8): 6566–6584, 2025

Genjia Liu, Yue Hu, Chenxin Xu, Weibo Mao, Junhao Ge, Zhengxiang Huang, Yifan Lu, Yinda Xu, Junkai Xia, Yafei Wang, and Siheng Chen. Toward collaborative autonomous driving: Simulation platform and end-to-end system.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(8): 6566–6584, 2025. doi: 10.1109/TPAMI.2025.3560327. 16/35 V2XCrafter: Le...

work page doi:10.1109/tpami.2025.3560327 2025
[25]

Bev- fusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation

Zhijian Liu, Haotian Tang, Alexander Amini, Xingyu Yang, Huizi Mao, Daniela Rus, and Song Han. Bev- fusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. InIEEE International Conference on Robotics and Automation (ICRA), 2023

2023
[26]

Sense4fl: Vehic- ular crowdsensing enhanced federated learning for object detection in autonomous driving.IEEE Transactions on Mobile Computing, pages 1–15, 2026

Yanan Ma, Senkang Hu, Zhengru Fang, Yun Ji, Yiqin Deng, and Yuguang Fang. Sense4fl: Vehic- ular crowdsensing enhanced federated learning for object detection in autonomous driving.IEEE Transactions on Mobile Computing, pages 1–15, 2026. doi: 10.1109/TMC.2026.3674333

work page doi:10.1109/tmc.2026.3674333 2026
[27]

Driveworld: 4d pre-trained scene understanding via world models for autonomous driving

Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, Liping Jing, Yiming Nie, and Bin Dai. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15522–15533, June 2024

2024
[28]

Wein- berger, Bharath Hariharan, and Wei-Lun Chao

Tai-Yu Pan, Sooyoung Jeon, Mengdi Fan, Jinsu Yoo, Zhenyang Feng, Mark Campbell, Kilian Q. Wein- berger, Bharath Hariharan, and Wei-Lun Chao. Transfer your perspective: Controllable 3d generation from any viewpoint in a driving scene. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12027–12036, June 2025

2025
[29]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine...

2021
[30]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 2022. URL https://arxiv.org/abs/2204.06125

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022

2022
[32]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. O...

2022
[33]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception...

2020
[34]

Street-viewimagegenerationfromabird’s-eyeview layout.IEEE Robotics and Automation Letters, 9(4):3578–3585, 2024

AlexanderSwerdlow, RunshengXu, andBoleiZhou. Street-viewimagegenerationfromabird’s-eyeview layout.IEEE Robotics and Automation Letters, 9(4):3578–3585, 2024. doi: 10.1109/LRA.2024.3368234. 17/35 V2XCrafter: Learning to Generate Driving Scene Across Agents

work page doi:10.1109/lra.2024.3368234 2024
[35]

Digital twin and drl-driven semantic dissemination for 6g autonomous driving service

Yihang Tao, Jun Wu, Xi Lin, Shahid Mumtaz, and Soumaya Cherkaoui. Digital twin and drl-driven semantic dissemination for 6g autonomous driving service. InGLOBECOM 2023 - 2023 IEEE Global Com- munications Conference, pages 2075–2080, 2023. doi: 10.1109/GLOBECOM54140.2023.10437455

work page doi:10.1109/globecom54140.2023.10437455 2023
[36]

Drl-driven digital twin function virtualization for adaptive service response in 6g networks.IEEE Networking Letters, 5(2):125–129, 2023

Yihang Tao, Jun Wu, Xi Lin, and Wu Yang. Drl-driven digital twin function virtualization for adaptive service response in 6g networks.IEEE Networking Letters, 5(2):125–129, 2023. doi: 10.1109/LNET. 2023.3269766

work page doi:10.1109/lnet 2023
[37]

Yihang Tao, Jun Wu, Qianqian Pan, Ali Kashif Bashir, and Marwan Omar. O-ran-based digital twin function virtualization for sustainable iov service response: An asynchronous hierarchical reinforcement learning approach.IEEE Transactions on Green Communications and Networking, 8(3):1049–1060,
[38]

doi: 10.1109/TGCN.2024.3435796

work page doi:10.1109/tgcn.2024.3435796 2024
[39]

Directed-cp: Directed collaborative perception for connected and autonomous vehicles via proactive attention

Yihang Tao, Senkang Hu, Zhengru Fang, and Yuguang Fang. Directed-cp: Directed collaborative perception for connected and autonomous vehicles via proactive attention. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 7004–7010, 2025. doi: 10.1109/ICRA55743. 2025.11127818

work page doi:10.1109/icra55743 2025
[40]

Learning mutual view information graph for adaptive adversarial collaborative perception.arXiv preprint arXiv:2602.19596, 2026

Yihang Tao, Senkang Hu, Haonan An, Zhengru Fang, Hangcheng Cao, and Yuguang Fang. Learning mutual view information graph for adaptive adversarial collaborative perception.arXiv preprint arXiv:2602.19596, 2026. URLhttps://arxiv.org/abs/2602.19596

work page arXiv 2026
[41]

Gcp: Guarded collaborative perception with spatial-temporal aware malicious agent detection.IEEE Transactions on Dependable and Secure Computing, pages 1–14, 2026

Yihang Tao, Senkang Hu, Yue Hu, Haonan An, Hangcheng Cao, and Yuguang Fang. Gcp: Guarded collaborative perception with spatial-temporal aware malicious agent detection.IEEE Transactions on Dependable and Secure Computing, pages 1–14, 2026. doi: 10.1109/TDSC.2026.3693684

work page doi:10.1109/tdsc.2026.3693684 2026
[42]

Con- fidence dimension for deep learning based on hoeffding inequality and relative evaluation.arXiv preprint arXiv:2203.09082, 2022

Runqi Wang, Linlin Yang, Baochang Zhang, Wentao Zhu, David Doermann, and Guodong Guo. Con- fidence dimension for deep learning based on hoeffding inequality and relative evaluation.arXiv preprint arXiv:2203.09082, 2022. URLhttps://arxiv.org/abs/2203.09082

work page arXiv 2022
[43]

V2VNet: Vehicle-to-Vehicle Communication for Joint Perception and Prediction

Tsun-Hsuan Wang, Sivabalan Manivasagam, Ming Liang, Bin Yang, Wenyuan Zeng, and Raquel Urtasun. V2VNet: Vehicle-to-Vehicle Communication for Joint Perception and Prediction. InEuropean Conference on Computer Vision (ECCV), pages 605–621, 2020

2020
[44]

Drive- dreamer: Towards real-world-drive world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drive- dreamer: Towards real-world-drive world models for autonomous driving. InComputer Vision - ECCV 2024: 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XLVIII, pages 55–72, Berlin, Heidelberg, 2024. Springer-Verlag. ISBN 978-3-031-7319...

work page doi:10.1007/978-3-031-73195-2_4 2024
[45]

Editable scene simulation for autonomous driving via collaborative llm-agents

Yuxi Wei, Zi Wang, Yifan Lu, Chenxin Xu, Changxing Liu, Hao Zhao, Siheng Chen, and Yanfeng Wang. Editable scene simulation for autonomous driving via collaborative llm-agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024

2024
[46]

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023

2023
[47]

In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp

Hao Xiang, Runsheng Xu, Xin Xia, Zhaoliang Zheng, Bolei Zhou, and Jiaqi Ma. V2xp-asg: Generating adversarial scenes for vehicle-to-everything perception. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 3584–3591, 2023. doi: 10.1109/ICRA48891.2023.10161384. 18/35 V2XCrafter: Learning to Generate Driving Scene Across Agents

work page doi:10.1109/icra48891.2023.10161384 2023
[48]

V2x-real: A largs-scale dataset for vehicle-to-everything cooperative perception

Hao Xiang, Zhaoliang Zheng, Xin Xia, Runsheng Xu, Letian Gao, Zewei Zhou, Xu Han, Xinkai Ji, Mingxi Li, Zonglin Meng, Li Jin, Mingyue Lei, Zhaoyang Ma, Zihang He, Haoxuan Ma, Yunshuang Yuan, Yingqian Zhao, and Jiaqi Ma. V2x-real: A largs-scale dataset for vehicle-to-everything cooperative perception. InComputer Vision - ECCV 2024: 18th European Conference...

2024
[49]

OPV2V: An Open Benchmark Dataset and Fusion Pipeline for Perception with Vehicle-to-Vehicle Communication

Runsheng Xu, Hao Xiang, Xin Xia, Xu Han, Jinlong Li, and Jiaqi Ma. OPV2V: An Open Benchmark Dataset and Fusion Pipeline for Perception with Vehicle-to-Vehicle Communication. In2022 Interna- tional Conference on Robotics and Automation (ICRA), pages 2583–2589, Philadelphia, PA, USA, May
[50]

Bevcontrol: Accurately controlling street-view elements with multi-perspective consistency via bev sketch layout.arXiv preprint arXiv:2308.01661, 2023

Kairui Yang, Enhui Ma, Jibin Peng, Qing Guo, Di Lin, and Kaicheng Yu. Bevcontrol: Accurately controlling street-view elements with multi-perspective consistency via bev sketch layout.arXiv preprint arXiv:2308.01661, 2023. URLhttps://arxiv.org/abs/2308.01661

work page arXiv 2023
[51]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, October 2023

2023
[52]

Unipc: A unified predictor-corrector framework for fast sampling of diffusion models.NeurIPS, 2023

Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models.NeurIPS, 2023

2023
[53]

Layoutdiffusion: Controllable diffusion model for layout-to-image generation

Guangcong Zheng, Xianpan Zhou, Xuewei Li, Zhongang Qi, Ying Shan, and Xi Li. Layoutdiffusion: Controllable diffusion model for layout-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22490–22499, June 2023

2023
[54]

World4drive: End-to-end autonomous driving via intention-aware physical latent world model

Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, XianPeng Lang, and Dongbin Zhao. World4drive: End-to-end autonomous driving via intention-aware physical latent world model. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 28632–28642, Oc...

2025
[55]

For data augmentation experiments, we additionally evaluate V2VNet

for multi-agent feature fusion. For data augmentation experiments, we additionally evaluate V2VNet
[56]

w/o CA&V*

and Late Fusion [14] baselines, testing both camera-only (C) and camera-LiDAR (C+L) configurations. B.2. Baseline Implementation Details To the best of our knowledge, no prior work has addressed the task of multi-agent collaborative driving scene generation. Therefore, we establish baselines by adapting state-of-the-art single-agent driving scene generati...

[1] [2]

URLhttps://arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv

[2] [3]

Lang, Sourabh Vora, VeniceErin Liong, Qiang Xu, AnushKrishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom

HolgerCaesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, VeniceErin Liong, Qiang Xu, AnushKrishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

2020

[3] [4]

Geodiffu- sion: Text-prompted geometric control for object detection data generation

Kai Chen, Enze Xie, Zhe Chen, Yibo Wang, Lanqing HONG, Zhenguo Li, and Dit-Yan Yeung. Geodiffu- sion: Text-prompted geometric control for object detection data generation. InThe Twelfth Interna- tional Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id= xBfQZWeDRH

2024

[4] [5]

CARLA: An Open Urban Driving Simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An Open Urban Driving Simulator. InProceedings of the 1st Annual Conference on Robot Learning (CoRL), pages 1–16, 2017

2017

[5] [6]

PACP: Priority-Aware Collaborative Perception for Connected and Autonomous Vehicles.IEEE Transactions on Mobile Computing, 23(12):15003–15018, 2024

Zhengru Fang, Senkang Hu, Haonan An, Yuang Zhang, Jingjing Wang, Hangcheng Cao, Xianhao Chen, and Yuguang Fang. PACP: Priority-Aware Collaborative Perception for Connected and Autonomous Vehicles.IEEE Transactions on Mobile Computing, 23(12):15003–15018, 2024

2024

[6] [7]

Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations

Zhengru Fang, Yu Guo, Fei Liu, Yuang Zhang, Yihang Tao, Senkang Hu, Wenbo Ding, and Yuguang Fang. Agent-centric visual reinforcement learning under dynamic perturbations.arXiv preprint arXiv:2604.24661, 2026. URLhttps://arxiv.org/abs/2604.24661

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [8]

MagicDrive: Street view generation with diverse 3d geometry control

RuiyuanGao, KaiChen, EnzeXie, LanqingHong, ZhenguoLi, Dit-YanYeung, andQiangXu. MagicDrive: Street view generation with diverse 3d geometry control. InInternational Conference on Learning Representations, 2024

2024

[8] [9]

MagicDrive-V2: High- resolution long video generation for autonomous driving with adaptive control

Ruiyuan Gao, Kai Chen, Bo Xiao, Lanqing Hong, Zhenguo Li, and Qiang Xu. MagicDrive-V2: High- resolution long video generation for autonomous driving with adaptive control. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

2025

[9] [10]

Onerestore: A universal restoration framework for composite degradation

Yu Guo, Yuan Gao, Yuxu Lu, Huilin Zhu, Ryan Wen Liu, and Shengfeng He. Onerestore: A universal restoration framework for composite degradation. InEuropean conference on computer vision, pages 255–272. Springer, 2024

2024

[10] [11]

Neptune-x: Active x-to-maritime generation for universal maritime object detection

Yu Guo, Shengfeng He, Yuxu Lu, Haonan An, Yihang Tao, Huilin Zhu, Jingxian Liu, and Yuguang Fang. Neptune-x: Active x-to-maritime generation for universal maritime object detection. InAnnual Conference on Neural Information Processing Systems, 2025

2025

[11] [12]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016

[12] [13]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InProceedings of the 31st 15/35 V2XCrafter: Learning to Generate Driving Scene Across Agents International Conference on Neural Information Processing Systems, NIPS’17, pages 6629–664...

2017

[13] [14]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InProceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546

2020

[14] [15]

Collaborative perception for connected and autonomous driving: Challenges, possible solutions and opportunities.IEEE Wireless Communications, 32(5):228–234, 2025

Senkang Hu, Zhengru Fang, Yiqin Deng, Xianhao Chen, and Yuguang Fang. Collaborative perception for connected and autonomous driving: Challenges, possible solutions and opportunities.IEEE Wireless Communications, 32(5):228–234, 2025. doi: 10.1109/MWC.002.2400348

work page doi:10.1109/mwc.002.2400348 2025

[15] [16]

CP- Guard: Malicious agent detection and defense in collaborative bird’s eye view perception.Proceedings of the AAAI Conference on Artificial Intelligence, 39(22):23203–23211, Apr

Senkang Hu, Yihang Tao, Guowen Xu, Yiqin Deng, Xianhao Chen, Yuguang Fang, and Sam Kwong. CP- Guard: Malicious agent detection and defense in collaborative bird’s eye view perception.Proceedings of the AAAI Conference on Artificial Intelligence, 39(22):23203–23211, Apr. 2025

2025

[16] [17]

Fourier position embedding: Enhancing attention’s periodic extension for length generalization

Ermo Hua, Che Jiang, Xingtai Lv, Kaiyan Zhang, Youbang Sun, Yuchen Fan, Xuekai Zhu, Biqing Qi, Ning Ding, and Bowen Zhou. Fourier position embedding: Enhancing attention’s periodic extension for length generalization. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=ZfDNDkg7Dh

2025

[17] [18]

Exponentialtaillocalrademachercomplexity risk bounds without the bernstein condition.J

VarunKanade, PatrickRebeschini, andTomaszVaškevičiu. Exponentialtaillocalrademachercomplexity risk bounds without the bernstein condition.J. Mach. Learn. Res., 25(1), January 2024. ISSN 1532- 4435

2024

[18] [19]

Multi-concept customization of text-to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023

[19] [20]

Controllable text-to-image gen- eration

Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip Torr. Controllable text-to-image gen- eration. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Gar- nett, editors,Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/ 1d7...

2019

[20] [21]

Instruct2see: Learningtoremoveanyobstructions across distributions.arXiv preprint arXiv:2505.17649, 2025

JunhangLi,YuGuo,ChuhuaXian,andShengfengHe. Instruct2see: Learningtoremoveanyobstructions across distributions.arXiv preprint arXiv:2505.17649, 2025. URLhttps://arxiv.org/abs/2505. 17649

work page arXiv 2025

[21] [22]

Gligen: Open-set grounded text-to-image generation.CVPR, 2023

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation.CVPR, 2023

2023

[22] [23]

Conflict-averse gradient descent for multi-task learning

Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 18878–18890. Curran As- sociates, Inc., 2021. URLhttps://proceedings.neurips.cc/p...

2021

[23] [24]

Toward collaborative autonomous driving: Simulation platform and end-to-end system.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(8): 6566–6584, 2025

Genjia Liu, Yue Hu, Chenxin Xu, Weibo Mao, Junhao Ge, Zhengxiang Huang, Yifan Lu, Yinda Xu, Junkai Xia, Yafei Wang, and Siheng Chen. Toward collaborative autonomous driving: Simulation platform and end-to-end system.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(8): 6566–6584, 2025. doi: 10.1109/TPAMI.2025.3560327. 16/35 V2XCrafter: Le...

work page doi:10.1109/tpami.2025.3560327 2025

[24] [25]

Bev- fusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation

Zhijian Liu, Haotian Tang, Alexander Amini, Xingyu Yang, Huizi Mao, Daniela Rus, and Song Han. Bev- fusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. InIEEE International Conference on Robotics and Automation (ICRA), 2023

2023

[25] [26]

Sense4fl: Vehic- ular crowdsensing enhanced federated learning for object detection in autonomous driving.IEEE Transactions on Mobile Computing, pages 1–15, 2026

Yanan Ma, Senkang Hu, Zhengru Fang, Yun Ji, Yiqin Deng, and Yuguang Fang. Sense4fl: Vehic- ular crowdsensing enhanced federated learning for object detection in autonomous driving.IEEE Transactions on Mobile Computing, pages 1–15, 2026. doi: 10.1109/TMC.2026.3674333

work page doi:10.1109/tmc.2026.3674333 2026

[26] [27]

Driveworld: 4d pre-trained scene understanding via world models for autonomous driving

Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, Liping Jing, Yiming Nie, and Bin Dai. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15522–15533, June 2024

2024

[27] [28]

Wein- berger, Bharath Hariharan, and Wei-Lun Chao

Tai-Yu Pan, Sooyoung Jeon, Mengdi Fan, Jinsu Yoo, Zhenyang Feng, Mark Campbell, Kilian Q. Wein- berger, Bharath Hariharan, and Wei-Lun Chao. Transfer your perspective: Controllable 3d generation from any viewpoint in a driving scene. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12027–12036, June 2025

2025

[28] [29]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine...

2021

[29] [30]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 2022. URL https://arxiv.org/abs/2204.06125

work page internal anchor Pith review Pith/arXiv arXiv 2022

[30] [31]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022

2022

[31] [32]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. O...

2022

[32] [33]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception...

2020

[33] [34]

Street-viewimagegenerationfromabird’s-eyeview layout.IEEE Robotics and Automation Letters, 9(4):3578–3585, 2024

AlexanderSwerdlow, RunshengXu, andBoleiZhou. Street-viewimagegenerationfromabird’s-eyeview layout.IEEE Robotics and Automation Letters, 9(4):3578–3585, 2024. doi: 10.1109/LRA.2024.3368234. 17/35 V2XCrafter: Learning to Generate Driving Scene Across Agents

work page doi:10.1109/lra.2024.3368234 2024

[34] [35]

Digital twin and drl-driven semantic dissemination for 6g autonomous driving service

Yihang Tao, Jun Wu, Xi Lin, Shahid Mumtaz, and Soumaya Cherkaoui. Digital twin and drl-driven semantic dissemination for 6g autonomous driving service. InGLOBECOM 2023 - 2023 IEEE Global Com- munications Conference, pages 2075–2080, 2023. doi: 10.1109/GLOBECOM54140.2023.10437455

work page doi:10.1109/globecom54140.2023.10437455 2023

[35] [36]

Drl-driven digital twin function virtualization for adaptive service response in 6g networks.IEEE Networking Letters, 5(2):125–129, 2023

Yihang Tao, Jun Wu, Xi Lin, and Wu Yang. Drl-driven digital twin function virtualization for adaptive service response in 6g networks.IEEE Networking Letters, 5(2):125–129, 2023. doi: 10.1109/LNET. 2023.3269766

work page doi:10.1109/lnet 2023

[36] [37]

Yihang Tao, Jun Wu, Qianqian Pan, Ali Kashif Bashir, and Marwan Omar. O-ran-based digital twin function virtualization for sustainable iov service response: An asynchronous hierarchical reinforcement learning approach.IEEE Transactions on Green Communications and Networking, 8(3):1049–1060,

[37] [38]

doi: 10.1109/TGCN.2024.3435796

work page doi:10.1109/tgcn.2024.3435796 2024

[38] [39]

Directed-cp: Directed collaborative perception for connected and autonomous vehicles via proactive attention

Yihang Tao, Senkang Hu, Zhengru Fang, and Yuguang Fang. Directed-cp: Directed collaborative perception for connected and autonomous vehicles via proactive attention. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 7004–7010, 2025. doi: 10.1109/ICRA55743. 2025.11127818

work page doi:10.1109/icra55743 2025

[39] [40]

Learning mutual view information graph for adaptive adversarial collaborative perception.arXiv preprint arXiv:2602.19596, 2026

Yihang Tao, Senkang Hu, Haonan An, Zhengru Fang, Hangcheng Cao, and Yuguang Fang. Learning mutual view information graph for adaptive adversarial collaborative perception.arXiv preprint arXiv:2602.19596, 2026. URLhttps://arxiv.org/abs/2602.19596

work page arXiv 2026

[40] [41]

Gcp: Guarded collaborative perception with spatial-temporal aware malicious agent detection.IEEE Transactions on Dependable and Secure Computing, pages 1–14, 2026

Yihang Tao, Senkang Hu, Yue Hu, Haonan An, Hangcheng Cao, and Yuguang Fang. Gcp: Guarded collaborative perception with spatial-temporal aware malicious agent detection.IEEE Transactions on Dependable and Secure Computing, pages 1–14, 2026. doi: 10.1109/TDSC.2026.3693684

work page doi:10.1109/tdsc.2026.3693684 2026

[41] [42]

Con- fidence dimension for deep learning based on hoeffding inequality and relative evaluation.arXiv preprint arXiv:2203.09082, 2022

Runqi Wang, Linlin Yang, Baochang Zhang, Wentao Zhu, David Doermann, and Guodong Guo. Con- fidence dimension for deep learning based on hoeffding inequality and relative evaluation.arXiv preprint arXiv:2203.09082, 2022. URLhttps://arxiv.org/abs/2203.09082

work page arXiv 2022

[42] [43]

V2VNet: Vehicle-to-Vehicle Communication for Joint Perception and Prediction

Tsun-Hsuan Wang, Sivabalan Manivasagam, Ming Liang, Bin Yang, Wenyuan Zeng, and Raquel Urtasun. V2VNet: Vehicle-to-Vehicle Communication for Joint Perception and Prediction. InEuropean Conference on Computer Vision (ECCV), pages 605–621, 2020

2020

[43] [44]

Drive- dreamer: Towards real-world-drive world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drive- dreamer: Towards real-world-drive world models for autonomous driving. InComputer Vision - ECCV 2024: 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XLVIII, pages 55–72, Berlin, Heidelberg, 2024. Springer-Verlag. ISBN 978-3-031-7319...

work page doi:10.1007/978-3-031-73195-2_4 2024

[44] [45]

Editable scene simulation for autonomous driving via collaborative llm-agents

Yuxi Wei, Zi Wang, Yifan Lu, Chenxin Xu, Changxing Liu, Hao Zhao, Siheng Chen, and Yanfeng Wang. Editable scene simulation for autonomous driving via collaborative llm-agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024

2024

[45] [46]

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023

2023

[46] [47]

In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp

Hao Xiang, Runsheng Xu, Xin Xia, Zhaoliang Zheng, Bolei Zhou, and Jiaqi Ma. V2xp-asg: Generating adversarial scenes for vehicle-to-everything perception. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 3584–3591, 2023. doi: 10.1109/ICRA48891.2023.10161384. 18/35 V2XCrafter: Learning to Generate Driving Scene Across Agents

work page doi:10.1109/icra48891.2023.10161384 2023

[47] [48]

V2x-real: A largs-scale dataset for vehicle-to-everything cooperative perception

Hao Xiang, Zhaoliang Zheng, Xin Xia, Runsheng Xu, Letian Gao, Zewei Zhou, Xu Han, Xinkai Ji, Mingxi Li, Zonglin Meng, Li Jin, Mingyue Lei, Zhaoyang Ma, Zihang He, Haoxuan Ma, Yunshuang Yuan, Yingqian Zhao, and Jiaqi Ma. V2x-real: A largs-scale dataset for vehicle-to-everything cooperative perception. InComputer Vision - ECCV 2024: 18th European Conference...

2024

[48] [49]

OPV2V: An Open Benchmark Dataset and Fusion Pipeline for Perception with Vehicle-to-Vehicle Communication

Runsheng Xu, Hao Xiang, Xin Xia, Xu Han, Jinlong Li, and Jiaqi Ma. OPV2V: An Open Benchmark Dataset and Fusion Pipeline for Perception with Vehicle-to-Vehicle Communication. In2022 Interna- tional Conference on Robotics and Automation (ICRA), pages 2583–2589, Philadelphia, PA, USA, May

[49] [50]

Bevcontrol: Accurately controlling street-view elements with multi-perspective consistency via bev sketch layout.arXiv preprint arXiv:2308.01661, 2023

Kairui Yang, Enhui Ma, Jibin Peng, Qing Guo, Di Lin, and Kaicheng Yu. Bevcontrol: Accurately controlling street-view elements with multi-perspective consistency via bev sketch layout.arXiv preprint arXiv:2308.01661, 2023. URLhttps://arxiv.org/abs/2308.01661

work page arXiv 2023

[50] [51]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, October 2023

2023

[51] [52]

Unipc: A unified predictor-corrector framework for fast sampling of diffusion models.NeurIPS, 2023

Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models.NeurIPS, 2023

2023

[52] [53]

Layoutdiffusion: Controllable diffusion model for layout-to-image generation

Guangcong Zheng, Xianpan Zhou, Xuewei Li, Zhongang Qi, Ying Shan, and Xi Li. Layoutdiffusion: Controllable diffusion model for layout-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22490–22499, June 2023

2023

[53] [54]

World4drive: End-to-end autonomous driving via intention-aware physical latent world model

Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, XianPeng Lang, and Dongbin Zhao. World4drive: End-to-end autonomous driving via intention-aware physical latent world model. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 28632–28642, Oc...

2025

[54] [55]

For data augmentation experiments, we additionally evaluate V2VNet

for multi-agent feature fusion. For data augmentation experiments, we additionally evaluate V2VNet

[55] [56]

w/o CA&V*

and Late Fusion [14] baselines, testing both camera-only (C) and camera-LiDAR (C+L) configurations. B.2. Baseline Implementation Details To the best of our knowledge, no prior work has addressed the task of multi-agent collaborative driving scene generation. Therefore, we establish baselines by adapting state-of-the-art single-agent driving scene generati...