arxiv: 2604.09206 · v1 · submitted 2026-04-10 · 💻 cs.CV

Recognition: unknown

Long-SCOPE: Fully Sparse Long-Range Cooperative 3D Perception

Jiahao Wang , Zikun Xu , Yuner Zhang , Zhongwei Jiang , Chenyang Lu , Shuocheng Yang , Yuxuan Wang , Jiaru Zhong

show 3 more authors

Chuang Zhang Shaobing Xu Jianqiang Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:44 UTC · model grok-4.3

classification 💻 cs.CV

keywords cooperative 3D perceptionlong-range sensingsparse representationV2X communicationfeature associationquery generationautonomous driving3D object detection

0 comments

The pith

Long-SCOPE replaces dense maps with geometry-guided sparse queries and learnable association to achieve accurate cooperative 3D perception at 100-150 meter ranges.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Long-SCOPE, a fully sparse framework for cooperative 3D perception in autonomous driving that targets two practical barriers at long distances. Dense bird's-eye-view representations cause computation to grow quadratically with range, while feature matching between vehicles breaks down under observation and alignment errors. Long-SCOPE uses a Geometry-guided Query Generation module to locate small distant objects via geometric priors and a learnable Context-Aware Association module to match queries despite positional noise. This combination keeps both computation and communication costs low while extending reliable sensing to 100-150 meters, which would make vehicle-to-everything perception viable for highway-scale autonomous driving.

Core claim

Long-SCOPE is a fully sparse long-range cooperative 3D perception framework featuring Geometry-guided Query Generation to accurately detect small distant objects and learnable Context-Aware Association to robustly match cooperative queries despite severe positional noise, delivering state-of-the-art performance on the V2X-Seq and Griffin datasets especially in the 100-150 m range while maintaining competitive computation and communication costs.

What carries the argument

The Geometry-guided Query Generation module that produces sparse queries using geometric priors for distant objects, paired with the learnable Context-Aware Association module that matches queries across vehicles despite alignment errors, enabling fully sparse rather than dense BEV processing.

If this is right

Cooperative perception can scale to 150 m without quadratic growth in computation or memory.
Only sparse queries need to be exchanged, keeping communication bandwidth low for real-time use.
Occlusion handling and extended sensing horizons become practical in multi-vehicle scenarios.
Performance advantages appear strongest precisely in the long-range regimes where prior methods degrade.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sparse-query design could reduce bandwidth further in larger fleets by sharing only selected detections rather than full feature volumes.
Adding temporal context to the association module might extend the approach to consistent tracking across multiple time steps at distance.
The framework points toward perception architectures that remain efficient when the number of cooperating agents grows beyond pairs.

Load-bearing premise

The Geometry-guided Query Generation and learnable Context-Aware Association modules will remain accurate and efficient when real-world V2X observation and alignment errors are larger or differently distributed than those in the V2X-Seq and Griffin datasets.

What would settle it

A controlled test on data with substantially larger positional noise or on a new V2X dataset with different error statistics where Long-SCOPE falls below dense baselines would show the modules do not generalize as claimed.

Figures

Figures reproduced from arXiv: 2604.09206 by Chenyang Lu, Chuang Zhang, Jiahao Wang, Jianqiang Wang, Jiaru Zhong, Shaobing Xu, Shuocheng Yang, Yuner Zhang, Yuxuan Wang, Zhongwei Jiang, Zikun Xu.

**Figure 2.** Figure 2: Performance comparison on Griffin-25m dataset. The X-axis and Y-axis represent perception metrics, while the bubble size and color encode the transmission cost on a logarithmic scale. information after transmission. As shown in [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Task Definition of Cooperative Perception. Co-Visible, [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of Long-SCOPE framework, highlighting our novel components: the Geometry-guided Query Generation module for [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: Depth derivation for high-vantage agents. We predict the [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of robustness to localization errors on [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison of query association perfor [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Cooperative 3D perception via Vehicle-to-Everything communication is a promising paradigm for enhancing autonomous driving, offering extended sensing horizons and occlusion resolution. However, the practical deployment of existing methods is hindered at long distances by two critical bottlenecks: the quadratic computational scaling of dense BEV representations and the fragility of feature association mechanisms under significant observation and alignment errors. To overcome these limitations, we introduce Long-SCOPE, a fully sparse framework designed for robust long-distance cooperative 3D perception. Our method features two novel components: a Geometry-guided Query Generation module to accurately detect small, distant objects, and a learnable Context-Aware Association module that robustly matches cooperative queries despite severe positional noise. Experiments on the V2X-Seq and Griffin datasets validate that Long-SCOPE achieves state-of-the-art performance, particularly in challenging 100-150 m long-range settings, while maintaining highly competitive computation and communication costs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Long-SCOPE adds geometry-guided queries and context-aware association to a sparse backbone for better long-range V2X perception, but the robustness evidence looks thin.

read the letter

The paper's core move is to replace dense BEV grids with a fully sparse pipeline and introduce two targeted modules: Geometry-guided Query Generation for picking up small distant objects and a learnable Context-Aware Association step that tries to match features even when positions are noisy. That combination directly targets the quadratic cost and fragile association problems that show up past 100 m in cooperative perception. The experiments on V2X-Seq and Griffin report SOTA numbers at 100-150 m while keeping compute and bandwidth competitive, which is the practical payoff the abstract highlights. If the full results section shows clean ablations that isolate those two modules as the source of the gains, the work gives the field a concrete recipe worth testing on other V2X setups. The sparse design itself is a reasonable engineering choice that avoids the obvious scaling trap. The main soft spot is the lack of stress testing against larger or differently distributed errors. The abstract and stress-test note both flag that real V2X deployments often have bigger calibration drift, timestamp offsets, or non-Gaussian misalignment than the two evaluation datasets contain. Without explicit experiments that inject those conditions and still show the modules holding up, the long-range claims stay tied to the specific noise profiles in V2X-Seq and Griffin. No error bars or failure-case breakdowns appear in the provided summary either, so it's hard to judge how stable the numbers are. This is a standard empirical methods paper aimed at researchers already working on cooperative 3D perception for autonomous driving. Readers who care about sparse detectors or long-range V2X will find the modules and dataset results useful to examine, but the work is not broad enough to interest people outside that niche. It deserves a serious referee because the problem is real, the proposed fixes are clearly described, and the datasets are relevant, even though the robustness checks need strengthening before the claims can be taken as settled.

Referee Report

2 major / 1 minor

Summary. The paper introduces Long-SCOPE, a fully sparse framework for long-range cooperative 3D perception via V2X communication. It proposes two novel modules—a Geometry-guided Query Generation module for detecting small distant objects and a learnable Context-Aware Association module for robust query matching under positional noise—to address quadratic scaling in dense BEV representations and fragile feature association at long distances. Experiments on the V2X-Seq and Griffin datasets are reported to achieve state-of-the-art performance particularly in the 100-150 m range while maintaining competitive computation and communication costs.

Significance. If the central claims hold under broader conditions, the work could meaningfully advance practical deployment of cooperative perception systems by mitigating key scalability and robustness bottlenecks at extended ranges, with potential benefits for occlusion handling and sensing horizons in autonomous driving.

major comments (2)

[Experiments] Experiments section: the SOTA claims on V2X-Seq and Griffin lack reported error bars, ablation studies isolating the Geometry-guided Query Generation and Context-Aware Association modules, and failure-case analysis. Without these, it is difficult to verify that the long-range gains are attributable to the proposed components rather than dataset-specific tuning or post-hoc choices.
[Method and Experiments] Method and Experiments sections: the robustness claim for the learnable Context-Aware Association module under 'severe positional noise' is load-bearing for the central long-range performance argument, yet the evaluation uses only the error statistics present in V2X-Seq and Griffin. No stress tests with higher-magnitude, differently correlated, or non-Gaussian noise (e.g., larger calibration drift or timestamp asynchrony) are described, leaving the transferability to real-world V2X deployments unverified.

minor comments (1)

[Abstract] Abstract: the specific quantitative metrics (e.g., mAP@100-150m, communication volume in bytes) underlying the 'state-of-the-art' and 'highly competitive costs' statements could be stated explicitly to allow immediate comparison with prior work.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to improve experimental rigor and robustness validation.

read point-by-point responses

Referee: [Experiments] Experiments section: the SOTA claims on V2X-Seq and Griffin lack reported error bars, ablation studies isolating the Geometry-guided Query Generation and Context-Aware Association modules, and failure-case analysis. Without these, it is difficult to verify that the long-range gains are attributable to the proposed components rather than dataset-specific tuning or post-hoc choices.

Authors: We agree that error bars, isolating ablations, and failure-case analysis would strengthen verifiability of the long-range gains. In the revised manuscript we will report error bars over multiple runs with varied random seeds for all main results on V2X-Seq and Griffin. We will expand the ablation table to isolate the individual contributions of the Geometry-guided Query Generation module and the Context-Aware Association module. We will also add a failure-case analysis subsection discussing representative underperformance scenarios at long range. revision: yes
Referee: [Method and Experiments] Method and Experiments sections: the robustness claim for the learnable Context-Aware Association module under 'severe positional noise' is load-bearing for the central long-range performance argument, yet the evaluation uses only the error statistics present in V2X-Seq and Griffin. No stress tests with higher-magnitude, differently correlated, or non-Gaussian noise (e.g., larger calibration drift or timestamp asynchrony) are described, leaving the transferability to real-world V2X deployments unverified.

Authors: We acknowledge that the current experiments rely on the noise statistics native to V2X-Seq and Griffin. While these datasets already embed realistic calibration and synchronization errors, we will add explicit stress-test experiments in the revision. These will synthetically increase noise magnitude, introduce correlated errors, and apply non-Gaussian distributions to simulate more extreme calibration drift and timestamp asynchrony, thereby providing stronger evidence for transferability. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical ML architecture with external dataset validation

full rationale

The paper introduces a sparse neural architecture (Geometry-guided Query Generation + learnable Context-Aware Association) for long-range V2X 3D perception and reports experimental results on the public V2X-Seq and Griffin benchmarks. No equations, derivations, or first-principles predictions appear; performance numbers are obtained by training and testing on held-out data rather than by fitting a parameter and relabeling the fit as a prediction. No self-citation chain is required to justify the core modules, and the evaluation uses independent external datasets. This is the standard non-circular case for an empirical CV paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard supervised deep-learning assumptions (availability of labeled 3D bounding boxes in V2X-Seq and Griffin) plus the unstated premise that the learned modules generalize beyond the training distribution of positional noise. No new physical axioms or invented entities are introduced.

axioms (1)

domain assumption Supervised training on existing V2X datasets produces models that generalize to real-world long-range cooperative scenarios
Implicit in any empirical claim of SOTA performance on those datasets.

pith-pipeline@v0.9.0 · 5493 in / 1359 out tokens · 51315 ms · 2026-05-10T16:44:47.668340+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references

[1]

Cooperative perception for 3D object detec- tion in driving scenarios using infrastructure sensors.IEEE Transactions on Intelligent Transportation Systems, 23(3): 1852–1864, 2022

Eduardo Arnold, Mehrdad Dianati, Robert De Temple, and Saber Fallah. Cooperative perception for 3D object detec- tion in driving scenarios using infrastructure sensors.IEEE Transactions on Intelligent Transportation Systems, 23(3): 1852–1864, 2022. 2

2022
[2]

Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom

Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuScenes: A multi- modal dataset for autonomous driving. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11618–11628, 2020. 5

2020
[3]

CoopDiff: A Diffusion-Guided Approach for Cooperation under Cor- ruptions, 2026

Gong Chen, Chaokun Zhang, and Pengcheng Lv. CoopDiff: A Diffusion-Guided Approach for Cooperation under Cor- ruptions, 2026. 2

2026
[4]

CATNet: Collaborative Alignment and Transformation Network for Cooperative Perception, 2026

Gong Chen, Chaokun Zhang, Tao Tang, Pengcheng Lv, Feng Li, and Xin Xie. CATNet: Collaborative Alignment and Transformation Network for Cooperative Perception, 2026. 2

2026
[5]

Cooper: Cooperative Perception for Connected Autonomous Vehi- cles Based on 3D Point Clouds

Qi Chen, Sihai Tang, Qing Yang, and Song Fu. Cooper: Cooperative Perception for Connected Autonomous Vehi- cles Based on 3D Point Clouds. In2019 IEEE 39th In- ternational Conference on Distributed Computing Systems (ICDCS), pages 514–524, 2019. 2

2019
[6]

TransIFF: An instance-level feature fusion framework for vehicle- infrastructure cooperative 3D detection with transformers

Ziming Chen, Yifeng Shi, and Jinrang Jia. TransIFF: An instance-level feature fusion framework for vehicle- infrastructure cooperative 3D detection with transformers. In 2023 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 18159–18168, Paris, France, 2023. IEEE. 2

2023
[7]

Hsu-Kuang Chiu, Chien-Yi Wang, Min-Hung Chen, and Stephen F. Smith. Probabilistic 3D Multi-Object Cooperative Tracking for Autonomous Driving via Differentiable Multi- Sensor Kalman Filter. In2024 IEEE International Con- ference on Robotics and Automation (ICRA), pages 18458– 18464, 2024. 2

2024
[8]

Point Clus- ter: A Compact Message Unit for Communication-Efficient Collaborative Perception

Zihan Ding, Jiahui Fu, Si Liu, Hongyu Li, Siheng Chen, Hongsheng Li, Shifeng Zhang, and Xu Zhou. Point Clus- ter: A Compact Message Unit for Communication-Efficient Collaborative Perception. InThe Thirteenth International Conference on Learning Representations, 2024. 2, 5

2024
[9]

QUEST: Query Stream for Practical Cooperative Per- ception

Siqi Fan, Haibao Yu, Wenxian Yang, Jirui Yuan, and Zaiqing Nie. QUEST: Query Stream for Practical Cooperative Per- ception. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 18436–18442, 2024. 2

2024
[10]

Equi- RO: A 4D mmWave Radar Odometry via Equivariant Net- works.IEEE Robotics and Automation Letters, 11(4):4034– 4041, 2026

Zeyu Han, Shuocheng Yang, Minghan Zhu, Fang Zhang, Shaobing Xu, Maani Ghaffari, and Jianqiang Wang. Equi- RO: A 4D mmWave Radar Odometry via Equivariant Net- works.IEEE Robotics and Automation Letters, 11(4):4034– 4041, 2026. 1

2026
[11]

Research Challenges and Progress in the End-to-End V2X Cooperative Autonomous Driving Competition

Ruiyang Hao, Haibao Yu, Jiaru Zhong, Chuanye Wang, Ji- ahao Wang, Yiming Kan, Wenxian Yang, Siqi Fan, Huilin Yin, Jianing Qiu, Yao Mu, Jiankai Sun, Li Chen, Walter Zim- mer, Dandan Zhang, Shanghang Zhang, Mac Schwager, Ping Luo, and Zaiqing Nie. Research Challenges and Progress in the End-to-End V2X Cooperative Autonomous Driving Competition. InProceeding...

2025
[12]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016. 6

2016
[13]

Where2comm: Communication-efficient collabora- tive perception via spatial confidence maps

Yue Hu, Shaoheng Fang, Zixing Lei, Yiqi Zhong, and Siheng Chen. Where2comm: Communication-efficient collabora- tive perception via spatial confidence maps. InAdvances in Neural Information Processing Systems, pages 4874–4886,
[14]

Collaboration Helps Camera Overtake LiDAR in 3D Detection

Yue Hu, Yifan Lu, Runsheng Xu, Weidi Xie, Siheng Chen, and Yanfeng Wang. Collaboration Helps Camera Overtake LiDAR in 3D Detection. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9243–9252, 2023. 1

2023
[15]

Communication-efficient collaborative percep- tion via information filling with codebook

Yue Hu, Juntong Peng, Sifei Liu, Junhao Ge, Si Liu, and Si- heng Chen. Communication-efficient collaborative percep- tion via information filling with codebook. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 15481–15490, 2024. 2

2024
[16]

Far3D: Expanding the Horizon for Surround-View 3D Object De- tection.Proceedings of the AAAI Conference on Artificial Intelligence, 38(3):2561–2569, 2024

Xiaohui Jiang, Shuailin Li, Yingfei Liu, Shihao Wang, Fan Jia, Tiancai Wang, Lijin Han, and Xiangyu Zhang. Far3D: Expanding the Horizon for Surround-View 3D Object De- tection.Proceedings of the AAAI Conference on Artificial Intelligence, 38(3):2561–2569, 2024. 4

2024
[17]

H. W. Kuhn. The Hungarian method for the assignment problem.Naval Research Logistics Quarterly, 2(1-2):83–97,
[18]

Learning Distilled Collaboration Graph for Multi-Agent Perception

Yiming Li, Shunli Ren, Pengxiang Wu, Siheng Chen, Chen Feng, and Wenjun Zhang. Learning Distilled Collaboration Graph for Multi-Agent Perception. InAdvances in Neural Information Processing Systems, pages 29541–29552. Cur- ran Associates, Inc., 2021. 1

2021
[19]

BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. InLecture Notes in Computer Science, pages 1–18, Cham, 2022. Springer Na- ture Switzerland. 6

2022
[20]

Sparse4D v3: Advancing end-to-end 3D de- tection and tracking, 2023

Xuewu Lin, Zixiang Pei, Tianwei Lin, Lichao Huang, and Zhizhong Su. Sparse4D v3: Advancing end-to-end 3D de- tection and tracking, 2023. 3

2023
[21]

LightGlue: Local Feature Matching at Light Speed

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle- feys. LightGlue: Local Feature Matching at Light Speed. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 17581–17592, 2023. 5

2023
[22]

SparseComm: An Efficient Sparse Communication Framework for Vehicle- Infrastructure Cooperative 3D Detection.Pattern Recogni- tion, 158:110961, 2025

Haizhuang Liu, Huazhen Chu, Junbao Zhuo, Bochao Zou, Jiansheng Chen, and Huimin Ma. SparseComm: An Efficient Sparse Communication Framework for Vehicle- Infrastructure Cooperative 3D Detection.Pattern Recogni- tion, 158:110961, 2025. 1, 2

2025
[23]

Robust collabora- tive 3D object detection in presence of pose errors

Yifan Lu, Quanhao Li, Baoan Liu, Mehrdad Dianati, Chen Feng, Siheng Chen, and Yanfeng Wang. Robust collabora- tive 3D object detection in presence of pose errors. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 4812–4818, 2023. 1, 2

2023
[24]

SuperGlue: Learning Feature Matching With Graph Neural Networks

Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperGlue: Learning Feature Matching With Graph Neural Networks. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4937–4946, 2020. 5

2020
[25]

Concerning nonnegative matrices and doubly stochastic matrices.Pacific Journal of Mathematics, 21(2):343–348, 1967

Richard Sinkhorn and Paul Knopp. Concerning nonnegative matrices and doubly stochastic matrices.Pacific Journal of Mathematics, 21(2):343–348, 1967. 5

1967
[26]

Rodolfo Valiente, Mahdi Zaman, Sedat Ozer, and Yaser P. Fallah. Controlling Steering Angle for Cooperative Self- driving Vehicles utilizing CNN and LSTM-based Deep Net- works. In2019 IEEE Intelligent Vehicles Symposium (IV), pages 2423–2428, 2019. 2

2019
[27]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neu- ral Information Processing Systems. Curran Associates, Inc.,
[28]

VGGT: Visual Geometry Grounded Transformer, 2025

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual Geometry Grounded Transformer, 2025. 5

2025
[29]

Griffin: Aerial-Ground Cooperative Detec- tion and Tracking Dataset and Benchmark.Proceedings of the AAAI Conference on Artificial Intelligence, 40(12):9867– 9875, 2026

Jiahao Wang, Xiangyu Cao, Jiaru Zhong, Yuner Zhang, Zeyu Han, Haibao Yu, Chuang Zhang, Lei He, Shaobing Xu, and Jianqiang Wang. Griffin: Aerial-Ground Cooperative Detec- tion and Tracking Dataset and Benchmark.Proceedings of the AAAI Conference on Artificial Intelligence, 40(12):9867– 9875, 2026. 2, 3, 5

2026
[30]

SparseCoop: Coop- erative Perception with Kinematic-Grounded Queries.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 40(12):9876–9884, 2026

Jiahao Wang, Zhongwei Jiang, Wenchao Sun, Jiaru Zhong, Haibao Yu, Yuner Zhang, Chenyang Lu, Chuang Zhang, Lei He, Shaobing Xu, and Jianqiang Wang. SparseCoop: Coop- erative Perception with Kinematic-Grounded Queries.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 40(12):9876–9884, 2026. 2, 3, 4, 5, 6, 8

2026
[31]

IFTR: An Instance-Level Fusion Transformer for Visual Collaborative Perception

Shaohong Wang, Lu Bin, Xinyu Xiao, Zhiyu Xiang, Hang- guan Shan, and Eryun Liu. IFTR: An Instance-Level Fusion Transformer for Visual Collaborative Perception. InCom- puter Vision – ECCV 2024, pages 124–141, Cham, 2025. Springer Nature Switzerland. 1, 2

2024
[32]

V2VNet: Vehicle-to-vehicle communication for joint perception and prediction

Tsun-Hsuan Wang, Sivabalan Manivasagam, Ming Liang, Bin Yang, Wenyuan Zeng, and Raquel Urtasun. V2VNet: Vehicle-to-vehicle communication for joint perception and prediction. InComputer Vision – ECCV 2020, pages 605– 621, Cham, 2020. Springer International Publishing. 1, 2

2020
[33]

DETR3D: 3D object detection from multi-view images via 3D-to-2D queries

Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. InProceedings of the 5th Conference on Robot Learning, pages 180–191. PMLR, 2022. 4

2022
[34]

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chun- hua Shen, and Tong He.π3: Permutation-Equivariant Visual Geometry Learning, 2025. 5

2025
[35]

Drones Help Drones: A Collaborative Framework for Multi-Drone Object Trajectory Prediction and Beyond.Advances in Neu- ral Information Processing Systems, 37:64604–64628, 2024

Zhechao Wang, Peirui Cheng, Mingxin Chen, Pengju Tian, Zhirui Wang, Xinming Li, Xue Yang, and Xian Sun. Drones Help Drones: A Collaborative Framework for Multi-Drone Object Trajectory Prediction and Beyond.Advances in Neu- ral Information Processing Systems, 37:64604–64628, 2024. 4

2024
[36]

CoopDETR: A Unified Cooperative Perception Framework for 3D Detection via Object Query

Zhe Wang, Shaocong Xu, Xucai Zhuang, Tongda Xu, Yan Wang, Jingjing Liu, Yilun Chen, and Ya-Qin Zhang. CoopDETR: A Unified Cooperative Perception Framework for 3D Detection via Object Query. InIEEE International Conference on Robotics and Automation, ICRA 2025, At- lanta, GA, USA, May 19-23, 2025, pages 2732–2739. IEEE,

2025
[37]

Quanmin Wei, Penglin Dai, Wei Li, Bingyi Liu, and Xiao Wu. InfoCom: Kilobyte-Scale Communication-Efficient Collaborative Perception with Information Bottleneck.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 40(35):29731–29739, 2026. 2

2026
[38]

V2X-ViT: Vehicle-to-everything cooperative perception with vision transformer

Runsheng Xu, Hao Xiang, Zhengzhong Tu, Xin Xia, Ming- Hsuan Yang, and Jiaqi Ma. V2X-ViT: Vehicle-to-everything cooperative perception with vision transformer. InLecture Notes in Computer Science, pages 107–124, Cham, 2022. Springer Nature Switzerland. 6

2022
[39]

MergeOcc: Bridge the Do- main Gap between Different LiDARs for Robust Occupancy Prediction

Zikun Xu and Shaobing Xu. MergeOcc: Bridge the Do- main Gap between Different LiDARs for Robust Occupancy Prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 26539–26548, 2025. 1

2025
[40]

BEVHeight: A robust framework for vision-based roadside 3D object detection

Lei Yang, Kaicheng Yu, Tao Tang, Jun Li, Kun Yuan, Li Wang, Xinyu Zhang, and Peng Chen. BEVHeight: A robust framework for vision-based roadside 3D object detection. In 2023 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 21611–21620, 2023. 4

2023
[41]

BEVHeight++: Toward Robust Visual Centric 3D Ob- ject Detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–18, 2025

Lei Yang, Tao Tang, Jun Li, Kun Yuan, Kai Wu, Peng Chen, Li Wang, Yi Huang, Lei Li, Xinyu Zhang, and Kaicheng Yu. BEVHeight++: Toward Robust Visual Centric 3D Ob- ject Detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–18, 2025. 4

2025
[42]

RINO: Accurate, Robust Radar- Inertial Odometry With Non-Iterative Estimation.IEEE Transactions on Automation Science and Engineering, 22: 20420–20434, 2025

Shuocheng Yang, Yueming Cao, Shengbo Eben Li, Jianqiang Wang, and Shaobing Xu. RINO: Accurate, Robust Radar- Inertial Odometry With Non-Iterative Estimation.IEEE Transactions on Automation Science and Engineering, 22: 20420–20434, 2025. 1

2025
[43]

DAIR-V2X: A large-scale dataset for vehicle-infrastructure cooperative 3D object detection

Haibao Yu, Yizhen Luo, Mao Shu, Yiyi Huo, Zebang Yang, Yifeng Shi, Zhenglong Guo, Hanyu Li, Xing Hu, Jirui Yuan, and Zaiqing Nie. DAIR-V2X: A large-scale dataset for vehicle-infrastructure cooperative 3D object detection. In 2022 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 21329–21338, 2022. 3

2022
[44]

V2X-seq: A large-scale sequential dataset for vehicle- infrastructure cooperative perception and forecasting

Haibao Yu, Wenxian Yang, Hongzhi Ruan, Zhenwei Yang, Yingjuan Tang, Xu Gao, Xin Hao, Yifeng Shi, Yifeng Pan, Ning Sun, Juan Song, Jirui Yuan, Ping Luo, and Zaiqing Nie. V2X-seq: A large-scale sequential dataset for vehicle- infrastructure cooperative perception and forecasting. In 2023 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR...

2023
[45]

End-to-End Au- tonomous Driving Through V2X Cooperation.Proceed- ings of the AAAI Conference on Artificial Intelligence, 39 (9):9598–9606, 2025

Haibao Yu, Wenxian Yang, Jiaru Zhong, Zhenwei Yang, Siqi Fan, Ping Luo, and Zaiqing Nie. End-to-End Au- tonomous Driving Through V2X Cooperation.Proceed- ings of the AAAI Conference on Artificial Intelligence, 39 (9):9598–9606, 2025. 2, 5, 6

2025
[46]

Generating evidential BEV maps in contin- uous driving space.ISPRS Journal of Photogrammetry and Remote Sensing, 204:27–41, 2023

Yunshuang Yuan, Hao Cheng, Michael Ying Yang, and Monika Sester. Generating evidential BEV maps in contin- uous driving space.ISPRS Journal of Photogrammetry and Remote Sensing, 204:27–41, 2023. 2

2023
[47]

SparseAlign: A Fully Sparse Framework for Coop- erative Object Detection

Yunshuang Yuan, Yan Xia, Daniel Cremers, and Monika Sester. SparseAlign: A Fully Sparse Framework for Coop- erative Object Detection. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22296– 22305, 2025. 1, 2

2025
[48]

Leveraging Temporal Contexts to Enhance Vehicle-Infrastructure Cooperative Per- ception

Jiaru Zhong, Haibao Yu, Tianyi Zhu, Jiahui Xu, Wenxian Yang, Zaiqing Nie, and Chao Sun. Leveraging Temporal Contexts to Enhance Vehicle-Infrastructure Cooperative Per- ception. In2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC), pages 915–922,
[49]

CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception

Jiaru Zhong, Jiahao Wang, Jiahui Xu, Xiaofan Li, Zaiqing Nie, and Haibao Yu. CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception. In 2025 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 26954–26965, 2025. 1, 2, 6

2025