RESOLVE: A Multi-Resolution and Multi-Modal Dataset for Roadside Cooperative Perception

Dajiang Suo; Linan Song; Marco De Vincenzi; Shaozu Ding

arxiv: 2606.31895 · v1 · pith:TVE2IX3Fnew · submitted 2026-06-30 · 💻 cs.CV

RESOLVE: A Multi-Resolution and Multi-Modal Dataset for Roadside Cooperative Perception

Shaozu Ding , Linan Song , Marco De Vincenzi , Dajiang Suo This is my paper

Pith reviewed 2026-07-01 05:44 UTC · model grok-4.3

classification 💻 cs.CV

keywords roadside cooperative perceptionmulti-resolution LiDARmultimodal fusion3D object detectionpoint cloud sparsitybenchmark datasettraffic participant trackingurban intersection sensing

0 comments

The pith

RESOLVE dataset enables controlled tests of roadside 3D perception across three fixed LiDAR resolutions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

RESOLVE is a real-world dataset collected at an urban intersection that pairs cameras with LiDAR at three different resolutions while holding all other sensing and environmental factors constant. It supplies over 100,000 images, 26,000 point cloud frames, and 220,000 bounding box annotations covering ten classes of traffic participants across varied lighting and weather. The controlled resolution variants create point cloud distribution shifts from sparsity, distance, and training-inference mismatches. This structure supports direct comparisons of unimodal and camera-LiDAR fusion models in 3D detection and tracking. Benchmark results show how fusion can offset sparsity effects and guide lower-cost roadside sensor choices.

Core claim

RESOLVE supplies synchronized multi-resolution LiDAR and camera data from roadside views along with manual annotations to enable systematic evaluation of unimodal and fusion architectures for 3D detection and tracking under controlled point sparsity shifts.

What carries the argument

The RESOLVE dataset's three fixed LiDAR resolution levels captured with otherwise identical sensing parameters, cameras, and scene conditions.

If this is right

Multimodal fusion recovers detection performance lost to reduced LiDAR point density.
Architectures can be compared without confounding from uncontrolled resolution changes.
Designers can directly assess cost savings from deploying lower-resolution roadside sensors.
Effects of training at one resolution and inferring at another become measurable.
Performance trends hold across detection, tracking, and diverse lighting or weather.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fixed-factor collection approach could be repeated with added radar or different intersection layouts to test broader generalizability.
The released benchmark code makes it straightforward to evaluate new models on the existing resolution variants.
Insights on sparsity compensation could inform sensor fusion choices in vehicle-to-infrastructure networks beyond this single site.
Similar controlled multi-resolution captures might reveal whether the observed fusion benefits transfer to other perception tasks such as segmentation.

Load-bearing premise

The 220,000 manual bounding box annotations remain accurate and consistent when the same scenes are captured at different LiDAR resolutions and under changing conditions.

What would settle it

Re-annotating a held-out subset of frames using only the lowest-resolution point clouds and finding average IoU agreement below 0.7 with the original labels.

Figures

Figures reproduced from arXiv: 2606.31895 by Dajiang Suo, Linan Song, Marco De Vincenzi, Shaozu Ding.

**Figure 1.** Figure 1: This illustration provides an comprehensive overview of our RESOLVE dataset. We visualize globally fused point clouds of a representative scene, with point clouds from LiDARs of different resolutions encoded in different colors. The sensor installation locations on the infrastructure side are highlighted. Both the high-fidelity simulation map based on CARLA and the HD map is displayed. paradigm focuses on … view at source ↗

**Figure 2.** Figure 2: The illustrations of temporal synchronization and spatial calibration across modalities. (a) shows calibration among LiDARs at different resolutions, including a zoomed-in view of target alignment, with camera frustums overlaid to verify camera–LiDAR extrinsics and coverage. (b) shows point clouds from different resolution LiDARs projected onto multiple camera views to assess cross-modal calibration. Sens… view at source ↗

**Figure 3.** Figure 3: Class-wise statistics under three LiDAR resolutions. (a) Lower resolution reduces valid 3D boxes, especially for small targets. (b) As resolution increases, the average number of points increases, reflecting richer geometric observations. (c) Higher resolution LiDAR provides a slight gain for small targets track, but can extend the effective tracking length of large vehicles. 0 10 20 30 40 # 3D box labels… view at source ↗

**Figure 4.** Figure 4: Statistics on valid boxes per frame and in-box points per box, along with visualization of motion trajectories. (a) On average, each frame contains 16 valid bounding boxes. Resolution variations have minimal impact on this metric. (b) Distribution of LiDAR points per valid 3D box under different resolutions. (c) BEV trajectory visualization on the intersection HD map, with trajectories color-coded by cla… view at source ↗

**Figure 5.** Figure 5: Statistics on the variation of average number of points with distance from object to the center of intersection at different resolutions. The resolution-caused difference is significant at close range but diminishes with increasing distance. the target in each sequence to support temporal tasks. We also provide a format conversion tool that can convert this dataset into nuScenes [6] format, making it [PIT… view at source ↗

**Figure 6.** Figure 6: Qualitative multi-object tracking results of SimpleTrack [38] on our RESOLVE across three resolution settings. Ground truths and predictions with a score threshold of 0.3 are overlaid and visualized at 15 Hz. Failure cases are marked in red circles. achieved by improving positioning accuracy. The average AMOTP decreases by 5.1% from low to medium resolution and by 14.1% from medium to high resolution. In … view at source ↗

**Figure 7.** Figure 7: Agent-level cooperative perception performance under three LiDAR resolutions. geometric details. By exchanging learned BEV features, intermediate fusion methods (e.g., AttFuse [58], F-Cooper [8], CoBEVT [55], and V2X-ViT [57]), which account for agent heterogeneity and inter-agent synchronization, outperform late and early fusion in most settings. The advantage of feature-level fusion is most evident at m… view at source ↗

**Figure 8.** Figure 8: t-SNE plots of the impact of LiDAR resolutions on feature representation capability of unimodal models under Transfusion-L [2] architecture [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Controlled experimental design under a general multimodal architecture. Top: standard backpropagation, mAP=93.1. Bottom: stop gradients in the LiDAR branch and freeze LiDAR-module parameters, mAP=89.4. 1 and 3 visualize the features of the first layer of LiDAR backbone, while 2 and 4 visualize the fused features. To further investigate how multimodal learning influences LiDAR feature representations, we fo… view at source ↗

**Figure 10.** Figure 10: Unimodal performance degradation caused by different LiDAR resolutions used during training and inference. Voxel Mamba [65] is used as unimodal detector. modal loss [50], the fusion model achieves higher detection performance and stronger feature separability at the fusion layers (marked number 2 in [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Installation location diagram of infrastructure cameras, marked in red circles [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: A professional team ensures that the LiDAR installations within each group differ only in height and provides specific measurement data for all sensors. 0.17m 22.5° 15° High Mid Low 22.5° Point Cloud Range Max:90m Max:90m Max:110m [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Influence by sensing range, VFOV, and installation height B Data Annoatation We complete 2D and 3D annotations using the MolarData platform. The annotations undergo multiple rounds of revision and review by professionals to ensure quality. The dataset contains 10 categories of annotated objects: pedestrians, golf carts, motorcycles, bicycles, cars, trucks, vans, construction vehicles, trailers, and buse… view at source ↗

**Figure 14.** Figure 14: Distribution of object dimensions (length, width, height) for each class in RESOLVE dataset. The box plots depict the intra-class statistical spread and emphasize inter-class scale variations. tive of this fused point cloud to effectively reduce the annotation workload. After completing the 3D annotation in the fused point cloud, and combining the spatial calibration results, we project these 3D boundin… view at source ↗

**Figure 15.** Figure 15: Relationship between the number of points within the bounding box and the distance from the object to the center of the intersection at three resolutions [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: Distance-wise mAP of different classes under three resolutions. We use Voxel Mamba [65] as the detection model. Increasing LiDAR resolution can improve mAP, and this advantage is more pronounced at long distances where the number of in-box points is close to the lower limit. C.2 Class Distances As shown in [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

**Figure 17.** Figure 17: Comparison of point cloud shapes for 10 traffic participants in our RESOLVE dataset at three LiDAR resolutions, including the notes of the distance from the intersection center. The results show that the point cloud density significantly increases with increasing resolution, and the target geometry gradually evolves from sparse outlines to more complete three-dimensional information, enhancing the abili… view at source ↗

**Figure 18.** Figure 18: The illustrations show how we visualize LiDAR backbone features using tSNE. (a) Sparse convolutional features project onto BEV view. (b) Clipping region features of each object and performing average pooling. aggregate the feature block into a C-dimensional vector in R C . Next, we collect all feature vectors at the object-level across frames to form a set of samples X ∈ R Ntotal×C . After standardizatio… view at source ↗

**Figure 19.** Figure 19: Multimodal performance degradation caused by different LiDAR resolutions used during training and inference. Top: BEVFusion [33] is used as multimodal detector. Bottom: UniTR [47] is used as multimodal detector under the low resolution. As the resolution increases, detections become more stable and the tracking bottleneck gradually shifts from detection quality to data association and occlusion handling. … view at source ↗

**Figure 20.** Figure 20: Comparison of detection results between 64/16-beam data obtained by downsampling from 128-beam and real-world 64/16-beam data. data [PITH_FULL_IMAGE:figures/full_fig_p034_20.png] view at source ↗

**Figure 21.** Figure 21: Effects of LiDAR resolution on object detection accuracy at different distances. Distance-conditioned multimodal perception gain. We further explore how multimodal perception gains over unimodal vary with object distance. In roadside sensing, LiDAR point density naturally decreases as object distance increases, leading to progressively sparser measurements for distant objects. As show in [PITH_FULL_IMAGE… view at source ↗

**Figure 22.** Figure 22: Comparison of spatial attention maps induced by different latent codes in the last VSA module of VoxSet [17] architecture across three LiDAR resolutions. E Visualization E.1 3D Object Detection As shown in Figs. 23 to 25, we visualize and compare the detection results of three typical scenarios based on different detection models on our RESOLVE dataset. In sunny scenarios (see [PITH_FULL_IMAGE:figures/fu… view at source ↗

**Figure 23.** Figure 23: Qualitative detection results of BEVFusion [33] in a sunny scene across three resolution settings, where missed or falsely detected objects are marked with purple circles [PITH_FULL_IMAGE:figures/full_fig_p038_23.png] view at source ↗

**Figure 24.** Figure 24: Qualitative detection results of TransFusion-L [2] in a rainy scene across three resolution settings, where missed or falsely detected objects are marked with purple circles [PITH_FULL_IMAGE:figures/full_fig_p039_24.png] view at source ↗

**Figure 25.** Figure 25: Qualitative detection results of LION [32] in a night scene across three resolution settings, where missed or falsely detected objects are marked with purple circles [PITH_FULL_IMAGE:figures/full_fig_p040_25.png] view at source ↗

**Figure 26.** Figure 26: Comparison of tracking results under different detection resolutions across three scenarios. We use Voxel Mamba [65] and SimpleTrack [38] as the detector and tracker [PITH_FULL_IMAGE:figures/full_fig_p041_26.png] view at source ↗

read the original abstract

LiDAR has increasingly been integrated into traffic cameras to expand coverage and mitigate occlusion in roadside cooperative perception. However, how unimodal and camera-LiDAR fusion architectures behave under variations in LiDAR point sparsity induced by sensor configurations and scene-dependent sensing conditions remains underexplored. We introduce RESOLVE, a large-scale real-world benchmark dataset featuring multi-resolution roadside LiDAR and synchronized camera-LiDAR sensing for systematic evaluation of unimodal and fusion-based architectures in roadside 3D detection and tracking. RESOLVE contains over 100k images and 26k point cloud frames with 220k manually annotated bounding boxes, captured at a real-world urban intersection across diverse lighting and weather conditions and spanning 10 classes of traffic participants. In particular, RESOLVE enables controlled evaluation across three LiDAR resolution levels while keeping all other sensing and environmental factors fixed. This allows fair cross-architecture comparisons under point cloud distribution shifts resulting from resolution variations, sensing distance, and training-inference resolution mismatches. Results from extensive benchmark experiments reveal insights into how multimodal fusion can compensate for LiDAR point sparsity, offering clues for designing cost-efficient roadside multimodal perception. The dataset and benchmark codes are available at https://github.com/ASU-Suo-Lab/RESOLVE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RESOLVE gives a controlled three-resolution LiDAR dataset captured at one fixed intersection, but the abstract leaves annotation consistency across densities unaddressed.

read the letter

The main thing here is the capture protocol: three LiDAR resolutions recorded simultaneously at the same urban intersection with synchronized cameras and all other factors held constant. This lets people run fair tests of how point sparsity alone affects detection, tracking, and fusion models.

The release itself is solid. Over 100k images and 26k point clouds with 220k boxes across 10 classes and real lighting/weather variation is substantial. Public code and data make it easy for others to reproduce the benchmarks or extend them. The roadside cooperative perception focus matches a practical need for cost-aware sensor placement.

The soft spot is the ground truth. The claim that performance differences come only from resolution shifts requires identical 3D boxes for the exact same objects at each density. Manual labeling of 220k boxes on sparse versus dense clouds can easily produce systematic differences in placement, size, or class, especially under varying conditions. The abstract gives no inter-resolution agreement numbers, no description of whether boxes were annotated independently or copied from the densest cloud, and no per-resolution quality checks. That gap directly weakens the "all other factors fixed" premise.

This paper is for groups working on multimodal roadside perception who want a benchmark for sparsity effects. A reader testing fusion robustness or sensor tradeoffs will get usable data from it.

It deserves peer review so the annotation workflow and consistency evidence can be examined.

Referee Report

1 major / 0 minor

Summary. The paper introduces RESOLVE, a real-world roadside dataset with synchronized multi-resolution LiDAR (three levels) and camera data captured at an urban intersection under varied lighting/weather. It contains >100k images, 26k point-cloud frames, and 220k manually annotated 3D bounding boxes across 10 classes. The central claim is that the dataset supports controlled evaluation of unimodal and camera-LiDAR fusion architectures for 3D detection/tracking by varying only LiDAR resolution while fixing all other sensing and environmental factors, thereby isolating distribution-shift effects; benchmark experiments and public code/data release are also provided.

Significance. A dataset enabling controlled isolation of LiDAR resolution effects in roadside cooperative perception would be a useful addition to the field, particularly given the public release of data and benchmark code. If annotation consistency across resolutions can be verified, the resource would support reproducible cross-architecture comparisons under realistic point-density shifts and training-inference mismatches.

major comments (1)

[Abstract] Abstract and dataset description: The claim that RESOLVE 'enables controlled evaluation across three LiDAR resolution levels while keeping all other sensing and environmental factors fixed' is load-bearing for the paper's contribution. However, no information is supplied on the annotation protocol for the 220k boxes—specifically, whether boxes were annotated independently on each resolution's point cloud, propagated from the highest-resolution scan, or adjusted per resolution. No inter-resolution agreement statistics, inter-annotator agreement, or per-resolution quality metrics are reported. Without these, performance differences cannot be unambiguously attributed to resolution-induced distribution shift.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the concern about the annotation protocol below and will revise the manuscript accordingly to strengthen the support for our central claim.

read point-by-point responses

Referee: [Abstract] Abstract and dataset description: The claim that RESOLVE 'enables controlled evaluation across three LiDAR resolution levels while keeping all other sensing and environmental factors fixed' is load-bearing for the paper's contribution. However, no information is supplied on the annotation protocol for the 220k boxes—specifically, whether boxes were annotated independently on each resolution's point cloud, propagated from the highest-resolution scan, or adjusted per resolution. No inter-resolution agreement statistics, inter-annotator agreement, or per-resolution quality metrics are reported. Without these, performance differences cannot be unambiguously attributed to resolution-induced distribution shift.

Authors: We agree that the manuscript does not currently supply details on the annotation protocol across resolutions, inter-annotator agreement, or per-resolution quality metrics. This information is necessary to fully support the claim of controlled evaluation by isolating resolution effects. In the revised manuscript we will add a dedicated subsection describing the annotation process (including whether annotations were performed independently per resolution or propagated with adjustments), along with the requested agreement and quality statistics. These additions will enable readers to assess attribution of performance differences to distribution shift. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; dataset release only

full rationale

The manuscript introduces a multi-resolution roadside perception dataset and benchmark protocol. No equations, fitted parameters, predictions, or first-principles derivations appear in the abstract or described full text. Claims about controlled evaluation rest on the physical data collection and manual annotation process rather than any self-referential construction. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results occur. This is the expected 0 outcome for a data-release paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes a new dataset rather than a derivation; it rests on standard computer-vision data-collection practices.

axioms (1)

domain assumption Manual bounding-box annotations are sufficiently accurate and consistent for quantitative benchmarking across sensor resolutions.
Required for the claim of fair cross-architecture comparisons under controlled conditions.

pith-pipeline@v0.9.1-grok · 5760 in / 1116 out tokens · 26019 ms · 2026-07-01T05:44:29.261978+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 51 canonical work pages · 2 internal anchors

[1]

In: Ninth IEEE/ACM International Conference on Internet-of-Things Design and Implementation, IoTDI 2024, Hong Kong, May 13- 16, 2024

Ahmad, F., Shin, C.S., Pang, W., Leong, B., Ghosh, P., Govindan, R.: Coopera- tive infrastructure perception. In: Ninth IEEE/ACM International Conference on Internet-of-Things Design and Implementation, IoTDI 2024, Hong Kong, May 13- 16, 2024. pp. 61–72 (2024).https://doi.org/10.1109/IOTDI61053.2024.00010

work page doi:10.1109/iotdi61053.2024.00010 2024
[2]

In: IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022

Bai, X., Hu, Z., Zhu, X., Huang, Q., Chen, Y., Fu, H., Tai, C.: Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In: IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. pp. 1080–1089 (2022).https://doi.org/10.1109/ CVPR52688.2022.00116

work page arXiv 2022
[3]

In: 2022 IEEE 25th International Conference on Intelligent Trans- portation Systems (ITSC)

Bai, Z., Wu, G., Barth, M.J., Liu, Y., Sisbot, E.A., Oguchi, K.: Pillargrid: Deep learning-based cooperative perception for 3d object detection from onboard- roadside lidar. In: 2022 IEEE 25th International Conference on Intelligent Trans- portation Systems (ITSC). p. 1743–1749. IEEE (Oct 2022).https://doi.org/10. 1109/itsc55140.2022.9921947

work page arXiv 2022
[4]

Me- chanical Systems and Signal Processing204, 110723 (2023).https://doi.org/ 10.1016/j.ymssp.2023.110723

Bai, Z., Wu, G., Barth, M.J., Liu, Y., Sisbot, E.A., Oguchi, K.: Vinet: Lightweight, scalable, and heterogeneous cooperative perception for 3d object detection. Me- chanical Systems and Signal Processing204, 110723 (2023).https://doi.org/ 10.1016/j.ymssp.2023.110723

work page doi:10.1016/j.ymssp.2023.110723 2023
[5]

In: Schenker, P.S

Besl, P.J., McKay, N.D.: Method for registration of 3-D shapes. In: Schenker, P.S. (ed.) Sensor Fusion IV: Control Paradigms and Data Structures. vol. 1611, pp. 586 – 606. International Society for Optics and Photonics, SPIE (1992).https: //doi.org/10.1117/12.57955

work page doi:10.1117/12.57955 1992
[6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

2020
[7]

In: IEEE International Conference on Robotics and Automation, ICRA 2023, London, UK, May 29 - June 2, 2023

Cai, X., Jiang, W., Xu, R., Zhao, W., Ma, J., Liu, S., Li, Y.: Analyzing infrastruc- ture lidar placement with realistic lidar simulation library. In: 2023 IEEE Inter- national Conference on Robotics and Automation (ICRA). pp. 5581–5587 (2023). https://doi.org/10.1109/ICRA48891.2023.10161027

work page doi:10.1109/icra48891.2023.10161027 2023
[8]

In: Proceedings of the 4th ACM/IEEE Symposium on Edge Com- puting

Chen, Q., Ma, X., Tang, S., Guo, J., Yang, Q., Fu, S.: F-cooper: feature based cooperative perception for autonomous vehicle edge computing system using 3d point clouds. In: Proceedings of the 4th ACM/IEEE Symposium on Edge Com- puting. p. 88–100. SEC ’19, Association for Computing Machinery, New York, NY, USA (2019),https://doi.org/10.1145/3318216.3363300

work page doi:10.1145/3318216.3363300 2019
[9]

Contributors, M.: MMDetection3D: OpenMMLab next-generation platform for general 3D object detection.https://github.com/open- mmlab/mmdetection3d (2020), accessed: 25 Jun 2026

2020
[10]

Ding et al

Corral-Soto, E.R., Grandhi, A., He, Y.Y., Rochan, M., Liu, B.: Improving lidar 3d object detection via range-based point cloud density optimization (2023),https: //arxiv.org/abs/2306.05663, accessed: 25 Jun 2026 16 S. Ding et al

work page arXiv 2023
[11]

IEEE Transactions on Intelligent Transportation Systems25(7), 6309–6327 (2024).https://doi.org/10.1109/TITS.2023.3343434

Creß,C.,Bing,Z.,Knoll,A.C.:Intelligenttransportationsystemsusingroadsidein- frastructure: A literature survey. IEEE Transactions on Intelligent Transportation Systems25(7), 6309–6327 (2024).https://doi.org/10.1109/TITS.2023.3343434

work page doi:10.1109/tits.2023.3343434 2024
[12]

Official Journal of the European Union, L 119, 4 May 2016, pp

European Parliament and Council of the European Union: Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 (general data protection regulation). Official Journal of the European Union, L 119, 4 May 2016, pp. 1–88 (2016),https://eur-lex.europa.eu/eli/reg/2016/679/oj/eng, cELEX: 32016R0679; accessed 25 Jun 2026

2016
[13]

Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The KITTI dataset. Int. J. Robotics Res.32(11), 1231–1237 (2013).https://doi.org/10. 1177/0278364913491297

2013
[14]

Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces (2024),https://arxiv.org/abs/2312.00752, accessed: 25 Jun 2026

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29- October 4, 2024, Proceedings, Part LXXVIII

Hadgi, S., Li, L., Ovsjanikov, M.: To supervise or not to supervise: Understanding and addressing the key challenges of point cloud transfer learning. In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29- October 4, 2024, Proceedings, Part LXXVIII. pp. 146–163 (2024).https://doi. org/10.1007/978-3-031-73229-4_9

work page doi:10.1007/978-3-031-73229-4_9 2024
[16]

In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Hao, R., Fan, S., Dai, Y., Zhang, Z., Li, C., Wang, Y., Yu, H., Yang, W., Yuan, J., Nie, Z.: Rcooper: A real-world large-scale dataset for roadside cooperative per- ception. In: IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR 2024, Seattle, WA, USA, June 16-22, 2024. pp. 22347–22357 (2024). https://doi.org/10.1109/CVPR52733.2024.02109

work page doi:10.1109/cvpr52733.2024.02109 2024
[17]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24,

He, C., Li, R., Li, S., Zhang, L.: Voxel set transformer: A set-to-set approach to 3d object detection from point clouds. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24,

2022
[18]

In: IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022

pp. 8407–8417 (2022).https://doi.org/10.1109/CVPR52688.2022.00823

work page doi:10.1109/cvpr52688.2022.00823 2022
[19]

IEEE Transac- tions on Intelligent Vehicles10(2), 1300–1314 (2025).https://doi.org/10.1109/ TIV.2024.3426524

He, Y., Cao, P., Suo, D., Liu, X.: A joint optimization of beam distribution and de- ployment for roadside lidar systems to maximize vehicle perception. IEEE Transac- tions on Intelligent Vehicles10(2), 1300–1314 (2025).https://doi.org/10.1109/ TIV.2024.3426524

work page arXiv 2025
[20]

In: IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022

Hu,J.S.K.,Kuai,T.,Waslander,S.L.:Pointdensity-awarevoxelsforlidar3dobject detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. pp. 8459–8468 (2022). https://doi.org/10.1109/CVPR52688.2022.00828

work page doi:10.1109/cvpr52688.2022.00828 2022
[21]

In��

Hu, Q., Liu, D., Hu, W.: Density-insensitive unsupervised domain adaption on 3d object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. pp. 17556– 17566 (2023).https://doi.org/10.1109/CVPR52729.2023.01684

work page doi:10.1109/cvpr52729.2023.01684 2023
[22]

In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

Hu, Y., Lu, Y., Xu, R., Xie, W., Chen, S., Wang, Y.: Collaboration helps camera overtake lidar in 3d detection. In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

2023
[23]

IEEE 1588: Ieee standard for a precision clock synchronization protocol for net- worked measurement and control systems (2020).https://doi.org/10.1109/ IEEESTD.2020.9120376

work page arXiv 2020
[24]

In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023

Jiang, W., Xiang, H., Cai, X., Xu, R., Ma, J., Li, Y., Lee, G.H., Liu, S.: Optimiz- ing the placement of roadside lidars for autonomous driving. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 18335–18344 (2023). https://doi.org/10.1109/ICCV51070.2023.01685 RESOLVE: A Multi-Resolution and Multi-Modal Roadside Dataset 17

work page doi:10.1109/iccv51070.2023.01685 2023
[25]

Field Robotics2, 1156–1176 (2022).https://doi.org/10.55417/fr.2022038

Krammer, A., Schöller, C., Gulati, D., Lakshminarasimhan, V., Kurz, F., Rosen- baum, D., Lenz, C., Knoll, A.: Providentia — a large-scale sensor system for the assistance of autonomous vehicles and its evaluation. Field Robotics2, 1156–1176 (2022).https://doi.org/10.55417/fr.2022038

work page doi:10.55417/fr.2022038 2022
[26]

In: 50 Years of Integer Programming 1958-2008 - From the Early Years to the State-of-the-Art, pp

Kuhn, H.W.: The hungarian method for the assignment problem. In: 50 Years of Integer Programming 1958-2008 - From the Early Years to the State-of-the-Art, pp. 29–47. Springer (2010).https://doi.org/10.1007/978-3-540-68279-0_2

work page doi:10.1007/978-3-540-68279-0_2 1958
[27]

In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20,

Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: Fast encoders for object detection from point clouds. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20,

2019
[28]

12697–12705 (2019).https://doi.org/10.1109/CVPR.2019.01298

pp. 12697–12705 (2019).https://doi.org/10.1109/CVPR.2019.01298

work page doi:10.1109/cvpr.2019.01298 2019
[29]

Li, H., Cao, B., Liang, Z., Li, W., Oh, J., Chen, Y., Liang, S., Zhou, H., Ma, C., Liu, J., Li, Z., Zhang, P., Long, K., Liu, M., Jiang, J., Yu, C., Liu, S., Yu, H., Li, X.: Cats-v2v: A real-world vehicle-to-vehicle cooperative perception dataset with complex adverse traffic scenarios (2025),https://arxiv.org/abs/2511.11168, accessed: 25 Jun 2026

work page arXiv 2025
[30]

In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXIV

Li, S., Ma, L., Li, X.: Domain generalization of 3d object detection by density- resampling. In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXIV. pp. 456–473 (2024). https://doi.org/10.1007/978-3-031-73039-9_26

work page doi:10.1007/978-3-031-73039-9_26 2024
[31]

In: 2023 IEEE/RSJ Interna- tionalConferenceonIntelligentRobotsandSystems(IROS).pp.7742–7749(2023)

Li, X., Xie, T., Liu, D., Gao, J., Dai, K., Jiang, Z., Zhao, L., Wang, K.: Poly- mot: A polyhedral framework for 3d multi-object tracking. In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 9391– 9398 (2023).https://doi.org/10.1109/IROS55552.2023.10341778

work page doi:10.1109/iros55552.2023.10341778 2023
[32]

Scott Armstrong, ed.Expert Opinions in Forecasting: The Role of the Delphi Technique

Liu, C., Zhu, M., Ma, C.: H-v2x: A large scale highway dataset for bev percep- tion. In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part I. Lecture Notes in Computer Science, vol. 15059, pp. 139–157. Springer (2024).https://doi.org/10.1007/978- 3-031-73232-4_8

work page doi:10.1007/978- 2024
[33]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Liu, C., Zhu, M., Zhang, Z., Song, L., Zhao, X., Luo, Q., Wang, Q., Guo, C., Su, K.: Tad-e2e: A large-scale end-to-end autonomous driving dataset. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 26600– 26609 (October 2025)

2025
[34]

Liu, Z., Hou, J., Wang, X., Ye, X., Wang, J., Zhao, H., Bai, X.: LION: linear group RNN for 3d object detection in point clouds. In: Advances in Neural Informa- tion Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024 (2024)

2024
[35]

In: IEEE International Conference on Robotics and Automation, ICRA 2023, London, UK, May 29 - June 2, 2023

Liu, Z., Tang, H., Amini, A., Yang, X., Mao, H., Rus, D.L., Han, S.: Bevfu- sion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In: IEEE International Conference on Robotics and Automation, ICRA 2023, London, UK, May 29 - June 2, 2023. pp. 2774–2781 (2023).https://doi.org/10.1109/ ICRA48891.2023.10160968

work page arXiv 2023
[36]

In: IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22,

Ma, C., Qiao, L., Zhu, C., Liu, K., Kong, Z., Li, Q., Zhou, X., Kan, Y., Wu, W.: Holovic: Large-scale dataset and benchmark for multi-sensor holographic intersec- tion and vehicle-infrastructure cooperative. In: IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22,

2024
[37]

In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

pp. 22129–22138 (2024).https://doi.org/10.1109/CVPR52733.2024.02089

work page doi:10.1109/cvpr52733.2024.02089 2024
[38]

Journal of Machine Learning Research9, 2579–2605 (11 2008) 18 S

van der Maaten, L., Hinton, G., Rachmad, Y.: Visualizing data using t-sne. Journal of Machine Learning Research9, 2579–2605 (11 2008) 18 S. Ding et al

2008
[39]

Science Robotics7(66), eabm6074 (2022).https://doi.org/10.1126/scirobotics.abm6074

Macenski, S., Foote, T., Gerkey, B., Lalancette, C., Woodall, W.: Robot operat- ing system 2: Design, architecture, and uses in the wild. Science Robotics7(66), eabm6074 (2022).https://doi.org/10.1126/scirobotics.abm6074

work page doi:10.1126/scirobotics.abm6074 2022
[40]

society for industrial & applied mathematics (2004)

Madsen, K., Nielsen, H.B., Tingleff, O.: Methods for non-linear least squares prob- lems (2nd ed.). society for industrial & applied mathematics (2004)

2004
[41]

In: Computer Vision - ECCV 2022 Workshops - Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part I

Pang, Z., Li, Z., Wang, N.: Simpletrack: Understanding and rethinking 3d multi- object tracking. In: Computer Vision - ECCV 2022 Workshops - Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part I. pp. 680–696 (2022).https://doi.org/ 10.1007/978-3-031-25056-9_43

work page doi:10.1007/978-3-031-25056-9_43 2022
[42]

In: 2023 IEEE 26th International Con- ference on Intelligent Transportation Systems (ITSC)

Qu, A., Huang, X., Suo, D.: Seip: Simulation-based design and evaluation of infrastructure-based collective perception. In: 2023 IEEE 26th International Con- ference on Intelligent Transportation Systems (ITSC). pp. 3871–3878 (2023). https://doi.org/10.1109/ITSC57777.2023.10422006

work page doi:10.1109/itsc57777.2023.10422006 2023
[43]

Richter, J., Faion, F., Feng, D., Becker, P.B., Sielecki, P., Glaeser, C.: Un- derstanding the domain gap in lidar object detection networks (2022),https: //arxiv.org/abs/2204.10024, accessed: 25 Jun 2026

work page arXiv 2022
[44]

Sekaran, K.C., Geisler, M., Rößle, D., Mohan, A., Cremers, D., Utschick, W., Botsch,M.,Huber,W.,Schön,T.:Urbaning-v2x:Alarge-scalemulti-vehicle,multi- infrastructure dataset across multiple intersections for cooperative perception. In: Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, Ne...

2025
[45]

Momentum Contrast for Unsupervised Visual Representation Learning

Sun,P.,Kretzschmar,H.,Dotiwalla,X.,Chouard,A.,Patnaik,V.,Tsui,P.,Guo,J., Zhou, Y., Chai, Y., Caine, B., Vasudevan, V., Han, W., Ngiam, J., Zhao, H., Timo- feev, A., Ettinger, S., Krivokon, M., Gao, A., Joshi, A., Zhang, Y., Shlens, J., Chen, Z., Anguelov, D.: Scalability in perception for autonomous driving: Waymo open dataset. In: 2020 IEEE/CVF Conferenc...

work page doi:10.1109/cvpr42600 2020
[46]

Team, O.D.: Openpcdet: An open-source toolbox for 3d object detection from point clouds.https://github.com/open-mmlab/OpenPCDet(2020), accessed: 25 Jun 2026

2020
[47]

In: Proceedings of the 31st International Conference on Neural Information Processing Systems

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. p. 6000–6010. NIPS’17, Curran Associates Inc., Red Hook, NY, USA (2017)

2017
[48]

Wang, B., Meng, S., Zhang, L., Wang, C., Huang, J., Li, Y., Ren, H., Xiao, Y., Peng, Y., Ji, J., Zhang, Y., Zhang, Y.: Corp: A multi-modal dataset for campus- oriented roadside perception tasks (2024),https://arxiv.org/abs/2404.03191, accessed: 25 Jun 2026

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Wang, H., Shi, C., Shi, S., Lei, M., Wang, S., He, D., Schiele, B., Wang, L.: DSVT: dynamic sparse voxel transformer with rotated sets. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. pp. 13520–13529 (2023).https://doi.org/10.1109/CVPR52729. 2023.01299

work page doi:10.1109/cvpr52729 2023
[50]

In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023

Wang, H., Tang, H., Shi, S., Li, A., Li, Z., Schiele, B., Wang, L.: Unitr: A uni- fied and efficient multi-modal transformer for bird’s-eye-view representation. In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023. pp. 6769–6779 (2023).https://doi.org/10.1109/ ICCV51070.2023.00625 RESOLVE: A Multi-Resolutio...

work page arXiv 2023
[51]

In: 2022 In- ternational Conference on Robotics and Automation (ICRA)

Wang, H., Zhang, X., Li, Z., Li, J., Wang, K., Lei, Z., Haibing, R.: Ips300+: a challenging multi-modal data sets for intersection perception system. In: 2022 In- ternational Conference on Robotics and Automation (ICRA). p. 2539–2545 (2022). https://doi.org/10.1109/ICRA46639.2022.9811699

work page doi:10.1109/icra46639.2022.9811699 2022
[52]

In: 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Wang, X., Qi, S., Zhao, J., Zhou, H., Zhang, S., Wang, G., Tu, K., Guo, S., Zhao, J., Li,J.,Qin,H.,Yang,M.:Mctrack:Aunified3dmulti-objecttrackingframeworkfor autonomous driving. In: 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 4551–4558 (2025).https://doi.org/10.1109/ IROS60139.2025.11245874

work page arXiv 2025
[53]

In: 2025 IEEE/CVF International Con- ference on Computer Vision (ICCV)

Wei, S., Luo, C., Luo, Y.: Boosting multimodal learning via disentangled gra- dient learning. In: IEEE/CVF International Conference on Computer Vision, ICCV 2025, Honolulu, HI, USA, October 19-25, 2025. pp. 1–10 (2025).https: //doi.org/10.1109/ICCV51701.2025.02124

work page doi:10.1109/iccv51701.2025.02124 2025
[54]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Wei, Y., Wei, Z., Rao, Y., Li, J., Zhou, J., Lu, J.: Lidar distillation: Bridging the beam-induced domain gap for 3d object detection. In: Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceed- ings, Part XXXIX. pp. 179–195 (2022).https://doi.org/10.1007/978-3-031- 19842-7_11

work page doi:10.1007/978-3-031- 2022
[55]

In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Weng, X., Wang, J., Held, D., Kitani, K.: 3d multi-object tracking: A baseline and new evaluation metrics. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020, Las Vegas, NV, USA, October 24, 2020 - January 24, 2021. pp. 10359–10366 (2020).https://doi.org/10.1109/IROS45743.2020. 9341164

work page doi:10.1109/iros45743.2020 2020
[56]

In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LII

Xiang, H., Zheng, Z., Xia, X., Xu, R., Gao, L., Zhou, Z., Han, X., Ji, X., Li, M., Meng, Z., Jin, L., Lei, M., Ma, Z., He, Z., Ma, H., Yuan, Y., Zhao, Y., Ma, J.: V2x-real: A large-scale dataset for vehicle-to-everything cooperative per- ception. In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceed...

work page doi:10.1007/978-3-031-72943-0_26 2024
[57]

IEEE Transactions on Multimedia26, 5536–5547 (2024).https://doi.org/10.1109/TMM.2023.3335879

Xiao, A., Guan, D., Zhang, X., Lu, S.: Domain adaptive lidar point cloud segmenta- tion with 3d spatial consistency. IEEE Transactions on Multimedia26, 5536–5547 (2024).https://doi.org/10.1109/TMM.2023.3335879

work page doi:10.1109/tmm.2023.3335879 2024
[58]

In: Conference on Robot Learning, CoRL 2022, 14-18 December 2022, Auckland, New Zealand

Xu, R., Tu, Z., Xiang, H., Shao, W., Zhou, B., Ma, J.: Cobevt: Cooperative bird’s eye view semantic segmentation with sparse transformers. In: Conference on Robot Learning, CoRL 2022, 14-18 December 2022, Auckland, New Zealand. pp. 989– 1000 (2022),https://proceedings.mlr.press/v205/xu23a.html, accessed: 25 Jun 2026

2022
[59]

In: The IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR) (2023)

Xu, R., Xia, X., Li, J., Li, H., Zhang, S., Tu, Z., Meng, Z., Xiang, H., Dong, X., Song, R., Yu, H., Zhou, B., Ma, J.: V2v4real: A real-world large-scale dataset for vehicle-to-vehicle cooperative perception. In: The IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR) (2023)

2023
[60]

In: Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXIX

Xu, R., Xiang, H., Tu, Z., Xia, X., Yang, M., Ma, J.: V2x-vit: Vehicle-to-everything cooperative perception with vision transformer. In: Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXIX. pp. 107–124 (2022).https://doi.org/10.1007/978-3-031-19842- 7_7

work page doi:10.1007/978-3-031-19842- 2022
[61]

In: 2022 In- ternational Conference on Robotics and Automation (ICRA)

Xu, R., Xiang, H., Xia, X., Han, X., Li, J., Ma, J.: OPV2V: an open bench- mark dataset and fusion pipeline for perception with vehicle-to-vehicle commu- nication. In: 2022 International Conference on Robotics and Automation, ICRA 20 S. Ding et al. 2022, Philadelphia, PA, USA, May 23-27, 2022. pp. 2583–2589 (2022).https: //doi.org/10.1109/ICRA46639.2022.9812038

work page doi:10.1109/icra46639.2022.9812038 2022
[62]

Sen- sors18(10) (2018).https://doi.org/10.3390/s18103337

Yan, Y., Mao, Y., Li, B.: Second: Sparsely embedded convolutional detection. Sen- sors18(10) (2018).https://doi.org/10.3390/s18103337

work page doi:10.3390/s18103337 2018
[63]

Advances in Neural Information Processing Systems (NeurIPS) (2025)

Yang, L., Zhang, X., Li, J., Wang, C., Ma, J., Song, Z., Zhao, T., Song, Z., Wang, L., Zhou, M., Shen, Y., Lv, C.: V2x-radar: A multi-modal dataset with 4d radar for cooperative perception. Advances in Neural Information Processing Systems (NeurIPS) (2025)

2025
[64]

In: IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022

Ye, X., Shu, M., Li, H., Shi, Y., Li, Y., Wang, G., Tan, X., Ding, E.: Rope3D: The Roadside Perception Dataset for Autonomous Driving and Monocular 3D Object Detection Task. In: Conference on Computer Vision and Pattern Recognition. pp. 21341–21350 (2022).https://doi.org/10.1109/CVPR52688.2022.02065

work page doi:10.1109/cvpr52688.2022.02065 2022
[65]

In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021

Yin, T., Zhou, X., Krähenbühl, P.: Center-based 3d object detection and tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. pp. 11784–11793 (2021).https://doi.org/10.1109/ CVPR46437.2021.01161

work page arXiv 2021
[66]

Yongqiang, D., Dengjiang, W., Gang, C., Bing, M., Xijia, G., Yajun, W., Jian- chao, L., Yanming, F., Juanjuan, L.: Baai-vanjee roadside dataset: Towards the connected automated vehicle highway technologies in challenging environments of china (2021),https://arxiv.org/abs/2105.14370, accessed: 25 Jun 2026

work page arXiv 2021
[67]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yu, H., Luo, Y., Shu, M., Huo, Y., Yang, Z., Shi, Y., Guo, Z., Li, H., Hu, X., Yuan, J., Nie, Z.: Dair-v2x: A large-scale dataset for vehicle-infrastructure cooperative 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21361–21370 (2022)

2022
[68]

Zhang, G., Fan, L., He, C., Lei, Z., Zhang, Z., Zhang, L.: Voxel mamba: Group-free statespacemodelsforpointcloudbased3dobjectdetection.In:AdvancesinNeural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024 (2024)

2024
[69]

A flexible new technique for camera calibration

Zhang, Z.: A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence22(11), 1330–1334 (2000).https: //doi.org/10.1109/34.888718

work page doi:10.1109/34.888718 2000
[70]

IEEE transactions on pattern analysis and ma- chine intelligence48 6, 7032–7049 (2026).https://doi.org/10.1109/TPAMI

Zhu, L.Z.Y.L.A.L.L.L.R.M.B.B.M.: Omnihd-scenes: A next-generation multimodal dataset for autonomous driving. IEEE transactions on pattern analysis and ma- chine intelligence48 6, 7032–7049 (2026).https://doi.org/10.1109/TPAMI. 2026.3663672

work page doi:10.1109/tpami 2026
[71]

In: European Conference on Computer Vision

Zhu, X., Sheng, H., Cai, S., Deng, B., Yang, S., Liang, Q., Chen, K., Gao, L., Song, J., Ye, J.: Roscenes: A large-scale multi-view 3d dataset for roadside perception. In: European Conference on Computer Vision. pp. 331–347. Springer (2024)

2024
[72]

In: 2023 IEEE 26th International Con- ference on Intelligent Transportation Systems (ITSC)

Zimmer, W., Creß, C., Nguyen, H.T., Knoll, A.C.: Tumtraf intersection dataset: All you need for urban 3d camera-lidar roadside perception. In: 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC). pp. 1030– 1037 (2023).https://doi.org/10.1109/ITSC57777.2023.10422289

work page doi:10.1109/itsc57777.2023.10422289 2023
[73]

9.754 Meters 6.172 Meters 36' 6

Zimmer, W., Wardana, G.A., Sritharan, S., Zhou, X., Song, R., Knoll, A.C.: Tum- traf v2x cooperative perception dataset. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 22668–22677 (2024) RESOLVE: A Multi-Resolution and Multi-Modal Roadside Dataset 21 Appendix A Sensor Setup This section provides more details of sensor setup...

2024

[1] [1]

In: Ninth IEEE/ACM International Conference on Internet-of-Things Design and Implementation, IoTDI 2024, Hong Kong, May 13- 16, 2024

Ahmad, F., Shin, C.S., Pang, W., Leong, B., Ghosh, P., Govindan, R.: Coopera- tive infrastructure perception. In: Ninth IEEE/ACM International Conference on Internet-of-Things Design and Implementation, IoTDI 2024, Hong Kong, May 13- 16, 2024. pp. 61–72 (2024).https://doi.org/10.1109/IOTDI61053.2024.00010

work page doi:10.1109/iotdi61053.2024.00010 2024

[2] [2]

In: IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022

Bai, X., Hu, Z., Zhu, X., Huang, Q., Chen, Y., Fu, H., Tai, C.: Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In: IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. pp. 1080–1089 (2022).https://doi.org/10.1109/ CVPR52688.2022.00116

work page arXiv 2022

[3] [3]

In: 2022 IEEE 25th International Conference on Intelligent Trans- portation Systems (ITSC)

Bai, Z., Wu, G., Barth, M.J., Liu, Y., Sisbot, E.A., Oguchi, K.: Pillargrid: Deep learning-based cooperative perception for 3d object detection from onboard- roadside lidar. In: 2022 IEEE 25th International Conference on Intelligent Trans- portation Systems (ITSC). p. 1743–1749. IEEE (Oct 2022).https://doi.org/10. 1109/itsc55140.2022.9921947

work page arXiv 2022

[4] [4]

Me- chanical Systems and Signal Processing204, 110723 (2023).https://doi.org/ 10.1016/j.ymssp.2023.110723

Bai, Z., Wu, G., Barth, M.J., Liu, Y., Sisbot, E.A., Oguchi, K.: Vinet: Lightweight, scalable, and heterogeneous cooperative perception for 3d object detection. Me- chanical Systems and Signal Processing204, 110723 (2023).https://doi.org/ 10.1016/j.ymssp.2023.110723

work page doi:10.1016/j.ymssp.2023.110723 2023

[5] [5]

In: Schenker, P.S

Besl, P.J., McKay, N.D.: Method for registration of 3-D shapes. In: Schenker, P.S. (ed.) Sensor Fusion IV: Control Paradigms and Data Structures. vol. 1611, pp. 586 – 606. International Society for Optics and Photonics, SPIE (1992).https: //doi.org/10.1117/12.57955

work page doi:10.1117/12.57955 1992

[6] [6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

2020

[7] [7]

In: IEEE International Conference on Robotics and Automation, ICRA 2023, London, UK, May 29 - June 2, 2023

Cai, X., Jiang, W., Xu, R., Zhao, W., Ma, J., Liu, S., Li, Y.: Analyzing infrastruc- ture lidar placement with realistic lidar simulation library. In: 2023 IEEE Inter- national Conference on Robotics and Automation (ICRA). pp. 5581–5587 (2023). https://doi.org/10.1109/ICRA48891.2023.10161027

work page doi:10.1109/icra48891.2023.10161027 2023

[8] [8]

In: Proceedings of the 4th ACM/IEEE Symposium on Edge Com- puting

Chen, Q., Ma, X., Tang, S., Guo, J., Yang, Q., Fu, S.: F-cooper: feature based cooperative perception for autonomous vehicle edge computing system using 3d point clouds. In: Proceedings of the 4th ACM/IEEE Symposium on Edge Com- puting. p. 88–100. SEC ’19, Association for Computing Machinery, New York, NY, USA (2019),https://doi.org/10.1145/3318216.3363300

work page doi:10.1145/3318216.3363300 2019

[9] [9]

Contributors, M.: MMDetection3D: OpenMMLab next-generation platform for general 3D object detection.https://github.com/open- mmlab/mmdetection3d (2020), accessed: 25 Jun 2026

2020

[10] [10]

Ding et al

Corral-Soto, E.R., Grandhi, A., He, Y.Y., Rochan, M., Liu, B.: Improving lidar 3d object detection via range-based point cloud density optimization (2023),https: //arxiv.org/abs/2306.05663, accessed: 25 Jun 2026 16 S. Ding et al

work page arXiv 2023

[11] [11]

IEEE Transactions on Intelligent Transportation Systems25(7), 6309–6327 (2024).https://doi.org/10.1109/TITS.2023.3343434

Creß,C.,Bing,Z.,Knoll,A.C.:Intelligenttransportationsystemsusingroadsidein- frastructure: A literature survey. IEEE Transactions on Intelligent Transportation Systems25(7), 6309–6327 (2024).https://doi.org/10.1109/TITS.2023.3343434

work page doi:10.1109/tits.2023.3343434 2024

[12] [12]

Official Journal of the European Union, L 119, 4 May 2016, pp

European Parliament and Council of the European Union: Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 (general data protection regulation). Official Journal of the European Union, L 119, 4 May 2016, pp. 1–88 (2016),https://eur-lex.europa.eu/eli/reg/2016/679/oj/eng, cELEX: 32016R0679; accessed 25 Jun 2026

2016

[13] [13]

Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The KITTI dataset. Int. J. Robotics Res.32(11), 1231–1237 (2013).https://doi.org/10. 1177/0278364913491297

2013

[14] [14]

Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces (2024),https://arxiv.org/abs/2312.00752, accessed: 25 Jun 2026

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29- October 4, 2024, Proceedings, Part LXXVIII

Hadgi, S., Li, L., Ovsjanikov, M.: To supervise or not to supervise: Understanding and addressing the key challenges of point cloud transfer learning. In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29- October 4, 2024, Proceedings, Part LXXVIII. pp. 146–163 (2024).https://doi. org/10.1007/978-3-031-73229-4_9

work page doi:10.1007/978-3-031-73229-4_9 2024

[16] [16]

In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Hao, R., Fan, S., Dai, Y., Zhang, Z., Li, C., Wang, Y., Yu, H., Yang, W., Yuan, J., Nie, Z.: Rcooper: A real-world large-scale dataset for roadside cooperative per- ception. In: IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR 2024, Seattle, WA, USA, June 16-22, 2024. pp. 22347–22357 (2024). https://doi.org/10.1109/CVPR52733.2024.02109

work page doi:10.1109/cvpr52733.2024.02109 2024

[17] [17]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24,

He, C., Li, R., Li, S., Zhang, L.: Voxel set transformer: A set-to-set approach to 3d object detection from point clouds. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24,

2022

[18] [18]

In: IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022

pp. 8407–8417 (2022).https://doi.org/10.1109/CVPR52688.2022.00823

work page doi:10.1109/cvpr52688.2022.00823 2022

[19] [19]

IEEE Transac- tions on Intelligent Vehicles10(2), 1300–1314 (2025).https://doi.org/10.1109/ TIV.2024.3426524

He, Y., Cao, P., Suo, D., Liu, X.: A joint optimization of beam distribution and de- ployment for roadside lidar systems to maximize vehicle perception. IEEE Transac- tions on Intelligent Vehicles10(2), 1300–1314 (2025).https://doi.org/10.1109/ TIV.2024.3426524

work page arXiv 2025

[20] [20]

In: IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022

Hu,J.S.K.,Kuai,T.,Waslander,S.L.:Pointdensity-awarevoxelsforlidar3dobject detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. pp. 8459–8468 (2022). https://doi.org/10.1109/CVPR52688.2022.00828

work page doi:10.1109/cvpr52688.2022.00828 2022

[21] [21]

In��

Hu, Q., Liu, D., Hu, W.: Density-insensitive unsupervised domain adaption on 3d object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. pp. 17556– 17566 (2023).https://doi.org/10.1109/CVPR52729.2023.01684

work page doi:10.1109/cvpr52729.2023.01684 2023

[22] [22]

In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

Hu, Y., Lu, Y., Xu, R., Xie, W., Chen, S., Wang, Y.: Collaboration helps camera overtake lidar in 3d detection. In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

2023

[23] [23]

IEEE 1588: Ieee standard for a precision clock synchronization protocol for net- worked measurement and control systems (2020).https://doi.org/10.1109/ IEEESTD.2020.9120376

work page arXiv 2020

[24] [24]

In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023

Jiang, W., Xiang, H., Cai, X., Xu, R., Ma, J., Li, Y., Lee, G.H., Liu, S.: Optimiz- ing the placement of roadside lidars for autonomous driving. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 18335–18344 (2023). https://doi.org/10.1109/ICCV51070.2023.01685 RESOLVE: A Multi-Resolution and Multi-Modal Roadside Dataset 17

work page doi:10.1109/iccv51070.2023.01685 2023

[25] [25]

Field Robotics2, 1156–1176 (2022).https://doi.org/10.55417/fr.2022038

Krammer, A., Schöller, C., Gulati, D., Lakshminarasimhan, V., Kurz, F., Rosen- baum, D., Lenz, C., Knoll, A.: Providentia — a large-scale sensor system for the assistance of autonomous vehicles and its evaluation. Field Robotics2, 1156–1176 (2022).https://doi.org/10.55417/fr.2022038

work page doi:10.55417/fr.2022038 2022

[26] [26]

In: 50 Years of Integer Programming 1958-2008 - From the Early Years to the State-of-the-Art, pp

Kuhn, H.W.: The hungarian method for the assignment problem. In: 50 Years of Integer Programming 1958-2008 - From the Early Years to the State-of-the-Art, pp. 29–47. Springer (2010).https://doi.org/10.1007/978-3-540-68279-0_2

work page doi:10.1007/978-3-540-68279-0_2 1958

[27] [27]

In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20,

Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: Fast encoders for object detection from point clouds. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20,

2019

[28] [28]

12697–12705 (2019).https://doi.org/10.1109/CVPR.2019.01298

pp. 12697–12705 (2019).https://doi.org/10.1109/CVPR.2019.01298

work page doi:10.1109/cvpr.2019.01298 2019

[29] [29]

Li, H., Cao, B., Liang, Z., Li, W., Oh, J., Chen, Y., Liang, S., Zhou, H., Ma, C., Liu, J., Li, Z., Zhang, P., Long, K., Liu, M., Jiang, J., Yu, C., Liu, S., Yu, H., Li, X.: Cats-v2v: A real-world vehicle-to-vehicle cooperative perception dataset with complex adverse traffic scenarios (2025),https://arxiv.org/abs/2511.11168, accessed: 25 Jun 2026

work page arXiv 2025

[30] [30]

In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXIV

Li, S., Ma, L., Li, X.: Domain generalization of 3d object detection by density- resampling. In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXIV. pp. 456–473 (2024). https://doi.org/10.1007/978-3-031-73039-9_26

work page doi:10.1007/978-3-031-73039-9_26 2024

[31] [31]

In: 2023 IEEE/RSJ Interna- tionalConferenceonIntelligentRobotsandSystems(IROS).pp.7742–7749(2023)

Li, X., Xie, T., Liu, D., Gao, J., Dai, K., Jiang, Z., Zhao, L., Wang, K.: Poly- mot: A polyhedral framework for 3d multi-object tracking. In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 9391– 9398 (2023).https://doi.org/10.1109/IROS55552.2023.10341778

work page doi:10.1109/iros55552.2023.10341778 2023

[32] [32]

Scott Armstrong, ed.Expert Opinions in Forecasting: The Role of the Delphi Technique

Liu, C., Zhu, M., Ma, C.: H-v2x: A large scale highway dataset for bev percep- tion. In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part I. Lecture Notes in Computer Science, vol. 15059, pp. 139–157. Springer (2024).https://doi.org/10.1007/978- 3-031-73232-4_8

work page doi:10.1007/978- 2024

[33] [33]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Liu, C., Zhu, M., Zhang, Z., Song, L., Zhao, X., Luo, Q., Wang, Q., Guo, C., Su, K.: Tad-e2e: A large-scale end-to-end autonomous driving dataset. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 26600– 26609 (October 2025)

2025

[34] [34]

Liu, Z., Hou, J., Wang, X., Ye, X., Wang, J., Zhao, H., Bai, X.: LION: linear group RNN for 3d object detection in point clouds. In: Advances in Neural Informa- tion Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024 (2024)

2024

[35] [35]

In: IEEE International Conference on Robotics and Automation, ICRA 2023, London, UK, May 29 - June 2, 2023

Liu, Z., Tang, H., Amini, A., Yang, X., Mao, H., Rus, D.L., Han, S.: Bevfu- sion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In: IEEE International Conference on Robotics and Automation, ICRA 2023, London, UK, May 29 - June 2, 2023. pp. 2774–2781 (2023).https://doi.org/10.1109/ ICRA48891.2023.10160968

work page arXiv 2023

[36] [36]

In: IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22,

Ma, C., Qiao, L., Zhu, C., Liu, K., Kong, Z., Li, Q., Zhou, X., Kan, Y., Wu, W.: Holovic: Large-scale dataset and benchmark for multi-sensor holographic intersec- tion and vehicle-infrastructure cooperative. In: IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22,

2024

[37] [37]

In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

pp. 22129–22138 (2024).https://doi.org/10.1109/CVPR52733.2024.02089

work page doi:10.1109/cvpr52733.2024.02089 2024

[38] [38]

Journal of Machine Learning Research9, 2579–2605 (11 2008) 18 S

van der Maaten, L., Hinton, G., Rachmad, Y.: Visualizing data using t-sne. Journal of Machine Learning Research9, 2579–2605 (11 2008) 18 S. Ding et al

2008

[39] [39]

Science Robotics7(66), eabm6074 (2022).https://doi.org/10.1126/scirobotics.abm6074

Macenski, S., Foote, T., Gerkey, B., Lalancette, C., Woodall, W.: Robot operat- ing system 2: Design, architecture, and uses in the wild. Science Robotics7(66), eabm6074 (2022).https://doi.org/10.1126/scirobotics.abm6074

work page doi:10.1126/scirobotics.abm6074 2022

[40] [40]

society for industrial & applied mathematics (2004)

Madsen, K., Nielsen, H.B., Tingleff, O.: Methods for non-linear least squares prob- lems (2nd ed.). society for industrial & applied mathematics (2004)

2004

[41] [41]

In: Computer Vision - ECCV 2022 Workshops - Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part I

Pang, Z., Li, Z., Wang, N.: Simpletrack: Understanding and rethinking 3d multi- object tracking. In: Computer Vision - ECCV 2022 Workshops - Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part I. pp. 680–696 (2022).https://doi.org/ 10.1007/978-3-031-25056-9_43

work page doi:10.1007/978-3-031-25056-9_43 2022

[42] [42]

In: 2023 IEEE 26th International Con- ference on Intelligent Transportation Systems (ITSC)

Qu, A., Huang, X., Suo, D.: Seip: Simulation-based design and evaluation of infrastructure-based collective perception. In: 2023 IEEE 26th International Con- ference on Intelligent Transportation Systems (ITSC). pp. 3871–3878 (2023). https://doi.org/10.1109/ITSC57777.2023.10422006

work page doi:10.1109/itsc57777.2023.10422006 2023

[43] [43]

Richter, J., Faion, F., Feng, D., Becker, P.B., Sielecki, P., Glaeser, C.: Un- derstanding the domain gap in lidar object detection networks (2022),https: //arxiv.org/abs/2204.10024, accessed: 25 Jun 2026

work page arXiv 2022

[44] [44]

Sekaran, K.C., Geisler, M., Rößle, D., Mohan, A., Cremers, D., Utschick, W., Botsch,M.,Huber,W.,Schön,T.:Urbaning-v2x:Alarge-scalemulti-vehicle,multi- infrastructure dataset across multiple intersections for cooperative perception. In: Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, Ne...

2025

[45] [45]

Momentum Contrast for Unsupervised Visual Representation Learning

Sun,P.,Kretzschmar,H.,Dotiwalla,X.,Chouard,A.,Patnaik,V.,Tsui,P.,Guo,J., Zhou, Y., Chai, Y., Caine, B., Vasudevan, V., Han, W., Ngiam, J., Zhao, H., Timo- feev, A., Ettinger, S., Krivokon, M., Gao, A., Joshi, A., Zhang, Y., Shlens, J., Chen, Z., Anguelov, D.: Scalability in perception for autonomous driving: Waymo open dataset. In: 2020 IEEE/CVF Conferenc...

work page doi:10.1109/cvpr42600 2020

[46] [46]

Team, O.D.: Openpcdet: An open-source toolbox for 3d object detection from point clouds.https://github.com/open-mmlab/OpenPCDet(2020), accessed: 25 Jun 2026

2020

[47] [47]

In: Proceedings of the 31st International Conference on Neural Information Processing Systems

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. p. 6000–6010. NIPS’17, Curran Associates Inc., Red Hook, NY, USA (2017)

2017

[48] [48]

Wang, B., Meng, S., Zhang, L., Wang, C., Huang, J., Li, Y., Ren, H., Xiao, Y., Peng, Y., Ji, J., Zhang, Y., Zhang, Y.: Corp: A multi-modal dataset for campus- oriented roadside perception tasks (2024),https://arxiv.org/abs/2404.03191, accessed: 25 Jun 2026

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Wang, H., Shi, C., Shi, S., Lei, M., Wang, S., He, D., Schiele, B., Wang, L.: DSVT: dynamic sparse voxel transformer with rotated sets. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. pp. 13520–13529 (2023).https://doi.org/10.1109/CVPR52729. 2023.01299

work page doi:10.1109/cvpr52729 2023

[50] [50]

In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023

Wang, H., Tang, H., Shi, S., Li, A., Li, Z., Schiele, B., Wang, L.: Unitr: A uni- fied and efficient multi-modal transformer for bird’s-eye-view representation. In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023. pp. 6769–6779 (2023).https://doi.org/10.1109/ ICCV51070.2023.00625 RESOLVE: A Multi-Resolutio...

work page arXiv 2023

[51] [51]

In: 2022 In- ternational Conference on Robotics and Automation (ICRA)

Wang, H., Zhang, X., Li, Z., Li, J., Wang, K., Lei, Z., Haibing, R.: Ips300+: a challenging multi-modal data sets for intersection perception system. In: 2022 In- ternational Conference on Robotics and Automation (ICRA). p. 2539–2545 (2022). https://doi.org/10.1109/ICRA46639.2022.9811699

work page doi:10.1109/icra46639.2022.9811699 2022

[52] [52]

In: 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Wang, X., Qi, S., Zhao, J., Zhou, H., Zhang, S., Wang, G., Tu, K., Guo, S., Zhao, J., Li,J.,Qin,H.,Yang,M.:Mctrack:Aunified3dmulti-objecttrackingframeworkfor autonomous driving. In: 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 4551–4558 (2025).https://doi.org/10.1109/ IROS60139.2025.11245874

work page arXiv 2025

[53] [53]

In: 2025 IEEE/CVF International Con- ference on Computer Vision (ICCV)

Wei, S., Luo, C., Luo, Y.: Boosting multimodal learning via disentangled gra- dient learning. In: IEEE/CVF International Conference on Computer Vision, ICCV 2025, Honolulu, HI, USA, October 19-25, 2025. pp. 1–10 (2025).https: //doi.org/10.1109/ICCV51701.2025.02124

work page doi:10.1109/iccv51701.2025.02124 2025

[54] [54]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Wei, Y., Wei, Z., Rao, Y., Li, J., Zhou, J., Lu, J.: Lidar distillation: Bridging the beam-induced domain gap for 3d object detection. In: Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceed- ings, Part XXXIX. pp. 179–195 (2022).https://doi.org/10.1007/978-3-031- 19842-7_11

work page doi:10.1007/978-3-031- 2022

[55] [55]

In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Weng, X., Wang, J., Held, D., Kitani, K.: 3d multi-object tracking: A baseline and new evaluation metrics. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020, Las Vegas, NV, USA, October 24, 2020 - January 24, 2021. pp. 10359–10366 (2020).https://doi.org/10.1109/IROS45743.2020. 9341164

work page doi:10.1109/iros45743.2020 2020

[56] [56]

In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LII

Xiang, H., Zheng, Z., Xia, X., Xu, R., Gao, L., Zhou, Z., Han, X., Ji, X., Li, M., Meng, Z., Jin, L., Lei, M., Ma, Z., He, Z., Ma, H., Yuan, Y., Zhao, Y., Ma, J.: V2x-real: A large-scale dataset for vehicle-to-everything cooperative per- ception. In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceed...

work page doi:10.1007/978-3-031-72943-0_26 2024

[57] [57]

IEEE Transactions on Multimedia26, 5536–5547 (2024).https://doi.org/10.1109/TMM.2023.3335879

Xiao, A., Guan, D., Zhang, X., Lu, S.: Domain adaptive lidar point cloud segmenta- tion with 3d spatial consistency. IEEE Transactions on Multimedia26, 5536–5547 (2024).https://doi.org/10.1109/TMM.2023.3335879

work page doi:10.1109/tmm.2023.3335879 2024

[58] [58]

In: Conference on Robot Learning, CoRL 2022, 14-18 December 2022, Auckland, New Zealand

Xu, R., Tu, Z., Xiang, H., Shao, W., Zhou, B., Ma, J.: Cobevt: Cooperative bird’s eye view semantic segmentation with sparse transformers. In: Conference on Robot Learning, CoRL 2022, 14-18 December 2022, Auckland, New Zealand. pp. 989– 1000 (2022),https://proceedings.mlr.press/v205/xu23a.html, accessed: 25 Jun 2026

2022

[59] [59]

In: The IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR) (2023)

Xu, R., Xia, X., Li, J., Li, H., Zhang, S., Tu, Z., Meng, Z., Xiang, H., Dong, X., Song, R., Yu, H., Zhou, B., Ma, J.: V2v4real: A real-world large-scale dataset for vehicle-to-vehicle cooperative perception. In: The IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR) (2023)

2023

[60] [60]

In: Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXIX

Xu, R., Xiang, H., Tu, Z., Xia, X., Yang, M., Ma, J.: V2x-vit: Vehicle-to-everything cooperative perception with vision transformer. In: Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXIX. pp. 107–124 (2022).https://doi.org/10.1007/978-3-031-19842- 7_7

work page doi:10.1007/978-3-031-19842- 2022

[61] [61]

In: 2022 In- ternational Conference on Robotics and Automation (ICRA)

Xu, R., Xiang, H., Xia, X., Han, X., Li, J., Ma, J.: OPV2V: an open bench- mark dataset and fusion pipeline for perception with vehicle-to-vehicle commu- nication. In: 2022 International Conference on Robotics and Automation, ICRA 20 S. Ding et al. 2022, Philadelphia, PA, USA, May 23-27, 2022. pp. 2583–2589 (2022).https: //doi.org/10.1109/ICRA46639.2022.9812038

work page doi:10.1109/icra46639.2022.9812038 2022

[62] [62]

Sen- sors18(10) (2018).https://doi.org/10.3390/s18103337

Yan, Y., Mao, Y., Li, B.: Second: Sparsely embedded convolutional detection. Sen- sors18(10) (2018).https://doi.org/10.3390/s18103337

work page doi:10.3390/s18103337 2018

[63] [63]

Advances in Neural Information Processing Systems (NeurIPS) (2025)

Yang, L., Zhang, X., Li, J., Wang, C., Ma, J., Song, Z., Zhao, T., Song, Z., Wang, L., Zhou, M., Shen, Y., Lv, C.: V2x-radar: A multi-modal dataset with 4d radar for cooperative perception. Advances in Neural Information Processing Systems (NeurIPS) (2025)

2025

[64] [64]

In: IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022

Ye, X., Shu, M., Li, H., Shi, Y., Li, Y., Wang, G., Tan, X., Ding, E.: Rope3D: The Roadside Perception Dataset for Autonomous Driving and Monocular 3D Object Detection Task. In: Conference on Computer Vision and Pattern Recognition. pp. 21341–21350 (2022).https://doi.org/10.1109/CVPR52688.2022.02065

work page doi:10.1109/cvpr52688.2022.02065 2022

[65] [65]

In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021

Yin, T., Zhou, X., Krähenbühl, P.: Center-based 3d object detection and tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. pp. 11784–11793 (2021).https://doi.org/10.1109/ CVPR46437.2021.01161

work page arXiv 2021

[66] [66]

Yongqiang, D., Dengjiang, W., Gang, C., Bing, M., Xijia, G., Yajun, W., Jian- chao, L., Yanming, F., Juanjuan, L.: Baai-vanjee roadside dataset: Towards the connected automated vehicle highway technologies in challenging environments of china (2021),https://arxiv.org/abs/2105.14370, accessed: 25 Jun 2026

work page arXiv 2021

[67] [67]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yu, H., Luo, Y., Shu, M., Huo, Y., Yang, Z., Shi, Y., Guo, Z., Li, H., Hu, X., Yuan, J., Nie, Z.: Dair-v2x: A large-scale dataset for vehicle-infrastructure cooperative 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21361–21370 (2022)

2022

[68] [68]

Zhang, G., Fan, L., He, C., Lei, Z., Zhang, Z., Zhang, L.: Voxel mamba: Group-free statespacemodelsforpointcloudbased3dobjectdetection.In:AdvancesinNeural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024 (2024)

2024

[69] [69]

A flexible new technique for camera calibration

Zhang, Z.: A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence22(11), 1330–1334 (2000).https: //doi.org/10.1109/34.888718

work page doi:10.1109/34.888718 2000

[70] [70]

IEEE transactions on pattern analysis and ma- chine intelligence48 6, 7032–7049 (2026).https://doi.org/10.1109/TPAMI

Zhu, L.Z.Y.L.A.L.L.L.R.M.B.B.M.: Omnihd-scenes: A next-generation multimodal dataset for autonomous driving. IEEE transactions on pattern analysis and ma- chine intelligence48 6, 7032–7049 (2026).https://doi.org/10.1109/TPAMI. 2026.3663672

work page doi:10.1109/tpami 2026

[71] [71]

In: European Conference on Computer Vision

Zhu, X., Sheng, H., Cai, S., Deng, B., Yang, S., Liang, Q., Chen, K., Gao, L., Song, J., Ye, J.: Roscenes: A large-scale multi-view 3d dataset for roadside perception. In: European Conference on Computer Vision. pp. 331–347. Springer (2024)

2024

[72] [72]

In: 2023 IEEE 26th International Con- ference on Intelligent Transportation Systems (ITSC)

Zimmer, W., Creß, C., Nguyen, H.T., Knoll, A.C.: Tumtraf intersection dataset: All you need for urban 3d camera-lidar roadside perception. In: 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC). pp. 1030– 1037 (2023).https://doi.org/10.1109/ITSC57777.2023.10422289

work page doi:10.1109/itsc57777.2023.10422289 2023

[73] [73]

9.754 Meters 6.172 Meters 36' 6

Zimmer, W., Wardana, G.A., Sritharan, S., Zhou, X., Song, R., Knoll, A.C.: Tum- traf v2x cooperative perception dataset. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 22668–22677 (2024) RESOLVE: A Multi-Resolution and Multi-Modal Roadside Dataset 21 Appendix A Sensor Setup This section provides more details of sensor setup...

2024