pith. sign in

arxiv: 2606.31895 · v1 · pith:TVE2IX3Fnew · submitted 2026-06-30 · 💻 cs.CV

RESOLVE: A Multi-Resolution and Multi-Modal Dataset for Roadside Cooperative Perception

Pith reviewed 2026-07-01 05:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords roadside cooperative perceptionmulti-resolution LiDARmultimodal fusion3D object detectionpoint cloud sparsitybenchmark datasettraffic participant trackingurban intersection sensing
0
0 comments X

The pith

RESOLVE dataset enables controlled tests of roadside 3D perception across three fixed LiDAR resolutions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

RESOLVE is a real-world dataset collected at an urban intersection that pairs cameras with LiDAR at three different resolutions while holding all other sensing and environmental factors constant. It supplies over 100,000 images, 26,000 point cloud frames, and 220,000 bounding box annotations covering ten classes of traffic participants across varied lighting and weather. The controlled resolution variants create point cloud distribution shifts from sparsity, distance, and training-inference mismatches. This structure supports direct comparisons of unimodal and camera-LiDAR fusion models in 3D detection and tracking. Benchmark results show how fusion can offset sparsity effects and guide lower-cost roadside sensor choices.

Core claim

RESOLVE supplies synchronized multi-resolution LiDAR and camera data from roadside views along with manual annotations to enable systematic evaluation of unimodal and fusion architectures for 3D detection and tracking under controlled point sparsity shifts.

What carries the argument

The RESOLVE dataset's three fixed LiDAR resolution levels captured with otherwise identical sensing parameters, cameras, and scene conditions.

If this is right

  • Multimodal fusion recovers detection performance lost to reduced LiDAR point density.
  • Architectures can be compared without confounding from uncontrolled resolution changes.
  • Designers can directly assess cost savings from deploying lower-resolution roadside sensors.
  • Effects of training at one resolution and inferring at another become measurable.
  • Performance trends hold across detection, tracking, and diverse lighting or weather.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fixed-factor collection approach could be repeated with added radar or different intersection layouts to test broader generalizability.
  • The released benchmark code makes it straightforward to evaluate new models on the existing resolution variants.
  • Insights on sparsity compensation could inform sensor fusion choices in vehicle-to-infrastructure networks beyond this single site.
  • Similar controlled multi-resolution captures might reveal whether the observed fusion benefits transfer to other perception tasks such as segmentation.

Load-bearing premise

The 220,000 manual bounding box annotations remain accurate and consistent when the same scenes are captured at different LiDAR resolutions and under changing conditions.

What would settle it

Re-annotating a held-out subset of frames using only the lowest-resolution point clouds and finding average IoU agreement below 0.7 with the original labels.

Figures

Figures reproduced from arXiv: 2606.31895 by Dajiang Suo, Linan Song, Marco De Vincenzi, Shaozu Ding.

Figure 1
Figure 1. Figure 1: This illustration provides an comprehensive overview of our RESOLVE dataset. We visualize globally fused point clouds of a representative scene, with point clouds from LiDARs of different resolutions encoded in different colors. The sensor installation locations on the infrastructure side are highlighted. Both the high-fidelity simulation map based on CARLA and the HD map is displayed. paradigm focuses on … view at source ↗
Figure 2
Figure 2. Figure 2: The illustrations of temporal synchronization and spatial calibration across modalities. (a) shows calibration among LiDARs at different resolutions, including a zoomed-in view of target alignment, with camera frustums overlaid to verify cam￾era–LiDAR extrinsics and coverage. (b) shows point clouds from different resolution LiDARs projected onto multiple camera views to assess cross-modal calibration. Sens… view at source ↗
Figure 3
Figure 3. Figure 3: Class-wise statistics under three LiDAR resolutions. (a) Lower resolution re￾duces valid 3D boxes, especially for small targets. (b) As resolution increases, the average number of points increases, reflecting richer geometric observations. (c) Higher resolution LiDAR provides a slight gain for small targets track, but can extend the effective tracking length of large vehicles. 0 10 20 30 40 # 3D box labels… view at source ↗
Figure 4
Figure 4. Figure 4: Statistics on valid boxes per frame and in-box points per box, along with visu￾alization of motion trajectories. (a) On average, each frame contains 16 valid bounding boxes. Resolution variations have minimal impact on this metric. (b) Distribution of LiDAR points per valid 3D box under different resolutions. (c) BEV trajectory visual￾ization on the intersection HD map, with trajectories color-coded by cla… view at source ↗
Figure 5
Figure 5. Figure 5: Statistics on the variation of average number of points with distance from object to the center of intersection at different resolutions. The resolution-caused difference is significant at close range but diminishes with increasing distance. the target in each sequence to support temporal tasks. We also provide a format conversion tool that can convert this dataset into nuScenes [6] format, making it [PIT… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative multi-object tracking results of SimpleTrack [38] on our RESOLVE across three resolution settings. Ground truths and predictions with a score threshold of 0.3 are overlaid and visualized at 15 Hz. Failure cases are marked in red circles. achieved by improving positioning accuracy. The average AMOTP decreases by 5.1% from low to medium resolution and by 14.1% from medium to high resolu￾tion. In … view at source ↗
Figure 7
Figure 7. Figure 7: Agent-level cooperative perception performance under three LiDAR resolutions. geometric details. By exchanging learned BEV features, intermediate fusion methods (e.g., AttFuse [58], F-Cooper [8], CoBEVT [55], and V2X-ViT [57]), which account for agent heterogeneity and inter-agent synchronization, outper￾form late and early fusion in most settings. The advantage of feature-level fusion is most evident at m… view at source ↗
Figure 8
Figure 8. Figure 8: t-SNE plots of the impact of LiDAR resolutions on feature representation ca￾pability of unimodal models under Transfusion-L [2] architecture [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Controlled experimental design under a general multimodal architecture. Top: standard backpropagation, mAP=93.1. Bottom: stop gradients in the LiDAR branch and freeze LiDAR-module parameters, mAP=89.4. 1 and 3 visualize the features of the first layer of LiDAR backbone, while 2 and 4 visualize the fused features. To further investigate how multimodal learning influences LiDAR feature representations, we fo… view at source ↗
Figure 10
Figure 10. Figure 10: Unimodal performance degradation caused by different LiDAR resolutions used during training and inference. Voxel Mamba [65] is used as unimodal detector. modal loss [50], the fusion model achieves higher detection performance and stronger feature separability at the fusion layers (marked number 2 in [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Installation location diagram of infrastructure cameras, marked in red circles [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: A professional team ensures that the LiDAR installations within each group differ only in height and provides specific measurement data for all sensors. 0.17m 22.5° 15° High Mid Low 22.5° Point Cloud Range Max:90m Max:90m Max:110m [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Influence by sensing range, VFOV, and installation height B Data Annoatation We complete 2D and 3D annotations using the MolarData platform. The an￾notations undergo multiple rounds of revision and review by professionals to ensure quality. The dataset contains 10 categories of annotated objects: pedes￾trians, golf carts, motorcycles, bicycles, cars, trucks, vans, construction vehicles, trailers, and buse… view at source ↗
Figure 14
Figure 14. Figure 14: Distribution of object dimensions (length, width, height) for each class in RE￾SOLVE dataset. The box plots depict the intra-class statistical spread and emphasize inter-class scale variations. tive of this fused point cloud to effectively reduce the annotation workload. After completing the 3D annotation in the fused point cloud, and combining the spa￾tial calibration results, we project these 3D boundin… view at source ↗
Figure 15
Figure 15. Figure 15: Relationship between the number of points within the bounding box and the distance from the object to the center of the intersection at three resolutions [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Distance-wise mAP of different classes under three resolutions. We use Voxel Mamba [65] as the detection model. Increasing LiDAR resolution can improve mAP, and this advantage is more pronounced at long distances where the number of in-box points is close to the lower limit. C.2 Class Distances As shown in [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Comparison of point cloud shapes for 10 traffic participants in our RESOLVE dataset at three LiDAR resolutions, including the notes of the distance from the inter￾section center. The results show that the point cloud density significantly increases with increas￾ing resolution, and the target geometry gradually evolves from sparse outlines to more complete three-dimensional information, enhancing the abili… view at source ↗
Figure 18
Figure 18. Figure 18: The illustrations show how we visualize LiDAR backbone features using t￾SNE. (a) Sparse convolutional features project onto BEV view. (b) Clipping region features of each object and performing average pooling. aggregate the feature block into a C-dimensional vector in R C . Next, we collect all feature vectors at the object-level across frames to form a set of samples X ∈ R Ntotal×C . After standardizatio… view at source ↗
Figure 19
Figure 19. Figure 19: Multimodal performance degradation caused by different LiDAR resolutions used during training and inference. Top: BEVFusion [33] is used as multimodal detector. Bottom: UniTR [47] is used as multimodal detector under the low resolution. As the resolution increases, detections become more stable and the tracking bottleneck gradually shifts from detection quality to data association and occlusion handling. … view at source ↗
Figure 20
Figure 20. Figure 20: Comparison of detection results between 64/16-beam data obtained by down￾sampling from 128-beam and real-world 64/16-beam data. data [PITH_FULL_IMAGE:figures/full_fig_p034_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Effects of LiDAR resolution on object detection accuracy at different distances. Distance-conditioned multimodal perception gain. We further explore how multimodal perception gains over unimodal vary with object distance. In roadside sensing, LiDAR point density naturally decreases as object distance increases, leading to progressively sparser measurements for distant objects. As show in [PITH_FULL_IMAGE… view at source ↗
Figure 22
Figure 22. Figure 22: Comparison of spatial attention maps induced by different latent codes in the last VSA module of VoxSet [17] architecture across three LiDAR resolutions. E Visualization E.1 3D Object Detection As shown in Figs. 23 to 25, we visualize and compare the detection results of three typical scenarios based on different detection models on our RESOLVE dataset. In sunny scenarios (see [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 23
Figure 23. Figure 23: Qualitative detection results of BEVFusion [33] in a sunny scene across three resolution settings, where missed or falsely detected objects are marked with purple circles [PITH_FULL_IMAGE:figures/full_fig_p038_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Qualitative detection results of TransFusion-L [2] in a rainy scene across three resolution settings, where missed or falsely detected objects are marked with purple circles [PITH_FULL_IMAGE:figures/full_fig_p039_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Qualitative detection results of LION [32] in a night scene across three resolu￾tion settings, where missed or falsely detected objects are marked with purple circles [PITH_FULL_IMAGE:figures/full_fig_p040_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Comparison of tracking results under different detection resolutions across three scenarios. We use Voxel Mamba [65] and SimpleTrack [38] as the detector and tracker [PITH_FULL_IMAGE:figures/full_fig_p041_26.png] view at source ↗
read the original abstract

LiDAR has increasingly been integrated into traffic cameras to expand coverage and mitigate occlusion in roadside cooperative perception. However, how unimodal and camera-LiDAR fusion architectures behave under variations in LiDAR point sparsity induced by sensor configurations and scene-dependent sensing conditions remains underexplored. We introduce RESOLVE, a large-scale real-world benchmark dataset featuring multi-resolution roadside LiDAR and synchronized camera-LiDAR sensing for systematic evaluation of unimodal and fusion-based architectures in roadside 3D detection and tracking. RESOLVE contains over 100k images and 26k point cloud frames with 220k manually annotated bounding boxes, captured at a real-world urban intersection across diverse lighting and weather conditions and spanning 10 classes of traffic participants. In particular, RESOLVE enables controlled evaluation across three LiDAR resolution levels while keeping all other sensing and environmental factors fixed. This allows fair cross-architecture comparisons under point cloud distribution shifts resulting from resolution variations, sensing distance, and training-inference resolution mismatches. Results from extensive benchmark experiments reveal insights into how multimodal fusion can compensate for LiDAR point sparsity, offering clues for designing cost-efficient roadside multimodal perception. The dataset and benchmark codes are available at https://github.com/ASU-Suo-Lab/RESOLVE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces RESOLVE, a real-world roadside dataset with synchronized multi-resolution LiDAR (three levels) and camera data captured at an urban intersection under varied lighting/weather. It contains >100k images, 26k point-cloud frames, and 220k manually annotated 3D bounding boxes across 10 classes. The central claim is that the dataset supports controlled evaluation of unimodal and camera-LiDAR fusion architectures for 3D detection/tracking by varying only LiDAR resolution while fixing all other sensing and environmental factors, thereby isolating distribution-shift effects; benchmark experiments and public code/data release are also provided.

Significance. A dataset enabling controlled isolation of LiDAR resolution effects in roadside cooperative perception would be a useful addition to the field, particularly given the public release of data and benchmark code. If annotation consistency across resolutions can be verified, the resource would support reproducible cross-architecture comparisons under realistic point-density shifts and training-inference mismatches.

major comments (1)
  1. [Abstract] Abstract and dataset description: The claim that RESOLVE 'enables controlled evaluation across three LiDAR resolution levels while keeping all other sensing and environmental factors fixed' is load-bearing for the paper's contribution. However, no information is supplied on the annotation protocol for the 220k boxes—specifically, whether boxes were annotated independently on each resolution's point cloud, propagated from the highest-resolution scan, or adjusted per resolution. No inter-resolution agreement statistics, inter-annotator agreement, or per-resolution quality metrics are reported. Without these, performance differences cannot be unambiguously attributed to resolution-induced distribution shift.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the concern about the annotation protocol below and will revise the manuscript accordingly to strengthen the support for our central claim.

read point-by-point responses
  1. Referee: [Abstract] Abstract and dataset description: The claim that RESOLVE 'enables controlled evaluation across three LiDAR resolution levels while keeping all other sensing and environmental factors fixed' is load-bearing for the paper's contribution. However, no information is supplied on the annotation protocol for the 220k boxes—specifically, whether boxes were annotated independently on each resolution's point cloud, propagated from the highest-resolution scan, or adjusted per resolution. No inter-resolution agreement statistics, inter-annotator agreement, or per-resolution quality metrics are reported. Without these, performance differences cannot be unambiguously attributed to resolution-induced distribution shift.

    Authors: We agree that the manuscript does not currently supply details on the annotation protocol across resolutions, inter-annotator agreement, or per-resolution quality metrics. This information is necessary to fully support the claim of controlled evaluation by isolating resolution effects. In the revised manuscript we will add a dedicated subsection describing the annotation process (including whether annotations were performed independently per resolution or propagated with adjustments), along with the requested agreement and quality statistics. These additions will enable readers to assess attribution of performance differences to distribution shift. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; dataset release only

full rationale

The manuscript introduces a multi-resolution roadside perception dataset and benchmark protocol. No equations, fitted parameters, predictions, or first-principles derivations appear in the abstract or described full text. Claims about controlled evaluation rest on the physical data collection and manual annotation process rather than any self-referential construction. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results occur. This is the expected 0 outcome for a data-release paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes a new dataset rather than a derivation; it rests on standard computer-vision data-collection practices.

axioms (1)
  • domain assumption Manual bounding-box annotations are sufficiently accurate and consistent for quantitative benchmarking across sensor resolutions.
    Required for the claim of fair cross-architecture comparisons under controlled conditions.

pith-pipeline@v0.9.1-grok · 5760 in / 1116 out tokens · 26019 ms · 2026-07-01T05:44:29.261978+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 51 canonical work pages · 2 internal anchors

  1. [1]

    In: Ninth IEEE/ACM International Conference on Internet-of-Things Design and Implementation, IoTDI 2024, Hong Kong, May 13- 16, 2024

    Ahmad, F., Shin, C.S., Pang, W., Leong, B., Ghosh, P., Govindan, R.: Coopera- tive infrastructure perception. In: Ninth IEEE/ACM International Conference on Internet-of-Things Design and Implementation, IoTDI 2024, Hong Kong, May 13- 16, 2024. pp. 61–72 (2024).https://doi.org/10.1109/IOTDI61053.2024.00010

  2. [2]

    In: IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022

    Bai, X., Hu, Z., Zhu, X., Huang, Q., Chen, Y., Fu, H., Tai, C.: Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In: IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. pp. 1080–1089 (2022).https://doi.org/10.1109/ CVPR52688.2022.00116

  3. [3]

    In: 2022 IEEE 25th International Conference on Intelligent Trans- portation Systems (ITSC)

    Bai, Z., Wu, G., Barth, M.J., Liu, Y., Sisbot, E.A., Oguchi, K.: Pillargrid: Deep learning-based cooperative perception for 3d object detection from onboard- roadside lidar. In: 2022 IEEE 25th International Conference on Intelligent Trans- portation Systems (ITSC). p. 1743–1749. IEEE (Oct 2022).https://doi.org/10. 1109/itsc55140.2022.9921947

  4. [4]

    Me- chanical Systems and Signal Processing204, 110723 (2023).https://doi.org/ 10.1016/j.ymssp.2023.110723

    Bai, Z., Wu, G., Barth, M.J., Liu, Y., Sisbot, E.A., Oguchi, K.: Vinet: Lightweight, scalable, and heterogeneous cooperative perception for 3d object detection. Me- chanical Systems and Signal Processing204, 110723 (2023).https://doi.org/ 10.1016/j.ymssp.2023.110723

  5. [5]

    In: Schenker, P.S

    Besl, P.J., McKay, N.D.: Method for registration of 3-D shapes. In: Schenker, P.S. (ed.) Sensor Fusion IV: Control Paradigms and Data Structures. vol. 1611, pp. 586 – 606. International Society for Optics and Photonics, SPIE (1992).https: //doi.org/10.1117/12.57955

  6. [6]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

    Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

  7. [7]

    In: IEEE International Conference on Robotics and Automation, ICRA 2023, London, UK, May 29 - June 2, 2023

    Cai, X., Jiang, W., Xu, R., Zhao, W., Ma, J., Liu, S., Li, Y.: Analyzing infrastruc- ture lidar placement with realistic lidar simulation library. In: 2023 IEEE Inter- national Conference on Robotics and Automation (ICRA). pp. 5581–5587 (2023). https://doi.org/10.1109/ICRA48891.2023.10161027

  8. [8]

    In: Proceedings of the 4th ACM/IEEE Symposium on Edge Com- puting

    Chen, Q., Ma, X., Tang, S., Guo, J., Yang, Q., Fu, S.: F-cooper: feature based cooperative perception for autonomous vehicle edge computing system using 3d point clouds. In: Proceedings of the 4th ACM/IEEE Symposium on Edge Com- puting. p. 88–100. SEC ’19, Association for Computing Machinery, New York, NY, USA (2019),https://doi.org/10.1145/3318216.3363300

  9. [9]

    Contributors, M.: MMDetection3D: OpenMMLab next-generation platform for general 3D object detection.https://github.com/open- mmlab/mmdetection3d (2020), accessed: 25 Jun 2026

  10. [10]

    Ding et al

    Corral-Soto, E.R., Grandhi, A., He, Y.Y., Rochan, M., Liu, B.: Improving lidar 3d object detection via range-based point cloud density optimization (2023),https: //arxiv.org/abs/2306.05663, accessed: 25 Jun 2026 16 S. Ding et al

  11. [11]

    IEEE Transactions on Intelligent Transportation Systems25(7), 6309–6327 (2024).https://doi.org/10.1109/TITS.2023.3343434

    Creß,C.,Bing,Z.,Knoll,A.C.:Intelligenttransportationsystemsusingroadsidein- frastructure: A literature survey. IEEE Transactions on Intelligent Transportation Systems25(7), 6309–6327 (2024).https://doi.org/10.1109/TITS.2023.3343434

  12. [12]

    Official Journal of the European Union, L 119, 4 May 2016, pp

    European Parliament and Council of the European Union: Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 (general data protection regulation). Official Journal of the European Union, L 119, 4 May 2016, pp. 1–88 (2016),https://eur-lex.europa.eu/eli/reg/2016/679/oj/eng, cELEX: 32016R0679; accessed 25 Jun 2026

  13. [13]

    Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The KITTI dataset. Int. J. Robotics Res.32(11), 1231–1237 (2013).https://doi.org/10. 1177/0278364913491297

  14. [14]

    Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces (2024),https://arxiv.org/abs/2312.00752, accessed: 25 Jun 2026

  15. [15]

    In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29- October 4, 2024, Proceedings, Part LXXVIII

    Hadgi, S., Li, L., Ovsjanikov, M.: To supervise or not to supervise: Understanding and addressing the key challenges of point cloud transfer learning. In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29- October 4, 2024, Proceedings, Part LXXVIII. pp. 146–163 (2024).https://doi. org/10.1007/978-3-031-73229-4_9

  16. [16]

    In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Hao, R., Fan, S., Dai, Y., Zhang, Z., Li, C., Wang, Y., Yu, H., Yang, W., Yuan, J., Nie, Z.: Rcooper: A real-world large-scale dataset for roadside cooperative per- ception. In: IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR 2024, Seattle, WA, USA, June 16-22, 2024. pp. 22347–22357 (2024). https://doi.org/10.1109/CVPR52733.2024.02109

  17. [17]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24,

    He, C., Li, R., Li, S., Zhang, L.: Voxel set transformer: A set-to-set approach to 3d object detection from point clouds. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24,

  18. [18]
  19. [19]

    IEEE Transac- tions on Intelligent Vehicles10(2), 1300–1314 (2025).https://doi.org/10.1109/ TIV.2024.3426524

    He, Y., Cao, P., Suo, D., Liu, X.: A joint optimization of beam distribution and de- ployment for roadside lidar systems to maximize vehicle perception. IEEE Transac- tions on Intelligent Vehicles10(2), 1300–1314 (2025).https://doi.org/10.1109/ TIV.2024.3426524

  20. [20]

    In: IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022

    Hu,J.S.K.,Kuai,T.,Waslander,S.L.:Pointdensity-awarevoxelsforlidar3dobject detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. pp. 8459–8468 (2022). https://doi.org/10.1109/CVPR52688.2022.00828

  21. [21]

    In���� �������� ���������� �� �������� ������ ��� ������� ����������� ������

    Hu, Q., Liu, D., Hu, W.: Density-insensitive unsupervised domain adaption on 3d object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. pp. 17556– 17566 (2023).https://doi.org/10.1109/CVPR52729.2023.01684

  22. [22]

    In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Hu, Y., Lu, Y., Xu, R., Xie, W., Chen, S., Wang, Y.: Collaboration helps camera overtake lidar in 3d detection. In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

  23. [23]

    IEEE 1588: Ieee standard for a precision clock synchronization protocol for net- worked measurement and control systems (2020).https://doi.org/10.1109/ IEEESTD.2020.9120376

  24. [24]

    In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023

    Jiang, W., Xiang, H., Cai, X., Xu, R., Ma, J., Li, Y., Lee, G.H., Liu, S.: Optimiz- ing the placement of roadside lidars for autonomous driving. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 18335–18344 (2023). https://doi.org/10.1109/ICCV51070.2023.01685 RESOLVE: A Multi-Resolution and Multi-Modal Roadside Dataset 17

  25. [25]

    Field Robotics2, 1156–1176 (2022).https://doi.org/10.55417/fr.2022038

    Krammer, A., Schöller, C., Gulati, D., Lakshminarasimhan, V., Kurz, F., Rosen- baum, D., Lenz, C., Knoll, A.: Providentia — a large-scale sensor system for the assistance of autonomous vehicles and its evaluation. Field Robotics2, 1156–1176 (2022).https://doi.org/10.55417/fr.2022038

  26. [26]

    In: 50 Years of Integer Programming 1958-2008 - From the Early Years to the State-of-the-Art, pp

    Kuhn, H.W.: The hungarian method for the assignment problem. In: 50 Years of Integer Programming 1958-2008 - From the Early Years to the State-of-the-Art, pp. 29–47. Springer (2010).https://doi.org/10.1007/978-3-540-68279-0_2

  27. [27]

    In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20,

    Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: Fast encoders for object detection from point clouds. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20,

  28. [28]

    12697–12705 (2019).https://doi.org/10.1109/CVPR.2019.01298

    pp. 12697–12705 (2019).https://doi.org/10.1109/CVPR.2019.01298

  29. [29]

    Li, H., Cao, B., Liang, Z., Li, W., Oh, J., Chen, Y., Liang, S., Zhou, H., Ma, C., Liu, J., Li, Z., Zhang, P., Long, K., Liu, M., Jiang, J., Yu, C., Liu, S., Yu, H., Li, X.: Cats-v2v: A real-world vehicle-to-vehicle cooperative perception dataset with complex adverse traffic scenarios (2025),https://arxiv.org/abs/2511.11168, accessed: 25 Jun 2026

  30. [30]

    In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXIV

    Li, S., Ma, L., Li, X.: Domain generalization of 3d object detection by density- resampling. In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXIV. pp. 456–473 (2024). https://doi.org/10.1007/978-3-031-73039-9_26

  31. [31]

    In: 2023 IEEE/RSJ Interna- tionalConferenceonIntelligentRobotsandSystems(IROS).pp.7742–7749(2023)

    Li, X., Xie, T., Liu, D., Gao, J., Dai, K., Jiang, Z., Zhao, L., Wang, K.: Poly- mot: A polyhedral framework for 3d multi-object tracking. In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 9391– 9398 (2023).https://doi.org/10.1109/IROS55552.2023.10341778

  32. [32]

    Scott Armstrong, ed.Expert Opinions in Forecasting: The Role of the Delphi Technique

    Liu, C., Zhu, M., Ma, C.: H-v2x: A large scale highway dataset for bev percep- tion. In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part I. Lecture Notes in Computer Science, vol. 15059, pp. 139–157. Springer (2024).https://doi.org/10.1007/978- 3-031-73232-4_8

  33. [33]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Liu, C., Zhu, M., Zhang, Z., Song, L., Zhao, X., Luo, Q., Wang, Q., Guo, C., Su, K.: Tad-e2e: A large-scale end-to-end autonomous driving dataset. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 26600– 26609 (October 2025)

  34. [34]

    Liu, Z., Hou, J., Wang, X., Ye, X., Wang, J., Zhao, H., Bai, X.: LION: linear group RNN for 3d object detection in point clouds. In: Advances in Neural Informa- tion Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024 (2024)

  35. [35]

    In: IEEE International Conference on Robotics and Automation, ICRA 2023, London, UK, May 29 - June 2, 2023

    Liu, Z., Tang, H., Amini, A., Yang, X., Mao, H., Rus, D.L., Han, S.: Bevfu- sion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In: IEEE International Conference on Robotics and Automation, ICRA 2023, London, UK, May 29 - June 2, 2023. pp. 2774–2781 (2023).https://doi.org/10.1109/ ICRA48891.2023.10160968

  36. [36]

    In: IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22,

    Ma, C., Qiao, L., Zhu, C., Liu, K., Kong, Z., Li, Q., Zhou, X., Kan, Y., Wu, W.: Holovic: Large-scale dataset and benchmark for multi-sensor holographic intersec- tion and vehicle-infrastructure cooperative. In: IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22,

  37. [37]
  38. [38]

    Journal of Machine Learning Research9, 2579–2605 (11 2008) 18 S

    van der Maaten, L., Hinton, G., Rachmad, Y.: Visualizing data using t-sne. Journal of Machine Learning Research9, 2579–2605 (11 2008) 18 S. Ding et al

  39. [39]

    Science Robotics7(66), eabm6074 (2022).https://doi.org/10.1126/scirobotics.abm6074

    Macenski, S., Foote, T., Gerkey, B., Lalancette, C., Woodall, W.: Robot operat- ing system 2: Design, architecture, and uses in the wild. Science Robotics7(66), eabm6074 (2022).https://doi.org/10.1126/scirobotics.abm6074

  40. [40]

    society for industrial & applied mathematics (2004)

    Madsen, K., Nielsen, H.B., Tingleff, O.: Methods for non-linear least squares prob- lems (2nd ed.). society for industrial & applied mathematics (2004)

  41. [41]

    In: Computer Vision - ECCV 2022 Workshops - Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part I

    Pang, Z., Li, Z., Wang, N.: Simpletrack: Understanding and rethinking 3d multi- object tracking. In: Computer Vision - ECCV 2022 Workshops - Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part I. pp. 680–696 (2022).https://doi.org/ 10.1007/978-3-031-25056-9_43

  42. [42]

    In: 2023 IEEE 26th International Con- ference on Intelligent Transportation Systems (ITSC)

    Qu, A., Huang, X., Suo, D.: Seip: Simulation-based design and evaluation of infrastructure-based collective perception. In: 2023 IEEE 26th International Con- ference on Intelligent Transportation Systems (ITSC). pp. 3871–3878 (2023). https://doi.org/10.1109/ITSC57777.2023.10422006

  43. [43]

    Richter, J., Faion, F., Feng, D., Becker, P.B., Sielecki, P., Glaeser, C.: Un- derstanding the domain gap in lidar object detection networks (2022),https: //arxiv.org/abs/2204.10024, accessed: 25 Jun 2026

  44. [44]

    Sekaran, K.C., Geisler, M., Rößle, D., Mohan, A., Cremers, D., Utschick, W., Botsch,M.,Huber,W.,Schön,T.:Urbaning-v2x:Alarge-scalemulti-vehicle,multi- infrastructure dataset across multiple intersections for cooperative perception. In: Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, Ne...

  45. [45]

    Momentum Contrast for Unsupervised Visual Representation Learning

    Sun,P.,Kretzschmar,H.,Dotiwalla,X.,Chouard,A.,Patnaik,V.,Tsui,P.,Guo,J., Zhou, Y., Chai, Y., Caine, B., Vasudevan, V., Han, W., Ngiam, J., Zhao, H., Timo- feev, A., Ettinger, S., Krivokon, M., Gao, A., Joshi, A., Zhang, Y., Shlens, J., Chen, Z., Anguelov, D.: Scalability in perception for autonomous driving: Waymo open dataset. In: 2020 IEEE/CVF Conferenc...

  46. [46]

    Team, O.D.: Openpcdet: An open-source toolbox for 3d object detection from point clouds.https://github.com/open-mmlab/OpenPCDet(2020), accessed: 25 Jun 2026

  47. [47]

    In: Proceedings of the 31st International Conference on Neural Information Processing Systems

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. p. 6000–6010. NIPS’17, Curran Associates Inc., Red Hook, NY, USA (2017)

  48. [48]

    Wang, B., Meng, S., Zhang, L., Wang, C., Huang, J., Li, Y., Ren, H., Xiao, Y., Peng, Y., Ji, J., Zhang, Y., Zhang, Y.: Corp: A multi-modal dataset for campus- oriented roadside perception tasks (2024),https://arxiv.org/abs/2404.03191, accessed: 25 Jun 2026

  49. [49]

    In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Wang, H., Shi, C., Shi, S., Lei, M., Wang, S., He, D., Schiele, B., Wang, L.: DSVT: dynamic sparse voxel transformer with rotated sets. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. pp. 13520–13529 (2023).https://doi.org/10.1109/CVPR52729. 2023.01299

  50. [50]

    In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023

    Wang, H., Tang, H., Shi, S., Li, A., Li, Z., Schiele, B., Wang, L.: Unitr: A uni- fied and efficient multi-modal transformer for bird’s-eye-view representation. In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023. pp. 6769–6779 (2023).https://doi.org/10.1109/ ICCV51070.2023.00625 RESOLVE: A Multi-Resolutio...

  51. [51]

    In: 2022 In- ternational Conference on Robotics and Automation (ICRA)

    Wang, H., Zhang, X., Li, Z., Li, J., Wang, K., Lei, Z., Haibing, R.: Ips300+: a challenging multi-modal data sets for intersection perception system. In: 2022 In- ternational Conference on Robotics and Automation (ICRA). p. 2539–2545 (2022). https://doi.org/10.1109/ICRA46639.2022.9811699

  52. [52]

    In: 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

    Wang, X., Qi, S., Zhao, J., Zhou, H., Zhang, S., Wang, G., Tu, K., Guo, S., Zhao, J., Li,J.,Qin,H.,Yang,M.:Mctrack:Aunified3dmulti-objecttrackingframeworkfor autonomous driving. In: 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 4551–4558 (2025).https://doi.org/10.1109/ IROS60139.2025.11245874

  53. [53]

    In: 2025 IEEE/CVF International Con- ference on Computer Vision (ICCV)

    Wei, S., Luo, C., Luo, Y.: Boosting multimodal learning via disentangled gra- dient learning. In: IEEE/CVF International Conference on Computer Vision, ICCV 2025, Honolulu, HI, USA, October 19-25, 2025. pp. 1–10 (2025).https: //doi.org/10.1109/ICCV51701.2025.02124

  54. [54]

    In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

    Wei, Y., Wei, Z., Rao, Y., Li, J., Zhou, J., Lu, J.: Lidar distillation: Bridging the beam-induced domain gap for 3d object detection. In: Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceed- ings, Part XXXIX. pp. 179–195 (2022).https://doi.org/10.1007/978-3-031- 19842-7_11

  55. [55]

    In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

    Weng, X., Wang, J., Held, D., Kitani, K.: 3d multi-object tracking: A baseline and new evaluation metrics. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020, Las Vegas, NV, USA, October 24, 2020 - January 24, 2021. pp. 10359–10366 (2020).https://doi.org/10.1109/IROS45743.2020. 9341164

  56. [56]

    In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LII

    Xiang, H., Zheng, Z., Xia, X., Xu, R., Gao, L., Zhou, Z., Han, X., Ji, X., Li, M., Meng, Z., Jin, L., Lei, M., Ma, Z., He, Z., Ma, H., Yuan, Y., Zhao, Y., Ma, J.: V2x-real: A large-scale dataset for vehicle-to-everything cooperative per- ception. In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceed...

  57. [57]

    IEEE Transactions on Multimedia26, 5536–5547 (2024).https://doi.org/10.1109/TMM.2023.3335879

    Xiao, A., Guan, D., Zhang, X., Lu, S.: Domain adaptive lidar point cloud segmenta- tion with 3d spatial consistency. IEEE Transactions on Multimedia26, 5536–5547 (2024).https://doi.org/10.1109/TMM.2023.3335879

  58. [58]

    In: Conference on Robot Learning, CoRL 2022, 14-18 December 2022, Auckland, New Zealand

    Xu, R., Tu, Z., Xiang, H., Shao, W., Zhou, B., Ma, J.: Cobevt: Cooperative bird’s eye view semantic segmentation with sparse transformers. In: Conference on Robot Learning, CoRL 2022, 14-18 December 2022, Auckland, New Zealand. pp. 989– 1000 (2022),https://proceedings.mlr.press/v205/xu23a.html, accessed: 25 Jun 2026

  59. [59]

    In: The IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR) (2023)

    Xu, R., Xia, X., Li, J., Li, H., Zhang, S., Tu, Z., Meng, Z., Xiang, H., Dong, X., Song, R., Yu, H., Zhou, B., Ma, J.: V2v4real: A real-world large-scale dataset for vehicle-to-vehicle cooperative perception. In: The IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR) (2023)

  60. [60]

    In: Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXIX

    Xu, R., Xiang, H., Tu, Z., Xia, X., Yang, M., Ma, J.: V2x-vit: Vehicle-to-everything cooperative perception with vision transformer. In: Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXIX. pp. 107–124 (2022).https://doi.org/10.1007/978-3-031-19842- 7_7

  61. [61]

    In: 2022 In- ternational Conference on Robotics and Automation (ICRA)

    Xu, R., Xiang, H., Xia, X., Han, X., Li, J., Ma, J.: OPV2V: an open bench- mark dataset and fusion pipeline for perception with vehicle-to-vehicle commu- nication. In: 2022 International Conference on Robotics and Automation, ICRA 20 S. Ding et al. 2022, Philadelphia, PA, USA, May 23-27, 2022. pp. 2583–2589 (2022).https: //doi.org/10.1109/ICRA46639.2022.9812038

  62. [62]

    Sen- sors18(10) (2018).https://doi.org/10.3390/s18103337

    Yan, Y., Mao, Y., Li, B.: Second: Sparsely embedded convolutional detection. Sen- sors18(10) (2018).https://doi.org/10.3390/s18103337

  63. [63]

    Advances in Neural Information Processing Systems (NeurIPS) (2025)

    Yang, L., Zhang, X., Li, J., Wang, C., Ma, J., Song, Z., Zhao, T., Song, Z., Wang, L., Zhou, M., Shen, Y., Lv, C.: V2x-radar: A multi-modal dataset with 4d radar for cooperative perception. Advances in Neural Information Processing Systems (NeurIPS) (2025)

  64. [64]

    In: IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022

    Ye, X., Shu, M., Li, H., Shi, Y., Li, Y., Wang, G., Tan, X., Ding, E.: Rope3D: The Roadside Perception Dataset for Autonomous Driving and Monocular 3D Object Detection Task. In: Conference on Computer Vision and Pattern Recognition. pp. 21341–21350 (2022).https://doi.org/10.1109/CVPR52688.2022.02065

  65. [65]

    In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021

    Yin, T., Zhou, X., Krähenbühl, P.: Center-based 3d object detection and tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. pp. 11784–11793 (2021).https://doi.org/10.1109/ CVPR46437.2021.01161

  66. [66]

    Yongqiang, D., Dengjiang, W., Gang, C., Bing, M., Xijia, G., Yajun, W., Jian- chao, L., Yanming, F., Juanjuan, L.: Baai-vanjee roadside dataset: Towards the connected automated vehicle highway technologies in challenging environments of china (2021),https://arxiv.org/abs/2105.14370, accessed: 25 Jun 2026

  67. [67]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Yu, H., Luo, Y., Shu, M., Huo, Y., Yang, Z., Shi, Y., Guo, Z., Li, H., Hu, X., Yuan, J., Nie, Z.: Dair-v2x: A large-scale dataset for vehicle-infrastructure cooperative 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21361–21370 (2022)

  68. [68]

    Zhang, G., Fan, L., He, C., Lei, Z., Zhang, Z., Zhang, L.: Voxel mamba: Group-free statespacemodelsforpointcloudbased3dobjectdetection.In:AdvancesinNeural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024 (2024)

  69. [69]

    A flexible new technique for camera calibration

    Zhang, Z.: A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence22(11), 1330–1334 (2000).https: //doi.org/10.1109/34.888718

  70. [70]

    IEEE transactions on pattern analysis and ma- chine intelligence48 6, 7032–7049 (2026).https://doi.org/10.1109/TPAMI

    Zhu, L.Z.Y.L.A.L.L.L.R.M.B.B.M.: Omnihd-scenes: A next-generation multimodal dataset for autonomous driving. IEEE transactions on pattern analysis and ma- chine intelligence48 6, 7032–7049 (2026).https://doi.org/10.1109/TPAMI. 2026.3663672

  71. [71]

    In: European Conference on Computer Vision

    Zhu, X., Sheng, H., Cai, S., Deng, B., Yang, S., Liang, Q., Chen, K., Gao, L., Song, J., Ye, J.: Roscenes: A large-scale multi-view 3d dataset for roadside perception. In: European Conference on Computer Vision. pp. 331–347. Springer (2024)

  72. [72]

    In: 2023 IEEE 26th International Con- ference on Intelligent Transportation Systems (ITSC)

    Zimmer, W., Creß, C., Nguyen, H.T., Knoll, A.C.: Tumtraf intersection dataset: All you need for urban 3d camera-lidar roadside perception. In: 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC). pp. 1030– 1037 (2023).https://doi.org/10.1109/ITSC57777.2023.10422289

  73. [73]

    9.754 Meters 6.172 Meters 36' 6

    Zimmer, W., Wardana, G.A., Sritharan, S., Zhou, X., Song, R., Knoll, A.C.: Tum- traf v2x cooperative perception dataset. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 22668–22677 (2024) RESOLVE: A Multi-Resolution and Multi-Modal Roadside Dataset 21 Appendix A Sensor Setup This section provides more details of sensor setup...