pith. sign in

arxiv: 2404.03191 · v3 · submitted 2024-04-04 · 💻 cs.CV

CORP: A Multi-Modal Dataset for Campus-Oriented Roadside Perception Tasks

Pith reviewed 2026-05-24 02:16 UTC · model grok-4.3

classification 💻 cs.CV
keywords roadside perceptionmulti-modal datasetcampus scenariosautonomous drivingLiDARobject trackinginstance segmentationbenchmark dataset
0
0 comments X

The pith

CORP is the first public benchmark dataset for multi-modal roadside perception in campus settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper notes that existing roadside perception datasets concentrate on urban arterial roads and overlook residential areas such as campuses that exhibit distinct characteristics. It releases CORP to address this gap, a collection of over 205k images and 102k point clouds from 18 cameras and 9 LiDAR sensors mounted on utility poles within a university campus. Annotations extend beyond bounding boxes to include unique IDs for tracking and pixel masks for instance segmentation. A sympathetic reader would care because the dataset supplies the data needed to develop and evaluate perception systems for objects and behaviors in these overlooked environments.

Core claim

The authors propose CORP as the first public benchmark dataset tailored for multi-modal roadside perception tasks under campus scenarios. Collected in a university campus, CORP consists of over 205k images plus 102k point clouds captured from 18 cameras and 9 LiDAR sensors with different configurations mounted on roadside utility poles to provide diverse viewpoints. The annotations encompass multi-dimensional information beyond 2D and 3D bounding boxes, providing extra support for 3D seamless tracking and instance segmentation with unique IDs and pixel masks for identifying targets, to enhance the understanding of objects and their behaviors distributed across the campus premises.

What carries the argument

The CORP dataset, built from synchronized multi-modal sensor streams on utility poles together with extended labels for tracking and segmentation.

If this is right

  • Researchers can train and benchmark multi-modal fusion methods on synchronized campus image and point cloud streams from varied viewpoints.
  • Algorithms for 3D object tracking can exploit the unique IDs across frames to maintain identities through campus scenes.
  • Instance segmentation models gain access to pixel masks that link 2D and 3D annotations for the same targets.
  • Perception systems for intelligent transportation can be evaluated on residential-area challenges such as pedestrian and cyclist behaviors near campus buildings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The multi-view utility-pole setup could be replicated in other non-arterial environments such as parks to test whether the same annotation style generalizes.
  • Comparison experiments between CORP and urban datasets would quantify how much viewpoint and scene-type differences affect current model performance.
  • The dataset's scale and sensor diversity make it suitable for studying long-term object re-identification across repeated campus routes.

Load-bearing premise

Campus scenarios exhibit entirely distinct characteristics from urban arterial roads that are not addressed by existing datasets.

What would settle it

Demonstration that perception models trained solely on existing urban roadside datasets reach equivalent accuracy on campus tasks without retraining or new labels.

Figures

Figures reproduced from arXiv: 2404.03191 by Beibei Wang, Haojie Ren, Jianmin Ji, Jingjing Huang, Lu Zhang, Yanyong Zhang, Yao Li, Yuru Peng, Yuxuan Xiao, Yu Zhang, Zijian Yu.

Figure 1
Figure 1. Figure 1: In the domain of 3D object detection, BEVDepth[ [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A BEV overview of the data pattern in CORP. The colored are LiDAR point clouds and the schematic yellow [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An overview of 4 types of coordinate systems involved in CORP, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: CORP image examples with (a) 3D annotations, (b) 2D boxes, and (c) segmentation masks. All are images [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of target location and orientation. (a) is a stacked overview of targets under BEV in their [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: A comparision of the target density in CORP and two of its urban-road counterparts. (a) and (c) are the [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Some challenging scenarios in CORP for object detection and segmentation tasks. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: A performance comparison beween P3D and IPM. The [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: An illustration of camera and roadside coordinate systems. The camera coordinate system is denoted as [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Overview of our P3D method. .2.2 Implementation details Once the intrinsic parameters, pose angles and the height of camera sensors are measured, we can lift a 2D target in an image to a 3D point in the camera coordinate system by following the closed-form Eq. (7) and Eq. (8) with no computational cost, given an image-based 2D detector employed beforehand to produce the bounding boxes of interested target… view at source ↗
Figure 11
Figure 11. Figure 11: Flatness profiles of the typical ground surfaces in the dataset and sample images for cameras in the [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
read the original abstract

Numerous roadside perception datasets have been introduced to propel advancements in autonomous driving and intelligent transportation systems research and development. However, it has been observed that the majority of their concentrates is on urban arterial roads, inadvertently overlooking residential areas such as parks and campuses that exhibit entirely distinct characteristics. In light of this gap, we propose CORP, which stands as the first public benchmark dataset tailored for multi-modal roadside perception tasks under campus scenarios. Collected in a university campus, CORP consists of over 205k images plus 102k point clouds captured from 18 cameras and 9 LiDAR sensors. These sensors with different configurations are mounted on roadside utility poles to provide diverse viewpoints within the campus region. The annotations of CORP encompass multi-dimensional information beyond 2D and 3D bounding boxes, providing extra support for 3D seamless tracking and instance segmentation with unique IDs and pixel masks for identifying targets, to enhance the understanding of objects and their behaviors distributed across the campus premises. Unlike other roadside datasets about urban traffic, CORP extends the spectrum to highlight the challenges for multi-modal perception in campuses and other residential areas.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CORP, claimed as the first public multi-modal roadside perception benchmark for campus scenarios. It comprises over 205k images and 102k point clouds captured by 18 cameras and 9 LiDARs mounted on utility poles, with annotations extending beyond 2D/3D bounding boxes to include unique IDs for 3D tracking and pixel masks for instance segmentation. The central claim is that existing datasets focus on urban arterial roads while overlooking distinct characteristics of residential/campus areas, and that CORP fills this gap with its sensor diversity and multi-dimensional labels.

Significance. If the novelty and distinctness claims hold, CORP would offer a useful addition to the literature by providing data from an underrepresented environment (university campus) with rich annotations supporting tracking and segmentation tasks. The multi-view, multi-modal sensor configuration is a concrete strength for perception research.

major comments (2)
  1. [Abstract] Abstract: The assertion that CORP 'stands as the first public benchmark dataset tailored for multi-modal roadside perception tasks under campus scenarios' and that campuses 'exhibit entirely distinct characteristics' is load-bearing for the contribution but is not supported by any quantitative comparisons (object density, trajectory statistics, scene diversity metrics, or similar) to prior roadside datasets; without these, the gap-filling claim cannot be evaluated.
  2. [Data annotation / labeling sections] Annotation description (full text, data collection and labeling sections): No details are provided on annotation validation procedures, quality control, or metrics such as inter-annotator agreement; this directly affects the claim that the 'multi-dimensional information' and 'unique IDs and pixel masks' meaningfully enhance understanding of objects and behaviors.
minor comments (2)
  1. [Abstract] Abstract: Typo/grammar: 'the majority of their concentrates is' should be rephrased to 'the majority of their concentration is' or 'most of their focus is'.
  2. [Abstract] Abstract: The phrasing 'Unlike other roadside datasets about urban traffic' is imprecise; consider 'Unlike other roadside datasets focused on urban traffic'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that CORP 'stands as the first public benchmark dataset tailored for multi-modal roadside perception tasks under campus scenarios' and that campuses 'exhibit entirely distinct characteristics' is load-bearing for the contribution but is not supported by any quantitative comparisons (object density, trajectory statistics, scene diversity metrics, or similar) to prior roadside datasets; without these, the gap-filling claim cannot be evaluated.

    Authors: We agree that the distinctness claim would be strengthened by quantitative evidence. In the revised manuscript we will insert a new comparison subsection (or table) reporting concrete metrics—object density per frame, average trajectory duration, number of unique object classes per scene, and scene entropy measures—computed on CORP versus representative prior roadside datasets focused on arterial roads. This addition will allow readers to evaluate the claimed gap directly. revision: yes

  2. Referee: [Data annotation / labeling sections] Annotation description (full text, data collection and labeling sections): No details are provided on annotation validation procedures, quality control, or metrics such as inter-annotator agreement; this directly affects the claim that the 'multi-dimensional information' and 'unique IDs and pixel masks' meaningfully enhance understanding of objects and behaviors.

    Authors: We accept that the absence of quality-control details weakens the annotation claims. The revised manuscript will add a concise subsection under Data Annotation that describes the multi-stage validation workflow (initial labeling followed by independent review by two additional annotators), the resolution protocol for disagreements, and the computed inter-annotator agreement scores (e.g., IoU for boxes and masks, ID consistency for tracking). revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces a new multi-modal roadside perception dataset (CORP) collected on a university campus. It contains no equations, derivations, fitted parameters, predictions, or uniqueness theorems. The central claim—that CORP is the first public benchmark for campus scenarios because prior datasets focus on urban arterial roads—is presented as an empirical observation rather than a derived result. No load-bearing step reduces by construction to the paper's own inputs, self-citations, or ansatzes. The contribution is self-contained as a data-collection and annotation effort.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset introduction paper whose central contribution is the collection and annotation of new empirical data rather than any fitted parameters, unproven axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5755 in / 1071 out tokens · 50232 ms · 2026-05-24T02:16:14.306567+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages

  1. [1]

    Tennille, Srinath K

    Haowen Xu, Anne Berres, Sarah A. Tennille, Srinath K. Ravulaparthy, Chieh Wang, and Jibonananda Sanyal. Continuous emulation and multiscale visualization of traffic flow using stationary roadside sensor data. IEEE Transactions on Intelligent Transportation Systems, 23(8):10530–10541, 2022

  2. [2]

    Vips: Real-time perception fusion for infrastructure-assisted autonomous driving

    Shuyao Shi, Jiahe Cui, Zhehao Jiang, Zhenyu Yan, Guoliang Xing, Jianwei Niu, and Zhenchao Ouyang. Vips: Real-time perception fusion for infrastructure-assisted autonomous driving. In Proceedings of the 28th Annual International Conference on Mobile Computing And Networking , MobiCom ’22, page 133–146, New York, NY , USA, 2022. Association for Computing Machinery

  3. [3]

    Shishir Shah

    Pranav Mantini, Zhenggang Li, and K. Shishir Shah. A day on campus - an anomaly detection dataset for events in a single camera. In Hiroshi Ishikawa, Cheng-Lin Liu, Tomas Pajdla, and Jianbo Shi, editors, Computer Vision – ACCV 2020, pages 619–635, Cham, 2021. Springer International Publishing

  4. [4]

    Dair-v2x: A large-scale dataset for vehicle-infrastructure cooperative 3d object detection

    Haibao Yu, Yizhen Luo, Mao Shu, Yiyi Huo, Zebang Yang, Yifeng Shi, Zhenglong Guo, Hanyu Li, Xing Hu, Jirui Yuan, and Zaiqing Nie. Dair-v2x: A large-scale dataset for vehicle-infrastructure cooperative 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 21361–21370, June 2022

  5. [5]

    Rope3d: The roadside perception dataset for autonomous driving and monocular 3d object detection task

    Xiaoqing Ye, Mao Shu, Hanyu Li, Yifeng Shi, Yingying Li, Guangjie Wang, Xiao Tan, and Errui Ding. Rope3d: The roadside perception dataset for autonomous driving and monocular 3d object detection task. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 21341–21350, June 2022

  6. [6]

    Walter Zimmer, Christian Creß, Huu Tung Nguyen, and Alois C. Knoll. A9 intersection dataset: All you need for urban 3d camera-lidar roadside perception, 2023

  7. [7]

    Ips300+: a challenging multi-modal data sets for intersection perception system

    Huanan Wang, Xinyu Zhang, Zhiwei Li, Jun Li, Kun Wang, Zhu Lei, and Ren Haibing. Ips300+: a challenging multi-modal data sets for intersection perception system. In 2022 International Conference on Robotics and Automation (ICRA), pages 2539–2545, 2022. 11

  8. [8]

    A9-dataset: Multi-sensor infrastructure-based dataset for mobility research

    Christian Creß, Walter Zimmer, Leah Strand, Maximilian Fortkord, Siyi Dai, Venkatnarayanan Lakshmi- narasimhan, and Alois Knoll. A9-dataset: Multi-sensor infrastructure-based dataset for mobility research. In 2022 IEEE Intelligent V ehicles Symposium (IV), pages 965–970, 2022

  9. [9]

    Bevheight: A robust framework for vision-based roadside 3d object detection

    Lei Yang, Kaicheng Yu, Tao Tang, Jun Li, Kun Yuan, Li Wang, Xinyu Zhang, and Peng Chen. Bevheight: A robust framework for vision-based roadside 3d object detection. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), March 2023

  10. [10]

    Bevdepth: Acquisition of reliable depth for multi-view 3d object detection, 2022

    Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran Wang, Yukang Shi, Jianjian Sun, and Zeming Li. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection, 2022

  11. [11]

    Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers, 2022

    Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers, 2022

  12. [12]

    A revisit of sparse coding based anomaly detection in stacked rnn framework

    Weixin Luo, Wen Liu, and Shenghua Gao. A revisit of sparse coding based anomaly detection in stacked rnn framework. In 2017 IEEE International Conference on Computer Vision (ICCV) , pages 341–349, 2017

  13. [13]

    A new comprehensive benchmark for semi-supervised video anomaly detection and anticipation

    Congqi Cao, Yue Lu, Peng Wang, and Yanning Zhang. A new comprehensive benchmark for semi-supervised video anomaly detection and anticipation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20392–20401, June 2023

  14. [14]

    Developing and testing robust autonomy: The university of sydney campus data set

    Wei Zhou, Julie Stephany Berrio, Charika De Alvis, Mao Shan, Stewart Worrall, James Ward, and Eduardo Nebot. Developing and testing robust autonomy: The university of sydney campus data set. IEEE Intelligent Transportation Systems Magazine, 12(4):23–40, 2020

  15. [15]

    Campus3d: A photogrammetry point cloud benchmark for hierarchical understanding of outdoor scene

    Xinke Li, Chongshou Li, Zekun Tong, Andrew Lim, Junsong Yuan, Yuwei Wu, Jing Tang, and Raymond Huang. Campus3d: A photogrammetry point cloud benchmark for hierarchical understanding of outdoor scene. MM ’20, page 238–246, New York, NY , USA, 2020. Association for Computing Machinery

  16. [16]

    V2x-seq: A large-scale sequential dataset for vehicle-infrastructure cooperative perception and forecasting

    Haibao Yu, Wenxian Yang, Hongzhi Ruan, Zhenwei Yang, Yingjuan Tang, Xu Gao, Xin Hao, Yifeng Shi, Yifeng Pan, Ning Sun, Juan Song, Jirui Yuan, Ping Luo, and Zaiqing Nie. V2x-seq: A large-scale sequential dataset for vehicle-infrastructure cooperative perception and forecasting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogn...

  17. [17]

    Int2: Interactive trajectory prediction at intersections

    Zhijie Yan, Pengfei Li, Zheng Fu, Shaocong Xu, Yongliang Shi, Xiaoxue Chen, Yuhang Zheng, Yang Li, Tianyu Liu, Chuxuan Li, Nairui Luo, Xu Gao, Yilun Chen, Zuoxu Wang, Yifeng Shi, Pengfei Huang, Zhengxiao Han, Jirui Yuan, Jiangtao Gong, Guyue Zhou, Hang Zhao, and Hao Zhao. Int2: Interactive trajectory prediction at intersections. In Proceedings of the IEEE...

  18. [18]

    You only look once: Unified, real-time object detection

    Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 779–788, 2016

  19. [19]

    Yolo9000: Better, faster, stronger

    Joseph Redmon and Ali Farhadi. Yolo9000: Better, faster, stronger. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6517–6525, 2017

  20. [20]

    Yolov3: An incremental improvement, 2018

    Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement, 2018

  21. [21]

    Yolov4: Optimal speed and accuracy of object detection, 2020

    Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection, 2020

  22. [22]

    YOLOv5 by Ultralytics, May 2020

    Glenn Jocher. YOLOv5 by Ultralytics, May 2020

  23. [23]

    You only learn one representation: Unified network for multiple tasks, 2021

    Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. You only learn one representation: Unified network for multiple tasks, 2021

  24. [24]

    Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors, 2022

    Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors, 2022

  25. [25]

    YOLO by Ultralytics, January 2023

    Glenn Jocher, Ayush Chaurasia, and Jing Qiu. YOLO by Ultralytics, January 2023

  26. [26]

    Modnet: Motion and appearance based moving object detection network for autonomous driving

    Mennatullah Siam, Heba Mahgoub, Mohamed Zahran, Senthil Yogamani, Martin Jagersand, and Ahmad El-Sallab. Modnet: Motion and appearance based moving object detection network for autonomous driving. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC) , pages 2859–2864, 2018

  27. [27]

    Monocular instance motion segmentation for autonomous driving: Kitti instancemotseg dataset and multi-task baseline

    Eslam Mohamed, Mahmoud Ewaisha, Mennatullah Siam, Hazem Rashed, Senthil Yogamani, Waleed Hamdy, Mohamed El-Dakdouky, and Ahmad El-Sallab. Monocular instance motion segmentation for autonomous driving: Kitti instancemotseg dataset and multi-task baseline. In 2021 IEEE Intelligent V ehicles Symposium (IV), pages 114–121, 2021

  28. [28]

    Learning to segment rigid motions from two frames

    Gengshan Yang and Deva Ramanan. Learning to segment rigid motions from two frames. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 1266–1275, 2021. 12

  29. [29]

    Discovering objects that can move

    Zhipeng Bao, Pavel Tokmakov, Allan Jabri, Yu-Xiong Wang, Adrien Gaidon, and Martial Hebert. Discovering objects that can move. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 11779–11788, 2022

  30. [30]

    Segmenting moving objects via an object-centric layered representation

    Junyu Xie, Weidi Xie, and Andrew Zisserman. Segmenting moving objects via an object-centric layered representation. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 28023–28036. Curran Associates, Inc., 2022

  31. [31]

    Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom

    Alex H. Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12689–12697, 2019

  32. [32]

    V oxelnet: End-to-end learning for point cloud based 3d object detection

    Yin Zhou and Oncel Tuzel. V oxelnet: End-to-end learning for point cloud based 3d object detection. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4490–4499, 2018

  33. [33]

    3dssd: Point-based 3d single stage object detector

    Zetong Yang, Yanan Sun, Shu Liu, and Jiaya Jia. 3dssd: Point-based 3d single stage object detector. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 11037–11045, 2020

  34. [34]

    V oxel r-cnn: Towards high performance voxel-based 3d object detection, 2021

    Jiajun Deng, Shaoshuai Shi, Peiwei Li, Wengang Zhou, Yanyong Zhang, and Houqiang Li. V oxel r-cnn: Towards high performance voxel-based 3d object detection, 2021

  35. [35]

    Center-based 3d object detection and tracking

    Tianwei Yin, Xingyi Zhou, and Philipp Krähenbühl. Center-based 3d object detection and tracking. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 11779–11788, 2021

  36. [36]

    Neighbor-vote: Improving monocular 3d object detection through neighbor distance voting

    Xiaomeng Chu, Jiajun Deng, Yao Li, Zhenxun Yuan, Yanyong Zhang, Jianmin Ji, and Yu Zhang. Neighbor-vote: Improving monocular 3d object detection through neighbor distance voting. In Proceedings of the 29th ACM International Conference on Multimedia, MM ’21, page 5239–5247, New York, NY , USA, 2021. Association for Computing Machinery

  37. [37]

    Objects are different: Flexible monocular 3d object detection

    Yunpeng Zhang, Jiwen Lu, and Jie Zhou. Objects are different: Flexible monocular 3d object detection. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 3288–3297, 2021

  38. [38]

    Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection

    Danila Rukhovich, Anna V orontsova, and Anton Konushin. Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2397–2406, 2022

  39. [39]

    Monoatt: Online monocular 3d object detection with adaptive token transformer

    Yunsong Zhou, Hongzi Zhu, Quan Liu, Shan Chang, and Minyi Guo. Monoatt: Online monocular 3d object detection with adaptive token transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17493–17503, June 2023

  40. [40]

    Unimode: Unified monocular 3d object detection, 2024

    Zhuoling Li, Xiaogang Xu, SerNam Lim, and Hengshuang Zhao. Unimode: Unified monocular 3d object detection, 2024

  41. [41]

    Monouni: A unified vehicle and infrastructure-side monocular 3d object detection network with sufficient depth clues

    Jia Jinrang, Zhenjia Li, and Yifeng Shi. Monouni: A unified vehicle and infrastructure-side monocular 3d object detection network with sufficient depth clues. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems , volume 36, pages 11703–11715. Curran Associates, Inc., 2023

  42. [42]

    Pointpainting: Sequential fusion for 3d object detection

    Sourabh V ora, Alex H Lang, Bassam Helou, and Oscar Beijbom. Pointpainting: Sequential fusion for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 4604–4612, 2020

  43. [43]

    Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation

    Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela L Rus, and Song Han. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In2023 IEEE International Conference on Robotics and Automation (ICRA) , pages 2774–2781. IEEE, 2023

  44. [44]

    3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection

    Jin Hyeok Yoo, Yecheol Kim, Jisong Kim, and Jun Won Choi. 3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16 , pages 720–736. Springer, 2020

  45. [45]

    Simple online and realtime tracking

    Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime tracking. In 2016 IEEE International Conference on Image Processing (ICIP) , pages 3464–3468, 2016

  46. [46]

    Simple online and realtime tracking with a deep association metric

    Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In 2017 IEEE International Conference on Image Processing (ICIP) , pages 3645–3649. IEEE, 2017

  47. [47]

    Observation-centric sort: Rethinking sort for robust multi-object tracking

    Jinkun Cao, Jiangmiao Pang, Xinshuo Weng, Rawal Khirodkar, and Kris Kitani. Observation-centric sort: Rethinking sort for robust multi-object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9686–9696, June 2023. 13

  48. [48]

    3d multi-object tracking: A baseline and new evaluation metrics

    Xinshuo Weng, Jianren Wang, David Held, and Kris Kitani. 3d multi-object tracking: A baseline and new evaluation metrics. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 10359–10366, 2020

  49. [49]

    Cross-modal 3d object detection and tracking for auto-driving

    Yihan Zeng, Chao Ma, Ming Zhu, Zhiming Fan, and Xiaokang Yang. Cross-modal 3d object detection and tracking for auto-driving. In Proc. Int. Conf. Intell. Robots Syst , pages 3850–3857. IEEE, 2021

  50. [50]

    Camera calibrator, 2022

    The MathWorks Inc. Camera calibrator, 2022

  51. [51]

    A flexible new technique for camera calibration

    Zhengyou Zhang. A flexible new technique for camera calibration. IEEE Transactions on pattern analysis and machine intelligence, 22(11):1330–1334, 2000

  52. [52]

    Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything, 2023

  53. [53]

    Sustech points: A portable 3d point cloud interactive annotation platform system

    E Li, Shuaijun Wang, Chengyang Li, Dachuan Li, Xiangbin Wu, and Qi Hao. Sustech points: A portable 3d point cloud interactive annotation platform system. In 2020 IEEE Intelligent V ehicles Symposium (IV), pages 1108–1115, 2020

  54. [54]

    U2-onet: A two-level nested octave u-structure network with a multi-scale attention mechanism for moving object segmentation

    Chenjie Wang, Chengyuan Li, Jun Liu, Bin Luo, Xin Su, Yajun Wang, and Yan Gao. U2-onet: A two-level nested octave u-structure network with a multi-scale attention mechanism for moving object segmentation. Remote Sensing, 13(1), 2021

  55. [55]

    Riwnet: A moving object instance segmentation network being robust in adverse weather conditions, 2021

    Chenjie Wang, Chengyuan Li, Bin Luo, Wei Wang, and Jun Liu. Riwnet: A moving object instance segmentation network being robust in adverse weather conditions, 2021

  56. [56]

    Real-time vehicle distance estimation using single view geometry

    Ahmed Ali, Ali Hassan, Afsheen Rafaqat Ali, Hussam Ullah Khan, Wajahat Kazmi, and Aamer Zaheer. Real-time vehicle distance estimation using single view geometry. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1100–1109, 2020

  57. [57]

    Deep learning based vehicle position and orientation estimation via inverse perspective mapping image

    Youngseok Kim and Dongsuk Kum. Deep learning based vehicle position and orientation estimation via inverse perspective mapping image. In 2019 IEEE Intelligent V ehicles Symposium (IV), pages 317–323, 2019

  58. [58]

    Joint vehicle detection and distance prediction via monocular depth estimation

    Chao Shen, Xiangmo Zhao, Zhanwen Liu, Tao Gao, and Jiang Xu. Joint vehicle detection and distance prediction via monocular depth estimation. IET Intelligent Transport Systems, 14(7):753–763, 2020

  59. [59]

    Towards generalization across depth for monocular 3d object detection

    Andrea Simonelli, Samuel Rota Buló, Lorenzo Porzi, Elisa Ricci, and Peter Kontschieder. Towards generalization across depth for monocular 3d object detection. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 767–782, Cham, 2020. Springer International Publishing

  60. [60]

    Towards model generalization for monocular 3d object detection, 2022

    Zhenyu Li, Zehui Chen, Ang Li, Liangji Fang, Qinhong Jiang, Xianming Liu, and Junjun Jiang. Towards model generalization for monocular 3d object detection, 2022

  61. [61]

    u v 1 # = K ·

    Massimo Bertozz, Alberto Broggi, and Alessandra Fascioli. Stereo inverse perspective mapping: theory and applications. Image and Vision Computing, 16(8):585–590, 1998. 14 Appendix .1 Rationale for P3D In roadside scenarios, the camera sensor is usually positioned at a certain height Hc above a local ground plane. We define the camera and the road coordina...