pith. sign in

arxiv: 2606.09882 · v1 · pith:OC7U7LMJnew · submitted 2026-06-03 · 💻 cs.CV · cs.LG

WHU-Infra3D: A Full-stack Multi-modal Dataset and Benchmark for 3D Roadside Infrastructure Inventory

Pith reviewed 2026-06-28 06:57 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords datasetmulti-modalLiDARpanoramic imageryinfrastructure inventory3D segmentationattribute recognitionbenchmark
0
0 comments X

The pith

WHU-Infra3D supplies aligned panoramic images, LiDAR scans, and 181k status annotations across 53.8 km to support automated roadside infrastructure health assessment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents WHU-Infra3D as a multi-modal dataset that pairs panoramic imagery with LiDAR point clouds over roads in three cities. It adds strict 2D-3D instance links, cross-frame tracking, and detailed labels for attributes such as rust or occlusion on thousands of 3D objects. The authors run baselines on five tasks: 2D detection, cross-view matching, 3D geo-identification, point-cloud segmentation, and attribute recognition. They report that existing models show clear drops when moving between cities and fail more often on uncommon defect types. If the dataset's alignments and labels hold, it would let systems move from coarse mapping to tracking the actual condition of poles, signs, and other assets.

Core claim

WHU-Infra3D integrates panoramic imagery and LiDAR point clouds with rigorous 2D-3D instance association and cross-frame tracking, supplying over 181k attribute and status annotations on more than 175k multi-view 2D boxes and thousands of 3D instances to serve as a benchmark for 3D roadside infrastructure inventory and operational health assessment.

What carries the argument

Rigorous 2D-3D instance association combined with cross-frame tracking that connects multi-view 2D bounding boxes to 3D infrastructure objects and their status labels.

If this is right

  • Baselines can be run on 2D detection, 2D cross-view matching, 3D geo-identification, 3D point cloud segmentation, and attribute recognition.
  • Current models exhibit measurable performance drops across cities.
  • Current models are weaker on long-tailed defective status classes.
  • The annotations directly support diagnosis of operational health for urban assets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Cities could use models trained on the dataset to rank maintenance priorities by automatically flagging defects.
  • The 2D-3D links could support time-series tracking of individual assets across repeated surveys.
  • Adding more cities or sensor types would test whether the observed domain gaps shrink.

Load-bearing premise

The data collected from three cities and the annotation process produce representative samples of real infrastructure conditions with accurate 2D-3D links that apply beyond the collection sites.

What would settle it

Models trained on WHU-Infra3D showing no improvement over prior datasets when tested on maintenance records or new cities would falsify the claim that the added alignments and status labels advance operational assessment.

Figures

Figures reproduced from arXiv: 2606.09882 by Bisheng Yang, Chong Liu, Luxuan Fu, Xuyu Feng, Zhen Dong.

Figure 1
Figure 1. Figure 1: Conceptual illustration of 3D Roadside Infrastructure [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the WHU-Infra3D dataset. The dataset provides a comprehensive platform for urban infrastructure inventory by integrating (Left) cross-city panoramic imagery with 2D bounding box annotations and (Bottom) LiDAR point clouds with rich 3D annotations (bounding boxes, semantic masks, and instance masks). (Center) The core feature is Instance Association, which establishes consistent cross-frame trac… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the data collection trajectories across three cities. The panels illustrate the mobile mapping routes in (a) Wuhan, (b) Nan￾jing, and (c) Shanghai. 3. The WHU-Infra3D Dataset This section presents WHU-Infra3D from three complemen￾tary perspectives: data acquisition, full-stack annotation design, and statistical characteristics. We first describe the sensing plat￾forms and collection protocol, t… view at source ↗
Figure 4
Figure 4. Figure 4: Representative 2D bounding box annotations on panoramic images from Wuhan, Shanghai, and Nanjing. The diverse urban layouts, lighting conditions, and infrastructure appearances across these three cities highlight the dataset’s scale and complexity. Note that the images are vertically cropped to focus on the road scenes for better visualization [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of cross-frame instance association in WHU￾Infra3D. The colored dashed lines connect the same physical objects (e.g., traffic signs, signal lights, bollards) across sequential panoramic frames (from Frame N-1 to Frame N+1). Note that the images are cropped for better visualization. rich semantic annotations translate raw sensory data into ac￾tionable insights for infrastructure maintenance. To… view at source ↗
Figure 5
Figure 5. Figure 5: Representative visualization of 3D annotations in WHU￾Infra3D. From top to bottom: 3D oriented bounding boxes, semantic labels with class-color legend, and instance labels. robust tracking; and (2) Cross-modal Association, where the 2D image bounding box and the corresponding 3D point cloud instance share an identical ID, establishing a precise one-to-one mapping between the visual and geometric modalities… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of the hierarchical attribute schema in WHU￾Infra3D. The concentric rings represent object categories, attributes, and specific values from the center outwards [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: illustrates the pronounced long-tail category dis￾tributions across both 2D and 3D instance annotations. In both modalities, Street Light, Traffic Sign, and Cylindrical Bol￾lard constitute the majority of instances, whereas Fire Hydrant, Trash Bin, and Spherical Bollard appear comparatively rarely. Beyond the long-tail pattern, a clear cross-modal discrepancy can be observed: Manhole accounts for a much la… view at source ↗
Figure 9
Figure 9. Figure 9: Illustrative examples of rare-status synthesis in infrastruc￾ture scenes. The top row shows original infrastructure images, while the subsequent rows present text-guided edited/generated results from Seedream 4.5, GPT-Image-1, Sora-image, and Nano Banana Pro, con￾ditioned on status descriptions such as displaced/open manhole cover, severe rust, and damaged traffic cone. vancements in generative AI. As illu… view at source ↗
Figure 10
Figure 10. Figure 10: A conceptual framework for autonomous infrastructure asset management driven by Multi-Agent collaboration. The workflow transitions from passive perception to active decision-making under the coordination of a central orchestrator and four specialized agents. WHU-Infra3D provides the multi-modal training and evaluation substrate for the perception, comprehension, and verification modules. 6. Conclusion In… view at source ↗
read the original abstract

The paradigm of digital twin cities is shifting from coarse visual mapping toward more precise and actionable digitization of urban assets. However, existing datasets predominantly focus on coarse visual perception, lacking the strict multi-modal alignment and attribute and status diagnosis required for automated infrastructure maintenance. To bridge this gap, we introduce WHU-Infra3D, a large-scale, multi-modal benchmark dataset dedicated to roadside infrastructure inventory. Covering 53.8 km across three cities, WHU-Infra3D uniquely integrates panoramic imagery and LiDAR point clouds with rigorous 2D-3D instance association and cross-frame tracking. Comprising over 175k multi-view 2D bounding boxes alongside thousands of 3D infrastructure instances, the dataset provides over 181k detailed attribute and status annotations (e.g., rust, occlusion) to empower operational health assessment. We establish comprehensive baselines across five core tasks: 2D detection, 2D cross-view matching, 3D geo-identification, 3D point cloud segmentation, and attribute recognition. Extensive evaluations expose significant cross-city domain gaps and inherent vulnerabilities of current models on long-tailed defective statuses, establishing WHU-Infra3D as an essential testbed for advancing scalable, AI-driven urban infrastructure inventory and lifecycle management. The WHU-Infra3D dataset is available at https://github.com/WHU-USI3DV/WHU-Infra3D.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces WHU-Infra3D, a large-scale multi-modal dataset for 3D roadside infrastructure inventory covering 53.8 km across three cities. It integrates panoramic imagery and LiDAR point clouds with claimed rigorous 2D-3D instance association and cross-frame tracking, providing over 175k multi-view 2D bounding boxes, thousands of 3D instances, and over 181k attribute/status annotations (e.g., rust, occlusion). Baselines are reported for five tasks (2D detection, 2D cross-view matching, 3D geo-identification, 3D point cloud segmentation, attribute recognition) that expose cross-city domain gaps and model vulnerabilities on long-tailed defective statuses. The dataset is released via GitHub.

Significance. If the 2D-3D associations and status labels prove accurate and representative, the dataset would offer a useful benchmark for infrastructure inventory tasks by supplying scale, multi-modal alignment, and explicit focus on defective conditions that current models struggle with, thereby supporting research on operational health assessment and cross-domain robustness.

major comments (1)
  1. [Abstract] Abstract: the central claims that the dataset supplies 'rigorous 2D-3D instance association' and 'detailed attribute and status annotations' sufficient to 'empower operational health assessment' and 'expose significant cross-city domain gaps' rest on unverified annotation quality. No annotation protocol, inter-annotator agreement statistics, expert-review sampling rate, or quantitative error analysis for the 3D instance linking step is supplied; without these the distinction between genuine long-tailed defective distributions and annotation bias or association noise cannot be made.
minor comments (1)
  1. [Abstract] Abstract: the figures 'over 175k multi-view 2D bounding boxes' and 'over 181k detailed attribute and status annotations' are presented without clarifying their exact relationship or overlap; a brief parenthetical or table reference would improve precision.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on annotation transparency. We agree that additional details on the annotation process are needed to support the claims and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claims that the dataset supplies 'rigorous 2D-3D instance association' and 'detailed attribute and status annotations' sufficient to 'empower operational health assessment' and 'expose significant cross-city domain gaps' rest on unverified annotation quality. No annotation protocol, inter-annotator agreement statistics, expert-review sampling rate, or quantitative error analysis for the 3D instance linking step is supplied; without these the distinction between genuine long-tailed defective distributions and annotation bias or association noise cannot be made.

    Authors: We acknowledge the referee's point that the current version does not provide sufficient documentation of annotation quality controls. In the revised manuscript we will add a dedicated subsection (likely in Section 3) that describes: (1) the full annotation protocol for 2D boxes, 3D instance linking, and attribute/status labels; (2) inter-annotator agreement statistics (e.g., Cohen's kappa or average IoU on overlapping annotations); (3) the sampling rate and criteria for expert review; and (4) any quantitative error analysis performed on the 3D association step (e.g., manual verification on a held-out subset). These additions will allow readers to better evaluate potential annotation bias versus genuine long-tailed distributions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset paper with no derivations or fitted predictions

full rationale

The paper introduces a multi-modal dataset and benchmark for infrastructure inventory, reporting collection over 53.8 km, 175k 2D boxes, thousands of 3D instances, and 181k annotations, plus baseline results on five tasks. No equations, parameter fitting, predictions derived from inputs, or self-citation chains appear in the abstract or described content. The central claims rest on empirical data release and standard benchmark evaluations rather than any reduction of outputs to inputs by construction. This is the expected non-finding for a dataset paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical dataset and benchmark paper with no mathematical derivations or theoretical components; no free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.1-grok · 5800 in / 1268 out tokens · 30474 ms · 2026-06-28T06:57:23.140777+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 9 canonical work pages

  1. [1]

    Caesar, V

    H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Li- ong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, O. Beijbom, nuscenes: A multimodal dataset for autonomous driving, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11621–11631

  2. [2]

    Cordts, M

    M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. En- zweiler, R. Benenson, U. Franke, S. Roth, B. Schiele, The cityscapes dataset for semantic urban scene understand- ing, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223

  3. [3]

    Hackel, N

    T. Hackel, N. Savinov, L. Ladicky, J. D. Wegner, K. Schindler, M. Pollefeys, Semantic3d. net: A new large- scale point cloud classification benchmark, arXiv preprint arXiv:1704.03847 (2017)

  4. [4]

    Z. Zhu, D. Liang, S. Zhang, X. Huang, B. Li, S. Hu, Traffic-sign detection and classification in the wild, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2110–2118

  5. [5]

    Wilson, T

    D. Wilson, T. Alshaabi, C. Van Oort, X. Zhang, J. Nel- son, S. Wshah, Object tracking and geo-localization from street images, Remote Sensing 14 (2022) 2575

  6. [6]

    Geiger, P

    A. Geiger, P. Lenz, R. Urtasun, Are we ready for au- tonomous driving? the kitti vision benchmark suite, in: 2012 IEEE conference on computer vision and pattern recognition, IEEE, 2012, pp. 3354–3361. 14

  7. [7]

    P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Pat- naik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caine, et al., Scalability in perception for autonomous driving: Waymo open dataset, in: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, 2020, pp. 2446–2454

  8. [8]

    Y . Liao, J. Xie, A. Geiger, Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (2022) 3292–3310

  9. [9]

    Wilson, W

    B. Wilson, W. Qi, T. Agarwal, J. Lambert, J. Singh, S. Khandelwal, B. Pan, R. Kumar, A. Hartnett, J. K. Pontes, et al., Argoverse 2: Next generation datasets for self-driving perception and forecasting, arXiv preprint arXiv:2301.00493 (2023)

  10. [10]

    Vallet, M

    B. Vallet, M. Brédif, A. Serna, B. Marcotegui, N. Papar- oditis, Terramobilita/iqmulus urban point cloud analysis benchmark, Computers & Graphics 49 (2015) 126–133

  11. [11]

    Neuhold, T

    G. Neuhold, T. Ollmann, S. Rota Bulo, P. Kontschieder, The mapillary vistas dataset for semantic understanding of street scenes, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 4990–4999

  12. [12]

    Roynard, J.-E

    X. Roynard, J.-E. Deschaud, F. Goulette, Paris-lille- 3d: A large and high-quality ground-truth urban point cloud dataset for automatic segmentation and classifica- tion, The International Journal of Robotics Research 37 (2018) 545–557

  13. [13]

    Behley, M

    J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, J. Gall, Semantickitti: A dataset for seman- tic scene understanding of lidar sequences, in: Proceed- ings of the IEEE/CVF international conference on com- puter vision, 2019, pp. 9297–9307

  14. [14]

    W. Tan, N. Qin, L. Ma, Y . Li, J. Du, G. Cai, K. Yang, J. Li, Toronto-3d: A large-scale mobile lidar dataset for seman- tic segmentation of urban roadways, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 202–203

  15. [15]

    X. Han, C. Liu, Y . Zhou, K. Tan, Z. Dong, B. Yang, Whu- urban3d: An urban scene lidar point cloud dataset for semantic instance segmentation, ISPRS Journal of Pho- togrammetry and Remote Sensing 209 (2024) 500–513

  16. [16]

    Houben, J

    S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, C. Igel, Detection of traffic signs in real-world images: The german traffic sign detection benchmark, in: The 2013 international joint conference on neural networks (IJCNN), Ieee, 2013, pp. 1–8

  17. [17]

    Almutairy, T

    F. Almutairy, T. Alshaabi, J. Nelson, S. Wshah, Arts: Au- tomotive repository of traffic signs for the united states, IEEE Transactions on Intelligent Transportation Systems 22 (2019) 457–465

  18. [18]

    Chaabane, L

    M. Chaabane, L. Gueguen, A. Trabelsi, R. Beveridge, S. O’Hara, End-to-end learning improves static object geo-localization from video, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision, 2021, pp. 2063–2072

  19. [19]

    C. Liu, M. Xie, C. Yuan, F. Liang, Z. Dong, B. Yang, Training-free open-set 3d inventory of transportation in- frastructure by combining point clouds and images, Au- tomation in Construction 178 (2025) 106377

  20. [20]

    Deep Residual Learning for Image Recognition

    J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-time object detection, in: 2016 IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), 2016, pp. 779–788. doi:10.1109/CVPR. 2016.91

  21. [21]

    S. Eri¸ sen, Sernet-former: Segmentation by efficient- resnet with attention-boosting gates and attention-fusion networks, in: IEEE International Conference on Com- puter Vision and Machine Intelligence, IEEE, 2024, pp. 1–6. doi:10.1109/cvmi61877.2024.10782648

  22. [22]

    W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.- Y . Fu, A. C. Berg, Ssd: Single shot multibox detector, in: Proceedings of the 2016 European Conference on Com- puter Vision (ECCV), 2016, pp. 21–37

  23. [23]

    J. Jain, J. Li, M. T. Chiu, A. Hassani, N. Orlov, H. Shi, Oneformer: One transformer to rule universal image seg- mentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2989–2998. doi:10.1109/cvpr52729.2023.00292

  24. [24]

    L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y . Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang, et al., Grounded language-image pre-training, in: Proceedings of the IEEE/CVF conference on computer vision and pat- tern recognition, 2022, pp. 10965–10975

  25. [26]

    T. Ren, Q. Jiang, S. Liu, Z. Zeng, W. Liu, H. Gao, H. Huang, Z. Ma, X. Jiang, Y . Chen, Y . Xiong, H. Zhang, F. Li, P. Tang, K. Yu, L. Zhang, Grounding dino 1.5: Advance the" edge" of open-set object detection, arXiv preprint arXiv:2405.10300 (2024). doi:10.48550/ arXiv.2405.10300

  26. [27]

    T. Ren, Y . Chen, Q. Jiang, Z. Zeng, Y . Xiong, W. Liu, Z. Ma, J. Shen, Y . Gao, X. Jiang, et al., Dino-x: A unified vision model for open-world object detection and under- standing, arXiv preprint arXiv:2411.14347 (2024). 15

  27. [28]

    Jiang, F

    Q. Jiang, F. Li, Z. Zeng, T. Ren, S. Liu, L. Zhang, T- rex2: Towards generic object detection via text-visual prompt synergy, in: Proceedings of the European Con- ference on Computer Vision, Springer, 2024, pp. 38–57. doi:10.1007/978-3-031-73232-4

  28. [29]

    Kirillov, E

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al., Segment anything, in: Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

  29. [30]

    N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson, et al., Sam 2: Segment anything in images and videos, arXiv preprint arXiv:2408.00714 (2024)

  30. [31]

    Carion, L

    N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang, et al., Sam 3: Segment anything with concepts, arXiv preprint arXiv:2511.16719 (2025)

  31. [32]

    Wojke, A

    N. Wojke, A. Bewley, D. Paulus, Simple online and realtime tracking with a deep association metric, in: 2017 IEEE international conference on image processing (ICIP), IEEE, 2017, pp. 3645–3649

  32. [33]

    Meinhardt, A

    T. Meinhardt, A. Kirillov, L. Leal-Taixe, C. Feichtenhofer, Trackformer: Multi-object tracking with transformers, in: Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, 2022, pp. 8844–8854

  33. [34]

    V . A. Krylov, E. Kenny, R. Dahyot, Automatic discov- ery and geotagging of objects from street view imagery, Remote Sensing 10 (2018) 661

  34. [35]

    A. S. Nassar, S. Lefèvre, J. D. Wegner, Simultaneous multi-view instance detection with learned geometric soft- constraints, in: Proceedings of the IEEE/CVF interna- tional conference on computer vision, 2019, pp. 6559– 6568

  35. [36]

    A. S. Nassar, S. D’aronco, S. Lefèvre, J. D. Wegner, Geo- graph: Graph-based multi-view object detection with ge- ometric cues end-to-end, in: European Conference on Computer Vision, Springer, 2020, pp. 488–504

  36. [37]

    C. Liu, L. Fu, Y . Jia, Z. Dong, B. Yang, Svii-3d: Advanc- ing roadside infrastructure inventory with decimeter-level 3d localization and comprehension from sparse street imagery, 2026. URL:https://arxiv.org/abs/2601. 10535.arXiv:2601.10535

  37. [38]

    Campbell, A

    A. Campbell, A. Both, Q. C. Sun, Detecting and map- ping traffic signs from google street view images using deep learning and gis, Computers, Environment and Ur- ban Systems 77 (2019) 101350

  38. [39]

    Z. Wang, L. Yang, Y . Sheng, M. Shen, Pole-like ob- jects segmentation and multiscale classification-based fu- sion from mobile point clouds in road scenes, Remote Sensing 13 (2021) 4382. doi:10.3390/rs13214382

  39. [40]

    J. Li, X. Cheng, Supervoxel-based extraction and clas- sification of pole-like objects from mls point cloud data, Optics & Laser Technology 146 (2022) 107562. doi:10. 1016/j.optlastec.2021.107562

  40. [41]

    F. Li, M. Lehtomäki, S. O. Elberink, G. V osselman, A. Kukko, E. Puttonen, Y . Chen, J. Hyyppä, Semantic seg- mentation of road furniture in mobile laser scanning data, ISPRS Journal of Photogrammetry and Remote Sensing 154 (2019) 98–113. doi:10.1016/j.isprsjprs.2019. 06.001

  41. [42]

    Truong-Hong, R

    L. Truong-Hong, R. Lindenbergh, M. Vermeij, Efficient sparse street furniture extraction from mobile laser scanning point clouds, Interna- tional Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences-ISPRS Archives 48 (2022) 161–168. doi:10.5194/ isprs-archives-xlviii-4-w4-2022-161-2022

  42. [43]

    C. R. Qi, L. Yi, H. Su, L. J. Guibas, Pointnet++: Deep hi- erarchical feature learning on point sets in a metric space, arXiv preprint arXiv:1706.02413 (2017). doi:10.48550/ arXiv.1706.02413

  43. [44]

    Thomas, C

    H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, L. J. Guibas, Kpconv: Flexible and de- formable convolution for point clouds, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6411–6420

  44. [45]

    C. Choy, J. Gwak, S. Savarese, 4d spatio-temporal con- vnets: Minkowski convolutional neural networks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3075–3084. doi:10.1109/cvpr.2019.00319

  45. [46]

    X. Wu, L. Jiang, P.-S. Wang, Z. Liu, X. Liu, Y . Qiao, W. Ouyang, T. He, H. Zhao, Point transformer v3: Sim- pler faster stronger, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 4840–4851

  46. [47]

    C. R. Qi, W. Liu, C. Wu, H. Su, L. J. Guibas, Frus- tum pointnets for 3d object detection from rgb-d data, in: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2018, pp. 918–927. doi:10.1109/cvpr.2018.00102

  47. [48]

    Y . Zhou, X. Han, M. Peng, H. Li, B. Yang, Z. Dong, B. Yang, Street-view imagery guided street furniture in- ventory from mobile laser scanning point clouds, IS- PRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77. doi:10.1016/j.isprsjprs.2022. 04.023

  48. [49]

    Z. Gong, H. Lin, D. Zhang, Z. Luo, J. Zelek, Y . Chen, A. Nurunnabi, C. Wang, J. Li, A frustum-based prob- abilistic framework for 3d object detection by fusion of lidar and camera data, ISPRS Journal of Photogrammetry 16 and Remote Sensing 159 (2020) 90–100. doi:10.1016/ j.isprsjprs.2019.10.015

  49. [50]

    N. Ma, J. Fan, W. Wang, J. Wu, Y . Jiang, L. Xie, R. Fan, Computer vision for road imaging and pothole detection: a state-of-the-art review of systems and algorithms, Trans- portation safety and Environment 4 (2022) tdac026

  50. [51]

    Behrendt, L

    K. Behrendt, L. Novak, R. Botros, A deep learning ap- proach to traffic lights: Detection, tracking, and classifica- tion, in: 2017 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2017, pp. 1370–1377

  51. [52]

    Aygün, M

    Z. Aygün, M. Kocaman, S. Aydemir, B. Konako ˘glu, Building damage detection using deep learning architec- ture with satellite images: The case of the 6 february 2023 kahramanmara¸ s earthquake, International Journal of Pio- neering Technology and Engineering 3 (2024) 53–61

  52. [53]

    Tabernik, D

    D. Tabernik, D. Sko ˇcaj, Deep learning for large-scale traffic-sign detection and recognition, IEEE transactions on intelligent transportation systems 21 (2019) 1427– 1440

  53. [54]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in: International conference on ma- chine learning, PmLR, 2021, pp. 8748–8763

  54. [56]

    L. Fu, C. Liu, B. Yang, Z. Dong, Unleashing the capabil- ities of large vision-language models for intelligent per- ception of roadside infrastructure, 2026. URL:https: //arxiv.org/abs/2601.10551.arXiv:2601.10551

  55. [57]

    S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: To- wards real-time object detection with region proposal net- works, Advances in neural information processing sys- tems 28 (2015)

  56. [58]

    Zhang, F

    H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, H.-Y . Shum, Dino: Detr with improved denoising anchor boxes for end-to-end object detection, in: International Conference on Learning Representa- tions (ICLR), 2023. URL:https://openreview.net/ forum?id=pS_p766Sj0

  57. [59]

    Sapkota, R

    R. Sapkota, R. H. Cheppally, A. Sharda, M. Kar- kee, Yolo26: Key architectural enhancements and per- formance benchmarking for real-time object detection,

  58. [60]

    arXiv:2509.25164

    URL:https://arxiv.org/abs/2509.25164. arXiv:2509.25164

  59. [61]

    Cheng, L

    T. Cheng, L. Song, Y . Ge, W. Liu, X. Wang, Y . Shan, Yolo-world: Real-time open-vocabulary object detection, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 16901–16911

  60. [62]

    Hurst, et al., Gpt-4o system card, arXiv preprint arXiv:2410.21276 (2024)

    A. Hurst, et al., Gpt-4o system card, arXiv preprint arXiv:2410.21276 (2024)

  61. [63]

    H. Liu, C. Li, Q. Wu, Y . J. Lee, Visual instruction tuning, in: Advances in Neural Information Processing Systems, 2024

  62. [64]

    Bai, et al., Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, arXiv preprint arXiv:2308.12966 (2023)

    J. Bai, et al., Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, arXiv preprint arXiv:2308.12966 (2023)

  63. [65]

    E. J. Hu, et al., Lora: Low-rank adaptation of large lan- guage models, arXiv preprint arXiv:2106.09685 (2022)

  64. [66]

    Lewis, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, in: Advances in Neural Information Processing Systems, 2020

    P. Lewis, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, in: Advances in Neural Information Processing Systems, 2020. 17