pith. sign in

arxiv: 2510.26641 · v4 · submitted 2025-10-30 · 💻 cs.CV

All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles

Pith reviewed 2026-05-18 02:55 UTC · model grok-4.3

classification 💻 cs.CV
keywords autonomous vehiclesobject detectionsensor fusionvision-language modelslarge language modelsmultimodal perceptioncooperative perceptiontransformer models
0
0 comments X

The pith

Synthesizing sensor fusion strategies, categorized datasets, and multimodal LLM and VLM approaches delivers a roadmap for object detection in autonomous vehicles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews the fundamental spectrum of AV sensors including camera, ultrasonic, LiDAR, and radar along with their fusion strategies and limitations in dynamic driving environments. It introduces a structured categorization of datasets into ego-vehicle, infrastructure-based, and cooperative types such as V2V, V2I, V2X, and I2I to support cross-analysis of data structures. The survey then examines detection methodologies ranging from 2D and 3D pipelines to hybrid fusion and transformer-driven systems powered by Vision Transformers, SLMs, VLMs, and LLMs, with emphasis on emerging generative AI paradigms. This synthesis bridges fragmented knowledge across multimodal perception and contextual reasoning to map current capabilities, open challenges, and future opportunities. A sympathetic reader would care because reliable object detection is central to safe autonomous transportation and the review highlights practical paths forward amid rapid AI advances.

Core claim

By systematically reviewing the spectrum of AV sensors and their fusion strategies, introducing a structured categorization of ego-vehicle, infrastructure-based, and cooperative datasets, and analyzing cutting-edge detection methodologies from 2D/3D pipelines to hybrid sensor fusion with particular attention to transformer-driven approaches powered by Vision Transformers, Large and Small Language Models, and VLMs, the survey delivers a clear roadmap of current capabilities, open challenges, and future opportunities.

What carries the argument

The structured categorization of AV datasets into ego-vehicle, infrastructure-based, and cooperative types combined with analysis of sensor fusion strategies and their integration into LLM and VLM-driven perception frameworks.

If this is right

  • Understanding sensor capabilities and limitations supports development of more effective fusion strategies for complex environments.
  • Dataset categorization enables better cross-analysis to improve training of robust detection models.
  • Focus on transformer-driven and VLM-powered methods points toward hybrid pipelines for next-generation perception.
  • Identification of open challenges in contextual reasoning guides research in cooperative intelligence and multimodal LLMs.
  • The overall synthesis provides direction for incorporating generative AI into reliable AV object detection systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The roadmap could be extended by adding quantitative performance comparisons across the reviewed fusion and VLM methods to aid practical selection.
  • Implications for real-time constraints and computational efficiency in vehicle hardware may need explicit mapping beyond the current analysis.
  • The cooperative dataset categories suggest potential for scaling to city-wide infrastructure networks, which could be tested in simulation.
  • Connections to broader robotics perception tasks indicate the framework might generalize beyond driving scenarios.

Load-bearing premise

The selected literature and categorization of datasets and methods comprehensively represent the fragmented state of multimodal perception without significant selection bias.

What would settle it

Discovery of a substantial body of recent work on AV object detection that uses an entirely different dataset categorization or centers on methods and challenges not addressed in the review would show the roadmap is incomplete.

Figures

Figures reproduced from arXiv: 2510.26641 by Abolfazl Razi, Hazim Alzorgan, Mahlagha Fazeli, Niloufar Mehrabi, Sayed Pedram Haeri Boroujeni.

Figure 1
Figure 1. Figure 1: Visualization of object detection across multiple sensor modalities in autonomous vehicles. The RGB image demonstrates [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The organization of this survey paper. 5 [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of major sensors used in AVs based on their types and perception performance. Sensor performance is evaluated [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of major AV datasets based on their specifications and applications. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A comprehensive taxonomy of object detection methods in AVs, categorized into four primary types. Each category [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overall framework of the 2D camera-based approaches in the context of autonomous driving systems. [PITH_FULL_IMAGE:figures/full_fig_p034_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overall framework of the Point-based approaches in 3D Lidar object detection. [PITH_FULL_IMAGE:figures/full_fig_p041_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Overall framework of the Range-based approaches in 3D Lidar object detection. [PITH_FULL_IMAGE:figures/full_fig_p041_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Overall framework of the Voxel-based approaches in 3D Lidar object detection. [PITH_FULL_IMAGE:figures/full_fig_p041_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Overall framework of the Pillar-based approaches in 3D Lidar object detection. [PITH_FULL_IMAGE:figures/full_fig_p042_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Overall framework of the Voxel-Point approaches in 3D Lidar object detection. [PITH_FULL_IMAGE:figures/full_fig_p042_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Overall framework of Early-Fusion 3D object detection in the context of autonomous driving systems. [PITH_FULL_IMAGE:figures/full_fig_p047_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Overall framework of Mid-Fusion 3D object detection in the context of autonomous driving systems. [PITH_FULL_IMAGE:figures/full_fig_p048_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Overall framework of Late-Fusion 3D object detection in the context of autonomous driving systems. [PITH_FULL_IMAGE:figures/full_fig_p048_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Performance comparison of the top three algorithms from each detection category (2D, 3D, and 2D–3D fusion object [PITH_FULL_IMAGE:figures/full_fig_p052_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Performance comparison of the top three algorithms from each detection category (2D, 3D, and 2D–3D fusion object [PITH_FULL_IMAGE:figures/full_fig_p053_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Overall framework of the VLMs in the context of autonomous driving systems. [PITH_FULL_IMAGE:figures/full_fig_p053_17.png] view at source ↗
read the original abstract

Autonomous Vehicles (AVs) are transforming the future of transportation through advances in intelligent perception, decision-making, and control systems. However, their success is tied to one core capability, reliable object detection in complex and multimodal environments. While recent breakthroughs in Computer Vision (CV) and Artificial Intelligence (AI) have driven remarkable progress, the field still faces a critical challenge as knowledge remains fragmented across multimodal perception, contextual reasoning, and cooperative intelligence. This survey bridges that gap by delivering a forward-looking analysis of object detection in AVs, emphasizing emerging paradigms such as Vision-Language Models (VLMs), Large Language Models (LLMs), and Generative AI rather than re-examining outdated techniques. We begin by systematically reviewing the fundamental spectrum of AV sensors (camera, ultrasonic, LiDAR, and Radar) and their fusion strategies, highlighting not only their capabilities and limitations in dynamic driving environments but also their potential to integrate with recent advances in LLM/VLM-driven perception frameworks. Next, we introduce a structured categorization of AV datasets that moves beyond simple collections, positioning ego-vehicle, infrastructure-based, and cooperative datasets (e.g., V2V, V2I, V2X, I2I), followed by a cross-analysis of data structures and characteristics. Ultimately, we analyze cutting-edge detection methodologies, ranging from 2D and 3D pipelines to hybrid sensor fusion, with particular attention to emerging transformer-driven approaches powered by Vision Transformers (ViTs), Large and Small Language Models (SLMs), and VLMs. By synthesizing these perspectives, our survey delivers a clear roadmap of current capabilities, open challenges, and future opportunities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. This manuscript is a survey on object detection for autonomous vehicles. It reviews sensor modalities (camera, ultrasonic, LiDAR, Radar) and fusion strategies, introduces a three-way categorization of datasets (ego-vehicle, infrastructure-based, cooperative/V2X), and examines detection pipelines from 2D/3D methods through transformer, ViT, SLM, VLM, and LLM approaches. The central claim is that this synthesis yields a clear roadmap of current capabilities, open challenges, and future opportunities in multimodal AV perception.

Significance. If the literature selection proves representative, the survey could usefully organize a fragmented field by linking classical sensor fusion to emerging multimodal LLMs/VLMs and cooperative perception. The structured dataset categorization and forward emphasis on next-gen paradigms are strengths that could guide researchers, though the absence of quantitative cross-comparisons limits immediate utility for capability assessment.

major comments (1)
  1. [Introduction] Introduction and abstract: The paper states it performs a 'systematic review' and delivers a 'clear roadmap' via synthesis of sensors, datasets, and methods. No section describes the literature search protocol (databases, keywords, date range, inclusion/exclusion criteria, or total papers screened). This is load-bearing for the central claim because the three-way dataset categorization and emphasis on VLMs/cooperative perception cannot be evaluated for selection bias or recency bias without such details.
minor comments (2)
  1. [Dataset section] The cross-analysis of dataset characteristics would be strengthened by a summary table comparing sample counts, sensor coverage, and annotation types across the ego/infrastructure/cooperative categories.
  2. [Fusion strategies] Notation for fusion strategies (e.g., early/late/hybrid) is used inconsistently between the sensor review and methodology sections; a single glossary or consistent abbreviations would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our survey manuscript. We have addressed the major comment regarding the literature search protocol by planning a clear revision to improve transparency.

read point-by-point responses
  1. Referee: [Introduction] Introduction and abstract: The paper states it performs a 'systematic review' and delivers a 'clear roadmap' via synthesis of sensors, datasets, and methods. No section describes the literature search protocol (databases, keywords, date range, inclusion/exclusion criteria, or total papers screened). This is load-bearing for the central claim because the three-way dataset categorization and emphasis on VLMs/cooperative perception cannot be evaluated for selection bias or recency bias without such details.

    Authors: We acknowledge that the manuscript does not include an explicit description of the literature search protocol, which limits the ability to assess selection or recency bias. Our survey is structured as a narrative synthesis emphasizing recent advances in multimodal fusion and VLM/LLM approaches rather than a formal PRISMA-style systematic review. To address this directly, we will add a dedicated subsection in the revised Introduction titled 'Literature Selection and Review Methodology.' This subsection will specify the primary databases (Google Scholar, arXiv, IEEE Xplore, and proceedings from CVPR/ECCV/ICCV), search keywords (e.g., 'object detection autonomous vehicles', 'multimodal sensor fusion', 'vision language models AV', 'cooperative perception V2X'), date range (2015-2024 with emphasis on 2020 onward for emerging paradigms), inclusion criteria (peer-reviewed papers or impactful preprints on sensor fusion, datasets, or LLM/VLM detection pipelines), and exclusion criteria (purely theoretical works without AV application or non-English sources). We believe this addition will allow readers to better evaluate the representativeness of the three-way dataset categorization and the roadmap presented. revision: yes

Circularity Check

0 steps flagged

No significant circularity in survey synthesis

full rationale

This paper is a literature survey synthesizing external work on AV object detection sensors, fusion strategies, ego/infrastructure/cooperative datasets, and transformer/VLM/LLM methods. Its central claim of delivering a 'clear roadmap of current capabilities, open challenges, and future opportunities' rests on cited independent sources rather than any internal equations, fitted parameters, or self-referential definitions that reduce to the paper's own inputs by construction. No mathematical derivations, self-citation load-bearing steps, or predictions equivalent to inputs appear in the abstract or described structure. The work is self-contained against external benchmarks via its references.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The survey rests on the domain assumption that AV perception knowledge is fragmented across modalities and that a structured review of sensors, datasets, and LLM/VLM methods can bridge it. No free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Knowledge in multimodal perception, contextual reasoning, and cooperative intelligence for AV object detection remains fragmented.
    Explicitly stated in the abstract as the motivation and gap the survey addresses.

pith-pipeline@v0.9.0 · 5867 in / 1093 out tokens · 37825 ms · 2026-05-18T02:55:07.663830+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We begin by systematically reviewing the fundamental spectrum of AV sensors (camera, ultrasonic, LiDAR, and Radar) and their fusion strategies... Next, we introduce a structured categorization of AV datasets... Ultimately, we analyze cutting-edge detection methodologies, ranging from 2D and 3D pipelines to hybrid sensor fusion, with particular attention to emerging transformer-driven approaches powered by Vision Transformers (ViTs), Large and Small Language Models (SLMs), and VLMs.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    By synthesizing these perspectives, our survey delivers a clear roadmap of current capabilities, open challenges, and future opportunities.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · 1 internal anchor

  1. [1]

    B. B. ELallid, N. Benamar, M. Bagaa, N. Mrani, Secure and efficient vehicle control of autonomous vehicles using federated deep reinforcement learning, Applied Soft Computing (2025) 113924

  2. [2]

    Khosravian, M

    A. Khosravian, M. Masih-Tehrani, A. Amirkhani, S. Ebrahimi-Nejad, Robust autonomous vehicle control by leveraging multi-stage mpc and quantized cnn in hil framework, Applied Soft Computing 162 (2024) 111802

  3. [3]

    Jiang, K

    H. Jiang, K. Xia, Y. Zhao, Z. Yao, Y. Jiang, Z. He, Environmental impacts and emission reduction methods of vehicles equipped with driving automation systems: An operational-level review, Transportation Research Part C: Emerging Technologies 173 (2025) 104996

  4. [4]

    Grosse, A

    K. Grosse, A. Alahi, A qualitative ai security risk assessment of autonomous vehicles, Transportation Research Part C: Emerging Technologies 169 (2024) 104797

  5. [5]

    K. Wang, C. Shen, X. Li, J. Lu, Uncertainty quantification for safe and reliable autonomous vehicles: A review of methods and applications, IEEE Transactions on Intelligent Transportation Systems (2025)

  6. [6]

    X. Chen, X. Wang, W. Zhao, C. Wang, S. Cheng, Z. Luan, Hierarchical deep reinforcement learning based multi- agent game control for energy consumption and traffic efficiency improving of autonomous vehicles, Energy 323 (2025) 135669

  7. [7]

    L. Zha, C. Gong, K. Lv, Real-time localization and navigation method for autonomous vehicles based on multi- modal data fusion by integrating memory transformer and ddqn, Image and Vision Computing 156 (2025) 105484. 55

  8. [8]

    W. Sun, H. Shao, J. Li, T. Wu, E. Z. Fainman, Multi-type traffic sensor location problem for origin–destination estimation considering spatiotemporal correlation and sensor failure, Transportation Research Part C: Emerging Technologies 179 (2025) 105288

  9. [9]

    X. Chen, S. P. H. Boroujeni, X. Shu, H. Li, A. Razi, Enhancing graph neural networks in large-scale traffic incident analysis with concurrency hypothesis, in: Proceedings of the 32nd ACM International Conference on Advances in Geographic Information Systems, 2024, pp. 196–207

  10. [10]

    Chang, W

    Y. Chang, W. Xiao, B. Coifman, Using spatiotemporal stacks for precise vehicle tracking from roadside 3d lidar data, Transportation research part C: emerging technologies 154 (2023) 104280

  11. [11]

    A. D. Beza, Z. Xie, M. Ramezani, D. Levinson, From lane-less to lane-free: Implications in the era of automated vehicles, Transportation Research Part C: Emerging Technologies 170 (2025) 104898

  12. [12]

    K. Wang, J. Guo, K. Chen, J. Lu, An in-depth examination of slam methods: Challenges, advancements, and applications in complex scenes for autonomous driving, IEEE Transactions on Intelligent Transportation Systems (2025)

  13. [13]

    Y. Zha, W. Shangguan, J. Chen, L. Chai, W. Qiu, A. M. L´ opez, Heterogeneous multiscale cooperative perception for connected autonomous vehicles via v2x interaction, IEEE Internet of Things Journal (2025)

  14. [14]

    Salari, L

    M. Salari, L. Kattan, M. Gentili, Optimal roadside units location for path flow reconstruction in a connected vehicle environment, Transportation Research Part C: Emerging Technologies 138 (2022) 103625

  15. [15]

    Praveen, S

    R. Praveen, S. Hundekari, P. Parida, T. Mittal, A. Sehgal, M. Bhavana, Autonomous vehicle navigation systems: Machine learning for real-time traffic prediction, in: 2025 International Conference on Computational, Communi- cation and Information Technology (ICCCIT), IEEE, 2025, pp. 809–813

  16. [16]

    Mohammadi, R

    A. Mohammadi, R. Ahmari, V. Hemmati, F. Owusu-Ambrose, M. N. Mahmoud, P. Kebria, A. Homaifar, Detection of multiple small biased gps spoofing attacks on autonomous vehicles using time series analysis, IEEE Open Journal of Vehicular Technology (2025)

  17. [17]

    S. D. RS, S. D. Varshni, Embedded large language models for enhanced human-machine interface in autonomous vehicles, in: 2025 International Conference on Multi-Agent Systems for Collaborative Intelligence (ICMSCI), IEEE, 2025, pp. 1143–1150

  18. [18]

    Kumar, P

    H. Kumar, P. Mamoria, D. K. Dewangan, Improving faster r-cnn for vehicle detection under varying conditions with domain adaptation technique, in: 2025 Fourth International Conference on Power, Control and Computing Technologies (ICPC2T), IEEE, 2025, pp. 1–6

  19. [19]

    Shrivastava, V

    A. Shrivastava, V. Kansal, A. Nagpal, K. K. Dixit, K. V. Rajkumar, et al., Ai-powered object detection for autonomous vehicles: A comparative study of machine learning models, in: 2025 International Conference on Computational, Communication and Information Technology (ICCCIT), IEEE, 2025, pp. 612–617

  20. [20]

    Subhedar, M

    J. Subhedar, M. R. Bachute, Insights of semantic segmentation using the deeplab architecture for autonomous driving, MethodsX (2025) 103387

  21. [21]

    S. Chen, X. Li, K. Wang, J. Sun, B. Yang, Ranging research on telematics based on mask r-cnn dual eye stereo vision ranging algorithm, in: The International Conference Optoelectronic Information and Optical Engineering (OIOE2024), Vol. 13513, SPIE, 2025, pp. 884–889

  22. [22]

    S. P. H. Boroujeni, N. Mehrabi, F. Afghah, C. P. McGrath, D. Bhatkar, M. A. Biradar, A. Razi, Toward ai- driven fire imagery: Attributes, challenges, comparisons, and the promise of vlms and llms, Machine Learning with Applications (2025) 100763

  23. [23]

    Y. Tian, F. Lin, Y. Li, T. Zhang, Q. Zhang, X. Fu, J. Huang, X. Dai, Y. Wang, C. Tian, et al., Uavs meet llms: Overviews and perspectives towards agentic low-altitude mobility, Information Fusion 122 (2025) 103158

  24. [24]

    Z. Guo, Z. Yagudin, A. Lykov, M. Konenkov, D. Tsetserukou, Vlm-auto: Vlm-based autonomous driving assistant with human-like behavior and understanding for complex road scenes, in: 2024 2nd International Conference on Foundation and Large Language Models (FLLM), IEEE, 2024, pp. 501–507

  25. [25]

    Y. Wang, S. Wang, Y. Li, M. Liu, Developments in 3d object detection for autonomous driving: A review, IEEE Sensors Journal (2025)

  26. [26]

    H. Wang, J. Liu, H. Dong, Z. Shao, A survey of the multi-sensor fusion object detection task in autonomous driving, Sensors 25 (9) (2025) 2794

  27. [27]

    H. Wang, X. Chen, Q. Yuan, P. Liu, A review of 3d object detection based on autonomous driving, The Visual Computer 41 (3) (2025) 1757–1775

  28. [28]

    Z. Song, L. Liu, F. Jia, Y. Luo, C. Jia, G. Zhang, L. Yang, L. Wang, Robustness-aware 3d object detection in autonomous driving: A review and outlook, IEEE Transactions on Intelligent Transportation Systems (2024)

  29. [29]

    S. Y. Alaba, A. C. Gurbuz, J. E. Ball, Emerging trends in autonomous vehicle perception: Multimodal fusion for 3d object detection, World Electric Vehicle Journal 15 (1) (2024) 20

  30. [30]

    Z. Zou, K. Chen, Z. Shi, Y. Guo, J. Ye, Object detection in 20 years: A survey, Proceedings of the IEEE 111 (3) (2023) 257–276

  31. [31]

    X. Ma, W. Ouyang, A. Simonelli, E. Ricci, 3d object detection from images for autonomous driving: a survey, IEEE Transactions on Pattern Analysis and Machine Intelligence 46 (5) (2023) 3537–3556

  32. [32]

    R. Qian, X. Lai, X. Li, 3d object detection for autonomous driving: A survey, Pattern Recognition 130 (2022) 108796

  33. [33]

    Y. Cui, R. Chen, W. Chu, L. Chen, D. Tian, Y. Li, D. Cao, Deep learning for image and point cloud fusion in autonomous driving: A review, IEEE Transactions on Intelligent Transportation Systems 23 (2) (2021) 722–739

  34. [34]

    D. Feng, C. Haase-Sch¨ utz, L. Rosenbaum, H. Hertlein, C. Glaeser, F. Timm, W. Wiesbeck, K. Dietmayer, Deep 56 multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and chal- lenges, IEEE Transactions on Intelligent Transportation Systems 22 (3) (2020) 1341–1360

  35. [35]

    J. Guo, U. Kurup, M. Shah, Is it safe to drive? an overview of factors, metrics, and datasets for driveability assessment in autonomous driving, IEEE Transactions on Intelligent Transportation Systems 21 (8) (2019) 3135– 3151

  36. [36]

    L. Hu, J. Zhang, J. Zhang, S. Cheng, Y. Wang, W. Zhang, N. Yu, Security analysis and adaptive false data injection against multi-sensor fusion localization for autonomous driving, Information Fusion 117 (2025) 102822

  37. [37]

    S. P. H. Boroujeni, A. Razi, S. Khoshdel, F. Afghah, J. L. Coen, L. O’Neill, P. Fule, A. Watts, N.-M. T. Kokolakis, K. G. Vamvoudakis, A comprehensive survey of research towards ai-enabled unmanned aerial systems in pre-, active-, and post-wildfire management, Information Fusion 108 (2024) 102369

  38. [38]

    H. Du, L. Ren, Y. Wang, X. Cao, C. Sun, Advancements in perception system with multi-sensor fusion for embodied agents, Information Fusion 117 (2025) 102859

  39. [39]

    Wu, Fusion-based modeling of an intelligent algorithm for enhanced object detection using a deep learning approach on radar and camera data, Information Fusion 113 (2025) 102647

    Y. Wu, Fusion-based modeling of an intelligent algorithm for enhanced object detection using a deep learning approach on radar and camera data, Information Fusion 113 (2025) 102647

  40. [40]

    Y. Wu, J. Liu, M. Gong, Q. Miao, W. Ma, C. Xu, Joint semantic segmentation using representations of lidar point clouds and camera images, Information Fusion 108 (2024) 102370

  41. [41]

    S. Li, X. Li, H. Wang, Y. Zhou, Z. Shen, Multi-gnss ppp/ins/vision/lidar tightly integrated system for precise navigation in urban environments, Information Fusion 90 (2023) 218–232

  42. [42]

    Mehrabi, S

    N. Mehrabi, S. P. H. Boroujeni, Age estimation based on facial images using hybrid features and particle swarm optimization, in: 2021 11th International Conference on Computer Engineering and Knowledge (ICCKE), IEEE, 2021, pp. 412–418

  43. [43]

    Sarlak, H

    A. Sarlak, H. Alzorgan, S. P. H. Boroujeni, A. Razi, R. Amin, Enhanced cooperative perception for autonomous vehicles using imperfect communication, in: 2024 20th International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT), IEEE, 2024, pp. 700–707

  44. [44]

    D. Kent, M. Alyaqoub, X. Lu, H. Khatounabadi, K. Sung, C. Scheller, A. Dalat, A. bin Thabit, R. Whitley, H. Radha, Msu-4s-the michigan state university four seasons dataset, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22658–22667

  45. [45]

    Zheng, L

    L. Zheng, L. Yang, Q. Lin, W. Ai, M. Liu, S. Lu, J. Liu, H. Ren, J. Mo, X. Bai, et al., Omnihd-scenes: A next-generation multimodal dataset for autonomous driving, arXiv preprint arXiv:2412.10734 (2024)

  46. [46]

    Alibeigi, W

    M. Alibeigi, W. Ljungbergh, A. Tonderski, G. Hess, A. Lilja, C. Lindstr¨ om, D. Motorniuk, J. Fu, J. Widahl, C. Petersson, Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 20178–20188

  47. [47]

    C. A. Diaz-Ruiz, Y. Xia, Y. You, J. Nino, J. Chen, J. Monica, X. Chen, K. Luo, Y. Wang, M. Emond, et al., Ithaca365: Dataset and driving perception under repeated and challenging weather conditions, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 21383–21392

  48. [48]

    J. Mao, M. Niu, C. Jiang, H. Liang, J. Chen, X. Liang, Y. Li, C. Ye, W. Zhang, Z. Li, et al., One million scenes for autonomous driving: Once dataset, arXiv preprint arXiv:2106.11037 (2021)

  49. [49]

    D´ eziel, P

    J.-L. D´ eziel, P. Merriaux, F. Tremblay, D. Lessard, D. Plourde, J. Stanguennec, P. Goulet, P. Olivier, Pixset: An opportunity for 3d computer vision to go beyond point clouds with a full-waveform lidar dataset, in: 2021 ieee international intelligent transportation systems conference (itsc), IEEE, 2021, pp. 2987–2993

  50. [50]

    P. Xiao, Z. Shao, S. Hao, Z. Zhang, X. Chai, J. Jiao, Z. Li, J. Wu, K. Sun, K. Jiang, et al., Pandaset: Ad- vanced sensor suite dataset for autonomous driving, in: 2021 IEEE international intelligent transportation systems conference (ITSC), IEEE, 2021, pp. 3095–3101

  51. [51]

    Geyer, Y

    J. Geyer, Y. Kassahun, M. Mahmudi, X. Ricou, R. Durgesh, A. S. Chung, L. Hauswald, V. H. Pham, M. M¨ uhlegg, S. Dorn, et al., A2d2: Audi autonomous driving dataset, arXiv preprint arXiv:2004.06320 (2020)

  52. [52]

    URLhttps://public.roboflow.com/object-detection/self-driving-car

    Roboflow, Self-driving car dataset, accessed: 2025-02-28 (2025). URLhttps://public.roboflow.com/object-detection/self-driving-car

  53. [53]

    P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, et al., Scalability in perception for autonomous driving: Waymo open dataset, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2446–2454

  54. [54]

    Q.-H. Pham, P. Sevestre, R. S. Pahwa, H. Zhan, C. H. Pang, Y. Chen, A. Mustafa, V. Chandrasekhar, J. Lin, A* 3d dataset: Towards autonomous driving in challenging environments, in: 2020 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2020, pp. 2267–2273

  55. [55]

    J. Bock, R. Krajewski, T. Moers, S. Runde, L. Vater, L. Eckstein, The ind dataset: A drone dataset of naturalistic road user trajectories at german intersections, in: 2020 IEEE Intelligent Vehicles Symposium (IV), 2020, pp. 1929–1934.doi:10.1109/IV47402.2020.9304839

  56. [56]

    Moers, L

    T. Moers, L. Vater, R. Krajewski, J. Bock, A. Zlocki, L. Eckstein, The exid dataset: A real-world trajectory dataset of highly interactive highway scenarios in germany, in: 2022 IEEE Intelligent Vehicles Symposium (IV), 2022, pp. 958–964.doi:10.1109/IV51971.2022.9827305

  57. [57]

    Caesar, V

    H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, O. Beijbom, nuscenes: A multimodal dataset for autonomous driving, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11621–11631

  58. [58]

    Chang, J

    M.-F. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett, D. Wang, P. Carr, S. Lucey, D. Ramanan, et al., Argoverse: 3d tracking and forecasting with rich maps, in: Proceedings of the IEEE/CVF conference on 57 computer vision and pattern recognition, 2019, pp. 8748–8757

  59. [59]

    J. Xue, J. Fang, T. Li, B. Zhang, P. Zhang, Z. Ye, J. Dou, Blvd: Building a large-scale 5d semantics benchmark for autonomous driving, in: 2019 International Conference on Robotics and Automation (ICRA), IEEE, 2019, pp. 6685–6691

  60. [60]

    A Commute in Data: The comma2k19 Dataset

    H. Schafer, E. Santana, A. Haden, R. Biasini, A commute in data: The comma2k19 dataset, arXiv preprint arXiv:1812.05752 (2018)

  61. [61]

    F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, T. Darrell, et al., Bdd100k: A diverse driving video database with scalable annotation tooling, arXiv preprint arXiv:1805.04687 2 (5) (2018) 6

  62. [62]

    Huang, X

    X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y. Lin, R. Yang, The apolloscape dataset for autonomous driving, in: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 954–960

  63. [63]

    Cordts, M

    M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, B. Schiele, The cityscapes dataset for semantic urban scene understanding, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223

  64. [64]

    Barnes, M

    D. Barnes, M. Gadd, P. Murcutt, P. Newman, I. Posner, The oxford radar robotcar dataset: A radar extension to the oxford robotcar dataset, in: 2020 IEEE international conference on robotics and automation (ICRA), IEEE, 2020, pp. 6433–6438

  65. [65]

    Geiger, P

    A. Geiger, P. Lenz, R. Urtasun, Are we ready for autonomous driving? the kitti vision benchmark suite, in: 2012 IEEE conference on computer vision and pattern recognition, IEEE, 2012, pp. 3354–3361

  66. [66]

    X. Zhu, H. Sheng, S. Cai, B. Deng, S. Yang, Q. Liang, K. Chen, L. Gao, J. Song, J. Ye, Roscenes: A large-scale multi-view 3d dataset for roadside perception, in: European Conference on Computer Vision, Springer, 2024, pp. 331–347

  67. [67]

    Zimmer, C

    W. Zimmer, C. Creß, H. T. Nguyen, A. C. Knoll, Tumtraf intersection dataset: All you need for urban 3d camera- lidar roadside perception, in: 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC), IEEE, 2023, pp. 1030–1037

  68. [68]

    C. Creß, W. Zimmer, L. Strand, M. Fortkord, S. Dai, V. Lakshminarasimhan, A. Knoll, A9-dataset: Multi-sensor infrastructure-based dataset for mobility research, in: 2022 IEEE Intelligent Vehicles Symposium (IV), IEEE, 2022, pp. 965–970

  69. [69]

    H. Wang, X. Zhang, Z. Li, J. Li, K. Wang, Z. Lei, R. Haibing, Ips300+: a challenging multi-modal data sets for intersection perception system, in: 2022 International Conference on Robotics and Automation (ICRA), IEEE, 2022, pp. 2539–2545

  70. [70]

    X. Ye, M. Shu, H. Li, Y. Shi, Y. Li, G. Wang, X. Tan, E. Ding, Rope3d: The roadside perception dataset for autonomous driving and monocular 3d object detection task, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 21341–21350

  71. [71]

    Busch, C

    S. Busch, C. Koetsier, J. Axmann, C. Brenner, Lumpi: The leibniz university multi-perspective intersection dataset, in: 2022 IEEE Intelligent Vehicles Symposium (IV), IEEE, 2022, pp. 1127–1134

  72. [72]

    M. Howe, I. Reid, J. Mackenzie, Weakly supervised training of monocular 3d object detectors using wide baseline multi-view traffic camera data, arXiv preprint arXiv:2110.10966 (2021)

  73. [73]

    W. Zhan, L. Sun, D. Wang, H. Shi, A. Clausse, M. Naumann, J. Kummerle, H. Konigshof, C. Stiller, A. de La Fortelle, et al., Interaction dataset: An international, adversarial and cooperative motion dataset in interactive driving scenarios with semantic maps, arXiv preprint arXiv:1910.03088 (2019)

  74. [74]

    Z. Tang, M. Naphade, M.-Y. Liu, X. Yang, S. Birchfield, S. Wang, R. Kumar, D. Anastasiu, J.-N. Hwang, Cityflow: A city-scale benchmark for multi-target multi-camera vehicle tracking and re-identification, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8797–8806

  75. [75]

    Drivelmm- o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding,

    A. Ishaq, J. Lahoud, K. More, O. Thawakar, R. Thawkar, D. Dissanayake, N. Ahsan, Y. Li, F. S. Khan, H. Cholakkal, et al., Drivelmm-o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding, arXiv preprint arXiv:2503.10621 (2025)

  76. [76]

    K. Chen, Y. Li, W. Zhang, Y. Liu, P. Li, R. Gao, L. Hong, M. Tian, X. Zhao, Z. Li, et al., Automated evaluation of large vision-language models on self-driving corner cases, in: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), IEEE, 2025, pp. 7817–7826

  77. [77]

    H.-k. Chiu, R. Hachiuma, C.-Y. Wang, S. F. Smith, Y.-C. F. Wang, M.-H. Chen, V2v-llm: Vehicle-to-vehicle cooperative autonomous driving with multi-modal large language models, arXiv preprint arXiv:2502.09980 (2025)

  78. [78]

    C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, H. Li, Drivelm: Driving with graph visual question answering, in: European Conference on Computer Vision, Springer, 2024, pp. 256–274

  79. [79]

    Inoue, Y

    Y. Inoue, Y. Yada, K. Tanahashi, Y. Yamaguchi, Nuscenes-mqa: Integrated evaluation of captions and qa for autonomous driving datasets using markup annotations, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 930–938

  80. [80]

    S. Wang, Z. Yu, X. Jiang, S. Lan, M. Shi, N. Chang, J. Kautz, Y. Li, J. M. Alvarez, Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning, arXiv preprint arXiv:2504.04348 (2025)

Showing first 80 references.