pith. sign in

arxiv: 2511.08439 · v2 · submitted 2025-11-11 · 💻 cs.AI

Dataset Safety in Autonomous Driving: Requirements, Risks, and Assurance

Pith reviewed 2026-05-17 23:29 UTC · model grok-4.3

classification 💻 cs.AI
keywords dataset safetyautonomous drivingAI perceptionISO/PAS 8800data lifecyclesafety analysisverification and validationhazard identification
0
0 comments X

The pith

A framework develops safe datasets for autonomous driving by aligning with ISO/PAS 8800 and managing the full data lifecycle.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out a structured approach to building datasets that support safe AI systems in autonomous vehicles. It centers on perception systems and introduces the AI Data Flywheel as a way to handle the complete dataset process from initial collection through annotation, curation, and ongoing maintenance. A reader would care because flawed data directly creates hazards that can compromise vehicle safety and reliability. The work adds safety analyses to find risks from insufficient data, defines how to set dataset safety requirements, and outlines verification steps to meet standards. It also surveys recent research to highlight current issues and possible next steps in the field.

Core claim

The paper claims that a structured framework aligned with ISO/PAS 8800 guidelines develops safe datasets for AI-based perception in autonomous driving by introducing the AI Data Flywheel and the dataset lifecycle that covers collection, annotation, curation, and maintenance, while adding rigorous safety analyses to identify hazards from dataset insufficiencies, defining processes for safety requirements, and proposing verification and validation strategies to ensure compliance.

What carries the argument

The AI Data Flywheel, a cyclic process that drives continuous data collection, annotation, curation, and maintenance to reduce risks from dataset problems.

If this is right

  • Hazards from dataset insufficiencies can be spotted early through the included safety analyses.
  • Clear dataset safety requirements can be set to match ISO/PAS 8800 guidelines.
  • Verification and validation steps can confirm that datasets meet safety standards before use.
  • Insights from reviewed research can point to practical challenges in maintaining safe data over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could be adapted to improve data quality in other AI applications that rely on perception, such as robotics or medical imaging.
  • Widespread adoption might encourage shared industry standards for dataset audits in autonomous systems.
  • Testing the lifecycle stages on real driving data collections would show whether the proposed steps actually prevent specific failure modes.

Load-bearing premise

The safety analyses, dataset requirements, and verification strategies will successfully reduce risks from poor data and achieve standard compliance even though no empirical tests or case studies are shown.

What would settle it

An autonomous vehicle system built with datasets following the framework that still suffers a safety incident caused by dataset insufficiencies such as missing edge cases or annotation errors.

Figures

Figures reproduced from arXiv: 2511.08439 by Alireza Abbaspour, B Ravi Kiran, Russel Mohr, Senthil Yogamani, Tejaskumar Balgonda Patil.

Figure 1
Figure 1. Figure 1: Components of a typical Autonomous Driving Pipeline. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Data flywheel from collection, Data quality and diversification, model training, automated labeling model based [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Automated data or file selection pipeline with various configurations to retrieve files that satisfy requirements, metadata [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Automated annotation quality check model is a semantic segmentation pipeline based on SAM and OpenClip. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Dataset Lifecycle recommended by ISO/PAS 8800 [ [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Dataset integrity is fundamental to the safety and reliability of AI systems, especially in autonomous driving. This paper presents a structured framework for developing safe datasets aligned with ISO/PAS 8800 guidelines. Using AI-based perception systems as the primary use case, it introduces the AI Data Flywheel and the dataset lifecycle, covering data collection, annotation, curation, and maintenance. The framework incorporates rigorous safety analyses to identify hazards and mitigate risks caused by dataset insufficiencies. It also defines processes for establishing dataset safety requirements and proposes verification and validation strategies to ensure compliance with safety standards. In addition to outlining best practices, the paper reviews recent research and emerging trends in dataset safety and autonomous vehicle development, providing insights into current challenges and future directions. By integrating these perspectives, the paper aims to advance robust, safety-assured AI systems for autonomous driving applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper presents a structured framework for developing safe datasets in autonomous driving, focused on AI-based perception systems and aligned with ISO/PAS 8800. It introduces the AI Data Flywheel concept and outlines a dataset lifecycle covering collection, annotation, curation, and maintenance. The framework includes safety analyses to identify hazards from dataset insufficiencies, processes for defining dataset safety requirements, and verification/validation strategies, accompanied by a review of recent research and trends in dataset safety.

Significance. If the framework holds as a coherent and practical proposal, it offers a useful synthesis of best practices for dataset safety assurance in safety-critical AI applications. By explicitly linking data lifecycle stages to ISO/PAS 8800 compliance and hazard mitigation, the work could help standardize approaches in autonomous vehicle development where dataset quality directly affects perception reliability. The literature review component adds value by contextualizing emerging challenges.

major comments (2)
  1. [Safety Analyses and Framework Overview] Abstract and Section on Safety Analyses: the assertion that the framework 'incorporates rigorous safety analyses to identify hazards and mitigate risks' is load-bearing for the assurance claims in the title, yet the provided descriptions remain at the level of process outlines without specifying concrete hazard identification techniques, risk metrics, or failure mode examples tied to perception datasets.
  2. [Verification and Validation Strategies] Section on Verification and Validation Strategies: the proposed V&V strategies reference external ISO/PAS 8800 guidelines as grounding but do not demonstrate how the AI Data Flywheel outputs feed into specific compliance checks or traceability requirements, leaving the central claim of ensured compliance without an internal mechanism for evaluation.
minor comments (3)
  1. [AI Data Flywheel] The AI Data Flywheel is introduced as a novel construct but its operational definition (e.g., feedback loops between stages) could be clarified with a diagram or pseudocode to distinguish it from standard iterative data pipelines.
  2. [Related Work and Trends] Literature review section would benefit from explicit mapping of cited works to specific framework components (e.g., which papers address annotation risks) to strengthen traceability.
  3. [Dataset Safety Requirements] Notation for dataset safety requirements could be made more consistent; terms like 'insufficiencies' and 'hazards' are used interchangeably in places without a glossary.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. The comments identify opportunities to strengthen the concreteness of the safety analyses and the traceability of compliance claims. We address each point below and will incorporate targeted revisions to improve clarity while preserving the framework's conceptual scope.

read point-by-point responses
  1. Referee: [Safety Analyses and Framework Overview] Abstract and Section on Safety Analyses: the assertion that the framework 'incorporates rigorous safety analyses to identify hazards and mitigate risks' is load-bearing for the assurance claims in the title, yet the provided descriptions remain at the level of process outlines without specifying concrete hazard identification techniques, risk metrics, or failure mode examples tied to perception datasets.

    Authors: We agree that the current descriptions emphasize process structure over concrete instantiation. The manuscript positions the safety analyses as an integrated component of the dataset lifecycle aligned with ISO/PAS 8800, drawing on the literature review for context. To address the concern, we will expand the relevant section with explicit examples: hazard identification via adapted FMEA for dataset issues, illustrative risk metrics such as coverage ratios for edge cases, and perception-specific failure modes (e.g., annotation errors in low-light conditions or distributional shifts in sensor data). These additions will reference the reviewed research trends without altering the high-level framework. revision: yes

  2. Referee: [Verification and Validation Strategies] Section on Verification and Validation Strategies: the proposed V&V strategies reference external ISO/PAS 8800 guidelines as grounding but do not demonstrate how the AI Data Flywheel outputs feed into specific compliance checks or traceability requirements, leaving the central claim of ensured compliance without an internal mechanism for evaluation.

    Authors: We acknowledge the value of making the linkage between Flywheel outputs and compliance mechanisms more explicit. The manuscript already describes the Flywheel as generating stage-specific artifacts (collection logs, annotation quality metrics, curation reports) intended to support traceability. In revision we will add a dedicated workflow diagram and accompanying text that maps these outputs to example ISO/PAS 8800 checks, including traceability requirements and verification criteria. This will illustrate the internal evaluation path while continuing to reference the standard for detailed normative requirements. revision: yes

Circularity Check

0 steps flagged

No significant circularity in framework proposal

full rationale

The paper presents a structured framework for dataset safety in autonomous driving aligned with external ISO/PAS 8800 guidelines. It introduces the AI Data Flywheel and dataset lifecycle (covering collection, annotation, curation, maintenance) along with safety analyses, requirements processes, and V&V strategies. These elements are proposed as best practices grounded in external standards and a literature review of recent research, with no mathematical derivations, equations, fitted parameters, or predictions that reduce by construction to the paper's own inputs. No self-citation chains, uniqueness theorems, or ansatzes from prior author work serve as load-bearing justifications. The contribution is self-contained as a process-oriented proposal against external benchmarks rather than an internally derived result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that ISO/PAS 8800 guidelines are appropriate for dataset safety in AI perception and that structured processes can identify and mitigate dataset hazards without additional empirical proof.

axioms (1)
  • domain assumption ISO/PAS 8800 provides suitable guidelines for safety in AI systems for autonomous driving
    The entire framework is presented as aligned with these guidelines.
invented entities (1)
  • AI Data Flywheel no independent evidence
    purpose: To represent the continuous cycle of data collection, annotation, curation, and maintenance for safety assurance
    New concept introduced to organize the dataset lifecycle in the framework.

pith-pipeline@v0.9.0 · 5457 in / 1307 out tokens · 58300 ms · 2026-05-17T23:29:39.698976+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 1 internal anchor

  1. [1]

    The safety risks of ai- driven solutions in autonomous road vehicles,

    F. Mirzarazi, S. Danishvar, and A. Mousavi, “The safety risks of ai- driven solutions in autonomous road vehicles,”World Electric Vehicle Journal, vol. 15, no. 10, p. 438, 2024

  2. [2]

    Neurall: Towards a unified visual perception model for automated driving,

    G. Sistu, I. Leang, S. Chennupati, S. Yogamani, C. Hughes, S. Milz, and S. Rawashdeh, “Neurall: Towards a unified visual perception model for automated driving,” in2019 IEEE Intelligent Transportation Systems Conference (ITSC). IEEE, 2019, pp. 796–803

  3. [3]

    Near-field depth estimation using monocular fisheye camera: A semi-supervised learning approach using sparse lidar data,

    V . R. Kumar, S. Milz, C. Witt, M. Simon, K. Amende, J. Petzold, S. Yogamani, and T. Pech, “Near-field depth estimation using monocular fisheye camera: A semi-supervised learning approach using sparse lidar data,” inCVPR Workshop, vol. 7, 2018, p. 2

  4. [4]

    Overview and empirical analysis of ISP parameter tuning for visual perception in autonomous driving,

    L. Yahiaoui, J. Horgan, B. Deeganet al., “Overview and empirical analysis of ISP parameter tuning for visual perception in autonomous driving,”Journal of Imaging, vol. 5, no. 10, p. 78, 2019

  5. [5]

    AuxNet: Auxiliary Tasks Enhanced Semantic Segmentation for Automated Driving,

    S. Chennupati, G. Sistu, S. Yogamaniet al., “AuxNet: Auxiliary Tasks Enhanced Semantic Segmentation for Automated Driving,” inProceed- ings of the International Conference on Computer Vision Theory and Applications, 2019, pp. 645–652

  6. [6]

    Collaborative perception datasets for autonomous driving: A review,

    N. Wang, D. Shang, Y . Gong, X. Hu, Z. Song, L. Yang, Y . Huang, X. Wang, and J. Lu, “Collaborative perception datasets for autonomous driving: A review,”arXiv preprint arXiv:2504.12696, 2025

  7. [7]

    A survey on autonomous driving datasets: Statistics, annotation quality, and a future outlook,

    M. Liu, E. Yurtsever, J. Fossaert, X. Zhou, W. Zimmer, Y . Cui, B. L. Zagar, and A. C. Knoll, “A survey on autonomous driving datasets: Statistics, annotation quality, and a future outlook,”IEEE Transactions on Intelligent Vehicles, 2024

  8. [8]

    Computer vision for autonomous vehicles: Problems, datasets and state of the art,

    J. Janai, F. G ¨uney, A. Behl, A. Geigeret al., “Computer vision for autonomous vehicles: Problems, datasets and state of the art,”Founda- tions and trends® in computer graphics and vision, vol. 12, no. 1–3, pp. 1–308, 2020

  9. [9]

    Are we hungry for 3d lidar data for semantic segmentation? a survey of datasets and methods,

    B. Gao, Y . Pan, C. Li, S. Geng, and H. Zhao, “Are we hungry for 3d lidar data for semantic segmentation? a survey of datasets and methods,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 7, pp. 6063–6081, 2021

  10. [10]

    Joseph and A

    L. Joseph and A. K. Mondal,Autonomous driving and advanced driver- assistance systems (ADAS): applications, development, legal issues, and testing. CRC Press, 2021

  11. [11]

    X-align: Cross-modal cross-view alignment for bird’s-eye-view segmentation,

    S. Borse, M. Klingner, V . R. Kumar, H. Cai, A. Almuzairee, S. Yoga- mani, and F. Porikli, “X-align: Cross-modal cross-view alignment for bird’s-eye-view segmentation,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 3287–3297

  12. [12]

    X3kd: Knowledge distillation across modalities, tasks and stages for multi-camera 3d object detection,

    M. Klingner, S. Borse, V . R. Kumar, B. Rezaei, V . Narayanan, S. Yoga- mani, and F. Porikli, “X3kd: Knowledge distillation across modalities, tasks and stages for multi-camera 3d object detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 343–13 353

  13. [13]

    FisheyeYOLO: Object Detec- tion on Fisheye Cameras for Autonomous Driving,

    H. Rashed, E. Mohamed, G. Sistuet al., “FisheyeYOLO: Object Detec- tion on Fisheye Cameras for Autonomous Driving,”Machine Learning for Autonomous Driving NeurIPSW, 2020

  14. [14]

    Challenges in de- signing datasets and validation for autonomous driving,

    M. U ˇriˇc´aˇr., D. Hurych., P. Kˇr´ıˇzek, and S. Yogamani., “Challenges in de- signing datasets and validation for autonomous driving,” inProceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5 VISAPP: VISAPP ,. SciTePress, 2019, pp. 653–659

  15. [15]

    Artificial intelligence in automated driving: an analysis of safety and cybersecurity challenges,

    R. HAMON, H. JUNKLEWITZ, M. J. I. SANCHEZ, L. D. FERNAN- DEZ, G. E. GOMEZ, A. A. HERRERA, A. KRISTONet al., “Artificial intelligence in automated driving: an analysis of safety and cybersecurity challenges,” 2022

  16. [16]

    A survey on autonomous driving datasets,

    W. Liu, Q. Dong, P. Wang, G. Yang, L. Meng, Y . Song, Y . Shi, and Y . Xue, “A survey on autonomous driving datasets,” in2021 8th International Conference on Dependable Systems and Their Applications (DSA). IEEE, 2021, pp. 399–407

  17. [17]

    Open-sourced data ecosystem in autonomous driving: the present and future,

    H. Li, Y . Li, H. Wang, J. Zeng, H. Xu, P. Cai, L. Chen, J. Yan, F. Xu, L. Xionget al., “Open-sourced data ecosystem in autonomous driving: the present and future,”arXiv preprint arXiv:2312.03408, 2023

  18. [18]

    Synthetic datasets for autonomous driving: A survey,

    Z. Song, Z. He, X. Li, Q. Ma, R. Ming, Z. Mao, H. Pei, L. Peng, J. Hu, D. Yaoet al., “Synthetic datasets for autonomous driving: A survey,” IEEE Transactions on Intelligent Vehicles, vol. 9, no. 1, pp. 1847–1864, 2023

  19. [19]

    Perception datasets for anomaly detection in autonomous driving: A survey,

    D. Bogdoll, S. Uhlemeyer, K. Kowol, and J. M. Z ¨ollner, “Perception datasets for anomaly detection in autonomous driving: A survey,” in 2023 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2023, pp. 1–8

  20. [20]

    A survey on datasets for the decision making of autonomous vehicles,

    Y . Wang, Z. Han, Y . Xing, S. Xu, and J. Wang, “A survey on datasets for the decision making of autonomous vehicles,”IEEE Intelligent Transportation Systems Magazine, vol. 16, no. 2, pp. 23–40, 2024

  21. [21]

    Aide: An automatic data engine for object detection in autonomous driving,

    M. Liang, J.-C. Su, S. Schulter, S. Garg, S. Zhao, Y . Wu, and M. Chandraker, “Aide: An automatic data engine for object detection in autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 695–14 706

  22. [22]

    Data-centric evolution in autonomous driving: A comprehensive survey of big data system, data mining, and closed-loop technologies,

    L. Li, W. Shao, W. Dong, Y . Tian, Q. Zhang, K. Yang, and W. Zhang, “Data-centric evolution in autonomous driving: A comprehensive survey of big data system, data mining, and closed-loop technologies,”arXiv preprint arXiv:2401.12888, 2024. 14 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

  23. [23]

    Tesla ai day 2022,

    Tesla, “Tesla ai day 2022,” 2022. [Online]. Available: https: //www.youtube.com/watch?v=ODSJsviD SU

  24. [24]

    Upgrading your fleet into an av data engine - scale ai,

    S. AI, “Upgrading your fleet into an av data engine - scale ai,” 2023. [Online]. Available: https://www.youtube.com/watch?v=lbOoXI1EeEs

  25. [25]

    The aurora data engine—advancing the aurora driver through valuable data that drives machine learning,

    Aurora, “The aurora data engine—advancing the aurora driver through valuable data that drives machine learning,” 2021. [Online]. Available: https://www.youtube.com/watch?v=Xe8YtdkMkS8

  26. [26]

    Momenta at cvpr 2023: How data-driven flywheel enables scalable path to full autonomy,

    Momenta, “Momenta at cvpr 2023: How data-driven flywheel enables scalable path to full autonomy,” 2023. [Online]. Available: https://www.youtube.com/watch?v=tNpEeIyuiJs

  27. [27]

    Maptr: Structured modeling and learning for online vectorized hd map construction

    B. Liao, S. Chen, X. Wang, T. Cheng, Q. Zhang, W. Liu, and C. Huang, “Maptr: Structured modeling and learning for online vectorized hd map construction,”arXiv preprint arXiv:2208.14437, 2022

  28. [28]

    Planning-oriented autonomous driving,

    Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wanget al., “Planning-oriented autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 17 853–17 862

  29. [29]

    Vad: Vectorized scene representation for efficient autonomous driving,

    B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang, “Vad: Vectorized scene representation for efficient autonomous driving,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8340–8350

  30. [30]

    Para- drive: Parallelized architecture for real-time autonomous driving,

    X. Weng, B. Ivanovic, Y . Wang, Y . Wang, and M. Pavone, “Para- drive: Parallelized architecture for real-time autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15 449–15 458

  31. [31]

    Lingo-2: Driving with natural language,

    W. R. Teamet al., “Lingo-2: Driving with natural language,” 2024

  32. [32]

    Navigation-guided sparse scene representation for end-to-end autonomous driving,

    P. Li and D. Cui, “Navigation-guided sparse scene representation for end-to-end autonomous driving,” inThe Thirteenth International Con- ference on Learning Representations

  33. [33]

    Li, Adrien Bardes, Suzanne Petryk, Oscar Ma ˜nas, et al

    F. Bordes, R. Y . Pang, A. Ajay, A. C. Li, A. Bardes, S. Petryk, O. Ma˜nas, Z. Lin, A. Mahmoud, B. Jayaramanet al., “An introduction to vision- language modeling,”arXiv preprint arXiv:2405.17247, 2024

  34. [34]

    DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    X. Tian, J. Gu, B. Li, Y . Liu, Y . Wang, Z. Zhao, K. Zhan, P. Jia, X. Lang, and H. Zhao, “Drivevlm: The convergence of autonomous driving and large vision-language models,”arXiv preprint arXiv:2402.12289, 2024

  35. [35]

    Covla: Comprehensive vision-language-action dataset for autonomous driving,

    H. Arai, K. Miwa, K. Sasaki, K. Watanabe, Y . Yamaguchi, S. Aoki, and I. Yamamoto, “Covla: Comprehensive vision-language-action dataset for autonomous driving,” in2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE, 2025, pp. 1933–1943

  36. [36]

    A survey on sensor selection and placement for connected and automated mobility,

    M. Kiraz, F. Sivrikaya, and S. Albayrak, “A survey on sensor selection and placement for connected and automated mobility,”IEEE Open Journal of Intelligent Transportation Systems, 2024

  37. [37]

    A concept for requirements-driven identification and mitigation of dataset gaps for perception tasks in automated driving

    M. S. Moustafa, M. Bieshaar, A. Albrecht, and B. Sick, “A concept for requirements-driven identification and mitigation of dataset gaps for perception tasks in automated driving.”

  38. [38]

    Semantic-aware video compres- sion for automotive cameras,

    Y . Wang, P. H. Chan, and V . Donzella, “Semantic-aware video compres- sion for automotive cameras,”IEEE Transactions on Intelligent Vehicles, vol. 8, no. 6, pp. 3712–3722, 2023

  39. [39]

    A survey on data compression techniques for automotive lidar point clouds,

    R. Roriz, H. Silva, F. Dias, and T. Gomes, “A survey on data compression techniques for automotive lidar point clouds,”Sensors, vol. 24, no. 10, p. 3185, 2024

  40. [40]

    Navya3dseg- navya 3d semantic segmentation dataset design & split generation for autonomous vehicles,

    A. Almin, L. Lemari ´e, A. Duong, and B. R. Kiran, “Navya3dseg- navya 3d semantic segmentation dataset design & split generation for autonomous vehicles,”IEEE Robotics and Automation Letters, vol. 8, no. 9, pp. 5584–5591, 2023

  41. [41]

    Leakage in data mining: Formulation, detection, and avoidance,

    S. Kaufman, S. Rosset, C. Perlich, and O. Stitelman, “Leakage in data mining: Formulation, detection, and avoidance,”ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 6, no. 4, pp. 1–21, 2012

  42. [42]

    D-lede: A data leakage detection method for automotive perception systems

    M. A. A. Babu, S. K. Pandey, D. Durisic, A. C. Koppisetty, and M. Staron, “D-lede: A data leakage detection method for automotive perception systems.”

  43. [43]

    Localization is all you evaluate: Data leakage in online mapping datasets and how to fix it,

    A. Lilja, J. Fu, E. Stenborg, and L. Hammarstrand, “Localization is all you evaluate: Data leakage in online mapping datasets and how to fix it,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 150–22 159

  44. [44]

    Zopp: A framework of zero-shot offboard panoptic perception for autonomous driving,

    T. Ma, H. Zhou, Q. Huang, X. Yang, J. Guo, B. Zhang, M. Dou, Y . Qiao, B. Shi, and H. Li, “Zopp: A framework of zero-shot offboard panoptic perception for autonomous driving,”Advances in Neural Information Processing Systems, vol. 37, pp. 140 266–140 291, 2024

  45. [45]

    Run-time introspection of 2d object detection in automated driving systems using learning representations,

    H. Y . Yatbaz, M. Dianati, K. Koufos, and R. Woodman, “Run-time introspection of 2d object detection in automated driving systems using learning representations,”IEEE Transactions on Intelligent Vehicles, vol. 9, no. 6, pp. 5033–5046, 2024

  46. [46]

    Objectlab: Automated diagnosis of mislabeled images in object detection data,

    U. Tkachenko, A. Thyagarajan, and J. Mueller, “Objectlab: Automated diagnosis of mislabeled images in object detection data,”arXiv preprint arXiv:2309.00832, 2023

  47. [47]

    Delving into localization errors for monocular 3d object detection,

    X. Ma, Y . Zhang, D. Xu, D. Zhou, S. Yi, H. Li, and W. Ouyang, “Delving into localization errors for monocular 3d object detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4721–4730

  48. [48]

    Key safety design overview in ai-driven autonomous vehicles,

    V . Vyas and Z. Xu, “Key safety design overview in ai-driven autonomous vehicles,”arXiv preprint arXiv:2412.08862, 2024

  49. [49]

    A comprehensive review on traffic datasets and simulators for autonomous vehicles,

    S. Sarker, B. Maples, I. Islam, M. Fan, C. Papadopoulos, and W. Li, “A comprehensive review on traffic datasets and simulators for autonomous vehicles,”arXiv preprint arXiv:2412.14207, 2024

  50. [50]

    A systematic digital engineering approach to verification & validation of autonomous ground vehicles in off-road environments,

    T. Vilas Samak, C. Vilas Samak, J. Brault, C. Harber, K. McCane, J. Smereka, M. Brudnak, D. Gorsich, and V . Krovi, “A systematic digital engineering approach to verification & validation of autonomous ground vehicles in off-road environments,”arXiv e-prints, pp. arXiv– 2503, 2025

  51. [51]

    Road vehicles – safety and artificial intelligence,

    I. O. for Standardization, “Road vehicles – safety and artificial intelligence,” ISO/PAS Standard No. 8800:2024, 2024. [Online]. Available: https://www.iso.org/standard/83303.html

  52. [52]

    Are we ready for autonomous driving? the kitti vision benchmark suite,

    A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012, pp. 3354–3361

  53. [53]

    Carla: An open urban driving simulator,

    A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V . Koltun, “Carla: An open urban driving simulator,” inConference on robot learning. PMLR, 2017, pp. 1–16

  54. [54]

    Lgsvl simulator: A high fidelity simulator for autonomous driving,

    G. Rong, B. H. Shin, H. Tabatabaee, Q. Lu, S. Lemke, M. Mo ˇzeiko, E. Boise, G. Uhm, M. Gerow, S. Mehtaet al., “Lgsvl simulator: A high fidelity simulator for autonomous driving,” in2020 IEEE 23rd International conference on intelligent transportation systems (ITSC). IEEE, 2020, pp. 1–6

  55. [55]

    A survey on image data augmen- tation for deep learning,

    C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmen- tation for deep learning,”Journal of big data, vol. 6, no. 1, pp. 1–48, 2019

  56. [56]

    Bdd100k: A diverse driving dataset for heterogeneous multitask learning,

    F. Yu, H. Chen, X. Wang, W. Xian, Y . Chen, F. Liu, V . Madhavan, and T. Darrell, “Bdd100k: A diverse driving dataset for heterogeneous multitask learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2636–2645

  57. [57]

    nuscenes: A multimodal dataset for autonomous driving,

    H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631

  58. [58]

    Visualizing data using t-sne

    L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.”Journal of machine learning research, vol. 9, no. 11, 2008

  59. [59]

    Comparing the benefits of pseudonymi- sation and anonymisation under the gdpr,

    M. Hintze and K. El Emam, “Comparing the benefits of pseudonymi- sation and anonymisation under the gdpr,”Journal of Data Protection & Privacy, vol. 2, no. 2, pp. 145–158, 2018

  60. [60]

    A unifying view on dataset shift in classification,

    J. G. Moreno-Torres, T. Raeder, R. Alaiz-Rodr ´ıguez, N. V . Chawla, and F. Herrera, “A unifying view on dataset shift in classification,”Pattern recognition, vol. 45, no. 1, pp. 521–530, 2012

  61. [61]

    Deep learning,

    I. Goodfellow, “Deep learning,” 2016

  62. [62]

    Scalability in perception for autonomous driving: Waymo open dataset,

    P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caineet al., “Scalability in perception for autonomous driving: Waymo open dataset,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2446–2454

  63. [63]

    Processing, assess- ing, and enhancing the waymo autonomous vehicle open dataset for driving behavior research,

    X. Hu, Z. Zheng, D. Chen, X. Zhang, and J. Sun, “Processing, assess- ing, and enhancing the waymo autonomous vehicle open dataset for driving behavior research,”Transportation Research Part C: Emerging Technologies, vol. 134, p. 103490, 2022

  64. [64]

    Semi-automatic framework for traffic landmark annotation,

    W. H. Lee, K. Jung, C. Kang, and H. S. Chang, “Semi-automatic framework for traffic landmark annotation,”IEEE Open Journal of Intelligent Transportation Systems, vol. 2, pp. 1–12, 2021

  65. [65]

    Understanding the effectiveness of lossy compression in machine learning training sets,

    R. Underwood, J. C. Calhoun, S. Di, and F. Cappello, “Understanding the effectiveness of lossy compression in machine learning training sets,” arXiv preprint arXiv:2403.15953, 2024

  66. [66]

    Fast error-bounded lossy hpc data compression with sz,

    S. Di and F. Cappello, “Fast error-bounded lossy hpc data compression with sz,” in2016 ieee international parallel and distributed processing symposium (ipdps). IEEE, 2016, pp. 730–739

  67. [67]

    Iterative compression towards in-distribution features in domain generalization,

    Y . Jiang, T. Zhang, Y . Li, G. Chen, and F. Chen, “Iterative compression towards in-distribution features in domain generalization,”Neurocom- puting, vol. 638, p. 130011, 2025

  68. [68]

    Deep-learning-based image com- pression for microscopy images: An empirical study,

    Y . Zhou, J. Sollmann, and J. Chen, “Deep-learning-based image com- pression for microscopy images: An empirical study,”Biological Imag- ing, vol. 4, p. e16, 2024

  69. [69]

    Operability studies and hazard analysis,

    H. Lawley, “Operability studies and hazard analysis,”Chem. Eng. Prog., vol. 70, no. 4, pp. 45–56, 1974. ABBASPOURet al.: DATASET SAFETY IN AUTONOMOUS DRIVING: REQUIREMENTS, RISKS, AND ASSURANCE 15

  70. [70]

    A hierarchical hazop-like safety analysis for learning-enabled systems,

    Y . Qi, P. R. Conmy, W. Huang, X. Zhao, and X. Huang, “A hierarchical hazop-like safety analysis for learning-enabled systems,”arXiv preprint arXiv:2206.10216, 2022

  71. [71]

    Dataset fault tree analysis for systematic evaluation of machine learning systems,

    T. Aoki, D. Kawakami, N. Chida, and T. Tomita, “Dataset fault tree analysis for systematic evaluation of machine learning systems,” in 2020 IEEE 25th Pacific Rim International Symposium on Dependable Computing (PRDC). IEEE, 2020, pp. 100–109

  72. [72]

    Introducing the ml fmea,

    P. Schmitt, H. B. Seifert, M. Bijelic, K. Pennar, J. Lopez, and F. Heide, “Introducing the ml fmea,” SAE Technical Paper, Tech. Rep., 2025

  73. [73]

    Stpa for learning-enabled systems: a survey and a new practice,

    Y . Qi, Y . Dong, S. Khastgir, P. Jennings, X. Zhao, and X. Huang, “Stpa for learning-enabled systems: a survey and a new practice,” in 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2023, pp. 1381–1388

  74. [74]

    Collaborative perception in autonomous driving: Methods, datasets, and challenges,

    Y . Han, H. Zhang, H. Li, Y . Jin, C. Lang, and Y . Li, “Collaborative perception in autonomous driving: Methods, datasets, and challenges,” IEEE Intelligent Transportation Systems Magazine, vol. 15, no. 6, pp. 131–151, 2023