Scenario Generation for Testing of Autonomous Driving Systems Using Real-World Failure Records

Anjali Parashar; Chuchu Fan

arxiv: 2606.31131 · v1 · pith:B3UQ4RM7new · submitted 2026-06-30 · 💻 cs.AI · cs.RO

Scenario Generation for Testing of Autonomous Driving Systems Using Real-World Failure Records

Anjali Parashar , Chuchu Fan This is my paper

Pith reviewed 2026-07-01 05:55 UTC · model grok-4.3

classification 💻 cs.AI cs.RO

keywords scenario generationautonomous driving systemsLLMfailure recordssimulation testingNHTSAMetadrivecrash records

0 comments

The pith

LLM pipeline generates accurate and diverse test scenarios for autonomous driving from real-world failure records.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method to create test scenarios for autonomous driving systems by using large language models to process historical crash records written in natural language. This approach aims to replace manual design of test templates and mathematical optimization with direct translation of real failure conditions into simulator inputs. By applying it to NHTSA records on the Metadrive simulator, the method produces scenarios covering four road types and three vehicle movement types along with road anomalies like working zones. The generated scenarios match the original testing conditions and uncover system failures even when only twenty scenarios are tested. This matters because it leverages existing failure data to make pre-deployment testing more efficient and realistic.

Core claim

The central discovery is a modular LLM-based pipeline that extracts categorical and contextual information from natural language ADS crash records and translates it into diverse, simulator-compatible scenarios. When applied to NHTSA records for testing on Metadrive, it generates scenarios with combinations of 4 road types, 3 non-ego vehicle movements, and on-road anomalies such as working zones. These scenarios align with provided testing conditions and reveal interesting failures within a budget of 20 scenarios.

What carries the argument

Modular LLM based synthetic scenario generation that translates natural language failure records into simulator-compatible scenarios.

Load-bearing premise

The large language model can reliably extract and translate categorical and contextual information from natural language failure records into simulator-compatible scenarios without significant errors or loss of fidelity.

What would settle it

Running the 20 generated scenarios in the Metadrive simulator and checking whether they match the original NHTSA failure conditions or fail to reveal the reported system failures would test the claim.

Figures

Figures reproduced from arXiv: 2606.31131 by Anjali Parashar, Chuchu Fan.

**Figure 2.** Figure 2: Trajectory rollout for Cluster 6. Meta variables for this example are: Road TypeTraffic Circle, Work Zone-No. We show an exemplar LLM paraphrased narrative (Section 4.2.1) at the bottom, and corresponding trajectory rollouts for a scenario (top) obtained for the scenario generated using scenario generation (Section 4.2.2). The scenarios generated by our paradigm can be easily integrated with existing math… view at source ↗

**Figure 3.** Figure 3: Trajectory rollout for Cluster 3. Meta variables for this scenario are: Road TypeIntersection, CP movement-Proceeding Straight, Work Zone-No. Figure shows frames corresponding to initial gap before fine tuning, left) and final gap for scenario fine-tuned by fine-tuning agent (right), showing the improvement in fatality made using LLM based fine-tuning. available data into diverse groups, as shown in [PI… view at source ↗

**Figure 4.** Figure 4: Trajectory rollout for Cluster 2. Meta variables for this scenario: CP movementProceeding Straight, Road Type- Intersection, Work Zone-No. While SV does not crash and maintains a safe distance from CP, SV shows oscillatory movement to avoid crash at the beginning, despite being at a sufficient distance from CP. The paraphrasing agent and scenario design agent adapt to system specific requirements. The ge… view at source ↗

**Figure 5.** Figure 5: Trajectory rollout for Cluster 4 & 13. Meta variables for this scenario: CP movementProceeding Straight, Road Type- Intersection (Cluster 4), Highway/Freeway (Cluster 13), Work Zone-Yes. Appendix A. Discussion of results (Q4) Why do we need a separate paraphrasing agent to generate synthetic narratives? Firstly, narratives may not be available for all kinds of scenario templates. In such cases, the LLM-ba… view at source ↗

**Figure 6.** Figure 6: Prompt used by scenario generation agent for actual schema generation. [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

read the original abstract

To ensure safe on-road behavior, pre-deployment testing and failure discovery of Autonomous Driving Systems (ADS) is crucial. Present day simulation based testing methods focus largely on mathematical models for efficient search of optimal scenarios, assuming a fixed scenario representation. On the other hand, real-world testing involves substantial manual effort to design scenario templates for testing. These templates represent distinct failure scenarios consisting of pre-deployment vehicle movements, map types, etc. Historical failure records for ADS are a reliable source of real-world failure conditions, which can be used for scenario generation. In this work, we propose a scenario generation pipeline using categorical and contextual information available from historical records in natural language format. Our approach consists of modular LLM based synthetic scenario generation, compatible with the testing constraints of a given system. We successfully apply our method to generate a diverse set of scenarios for testing autonomous navigation on Metadrive simulator using the NHTSA ADS crash records. Our approach results in accurate and diverse scenario generation with a combination of 4 road types, 3 non ego vehicle movement types, including on road anomalies in the form of working zones. Generated scenarios align with the provided testing conditions, and reveals interesting failures of the system within a limited testing budget of 20 scenarios. Code is available at https://github.com/anjaliParashar/crash2scenario.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical LLM pipeline to turn NHTSA crash text into MetaDrive scenarios but offers no quantitative check that the extraction step keeps the original details intact.

read the letter

The paper's main contribution is a modular pipeline that feeds natural-language NHTSA ADS crash records into an LLM to produce MetaDrive test scenarios. They extract road type, non-ego movements, and anomalies such as work zones, then run 20 generated cases that they say match the source conditions and surface some failures.

The practical angle is useful: it starts from real failure records rather than synthetic templates or pure optimization, releases the code, and shows coverage across four road types and three movement categories. That is a straightforward engineering step that others working on data-driven ADS testing could build on.

The soft spot is exactly where the stress-test note points. The central claim is that the generated scenarios are accurate and aligned, yet the abstract and description give no fidelity metrics, no human review of extracted parameters against the source text, and no error analysis on the LLM translation. Without that, it is hard to know whether the failures found actually trace back to the recorded conditions or to distortions introduced during extraction. The diversity and failure-revelation results therefore sit on an unverified step.

This is for people already doing simulation-based ADS validation who want concrete examples of pulling real crash data into a simulator. The idea is clear enough to deserve referee time, even though the current evidence for the accuracy claim is thin. I would send it for review and expect the main request to be for the missing validation numbers.

Referee Report

2 major / 1 minor

Summary. The paper proposes a modular LLM-based pipeline to translate natural-language NHTSA ADS crash records into simulator-compatible scenarios for MetaDrive. It claims to produce accurate and diverse scenarios (4 road types, 3 non-ego movement types, work-zone anomalies) that align with given testing constraints and reveal failures in only 20 scenarios. Code is released at the cited GitHub repository.

Significance. If the extraction fidelity holds, the work supplies a practical bridge between real-world failure records and simulation testing that could complement purely mathematical scenario search or manual template design. Public code release supports reproducibility and is a clear strength.

major comments (2)

[Abstract and results] Abstract and results description: the central claim that scenarios are 'accurate' and that the LLM 'reliably extract[s] and translate[s]' categorical/contextual information rests on an unverified step. No fidelity metric, edit-distance score, inter-annotator agreement, or source-record vs. generated-parameter comparison is reported, leaving the diversity and failure-revelation results without quantitative grounding.
[Method and evaluation] Method and evaluation sections: the pipeline description does not include any error analysis or validation protocol for the LLM extraction of road type, vehicle movement, or anomaly fields. Because this extraction is the load-bearing prerequisite for all downstream claims, its absence prevents assessment of whether the generated scenarios actually preserve the original records.

minor comments (1)

[Conclusion] The GitHub link is helpful; consider adding a short reproducibility statement or example run script in the paper itself.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights the need for stronger validation of the LLM extraction step. We address each major comment below and will incorporate revisions to provide quantitative grounding for the fidelity claims.

read point-by-point responses

Referee: [Abstract and results] Abstract and results description: the central claim that scenarios are 'accurate' and that the LLM 'reliably extract[s] and translate[s]' categorical/contextual information rests on an unverified step. No fidelity metric, edit-distance score, inter-annotator agreement, or source-record vs. generated-parameter comparison is reported, leaving the diversity and failure-revelation results without quantitative grounding.

Authors: We agree that the manuscript lacks explicit quantitative metrics (such as fidelity scores or inter-annotator agreement) for the LLM extraction of categorical and contextual fields from NHTSA records. The claims of accuracy rest on the observed alignment of generated scenarios with the reported road types, vehicle movements, and anomalies, plus their ability to surface failures within the 20-scenario budget. To strengthen this, the revised manuscript will add a validation subsection reporting agreement rates from a manual review of extracted parameters against a sample of source records. revision: yes
Referee: [Method and evaluation] Method and evaluation sections: the pipeline description does not include any error analysis or validation protocol for the LLM extraction of road type, vehicle movement, or anomaly fields. Because this extraction is the load-bearing prerequisite for all downstream claims, its absence prevents assessment of whether the generated scenarios actually preserve the original records.

Authors: The referee is correct that no formal error analysis or validation protocol for the extraction of road type, movement, and anomaly fields appears in the Method or Evaluation sections. Our evaluation emphasized downstream simulator outcomes rather than direct extraction fidelity. We will revise the Method section to describe a validation protocol (e.g., sampling records for human verification of extracted fields) and report associated error rates in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; applied pipeline with external validation

full rationale

The paper describes an LLM-based engineering pipeline that extracts categorical information from NHTSA natural-language crash records and generates simulator scenarios for Metadrive. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claims rest on empirical application results (diversity across 4 road types, 3 movement types, work zones, and observed failures in 20 scenarios) rather than any reduction to inputs by construction. This matches the default case of a self-contained applied method scored 0-2.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on parameters or assumptions beyond the high-level pipeline description.

pith-pipeline@v0.9.1-grok · 5766 in / 960 out tokens · 26944 ms · 2026-07-01T05:55:33.406900+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 4 canonical work pages

[1]

Data-efficient learning via clustering- based sensitivity sampling: Foundation models and beyond.arXiv preprint arXiv:2402.17327,

Kyriakos Axiotis, Vincent Cohen-Addad, Monika Henzinger, Sammy Jerome, Vahab Mirrokni, David Saulpic, David Woodruff, and Michael Wunder. Data-efficient learning via clustering- based sensitivity sampling: Foundation models and beyond.arXiv preprint arXiv:2402.17327,

work page arXiv
[2]

The european new car assessment programme.https://www.euroncap.com/ en

Euro NCAP. The european new car assessment programme.https://www.euroncap.com/ en. Euro NCAP. Euro ncap 2026 protocols. Online,

2026
[3]

Daniel J Fremont, Edward Kim, Tommaso Dreossi, Shromona Ghosh, Xiangyu Yue, Alberto L Sangiovanni-Vincentelli, and Sanjit A Seshia

URLhttps://www.euroncap.com/ en/for-engineers/protocols/2026-protocols/. Daniel J Fremont, Edward Kim, Tommaso Dreossi, Shromona Ghosh, Xiangyu Yue, Alberto L Sangiovanni-Vincentelli, and Sanjit A Seshia. Scenic: a language for scenario specification and data generation.Machine Learning, 112(10):3805–3849,

2026
[4]

KING: Generating Safety-Critical Driving Scenarios for Robust Imitation via Kinematics Gradi- ents

11 PARASHARFAN Niklas Hanselmann, Katrin Renz, Kashyap Chitta, Apratim Bhattacharyya, and Andreas Geiger. KING: Generating Safety-Critical Driving Scenarios for Robust Imitation via Kinematics Gradi- ents. InComputer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVIII, pages 335–352, Berlin, Heide...

2022
[5]

ISBN 978-3-031-19838-0

Springer-Verlag. ISBN 978-3-031-19838-0. doi: 10.1007/978-3-031-19839-7

work page doi:10.1007/978-3-031-19839-7
[6]

Ontology of autonomous driving as a tool for argumentation on responsibility

Piotr Kulicki and Robert Trypuz. Ontology of autonomous driving as a tool for argumentation on responsibility. InLogic and Argumentation: 6th International Conference, CLAR 2025, Taiyuan, China, June 14–16, 2025, Proceedings, page 104–120, Berlin, Heidelberg,

2025
[7]

ISBN 978-981-96-7955-3

Springer- Verlag. ISBN 978-981-96-7955-3. doi: 10.1007/978-981-96-7956-0

work page doi:10.1007/978-981-96-7956-0
[8]

Standing general order on crash reporting for vehicles equipped with automated driving systems and level 2 advanced driver assistance systems

National Highway Traffic Safety Administration. Standing general order on crash reporting for vehicles equipped with automated driving systems and level 2 advanced driver assistance systems. Regulatory guidance, U.S. Department of Transportation, 2021.https://www.nhtsa.gov/ laws-regulations/standing-general-order-crash-reporting. National Highway Traffic ...

2021
[9]

Failure prediction from few expert demonstrations

Anjali Parashar, Kunal Garg, Joseph Zhang, and Chuchu Fan. Failure prediction from few expert demonstrations. InNeurIPS 2024 Workshop on Bayesian Decision-making and Uncertainty, 2024a. Anjali Parashar, Ji Yin, Charles Dawson, Panagiotis Tsiotras, and Chuchu Fan. Learning-based bayesian inference for testing of autonomous systems.IEEE Robotics and Automat...

2024
[10]

Sentence-bert: Sentence embeddings using siamese bert- networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), pages 3982–3992,

2019
[11]

Rate-informed discovery via bayesian adaptive multifidelity sampling.arXiv preprint arXiv:2411.17826,

Aman Sinha, Payam Nikdel, Supratik Paul, and Shimon Whiteson. Rate-informed discovery via bayesian adaptive multifidelity sampling.arXiv preprint arXiv:2411.17826,

work page arXiv
[12]

Crash Partner

Higher background traffic increases chances of collisions, but also increases randomization in scenario generation. These values are summarized in Table 5 and can be directly provided as scenario input to theMetadrivesimulator. 14 SCENARIOGENERATION FORADS TESTING USINGFAILURERECORDS Appendix C. Clustering scenario details Cluster-0. Road: Intersection, S...

2025
[13]

traveling straight through the intersection in the adjacent through lane continued forward at a steady speed. As Vehicle 1 entered the intersection and crossed the straight- through path, Vehicle 2 reached the conflict point and struck Vehicle 1 in the intersection, resulting in an angle/side-impact collision. Cluster-5. Road: Highway / Freeway, SV: Proce...

2024
[14]

same"|"opposite

in the adjacent outside right-turn lane also initiated a right turn at approximately the same time. During the turn, Vehicle 2 tracked wider than its marked lane and encroached into the inside receiving lane, moving laterally into Vehicle 1’s path. Vehicle 1 braked to maintain a safe gap but was unable to fully avoid contact, resulting in a low-speed side...

2025

[1] [1]

Data-efficient learning via clustering- based sensitivity sampling: Foundation models and beyond.arXiv preprint arXiv:2402.17327,

Kyriakos Axiotis, Vincent Cohen-Addad, Monika Henzinger, Sammy Jerome, Vahab Mirrokni, David Saulpic, David Woodruff, and Michael Wunder. Data-efficient learning via clustering- based sensitivity sampling: Foundation models and beyond.arXiv preprint arXiv:2402.17327,

work page arXiv

[2] [2]

The european new car assessment programme.https://www.euroncap.com/ en

Euro NCAP. The european new car assessment programme.https://www.euroncap.com/ en. Euro NCAP. Euro ncap 2026 protocols. Online,

2026

[3] [3]

Daniel J Fremont, Edward Kim, Tommaso Dreossi, Shromona Ghosh, Xiangyu Yue, Alberto L Sangiovanni-Vincentelli, and Sanjit A Seshia

URLhttps://www.euroncap.com/ en/for-engineers/protocols/2026-protocols/. Daniel J Fremont, Edward Kim, Tommaso Dreossi, Shromona Ghosh, Xiangyu Yue, Alberto L Sangiovanni-Vincentelli, and Sanjit A Seshia. Scenic: a language for scenario specification and data generation.Machine Learning, 112(10):3805–3849,

2026

[4] [4]

KING: Generating Safety-Critical Driving Scenarios for Robust Imitation via Kinematics Gradi- ents

11 PARASHARFAN Niklas Hanselmann, Katrin Renz, Kashyap Chitta, Apratim Bhattacharyya, and Andreas Geiger. KING: Generating Safety-Critical Driving Scenarios for Robust Imitation via Kinematics Gradi- ents. InComputer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVIII, pages 335–352, Berlin, Heide...

2022

[5] [5]

ISBN 978-3-031-19838-0

Springer-Verlag. ISBN 978-3-031-19838-0. doi: 10.1007/978-3-031-19839-7

work page doi:10.1007/978-3-031-19839-7

[6] [6]

Ontology of autonomous driving as a tool for argumentation on responsibility

Piotr Kulicki and Robert Trypuz. Ontology of autonomous driving as a tool for argumentation on responsibility. InLogic and Argumentation: 6th International Conference, CLAR 2025, Taiyuan, China, June 14–16, 2025, Proceedings, page 104–120, Berlin, Heidelberg,

2025

[7] [7]

ISBN 978-981-96-7955-3

Springer- Verlag. ISBN 978-981-96-7955-3. doi: 10.1007/978-981-96-7956-0

work page doi:10.1007/978-981-96-7956-0

[8] [8]

Standing general order on crash reporting for vehicles equipped with automated driving systems and level 2 advanced driver assistance systems

National Highway Traffic Safety Administration. Standing general order on crash reporting for vehicles equipped with automated driving systems and level 2 advanced driver assistance systems. Regulatory guidance, U.S. Department of Transportation, 2021.https://www.nhtsa.gov/ laws-regulations/standing-general-order-crash-reporting. National Highway Traffic ...

2021

[9] [9]

Failure prediction from few expert demonstrations

Anjali Parashar, Kunal Garg, Joseph Zhang, and Chuchu Fan. Failure prediction from few expert demonstrations. InNeurIPS 2024 Workshop on Bayesian Decision-making and Uncertainty, 2024a. Anjali Parashar, Ji Yin, Charles Dawson, Panagiotis Tsiotras, and Chuchu Fan. Learning-based bayesian inference for testing of autonomous systems.IEEE Robotics and Automat...

2024

[10] [10]

Sentence-bert: Sentence embeddings using siamese bert- networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), pages 3982–3992,

2019

[11] [11]

Rate-informed discovery via bayesian adaptive multifidelity sampling.arXiv preprint arXiv:2411.17826,

Aman Sinha, Payam Nikdel, Supratik Paul, and Shimon Whiteson. Rate-informed discovery via bayesian adaptive multifidelity sampling.arXiv preprint arXiv:2411.17826,

work page arXiv

[12] [12]

Crash Partner

Higher background traffic increases chances of collisions, but also increases randomization in scenario generation. These values are summarized in Table 5 and can be directly provided as scenario input to theMetadrivesimulator. 14 SCENARIOGENERATION FORADS TESTING USINGFAILURERECORDS Appendix C. Clustering scenario details Cluster-0. Road: Intersection, S...

2025

[13] [13]

traveling straight through the intersection in the adjacent through lane continued forward at a steady speed. As Vehicle 1 entered the intersection and crossed the straight- through path, Vehicle 2 reached the conflict point and struck Vehicle 1 in the intersection, resulting in an angle/side-impact collision. Cluster-5. Road: Highway / Freeway, SV: Proce...

2024

[14] [14]

same"|"opposite

in the adjacent outside right-turn lane also initiated a right turn at approximately the same time. During the turn, Vehicle 2 tracked wider than its marked lane and encroached into the inside receiving lane, moving laterally into Vehicle 1’s path. Vehicle 1 braked to maintain a safe gap but was unable to fully avoid contact, resulting in a low-speed side...

2025