arxiv: 2604.22068 · v1 · submitted 2026-04-23 · 💻 cs.SE · cs.RO

Recognition: unknown

TRACE: Topology-aware Reconstruction of Accidents in CARLA for AV Evaluation

Nahian Salsabil , Sebastian Elbaum

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:50 UTC · model grok-4.3

classification 💻 cs.SE cs.RO

keywords autonomous vehiclescrash reconstructionCARLA simulatorNHTSA reportsroad topologysafety validationaccident scenariosOpenStreetMap

0 comments

The pith

TRACE turns NHTSA crash reports into CARLA simulations that keep the real road layouts from maps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a pipeline called TRACE that automatically converts official descriptions of real car crashes into detailed simulations inside the CARLA driving environment. It pulls the precise road shapes and intersections from open map data for each crash location, uses large language models to determine where each vehicle started and what it was doing before impact, and then builds full driving paths that match the report details. The result is an open collection of 52 varied crash cases that include different collision types, road layouts, and pre-crash actions. A sympathetic reader would care because current AV tests often rely on made-up conflicts or simplified roads that miss the specific geometries where real failures occur.

Core claim

TRACE automates the reconstruction of NHTSA crash reports into high-fidelity CARLA simulations by retrieving site-specific OpenStreetMap data to preserve exact road topology, leveraging Large Language Models to infer vehicles' initial state from road geometry and pre-crash maneuvers, and generating simulation trajectories from semi-structured report data, yielding a benchmark of 52 diverse accident scenarios.

What carries the argument

The TRACE pipeline, which combines OpenStreetMap data retrieval for exact road topology, LLM inference of initial vehicle states, and trajectory generation from report data.

If this is right

AV developers can test against simulations drawn from actual crash sites rather than generic or invented road layouts.
The 52 scenarios cover multiple collision types and road topologies, allowing systematic exposure of weaknesses that appear in real incidents.
An open-source benchmark reduces the need to wait for rare real-world AV failures by recreating known ones in a repeatable simulator.
The method supports scaling to additional reports to grow the set of testable cases over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reconstruction steps could be applied to crash data from other countries or databases beyond NHTSA.
If the simulations prove reliable, patterns across the 52 cases might reveal which road features most often contribute to AV errors.
Testing could extend to edge cases like different weather or lighting by layering those on top of the base reconstructions.
The approach might help regulators define minimum simulation coverage requirements based on real crash distributions.

Load-bearing premise

The vehicle starting positions and paths inferred by the language models, together with the roads taken from maps, are close enough to the original crashes that AV systems will fail in the simulations for the same reasons they would fail in reality.

What would settle it

Compare the behavior and failure points of an AV system run in the TRACE simulations against detailed records or video of the original real-world crashes; large mismatches in collision timing, location, or sequence would show the reconstructions are not faithful.

Figures

Figures reproduced from arXiv: 2604.22068 by Nahian Salsabil, Sebastian Elbaum.

**Figure 2.** Figure 2: TRACE pipeline. The Extractor parses NHTSA crash reports. The Map Reconstructor retrieves and converts OSM data to OpenDRIVE. The Scene Reconstructor uses an LLMbased State Estimator to infer vehicle trajectories for the Launcher to execute in simulation. for data extraction and only focuses on the impact, while SoVAR’s template-based extraction limits its ability to handle diverse accident descriptions.… view at source ↗

**Figure 1.** Figure 1: Sample scenarios generated by TRACE showing the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 3.** Figure 3: Road Topology of StateCase=510179 in (a) real-world [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Validating Autonomous Vehicles (AVs) requires exposure to rare, safety-critical scenarios, infrequent in routine driving data. Existing benchmarks address this by generating synthetic conflicts or mapping accident descriptions to abstract road geometries, failing to capture the topological complexity of real-world crashes. We introduce TRACE , a pipeline that automates the reconstruction of NHTSA crash reports into high-fidelity CARLA simulations by (1) retrieving site-specific OpenStreetMap data to preserve exact road topology, (2) leveraging Large Language Models to infer vehicles' initial state from road geometry and pre-crash maneuvers, and (3) generating simulation trajectories from semi-structured report data. Using this pipeline, we curated a benchmark of 52 diverse accident scenarios covering varied collision types, road topologies, and pre-crash maneuvers, providing a challenging open source resource for testing AV systems against real-world failures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TRACE gives a workable pipeline for pulling real NHTSA reports into CARLA with actual road topology but skips any check on whether the LLM steps and trajectories match the original crashes.

read the letter

The main contribution is a three-step pipeline that grabs site-specific OpenStreetMap data to keep the real road layout, uses LLMs to infer starting positions and maneuvers from the report text plus geometry, and then produces CARLA trajectories. They ran this on 52 NHTSA reports to build a benchmark covering varied collisions and topologies, and they released the scenarios openly. That combination of real topology plus automated inference from actual reports is not something the cited prior work on synthetic conflicts or abstract mappings does in the same way. The description of the steps is straightforward and the open release is a practical plus for anyone who wants ready-to-run test cases instead of having to build them from scratch. The work is aimed at AV validation researchers who need safety-critical scenarios that come from real incidents rather than invented ones. A reader working on simulation-based testing would find the dataset and the automation useful as a starting point even if they end up adding their own checks. The soft spot is the missing validation. The paper does not report any metrics comparing simulated impact speeds, angles, contact points, or post-crash motion against the original report values, nor any expert review of how faithful the reconstructions are. That leaves the central claim of high-fidelity, challenging scenarios resting on an untested assumption about the LLM inference and trajectory generation. The circularity burden is zero since this is a construction method, not a fitted model. I would bring this to a reading group as a maybe because the idea is concrete and the output is usable, but the lack of fidelity numbers limits how far you can trust the results without further work. I would not cite it yet for the same reason. It deserves peer review because the approach is grounded in real data and fills a clear gap in AV testing resources; referees can push for the validation experiments that would make the claims stronger.

Referee Report

2 major / 2 minor

Summary. The paper presents TRACE, an automated pipeline that reconstructs NHTSA crash reports into CARLA simulations. It retrieves site-specific OpenStreetMap data to preserve exact road topologies, uses Large Language Models to infer vehicles' initial states from road geometry and pre-crash maneuvers, and generates simulation trajectories from semi-structured report data. The authors produce and release a benchmark of 52 diverse accident scenarios covering varied collision types, road topologies, and pre-crash maneuvers as a resource for testing AV systems against real-world failures.

Significance. If the reconstructions are shown to be faithful, this would be a useful contribution to AV evaluation by supplying topology-preserving simulations derived from actual crash reports, filling a gap left by purely synthetic or abstract-geometry benchmarks and enabling more realistic exposure to safety-critical events.

major comments (2)

[Abstract and Section 3 (Pipeline)] Abstract and pipeline description: The central claim that the CARLA simulations are 'high-fidelity' and suitable for exposing real-world AV failure modes rests on LLM-inferred initial states and generated trajectories, yet no quantitative metrics (impact speed, angle, point of contact, or post-crash motion) are reported comparing the simulations to the original NHTSA report values or any ground-truth data.
[Section 4 (Benchmark)] Benchmark section: The curation of the 52 scenarios is presented without any error analysis, fidelity scores, or expert validation of how well the reconstructed dynamics match the source crashes; this directly affects the claim that the benchmark is 'challenging' for AV testing.

minor comments (2)

[Section 3] The description of how semi-structured report data is parsed into simulation parameters could include a concrete example or pseudocode for reproducibility.
[Figures] Figure captions for scenario visualizations should explicitly note which elements (road topology, vehicle positions, trajectories) are derived from OSM versus LLM inference.

Simulated Author's Rebuttal

2 responses · 1 unresolved

Thank you for the constructive review of our manuscript on TRACE. We address the major comments point-by-point below, acknowledging where additional validation is needed, and outline the revisions we will make.

read point-by-point responses

Referee: Abstract and Section 3 (Pipeline): The central claim that the CARLA simulations are 'high-fidelity' and suitable for exposing real-world AV failure modes rests on LLM-inferred initial states and generated trajectories, yet no quantitative metrics (impact speed, angle, point of contact, or post-crash motion) are reported comparing the simulations to the original NHTSA report values or any ground-truth data.

Authors: We agree that the manuscript would be strengthened by quantitative validation where possible. NHTSA crash reports are primarily narrative and semi-structured, and do not consistently provide precise numerical values for impact speed, angle, point of contact, or post-crash motion across all cases. Our pipeline achieves fidelity primarily through site-specific OSM topology retrieval and LLM inference of initial states and trajectories that are consistent with the described pre-crash maneuvers and road geometry. In the revision, we will update the abstract and Section 3 to clarify these sources of fidelity, add a limitations discussion on the lack of exact dynamic matching, and include qualitative comparisons (e.g., trajectory visualizations aligned to report descriptions) for a representative subset of scenarios where partial details are available. We will also replace the term 'high-fidelity' with 'topology-preserving' in key claims. revision: partial
Referee: Benchmark section: The curation of the 52 scenarios is presented without any error analysis, fidelity scores, or expert validation of how well the reconstructed dynamics match the source crashes; this directly affects the claim that the benchmark is 'challenging' for AV testing.

Authors: We acknowledge that the current benchmark section does not include systematic error analysis or fidelity scoring. The 52 scenarios were selected to maximize diversity in collision types, road topologies, and pre-crash maneuvers drawn directly from NHTSA reports. In the revised manuscript, we will expand Section 4 with a new error analysis subsection that discusses sources of potential discrepancy (including LLM inference variability) and provides qualitative fidelity assessments via manual review and trajectory-report alignment for sampled scenarios. This will support the claim that the benchmark is challenging by explicitly linking each scenario to documented real-world failure modes. revision: yes

standing simulated objections not resolved

Full quantitative metrics (e.g., exact impact speeds and angles) cannot be provided for the complete set of 52 scenarios, as the source NHTSA reports lack consistent numerical ground-truth data for these parameters.

Circularity Check

0 steps flagged

No circularity: construction pipeline with no derivations or self-referential reductions

full rationale

The manuscript describes an automated pipeline that retrieves OSM topology, uses LLMs to infer initial states from report text and geometry, and generates CARLA trajectories. No equations, fitted parameters, or mathematical derivations are present in the provided text. The central output (the 52-scenario benchmark) is produced by the described steps rather than reduced to its inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing premises. Lack of quantitative fidelity metrics is an empirical-validation concern, not a circularity issue in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that existing tools (CARLA, OSM, LLMs) can be composed to produce faithful crash reconstructions without new physical modeling; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Large language models can reliably infer plausible vehicle initial states and pre-crash maneuvers from road geometry descriptions and semi-structured crash report text.
This assumption is required for pipeline step (2) and is not independently verified in the abstract.
domain assumption CARLA simulations driven by OSM-derived road topology plus LLM-generated trajectories are sufficiently representative of real crashes for AV safety evaluation.
This is the load-bearing premise that justifies releasing the 52 scenarios as a challenging benchmark.

pith-pipeline@v0.9.0 · 5441 in / 1511 out tokens · 34202 ms · 2026-05-09T20:50:03.528734+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references

[1]

In2019 International Conference on Robotics and Automation (ICRA)(2019), pp

Abeysirigoonawardena, Y., Shkurti, F., and Dudek, G.Generating adversarial driving scenarios in high-fidelity simulators. In2019 International Conference on Robotics and Automation (ICRA)(2019), pp. 8271–8277

2019
[2]

In2018 IEEE Intelligent Vehicles Symposium (IV)(2018), pp

Bagschik, G., Menzel, T., and Maurer, M.Ontology based scene creation for the development of automated vehicles. In2018 IEEE Intelligent Vehicles Symposium (IV)(2018), pp. 1813–1820

2018
[3]

Autonomous vehicle collision reports

California Department of Motor Vehicles. Autonomous vehicle collision reports. https://www.dmv.ca.gov/portal/vehicle-industry-services/autonomous- vehicles/autonomous-vehicle-collision-reports/, 2026. Accessed: 2026-01-21

2026
[4]

CARLA simulator documentation: Release 0.9.15

CARLA Team. CARLA simulator documentation: Release 0.9.15. https://carla.re adthedocs.io/en/0.9.15/, 2024. Accessed: 2026-01-22

2024
[5]

CARLA autonomous driving challenge, 2026

CARLA Team. CARLA autonomous driving challenge, 2026. Accessed: 2026-01- 21

2026
[6]

CARLA: Open-source simulator for autonomous driving research

CARLA Team. CARLA: Open-source simulator for autonomous driving research. https://carla.org/, 2026. Accessed: 2026-01-21

2026
[7]

Ding, W., Xu, M., and Zhao, D.Cmts: Conditional multiple trajectory synthesizer for generating safety-critical driving scenarios, 2019

2019
[8]

European road safety observatory, 2026

European Commission. European road safety observatory, 2026. Accessed: 2026- 01-23

2026
[9]

Gao, Y., Piccinini, M., Moller, K., Alanwar, A., and Betz, J.From words to collisions: Llm-guided evaluation and adversarial generation of safety-critical driving scenarios, 2025

2025
[10]

T., Liu, Y., and Chen, Z.Sovar: Build generalizable scenarios from accident reports for autonomous driving testing

Guo, A., Zhou, Y., Tian, H., Fang, C., Sun, Y., Sun, W., Gao, X., Luu, A. T., Liu, Y., and Chen, Z.Sovar: Build generalizable scenarios from accident reports for autonomous driving testing. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering(Oct. 2024), ACM, p. 268–280

2024
[11]

In2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion)(2019), pp

Huynh, T., Gambi, A., and Fraser, G.Ac3r: Automatically reconstructing car crashes from police reports. In2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion)(2019), pp. 31–34

2019
[12]

Electronics 13, 16 (2024)

Jiang, W., Wang, L., Zhang, T., Chen, Y., Dong, J., Bao, W., Zhang, Z., and Fu, Q.Robuste2e: Exploring the robustness of end-to-end autonomous driving. Electronics 13, 16 (2024)

2024
[13]

Li, M., Ding, W., Lin, H., Lyu, Y., Y ao, Y., Zhang, Y., and Zhao, D.Crashagent: Crash scenario generation via multi-modal reasoning, 2025

2025
[14]

Luo, S., Zhang, Y., Deng, Y., Liang, L., and Zheng, X.Safe: Harnessing llm for scenario-driven ads testing from multimodal crash data, 2025

2025
[15]

Fatality Analysis Reporting System (FARS) analytical user’s manual, 1975–2023

National Center for Statistics and Analysis. Fatality Analysis Reporting System (FARS) analytical user’s manual, 1975–2023. Tech. Rep. DOT HS 813 706, National Highway Traffic Safety Administration, 2025. Accessed: 2026-01-21

1975
[16]

Crash API - National Highway Traffic Safety Administration

National Highway Traffic Safety Administration. Crash API - National Highway Traffic Safety Administration. https://crashviewer.nhtsa.dot.gov/Cras hAPI, 2026. Accessed: 2026-01-21

2026
[17]

Crash investigation sampling system (CISS)

National Highway Traffic Safety Administration. Crash investigation sampling system (CISS). https://www.nhtsa.gov/crash-data-systems/crash- investigation-sampling-system, 2026. Accessed: 2026-01-21

2026
[18]

OpenStreetMap

OpenStreetMap contributors. OpenStreetMap. https://www.openstreetmap. org/, 2026. Accessed: 2026-01-22

2026
[19]

J., Fidler, S., and Litany, O.Generating useful accident-prone driving scenarios via a learned traffic prior, 2022

Rempe, D., Philion, J., Guibas, L. J., Fidler, S., and Litany, O.Generating useful accident-prone driving scenarios via a learned traffic prior, 2022

2022
[20]

In2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC)(2017), pp

Rocklage, E., Kraft, H., Karatas, A., and Seewig, J.Automated scenario genera- tion for regression testing of autonomous vehicles. In2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC)(2017), pp. 476–483

2017
[21]

Tan, S., Ivanovic, B., Weng, X., Pavone, M., and Kraehenbuehl, P.Language conditioned traffic generation, 2023

2023
[22]

Osmium Tool: A multipurpose command line tool for working with openstreetmap data

The Osmium Tool Team. Osmium Tool: A multipurpose command line tool for working with openstreetmap data. https://osmcode.org/osmium-tool/, 2026. Accessed: 2026-01-22

2026
[23]

von Stein, M., Shriver, D., and Elbaum, S.Deepmaneuver: Adversarial test generation for trajectory manipulation of autonomous vehicles.IEEE Transactions on Software Engineering 49, 10 (2023), 4496–4509

2023
[24]

In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)(Los Alamitos, CA, USA, May 2025), IEEE Computer Society, pp

Woodlief, T., Hildebrandt, C., and Elbaum, S.A Differential Testing Frame- work to Identify Critical AV Failures Leveraging Arbitrary Inputs . In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)(Los Alamitos, CA, USA, May 2025), IEEE Computer Society, pp. 360–372

2025
[25]

Xu, C., Ding, W., Lyu, W., Liu, Z., Wang, S., He, Y., Hu, H., Zhao, D., and Li, B.Safebench: A benchmarking platform for safety evaluation of autonomous vehicles, 2022

2022
[26]

Y ang, Z., Chai, Y., Anguelov, D., Zhou, Y., Sun, P., Erhan, D., Rafferty, S., and Kretzschmar, H.Surfelgan: Synthesizing realistic sensor data for autonomous driving, 2020

2020
[27]

Zhang, J., Xu, C., and Li, B.Chatscene: Knowledge-enabled safety-critical sce- nario generation for autonomous vehicles, 2024

2024
[28]

Zhang, L., Peng, Z., Li, Q., and Zhou, B.Cat: Closed-loop adversarial training for safe end-to-end driving, 2023

2023
[29]

Zhang, X., Zhang, Q., Han, L., Qu, Q., and Chen, X.Accidentsim: Generating physically realistic vehicle collision videos from real-world accident reports, 2025

2025