pith. machine review for the scientific record. sign in

arxiv: 2604.22068 · v1 · submitted 2026-04-23 · 💻 cs.SE · cs.RO

Recognition: unknown

TRACE: Topology-aware Reconstruction of Accidents in CARLA for AV Evaluation

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:50 UTC · model grok-4.3

classification 💻 cs.SE cs.RO
keywords autonomous vehiclescrash reconstructionCARLA simulatorNHTSA reportsroad topologysafety validationaccident scenariosOpenStreetMap
0
0 comments X

The pith

TRACE turns NHTSA crash reports into CARLA simulations that keep the real road layouts from maps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a pipeline called TRACE that automatically converts official descriptions of real car crashes into detailed simulations inside the CARLA driving environment. It pulls the precise road shapes and intersections from open map data for each crash location, uses large language models to determine where each vehicle started and what it was doing before impact, and then builds full driving paths that match the report details. The result is an open collection of 52 varied crash cases that include different collision types, road layouts, and pre-crash actions. A sympathetic reader would care because current AV tests often rely on made-up conflicts or simplified roads that miss the specific geometries where real failures occur.

Core claim

TRACE automates the reconstruction of NHTSA crash reports into high-fidelity CARLA simulations by retrieving site-specific OpenStreetMap data to preserve exact road topology, leveraging Large Language Models to infer vehicles' initial state from road geometry and pre-crash maneuvers, and generating simulation trajectories from semi-structured report data, yielding a benchmark of 52 diverse accident scenarios.

What carries the argument

The TRACE pipeline, which combines OpenStreetMap data retrieval for exact road topology, LLM inference of initial vehicle states, and trajectory generation from report data.

If this is right

  • AV developers can test against simulations drawn from actual crash sites rather than generic or invented road layouts.
  • The 52 scenarios cover multiple collision types and road topologies, allowing systematic exposure of weaknesses that appear in real incidents.
  • An open-source benchmark reduces the need to wait for rare real-world AV failures by recreating known ones in a repeatable simulator.
  • The method supports scaling to additional reports to grow the set of testable cases over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reconstruction steps could be applied to crash data from other countries or databases beyond NHTSA.
  • If the simulations prove reliable, patterns across the 52 cases might reveal which road features most often contribute to AV errors.
  • Testing could extend to edge cases like different weather or lighting by layering those on top of the base reconstructions.
  • The approach might help regulators define minimum simulation coverage requirements based on real crash distributions.

Load-bearing premise

The vehicle starting positions and paths inferred by the language models, together with the roads taken from maps, are close enough to the original crashes that AV systems will fail in the simulations for the same reasons they would fail in reality.

What would settle it

Compare the behavior and failure points of an AV system run in the TRACE simulations against detailed records or video of the original real-world crashes; large mismatches in collision timing, location, or sequence would show the reconstructions are not faithful.

Figures

Figures reproduced from arXiv: 2604.22068 by Nahian Salsabil, Sebastian Elbaum.

Figure 2
Figure 2. Figure 2: TRACE pipeline. The Extractor parses NHTSA crash reports. The Map Reconstructor retrieves and converts OSM data to OpenDRIVE. The Scene Reconstructor uses an LLM￾based State Estimator to infer vehicle trajectories for the Launcher to execute in simulation. for data extraction and only focuses on the impact, while SoVAR’s template-based extraction limits its ability to handle diverse acci￾dent descriptions.… view at source ↗
Figure 1
Figure 1. Figure 1: Sample scenarios generated by TRACE showing the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Road Topology of StateCase=510179 in (a) real-world [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

Validating Autonomous Vehicles (AVs) requires exposure to rare, safety-critical scenarios, infrequent in routine driving data. Existing benchmarks address this by generating synthetic conflicts or mapping accident descriptions to abstract road geometries, failing to capture the topological complexity of real-world crashes. We introduce TRACE , a pipeline that automates the reconstruction of NHTSA crash reports into high-fidelity CARLA simulations by (1) retrieving site-specific OpenStreetMap data to preserve exact road topology, (2) leveraging Large Language Models to infer vehicles' initial state from road geometry and pre-crash maneuvers, and (3) generating simulation trajectories from semi-structured report data. Using this pipeline, we curated a benchmark of 52 diverse accident scenarios covering varied collision types, road topologies, and pre-crash maneuvers, providing a challenging open source resource for testing AV systems against real-world failures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents TRACE, an automated pipeline that reconstructs NHTSA crash reports into CARLA simulations. It retrieves site-specific OpenStreetMap data to preserve exact road topologies, uses Large Language Models to infer vehicles' initial states from road geometry and pre-crash maneuvers, and generates simulation trajectories from semi-structured report data. The authors produce and release a benchmark of 52 diverse accident scenarios covering varied collision types, road topologies, and pre-crash maneuvers as a resource for testing AV systems against real-world failures.

Significance. If the reconstructions are shown to be faithful, this would be a useful contribution to AV evaluation by supplying topology-preserving simulations derived from actual crash reports, filling a gap left by purely synthetic or abstract-geometry benchmarks and enabling more realistic exposure to safety-critical events.

major comments (2)
  1. [Abstract and Section 3 (Pipeline)] Abstract and pipeline description: The central claim that the CARLA simulations are 'high-fidelity' and suitable for exposing real-world AV failure modes rests on LLM-inferred initial states and generated trajectories, yet no quantitative metrics (impact speed, angle, point of contact, or post-crash motion) are reported comparing the simulations to the original NHTSA report values or any ground-truth data.
  2. [Section 4 (Benchmark)] Benchmark section: The curation of the 52 scenarios is presented without any error analysis, fidelity scores, or expert validation of how well the reconstructed dynamics match the source crashes; this directly affects the claim that the benchmark is 'challenging' for AV testing.
minor comments (2)
  1. [Section 3] The description of how semi-structured report data is parsed into simulation parameters could include a concrete example or pseudocode for reproducibility.
  2. [Figures] Figure captions for scenario visualizations should explicitly note which elements (road topology, vehicle positions, trajectories) are derived from OSM versus LLM inference.

Simulated Author's Rebuttal

2 responses · 1 unresolved

Thank you for the constructive review of our manuscript on TRACE. We address the major comments point-by-point below, acknowledging where additional validation is needed, and outline the revisions we will make.

read point-by-point responses
  1. Referee: Abstract and Section 3 (Pipeline): The central claim that the CARLA simulations are 'high-fidelity' and suitable for exposing real-world AV failure modes rests on LLM-inferred initial states and generated trajectories, yet no quantitative metrics (impact speed, angle, point of contact, or post-crash motion) are reported comparing the simulations to the original NHTSA report values or any ground-truth data.

    Authors: We agree that the manuscript would be strengthened by quantitative validation where possible. NHTSA crash reports are primarily narrative and semi-structured, and do not consistently provide precise numerical values for impact speed, angle, point of contact, or post-crash motion across all cases. Our pipeline achieves fidelity primarily through site-specific OSM topology retrieval and LLM inference of initial states and trajectories that are consistent with the described pre-crash maneuvers and road geometry. In the revision, we will update the abstract and Section 3 to clarify these sources of fidelity, add a limitations discussion on the lack of exact dynamic matching, and include qualitative comparisons (e.g., trajectory visualizations aligned to report descriptions) for a representative subset of scenarios where partial details are available. We will also replace the term 'high-fidelity' with 'topology-preserving' in key claims. revision: partial

  2. Referee: Benchmark section: The curation of the 52 scenarios is presented without any error analysis, fidelity scores, or expert validation of how well the reconstructed dynamics match the source crashes; this directly affects the claim that the benchmark is 'challenging' for AV testing.

    Authors: We acknowledge that the current benchmark section does not include systematic error analysis or fidelity scoring. The 52 scenarios were selected to maximize diversity in collision types, road topologies, and pre-crash maneuvers drawn directly from NHTSA reports. In the revised manuscript, we will expand Section 4 with a new error analysis subsection that discusses sources of potential discrepancy (including LLM inference variability) and provides qualitative fidelity assessments via manual review and trajectory-report alignment for sampled scenarios. This will support the claim that the benchmark is challenging by explicitly linking each scenario to documented real-world failure modes. revision: yes

standing simulated objections not resolved
  • Full quantitative metrics (e.g., exact impact speeds and angles) cannot be provided for the complete set of 52 scenarios, as the source NHTSA reports lack consistent numerical ground-truth data for these parameters.

Circularity Check

0 steps flagged

No circularity: construction pipeline with no derivations or self-referential reductions

full rationale

The manuscript describes an automated pipeline that retrieves OSM topology, uses LLMs to infer initial states from report text and geometry, and generates CARLA trajectories. No equations, fitted parameters, or mathematical derivations are present in the provided text. The central output (the 52-scenario benchmark) is produced by the described steps rather than reduced to its inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing premises. Lack of quantitative fidelity metrics is an empirical-validation concern, not a circularity issue in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that existing tools (CARLA, OSM, LLMs) can be composed to produce faithful crash reconstructions without new physical modeling; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Large language models can reliably infer plausible vehicle initial states and pre-crash maneuvers from road geometry descriptions and semi-structured crash report text.
    This assumption is required for pipeline step (2) and is not independently verified in the abstract.
  • domain assumption CARLA simulations driven by OSM-derived road topology plus LLM-generated trajectories are sufficiently representative of real crashes for AV safety evaluation.
    This is the load-bearing premise that justifies releasing the 52 scenarios as a challenging benchmark.

pith-pipeline@v0.9.0 · 5441 in / 1511 out tokens · 34202 ms · 2026-05-09T20:50:03.528734+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references

  1. [1]

    In2019 International Conference on Robotics and Automation (ICRA)(2019), pp

    Abeysirigoonawardena, Y., Shkurti, F., and Dudek, G.Generating adversarial driving scenarios in high-fidelity simulators. In2019 International Conference on Robotics and Automation (ICRA)(2019), pp. 8271–8277

  2. [2]

    In2018 IEEE Intelligent Vehicles Symposium (IV)(2018), pp

    Bagschik, G., Menzel, T., and Maurer, M.Ontology based scene creation for the development of automated vehicles. In2018 IEEE Intelligent Vehicles Symposium (IV)(2018), pp. 1813–1820

  3. [3]

    Autonomous vehicle collision reports

    California Department of Motor Vehicles. Autonomous vehicle collision reports. https://www.dmv.ca.gov/portal/vehicle-industry-services/autonomous- vehicles/autonomous-vehicle-collision-reports/, 2026. Accessed: 2026-01-21

  4. [4]

    CARLA simulator documentation: Release 0.9.15

    CARLA Team. CARLA simulator documentation: Release 0.9.15. https://carla.re adthedocs.io/en/0.9.15/, 2024. Accessed: 2026-01-22

  5. [5]

    CARLA autonomous driving challenge, 2026

    CARLA Team. CARLA autonomous driving challenge, 2026. Accessed: 2026-01- 21

  6. [6]

    CARLA: Open-source simulator for autonomous driving research

    CARLA Team. CARLA: Open-source simulator for autonomous driving research. https://carla.org/, 2026. Accessed: 2026-01-21

  7. [7]

    Ding, W., Xu, M., and Zhao, D.Cmts: Conditional multiple trajectory synthesizer for generating safety-critical driving scenarios, 2019

  8. [8]

    European road safety observatory, 2026

    European Commission. European road safety observatory, 2026. Accessed: 2026- 01-23

  9. [9]

    Gao, Y., Piccinini, M., Moller, K., Alanwar, A., and Betz, J.From words to collisions: Llm-guided evaluation and adversarial generation of safety-critical driving scenarios, 2025

  10. [10]

    T., Liu, Y., and Chen, Z.Sovar: Build generalizable scenarios from accident reports for autonomous driving testing

    Guo, A., Zhou, Y., Tian, H., Fang, C., Sun, Y., Sun, W., Gao, X., Luu, A. T., Liu, Y., and Chen, Z.Sovar: Build generalizable scenarios from accident reports for autonomous driving testing. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering(Oct. 2024), ACM, p. 268–280

  11. [11]

    In2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion)(2019), pp

    Huynh, T., Gambi, A., and Fraser, G.Ac3r: Automatically reconstructing car crashes from police reports. In2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion)(2019), pp. 31–34

  12. [12]

    Electronics 13, 16 (2024)

    Jiang, W., Wang, L., Zhang, T., Chen, Y., Dong, J., Bao, W., Zhang, Z., and Fu, Q.Robuste2e: Exploring the robustness of end-to-end autonomous driving. Electronics 13, 16 (2024)

  13. [13]

    Li, M., Ding, W., Lin, H., Lyu, Y., Y ao, Y., Zhang, Y., and Zhao, D.Crashagent: Crash scenario generation via multi-modal reasoning, 2025

  14. [14]

    Luo, S., Zhang, Y., Deng, Y., Liang, L., and Zheng, X.Safe: Harnessing llm for scenario-driven ads testing from multimodal crash data, 2025

  15. [15]

    Fatality Analysis Reporting System (FARS) analytical user’s manual, 1975–2023

    National Center for Statistics and Analysis. Fatality Analysis Reporting System (FARS) analytical user’s manual, 1975–2023. Tech. Rep. DOT HS 813 706, National Highway Traffic Safety Administration, 2025. Accessed: 2026-01-21

  16. [16]

    Crash API - National Highway Traffic Safety Administration

    National Highway Traffic Safety Administration. Crash API - National Highway Traffic Safety Administration. https://crashviewer.nhtsa.dot.gov/Cras hAPI, 2026. Accessed: 2026-01-21

  17. [17]

    Crash investigation sampling system (CISS)

    National Highway Traffic Safety Administration. Crash investigation sampling system (CISS). https://www.nhtsa.gov/crash-data-systems/crash- investigation-sampling-system, 2026. Accessed: 2026-01-21

  18. [18]

    OpenStreetMap

    OpenStreetMap contributors. OpenStreetMap. https://www.openstreetmap. org/, 2026. Accessed: 2026-01-22

  19. [19]

    J., Fidler, S., and Litany, O.Generating useful accident-prone driving scenarios via a learned traffic prior, 2022

    Rempe, D., Philion, J., Guibas, L. J., Fidler, S., and Litany, O.Generating useful accident-prone driving scenarios via a learned traffic prior, 2022

  20. [20]

    In2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC)(2017), pp

    Rocklage, E., Kraft, H., Karatas, A., and Seewig, J.Automated scenario genera- tion for regression testing of autonomous vehicles. In2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC)(2017), pp. 476–483

  21. [21]

    Tan, S., Ivanovic, B., Weng, X., Pavone, M., and Kraehenbuehl, P.Language conditioned traffic generation, 2023

  22. [22]

    Osmium Tool: A multipurpose command line tool for working with openstreetmap data

    The Osmium Tool Team. Osmium Tool: A multipurpose command line tool for working with openstreetmap data. https://osmcode.org/osmium-tool/, 2026. Accessed: 2026-01-22

  23. [23]

    von Stein, M., Shriver, D., and Elbaum, S.Deepmaneuver: Adversarial test generation for trajectory manipulation of autonomous vehicles.IEEE Transactions on Software Engineering 49, 10 (2023), 4496–4509

  24. [24]

    In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)(Los Alamitos, CA, USA, May 2025), IEEE Computer Society, pp

    Woodlief, T., Hildebrandt, C., and Elbaum, S.A Differential Testing Frame- work to Identify Critical AV Failures Leveraging Arbitrary Inputs . In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)(Los Alamitos, CA, USA, May 2025), IEEE Computer Society, pp. 360–372

  25. [25]

    Xu, C., Ding, W., Lyu, W., Liu, Z., Wang, S., He, Y., Hu, H., Zhao, D., and Li, B.Safebench: A benchmarking platform for safety evaluation of autonomous vehicles, 2022

  26. [26]

    Y ang, Z., Chai, Y., Anguelov, D., Zhou, Y., Sun, P., Erhan, D., Rafferty, S., and Kretzschmar, H.Surfelgan: Synthesizing realistic sensor data for autonomous driving, 2020

  27. [27]

    Zhang, J., Xu, C., and Li, B.Chatscene: Knowledge-enabled safety-critical sce- nario generation for autonomous vehicles, 2024

  28. [28]

    Zhang, L., Peng, Z., Li, Q., and Zhou, B.Cat: Closed-loop adversarial training for safe end-to-end driving, 2023

  29. [29]

    Zhang, X., Zhang, Q., Han, L., Qu, Q., and Chen, X.Accidentsim: Generating physically realistic vehicle collision videos from real-world accident reports, 2025