Automating Crash Diagram Generation Using Vision-Language Models: A Case Study on Multi-Lane Roundabouts
Pith reviewed 2026-05-15 14:44 UTC · model grok-4.3
The pith
Vision-language models can generate usable crash diagrams from police reports for multi-lane roundabouts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that vision-language models, led by GPT-4o, can interpret police crash reports and produce diagrams for multi-lane roundabouts when guided by a three-part structured prompt framework, reaching an average score of 6.29 out of 10 on a custom 10-metric evaluation system that assesses semantic accuracy, spatial fidelity, and visual clarity, with the top model showing stronger alignment between extracted data and generated visuals than the alternatives tested.
What carries the argument
A three-part structured prompt framework that guides models through interpretation of the crash report, extraction of relevant data, and visual synthesis of the diagram.
If this is right
- Crash analysis workflows could reduce time spent on manual diagram preparation.
- Diagram consistency may increase across different analysts and locations.
- The prompting approach offers a starting point for applying generative AI to other spatial engineering visualizations.
- Models with better spatial reasoning, like GPT-4o, align extracted crash details more closely with the output diagrams.
- Current performance gaps point to needed improvements in handling complex road geometries.
Where Pith is reading between the lines
- The same prompt structure could extend to diagram generation for other intersection types beyond roundabouts.
- Batch processing of historical crash data might become feasible if linked to existing safety databases.
- The 10-metric evaluation could benchmark future vision-language models on transportation visualization tasks.
- Wider adoption might lower variability in safety reporting and support faster policy analysis.
Load-bearing premise
The custom 10-metric evaluation system supplies an objective, reproducible measure of diagram quality that aligns with expert human judgment, especially for spatial fidelity in complex roundabout geometries.
What would settle it
Independent human experts applying the same 10 metrics to the generated diagrams produce average scores that differ markedly from the reported model results, particularly on spatial fidelity for GPT-4o.
Figures
read the original abstract
Crash diagrams are essential tools in transportation safety analysis, yet their manual preparation remains time-consuming and prone to human variability. This study investigates the use of Vision-Language Models (VLMs) to automate crash diagram generation from police crash reports, focusing on multilane roundabouts as a challenging test case. A three-part structured prompt framework was developed to guide model reasoning through interpretation, extraction, and visual synthesis, while a 10-metric evaluation system was designed to assess diagram quality in terms of semantic accuracy, spatial fidelity, and visual clarity. Three popular models, including GPT-4o, Gemini-1.5-Flash, and Janus-4o, were tested on 79 crash reports. GPT-4o achieved the highest average performance (6.29 out of 10), followed by Gemini-1.5-Flash (5.28) and Janus-4o (3.64). The analysis revealed GPT-4o's superior spatial reasoning and alignment between extracted and visualized crash data. These results highlight both the promise and current limitations of VLMs in engineering visualization tasks. The study lays the groundwork for integrating generative AI into crash analysis workflows to improve efficiency, consistency, and interpretability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper investigates automating crash diagram generation from police reports using Vision-Language Models (VLMs), with a focus on multi-lane roundabouts as a challenging case. It proposes a three-part structured prompt framework for interpretation, extraction, and visual synthesis, and introduces a custom 10-metric evaluation system for semantic accuracy, spatial fidelity, and visual clarity. Three models (GPT-4o, Gemini-1.5-Flash, Janus-4o) are tested on 79 reports, with GPT-4o reporting the highest average score of 6.29/10 and superior spatial reasoning.
Significance. If the evaluation proves reproducible and aligned with expert judgment, the work could meaningfully advance AI-assisted tools in transportation safety by reducing manual effort and variability in crash diagram preparation. The structured prompt approach and real-world roundabout test cases offer a practical benchmark for VLM capabilities in engineering visualization, highlighting both promise and current limits in spatial tasks.
major comments (2)
- [Evaluation methodology and results] The 10-metric evaluation system (described in the methods and results sections) is load-bearing for all performance claims, including the headline scores (GPT-4o at 6.29/10, Gemini-1.5-Flash at 5.28, Janus-4o at 3.64) and the conclusion of superior spatial reasoning. However, the manuscript provides no definitions or weighting for the metrics, no description of the scoring procedure or who performed the scoring, and no inter-rater reliability statistics or validation against expert human judgments of diagram correctness. This leaves open whether scores reflect actual spatial fidelity or simply prompt compliance.
- [Data and experimental setup] The selection criteria for the 79 crash reports, any blinding procedures, and statistical tests for differences between model scores are not reported. Without these, the comparative claims cannot be assessed for robustness or generalizability beyond the specific sample.
minor comments (2)
- Clarify the model name 'Janus-4o' with a citation or description, as it is less commonly referenced than GPT-4o or Gemini-1.5-Flash.
- Ensure all figures of generated diagrams include scale bars or annotations to allow readers to directly assess spatial fidelity claims.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which highlight important aspects of reproducibility and methodological transparency. We have prepared point-by-point responses below and will revise the manuscript to incorporate additional details where feasible.
read point-by-point responses
-
Referee: [Evaluation methodology and results] The 10-metric evaluation system (described in the methods and results sections) is load-bearing for all performance claims, including the headline scores (GPT-4o at 6.29/10, Gemini-1.5-Flash at 5.28, Janus-4o at 3.64) and the conclusion of superior spatial reasoning. However, the manuscript provides no definitions or weighting for the metrics, no description of the scoring procedure or who performed the scoring, and no inter-rater reliability statistics or validation against expert human judgments of diagram correctness. This leaves open whether scores reflect actual spatial fidelity or simply prompt compliance.
Authors: We agree that the current manuscript lacks sufficient detail on the evaluation methodology, which is essential for interpreting the reported scores. In the revised version, we will expand the methods section to include: (1) explicit definitions and descriptions for each of the 10 metrics, grouped under semantic accuracy, spatial fidelity, and visual clarity; (2) the scoring procedure, with each metric rated on a 0-1 scale (yielding a total out of 10) and any weighting applied; (3) clarification that scoring was performed by the authors, who have domain expertise in transportation safety; and (4) an explicit statement acknowledging the absence of inter-rater reliability statistics as a limitation, along with plans to incorporate expert validation in follow-up work. These additions will help demonstrate that the scores target actual diagram quality rather than prompt adherence alone. revision: yes
-
Referee: [Data and experimental setup] The selection criteria for the 79 crash reports, any blinding procedures, and statistical tests for differences between model scores are not reported. Without these, the comparative claims cannot be assessed for robustness or generalizability beyond the specific sample.
Authors: We acknowledge the need for greater transparency in the experimental setup. The revised methods section will specify the selection criteria for the 79 reports (drawn from publicly available police crash reports involving multi-lane roundabouts, filtered for sufficient textual detail on vehicle positions and maneuvers). We will also note that no formal blinding procedures were employed, as the evaluation relied on direct author review of model outputs. Finally, we will add statistical comparisons (e.g., one-way ANOVA with post-hoc tests) to assess differences between model scores, reporting p-values and effect sizes in the results. These revisions will allow readers to better evaluate the robustness and scope of the findings. revision: yes
Circularity Check
No significant circularity; evaluation uses external references
full rationale
The paper describes an empirical comparison of VLM-generated crash diagrams against human-prepared reference diagrams using a custom 10-metric rubric for semantic accuracy, spatial fidelity, and visual clarity. No mathematical derivations, fitted parameters, predictions, or self-citations are present that reduce any result to the inputs by construction. The performance scores (e.g., GPT-4o at 6.29/10) are computed directly from the external metric applied to model outputs versus references, with no self-definitional loops or load-bearing self-citations. This is a standard empirical setup without circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption VLMs can reliably interpret textual crash descriptions and synthesize accurate visual diagrams when given a structured three-part prompt
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A three-part structured prompt framework was developed to guide model reasoning through interpretation, extraction, and visual synthesis, while a 10-metric evaluation system was designed to assess diagram quality in terms of semantic accuracy, spatial fidelity, and visual clarity.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GPT-4o achieved the highest average performance (6.29 out of 10)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
D. Fernandez, P. MohajerAnsari, A. Salarpour, and M. Pesé. Avoiding the crash: a vision-language model evaluation of critical traffic scenarios.SAE Int. J. Adv. Curr. Prac. in Mobility, 7:2255–2266, 2025. doi: 10.4271/2025-01-8213
-
[2]
S. Jaradat, N. Acharya, S. Shivshankar, T. Alhadidi, and M. Elhenawy. AI for data quality auditing: detecting mis- labeled work zone crashes using large language models.Algorithms, 18:317, 2025.doi:10.3390/a18060317
-
[3]
Transportation injury mapping system (TIMS)
UC Berkeley SafeTREC. Transportation injury mapping system (TIMS). UC Regents, 2025. URL: https: //tims.berkeley.edu
work page 2025
-
[4]
PdMagic. Crash magic online. Pd’ Programming, Inc., 2025. URL:https://www.pdmagic.com
work page 2025
-
[5]
Aashtoware safety intersection
AASHTOWare. Aashtoware safety intersection. American Association of State Highway and Transportation Officials, 2025. URL: https://www.aashtoware.org/products/safety/ aashtoware-safety-intersection
work page 2025
-
[6]
H. Zhen and J. Yang. Tab-text: bridging tabular data and natural language for enhanced traffic safety analysis and modeling.Expert Syst. Appl., 290:128450, 2025.doi:10.1016/j.eswa.2025.128450
-
[7]
H. Zhen, Y . Shi, Y . Huang, J. Yang, and N. Liu. Leveraging large language models with chain-of-thought and prompt engineering for traffic crash severity analysis and inference.Computers, 13:232, 2024. doi: 10.3390/computers13090232
-
[8]
H. Zhen and J. Yang. Crashsage: a large language model-centered framework for contextual and interpretable traffic crash analysis.Artificial Intelligence for Transportation, 3–4:100030, 2025. doi:10.1016/j.ait.2025. 100030
-
[9]
S. Akter, I. Shihab, and A. Sharma. Large language models for crash detection in video: a survey of methods, datasets, and challenges. 2025.arXiv:2507.02074,doi:10.48550/arXiv.2507.02074
-
[10]
X. Cao, T. Zhou, Y . Ma, W. Ye, C. Cui, K. Tang, et al. MapLM: a real-world large-scale vision-language benchmark for map and traffic scene understanding. InProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21819–21830, 2024.doi:10.1109/CVPR52733.2024.02061
-
[11]
H. Ding, Y . Du, and Z. Xia. Urban road anomaly monitoring using vision-language models for enhanced safety management.Appl. Sci., 15:2517, 2025.doi:10.3390/app15052517
-
[12]
OpenAI. GPT-4o system card. Technical report, OpenAI, 2024. URL: https://arxiv.org/pdf/2410.21276
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
R. Anil, S. Borgeaud, J. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, et al. Gemini: a family of highly capable multimodal models. 2023.arXiv:2312.11805,doi:10.48550/arXiv.2312.11805. 15 Automating Crash Diagram Generation Using Vision-Language Models: A Case Study on Multi-Lane Roundabouts
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805 2023
-
[14]
J. Chen, Z. Cai, P. Chen, S. Chen, K. Ji, X. Wang, et al. ShareGPT-4o-Image: aligning multimodal models with GPT-4o-level image generation. 2025.arXiv:2506.18095,doi:10.48550/arXiv.2506.18095
-
[15]
A. Medina, J. Bansen, B. Williams, A. Pochowski, L. Rodegerdts, J. Markosian, et al. Reasons for drivers failing to yield at multi-lane roundabout exits: transportation pooled fund study final report. Technical Report FHW A-HRT-23-023, 2023. URL:https://rosap.ntl.bts.gov/view/dot/66498. 16
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.