Automating Crash Diagram Generation Using Vision-Language Models: A Case Study on Multi-Lane Roundabouts

Hao Zhen; Jidong J. Yang; Xiao Lu

arxiv: 2604.15332 · v1 · submitted 2026-03-09 · 💻 cs.HC · cs.AI· cs.CV· cs.SE

Automating Crash Diagram Generation Using Vision-Language Models: A Case Study on Multi-Lane Roundabouts

Xiao Lu , Hao Zhen , Jidong J. Yang This is my paper

Pith reviewed 2026-05-15 14:44 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.CVcs.SE

keywords vision-language modelscrash diagram generationmulti-lane roundaboutspolice crash reportsGPT-4otransportation safetyspatial reasoningautomation

0 comments

The pith

Vision-language models can generate usable crash diagrams from police reports for multi-lane roundabouts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether vision-language models can automate crash diagram creation from written police reports, using multi-lane roundabouts as a difficult test case. The authors built a three-part prompt structure to guide models through report interpretation, data extraction, and diagram synthesis, paired with a 10-metric scoring system that checks semantic accuracy, spatial fidelity, and visual clarity. On 79 real reports, GPT-4o scored highest at 6.29 out of 10, outperforming Gemini-1.5-Flash and Janus-4o in spatial reasoning and data alignment. A reader would care because manual diagrams take significant time and vary by person, so reliable automation could speed up and standardize safety analysis in transportation engineering.

Core claim

The central claim is that vision-language models, led by GPT-4o, can interpret police crash reports and produce diagrams for multi-lane roundabouts when guided by a three-part structured prompt framework, reaching an average score of 6.29 out of 10 on a custom 10-metric evaluation system that assesses semantic accuracy, spatial fidelity, and visual clarity, with the top model showing stronger alignment between extracted data and generated visuals than the alternatives tested.

What carries the argument

A three-part structured prompt framework that guides models through interpretation of the crash report, extraction of relevant data, and visual synthesis of the diagram.

If this is right

Crash analysis workflows could reduce time spent on manual diagram preparation.
Diagram consistency may increase across different analysts and locations.
The prompting approach offers a starting point for applying generative AI to other spatial engineering visualizations.
Models with better spatial reasoning, like GPT-4o, align extracted crash details more closely with the output diagrams.
Current performance gaps point to needed improvements in handling complex road geometries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompt structure could extend to diagram generation for other intersection types beyond roundabouts.
Batch processing of historical crash data might become feasible if linked to existing safety databases.
The 10-metric evaluation could benchmark future vision-language models on transportation visualization tasks.
Wider adoption might lower variability in safety reporting and support faster policy analysis.

Load-bearing premise

The custom 10-metric evaluation system supplies an objective, reproducible measure of diagram quality that aligns with expert human judgment, especially for spatial fidelity in complex roundabout geometries.

What would settle it

Independent human experts applying the same 10 metrics to the generated diagrams produce average scores that differ markedly from the reported model results, particularly on spatial fidelity for GPT-4o.

Figures

Figures reproduced from arXiv: 2604.15332 by Hao Zhen, Jidong J. Yang, Xiao Lu.

**Figure 2.** Figure 2: Aerial view of the study roundabout. overtaking, and failure-to-yield incidents—many of which are known to be exacerbated by complex lane configurations at multilane roundabouts. Each MV-104A report contains structured fields capturing vital crash descriptors, including: • Vehicle trajectories and movement patterns; • Collision narratives and driver violation codes; • Time of day, weather, lighting, and ro… view at source ↗

**Figure 3.** Figure 3: New York state DMV damage code diagram for vehicle impact regions. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of crash diagrams generated by Janus-4o, Gemini-1.5-Flash, and GPT-4o against the official [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of crash diagrams generated by Janus-4o (top-left), Gemini-1.5-Flash (top-right), and GPT-4o [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

read the original abstract

Crash diagrams are essential tools in transportation safety analysis, yet their manual preparation remains time-consuming and prone to human variability. This study investigates the use of Vision-Language Models (VLMs) to automate crash diagram generation from police crash reports, focusing on multilane roundabouts as a challenging test case. A three-part structured prompt framework was developed to guide model reasoning through interpretation, extraction, and visual synthesis, while a 10-metric evaluation system was designed to assess diagram quality in terms of semantic accuracy, spatial fidelity, and visual clarity. Three popular models, including GPT-4o, Gemini-1.5-Flash, and Janus-4o, were tested on 79 crash reports. GPT-4o achieved the highest average performance (6.29 out of 10), followed by Gemini-1.5-Flash (5.28) and Janus-4o (3.64). The analysis revealed GPT-4o's superior spatial reasoning and alignment between extracted and visualized crash data. These results highlight both the promise and current limitations of VLMs in engineering visualization tasks. The study lays the groundwork for integrating generative AI into crash analysis workflows to improve efficiency, consistency, and interpretability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VLMs can sketch basic crash diagrams from reports with middling results, but the custom 10-metric scores lack any validation so the performance claims stay tentative.

read the letter

The paper tests three VLMs on turning 79 police reports into crash diagrams for multi-lane roundabouts. They use a three-part prompt to handle description, data extraction, and drawing, then score the outputs on a 10-metric scale for semantic fit, spatial layout, and clarity. GPT-4o leads at 6.29/10, ahead of Gemini-1.5-Flash and Janus-4o. This is the first reported case study on this exact task, and it gives a practical sense of what current models can do without custom training. The numbers on real reports are straightforward to read and show the models are at least producing recognizable diagrams rather than nonsense. That alone makes the work worth a look for anyone tracking AI in transportation safety. The main weakness is the evaluation itself. The metric definitions, scoring rules, and who actually assigned the scores are not shown, and there is no check against expert raters or side-by-side comparison with hand-drawn references. Without those steps the 6.29 figure could just reflect how well the model followed the prompt rather than true spatial accuracy. The stress-test concern holds up on the text. This is for researchers who want an early benchmark on VLM use in crash analysis or similar visualization tasks. It is not yet strong enough for someone needing production-ready automation. Send it to peer review; the experiment is simple and the domain is useful, but the methods section needs the missing validation details to be taken seriously.

Referee Report

2 major / 2 minor

Summary. This paper investigates automating crash diagram generation from police reports using Vision-Language Models (VLMs), with a focus on multi-lane roundabouts as a challenging case. It proposes a three-part structured prompt framework for interpretation, extraction, and visual synthesis, and introduces a custom 10-metric evaluation system for semantic accuracy, spatial fidelity, and visual clarity. Three models (GPT-4o, Gemini-1.5-Flash, Janus-4o) are tested on 79 reports, with GPT-4o reporting the highest average score of 6.29/10 and superior spatial reasoning.

Significance. If the evaluation proves reproducible and aligned with expert judgment, the work could meaningfully advance AI-assisted tools in transportation safety by reducing manual effort and variability in crash diagram preparation. The structured prompt approach and real-world roundabout test cases offer a practical benchmark for VLM capabilities in engineering visualization, highlighting both promise and current limits in spatial tasks.

major comments (2)

[Evaluation methodology and results] The 10-metric evaluation system (described in the methods and results sections) is load-bearing for all performance claims, including the headline scores (GPT-4o at 6.29/10, Gemini-1.5-Flash at 5.28, Janus-4o at 3.64) and the conclusion of superior spatial reasoning. However, the manuscript provides no definitions or weighting for the metrics, no description of the scoring procedure or who performed the scoring, and no inter-rater reliability statistics or validation against expert human judgments of diagram correctness. This leaves open whether scores reflect actual spatial fidelity or simply prompt compliance.
[Data and experimental setup] The selection criteria for the 79 crash reports, any blinding procedures, and statistical tests for differences between model scores are not reported. Without these, the comparative claims cannot be assessed for robustness or generalizability beyond the specific sample.

minor comments (2)

Clarify the model name 'Janus-4o' with a citation or description, as it is less commonly referenced than GPT-4o or Gemini-1.5-Flash.
Ensure all figures of generated diagrams include scale bars or annotations to allow readers to directly assess spatial fidelity claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which highlight important aspects of reproducibility and methodological transparency. We have prepared point-by-point responses below and will revise the manuscript to incorporate additional details where feasible.

read point-by-point responses

Referee: [Evaluation methodology and results] The 10-metric evaluation system (described in the methods and results sections) is load-bearing for all performance claims, including the headline scores (GPT-4o at 6.29/10, Gemini-1.5-Flash at 5.28, Janus-4o at 3.64) and the conclusion of superior spatial reasoning. However, the manuscript provides no definitions or weighting for the metrics, no description of the scoring procedure or who performed the scoring, and no inter-rater reliability statistics or validation against expert human judgments of diagram correctness. This leaves open whether scores reflect actual spatial fidelity or simply prompt compliance.

Authors: We agree that the current manuscript lacks sufficient detail on the evaluation methodology, which is essential for interpreting the reported scores. In the revised version, we will expand the methods section to include: (1) explicit definitions and descriptions for each of the 10 metrics, grouped under semantic accuracy, spatial fidelity, and visual clarity; (2) the scoring procedure, with each metric rated on a 0-1 scale (yielding a total out of 10) and any weighting applied; (3) clarification that scoring was performed by the authors, who have domain expertise in transportation safety; and (4) an explicit statement acknowledging the absence of inter-rater reliability statistics as a limitation, along with plans to incorporate expert validation in follow-up work. These additions will help demonstrate that the scores target actual diagram quality rather than prompt adherence alone. revision: yes
Referee: [Data and experimental setup] The selection criteria for the 79 crash reports, any blinding procedures, and statistical tests for differences between model scores are not reported. Without these, the comparative claims cannot be assessed for robustness or generalizability beyond the specific sample.

Authors: We acknowledge the need for greater transparency in the experimental setup. The revised methods section will specify the selection criteria for the 79 reports (drawn from publicly available police crash reports involving multi-lane roundabouts, filtered for sufficient textual detail on vehicle positions and maneuvers). We will also note that no formal blinding procedures were employed, as the evaluation relied on direct author review of model outputs. Finally, we will add statistical comparisons (e.g., one-way ANOVA with post-hoc tests) to assess differences between model scores, reporting p-values and effect sizes in the results. These revisions will allow readers to better evaluate the robustness and scope of the findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation uses external references

full rationale

The paper describes an empirical comparison of VLM-generated crash diagrams against human-prepared reference diagrams using a custom 10-metric rubric for semantic accuracy, spatial fidelity, and visual clarity. No mathematical derivations, fitted parameters, predictions, or self-citations are present that reduce any result to the inputs by construction. The performance scores (e.g., GPT-4o at 6.29/10) are computed directly from the external metric applied to model outputs versus references, with no self-definitional loops or load-bearing self-citations. This is a standard empirical setup without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that existing VLMs possess sufficient spatial and semantic reasoning for diagram synthesis when guided by the authors' prompt framework; no free parameters are fitted and no new entities are introduced.

axioms (1)

domain assumption VLMs can reliably interpret textual crash descriptions and synthesize accurate visual diagrams when given a structured three-part prompt
Invoked as the basis for the entire automation pipeline without independent verification outside the reported experiments.

pith-pipeline@v0.9.0 · 5526 in / 1289 out tokens · 43073 ms · 2026-05-15T14:44:24.298642+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A three-part structured prompt framework was developed to guide model reasoning through interpretation, extraction, and visual synthesis, while a 10-metric evaluation system was designed to assess diagram quality in terms of semantic accuracy, spatial fidelity, and visual clarity.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GPT-4o achieved the highest average performance (6.29 out of 10)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

[1]

Fernandez, P

D. Fernandez, P. MohajerAnsari, A. Salarpour, and M. Pesé. Avoiding the crash: a vision-language model evaluation of critical traffic scenarios.SAE Int. J. Adv. Curr. Prac. in Mobility, 7:2255–2266, 2025. doi: 10.4271/2025-01-8213

work page doi:10.4271/2025-01-8213 2025
[2]

Jaradat, N

S. Jaradat, N. Acharya, S. Shivshankar, T. Alhadidi, and M. Elhenawy. AI for data quality auditing: detecting mis- labeled work zone crashes using large language models.Algorithms, 18:317, 2025.doi:10.3390/a18060317

work page doi:10.3390/a18060317 2025
[3]

Transportation injury mapping system (TIMS)

UC Berkeley SafeTREC. Transportation injury mapping system (TIMS). UC Regents, 2025. URL: https: //tims.berkeley.edu

work page 2025
[4]

Crash magic online

PdMagic. Crash magic online. Pd’ Programming, Inc., 2025. URL:https://www.pdmagic.com

work page 2025
[5]

Aashtoware safety intersection

AASHTOWare. Aashtoware safety intersection. American Association of State Highway and Transportation Officials, 2025. URL: https://www.aashtoware.org/products/safety/ aashtoware-safety-intersection

work page 2025
[6]

Zhen and J

H. Zhen and J. Yang. Tab-text: bridging tabular data and natural language for enhanced traffic safety analysis and modeling.Expert Syst. Appl., 290:128450, 2025.doi:10.1016/j.eswa.2025.128450

work page doi:10.1016/j.eswa.2025.128450 2025
[7]

H. Zhen, Y . Shi, Y . Huang, J. Yang, and N. Liu. Leveraging large language models with chain-of-thought and prompt engineering for traffic crash severity analysis and inference.Computers, 13:232, 2024. doi: 10.3390/computers13090232

work page doi:10.3390/computers13090232 2024
[8]

Zhen and J

H. Zhen and J. Yang. Crashsage: a large language model-centered framework for contextual and interpretable traffic crash analysis.Artificial Intelligence for Transportation, 3–4:100030, 2025. doi:10.1016/j.ait.2025. 100030

work page doi:10.1016/j.ait.2025 2025
[9]

Akter, I

S. Akter, I. Shihab, and A. Sharma. Large language models for crash detection in video: a survey of methods, datasets, and challenges. 2025.arXiv:2507.02074,doi:10.48550/arXiv.2507.02074

work page doi:10.48550/arxiv.2507.02074 2025
[10]

X. Cao, T. Zhou, Y . Ma, W. Ye, C. Cui, K. Tang, et al. MapLM: a real-world large-scale vision-language benchmark for map and traffic scene understanding. InProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21819–21830, 2024.doi:10.1109/CVPR52733.2024.02061

work page doi:10.1109/cvpr52733.2024.02061 2024
[11]

H. Ding, Y . Du, and Z. Xia. Urban road anomaly monitoring using vision-language models for enhanced safety management.Appl. Sci., 15:2517, 2025.doi:10.3390/app15052517

work page doi:10.3390/app15052517 2025
[12]

GPT-4o System Card

OpenAI. GPT-4o system card. Technical report, OpenAI, 2024. URL: https://arxiv.org/pdf/2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

R. Anil, S. Borgeaud, J. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, et al. Gemini: a family of highly capable multimodal models. 2023.arXiv:2312.11805,doi:10.48550/arXiv.2312.11805. 15 Automating Crash Diagram Generation Using Vision-Language Models: A Case Study on Multi-Lane Roundabouts

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805 2023
[14]

J. Chen, Z. Cai, P. Chen, S. Chen, K. Ji, X. Wang, et al. ShareGPT-4o-Image: aligning multimodal models with GPT-4o-level image generation. 2025.arXiv:2506.18095,doi:10.48550/arXiv.2506.18095

work page doi:10.48550/arxiv.2506.18095 2025
[15]

Medina, J

A. Medina, J. Bansen, B. Williams, A. Pochowski, L. Rodegerdts, J. Markosian, et al. Reasons for drivers failing to yield at multi-lane roundabout exits: transportation pooled fund study final report. Technical Report FHW A-HRT-23-023, 2023. URL:https://rosap.ntl.bts.gov/view/dot/66498. 16

work page 2023

[1] [1]

Fernandez, P

D. Fernandez, P. MohajerAnsari, A. Salarpour, and M. Pesé. Avoiding the crash: a vision-language model evaluation of critical traffic scenarios.SAE Int. J. Adv. Curr. Prac. in Mobility, 7:2255–2266, 2025. doi: 10.4271/2025-01-8213

work page doi:10.4271/2025-01-8213 2025

[2] [2]

Jaradat, N

S. Jaradat, N. Acharya, S. Shivshankar, T. Alhadidi, and M. Elhenawy. AI for data quality auditing: detecting mis- labeled work zone crashes using large language models.Algorithms, 18:317, 2025.doi:10.3390/a18060317

work page doi:10.3390/a18060317 2025

[3] [3]

Transportation injury mapping system (TIMS)

UC Berkeley SafeTREC. Transportation injury mapping system (TIMS). UC Regents, 2025. URL: https: //tims.berkeley.edu

work page 2025

[4] [4]

Crash magic online

PdMagic. Crash magic online. Pd’ Programming, Inc., 2025. URL:https://www.pdmagic.com

work page 2025

[5] [5]

Aashtoware safety intersection

AASHTOWare. Aashtoware safety intersection. American Association of State Highway and Transportation Officials, 2025. URL: https://www.aashtoware.org/products/safety/ aashtoware-safety-intersection

work page 2025

[6] [6]

Zhen and J

H. Zhen and J. Yang. Tab-text: bridging tabular data and natural language for enhanced traffic safety analysis and modeling.Expert Syst. Appl., 290:128450, 2025.doi:10.1016/j.eswa.2025.128450

work page doi:10.1016/j.eswa.2025.128450 2025

[7] [7]

H. Zhen, Y . Shi, Y . Huang, J. Yang, and N. Liu. Leveraging large language models with chain-of-thought and prompt engineering for traffic crash severity analysis and inference.Computers, 13:232, 2024. doi: 10.3390/computers13090232

work page doi:10.3390/computers13090232 2024

[8] [8]

Zhen and J

H. Zhen and J. Yang. Crashsage: a large language model-centered framework for contextual and interpretable traffic crash analysis.Artificial Intelligence for Transportation, 3–4:100030, 2025. doi:10.1016/j.ait.2025. 100030

work page doi:10.1016/j.ait.2025 2025

[9] [9]

Akter, I

S. Akter, I. Shihab, and A. Sharma. Large language models for crash detection in video: a survey of methods, datasets, and challenges. 2025.arXiv:2507.02074,doi:10.48550/arXiv.2507.02074

work page doi:10.48550/arxiv.2507.02074 2025

[10] [10]

X. Cao, T. Zhou, Y . Ma, W. Ye, C. Cui, K. Tang, et al. MapLM: a real-world large-scale vision-language benchmark for map and traffic scene understanding. InProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21819–21830, 2024.doi:10.1109/CVPR52733.2024.02061

work page doi:10.1109/cvpr52733.2024.02061 2024

[11] [11]

H. Ding, Y . Du, and Z. Xia. Urban road anomaly monitoring using vision-language models for enhanced safety management.Appl. Sci., 15:2517, 2025.doi:10.3390/app15052517

work page doi:10.3390/app15052517 2025

[12] [12]

GPT-4o System Card

OpenAI. GPT-4o system card. Technical report, OpenAI, 2024. URL: https://arxiv.org/pdf/2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

R. Anil, S. Borgeaud, J. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, et al. Gemini: a family of highly capable multimodal models. 2023.arXiv:2312.11805,doi:10.48550/arXiv.2312.11805. 15 Automating Crash Diagram Generation Using Vision-Language Models: A Case Study on Multi-Lane Roundabouts

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805 2023

[14] [14]

J. Chen, Z. Cai, P. Chen, S. Chen, K. Ji, X. Wang, et al. ShareGPT-4o-Image: aligning multimodal models with GPT-4o-level image generation. 2025.arXiv:2506.18095,doi:10.48550/arXiv.2506.18095

work page doi:10.48550/arxiv.2506.18095 2025

[15] [15]

Medina, J

A. Medina, J. Bansen, B. Williams, A. Pochowski, L. Rodegerdts, J. Markosian, et al. Reasons for drivers failing to yield at multi-lane roundabout exits: transportation pooled fund study final report. Technical Report FHW A-HRT-23-023, 2023. URL:https://rosap.ntl.bts.gov/view/dot/66498. 16

work page 2023