pith. sign in

arxiv: 2604.15332 · v1 · submitted 2026-03-09 · 💻 cs.HC · cs.AI· cs.CV· cs.SE

Automating Crash Diagram Generation Using Vision-Language Models: A Case Study on Multi-Lane Roundabouts

Pith reviewed 2026-05-15 14:44 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.CVcs.SE
keywords vision-language modelscrash diagram generationmulti-lane roundaboutspolice crash reportsGPT-4otransportation safetyspatial reasoningautomation
0
0 comments X

The pith

Vision-language models can generate usable crash diagrams from police reports for multi-lane roundabouts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether vision-language models can automate crash diagram creation from written police reports, using multi-lane roundabouts as a difficult test case. The authors built a three-part prompt structure to guide models through report interpretation, data extraction, and diagram synthesis, paired with a 10-metric scoring system that checks semantic accuracy, spatial fidelity, and visual clarity. On 79 real reports, GPT-4o scored highest at 6.29 out of 10, outperforming Gemini-1.5-Flash and Janus-4o in spatial reasoning and data alignment. A reader would care because manual diagrams take significant time and vary by person, so reliable automation could speed up and standardize safety analysis in transportation engineering.

Core claim

The central claim is that vision-language models, led by GPT-4o, can interpret police crash reports and produce diagrams for multi-lane roundabouts when guided by a three-part structured prompt framework, reaching an average score of 6.29 out of 10 on a custom 10-metric evaluation system that assesses semantic accuracy, spatial fidelity, and visual clarity, with the top model showing stronger alignment between extracted data and generated visuals than the alternatives tested.

What carries the argument

A three-part structured prompt framework that guides models through interpretation of the crash report, extraction of relevant data, and visual synthesis of the diagram.

If this is right

  • Crash analysis workflows could reduce time spent on manual diagram preparation.
  • Diagram consistency may increase across different analysts and locations.
  • The prompting approach offers a starting point for applying generative AI to other spatial engineering visualizations.
  • Models with better spatial reasoning, like GPT-4o, align extracted crash details more closely with the output diagrams.
  • Current performance gaps point to needed improvements in handling complex road geometries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompt structure could extend to diagram generation for other intersection types beyond roundabouts.
  • Batch processing of historical crash data might become feasible if linked to existing safety databases.
  • The 10-metric evaluation could benchmark future vision-language models on transportation visualization tasks.
  • Wider adoption might lower variability in safety reporting and support faster policy analysis.

Load-bearing premise

The custom 10-metric evaluation system supplies an objective, reproducible measure of diagram quality that aligns with expert human judgment, especially for spatial fidelity in complex roundabout geometries.

What would settle it

Independent human experts applying the same 10 metrics to the generated diagrams produce average scores that differ markedly from the reported model results, particularly on spatial fidelity for GPT-4o.

Figures

Figures reproduced from arXiv: 2604.15332 by Hao Zhen, Jidong J. Yang, Xiao Lu.

Figure 1
Figure 1. Figure 1: Crash diagram generation workflow using drone-based aerial imagery and vision-language models. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Aerial view of the study roundabout. overtaking, and failure-to-yield incidents—many of which are known to be exacerbated by complex lane configurations at multilane roundabouts. Each MV-104A report contains structured fields capturing vital crash descriptors, including: • Vehicle trajectories and movement patterns; • Collision narratives and driver violation codes; • Time of day, weather, lighting, and ro… view at source ↗
Figure 3
Figure 3. Figure 3: New York state DMV damage code diagram for vehicle impact regions. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of crash diagrams generated by Janus-4o, Gemini-1.5-Flash, and GPT-4o against the official [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of crash diagrams generated by Janus-4o (top-left), Gemini-1.5-Flash (top-right), and GPT-4o [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
read the original abstract

Crash diagrams are essential tools in transportation safety analysis, yet their manual preparation remains time-consuming and prone to human variability. This study investigates the use of Vision-Language Models (VLMs) to automate crash diagram generation from police crash reports, focusing on multilane roundabouts as a challenging test case. A three-part structured prompt framework was developed to guide model reasoning through interpretation, extraction, and visual synthesis, while a 10-metric evaluation system was designed to assess diagram quality in terms of semantic accuracy, spatial fidelity, and visual clarity. Three popular models, including GPT-4o, Gemini-1.5-Flash, and Janus-4o, were tested on 79 crash reports. GPT-4o achieved the highest average performance (6.29 out of 10), followed by Gemini-1.5-Flash (5.28) and Janus-4o (3.64). The analysis revealed GPT-4o's superior spatial reasoning and alignment between extracted and visualized crash data. These results highlight both the promise and current limitations of VLMs in engineering visualization tasks. The study lays the groundwork for integrating generative AI into crash analysis workflows to improve efficiency, consistency, and interpretability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. This paper investigates automating crash diagram generation from police reports using Vision-Language Models (VLMs), with a focus on multi-lane roundabouts as a challenging case. It proposes a three-part structured prompt framework for interpretation, extraction, and visual synthesis, and introduces a custom 10-metric evaluation system for semantic accuracy, spatial fidelity, and visual clarity. Three models (GPT-4o, Gemini-1.5-Flash, Janus-4o) are tested on 79 reports, with GPT-4o reporting the highest average score of 6.29/10 and superior spatial reasoning.

Significance. If the evaluation proves reproducible and aligned with expert judgment, the work could meaningfully advance AI-assisted tools in transportation safety by reducing manual effort and variability in crash diagram preparation. The structured prompt approach and real-world roundabout test cases offer a practical benchmark for VLM capabilities in engineering visualization, highlighting both promise and current limits in spatial tasks.

major comments (2)
  1. [Evaluation methodology and results] The 10-metric evaluation system (described in the methods and results sections) is load-bearing for all performance claims, including the headline scores (GPT-4o at 6.29/10, Gemini-1.5-Flash at 5.28, Janus-4o at 3.64) and the conclusion of superior spatial reasoning. However, the manuscript provides no definitions or weighting for the metrics, no description of the scoring procedure or who performed the scoring, and no inter-rater reliability statistics or validation against expert human judgments of diagram correctness. This leaves open whether scores reflect actual spatial fidelity or simply prompt compliance.
  2. [Data and experimental setup] The selection criteria for the 79 crash reports, any blinding procedures, and statistical tests for differences between model scores are not reported. Without these, the comparative claims cannot be assessed for robustness or generalizability beyond the specific sample.
minor comments (2)
  1. Clarify the model name 'Janus-4o' with a citation or description, as it is less commonly referenced than GPT-4o or Gemini-1.5-Flash.
  2. Ensure all figures of generated diagrams include scale bars or annotations to allow readers to directly assess spatial fidelity claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which highlight important aspects of reproducibility and methodological transparency. We have prepared point-by-point responses below and will revise the manuscript to incorporate additional details where feasible.

read point-by-point responses
  1. Referee: [Evaluation methodology and results] The 10-metric evaluation system (described in the methods and results sections) is load-bearing for all performance claims, including the headline scores (GPT-4o at 6.29/10, Gemini-1.5-Flash at 5.28, Janus-4o at 3.64) and the conclusion of superior spatial reasoning. However, the manuscript provides no definitions or weighting for the metrics, no description of the scoring procedure or who performed the scoring, and no inter-rater reliability statistics or validation against expert human judgments of diagram correctness. This leaves open whether scores reflect actual spatial fidelity or simply prompt compliance.

    Authors: We agree that the current manuscript lacks sufficient detail on the evaluation methodology, which is essential for interpreting the reported scores. In the revised version, we will expand the methods section to include: (1) explicit definitions and descriptions for each of the 10 metrics, grouped under semantic accuracy, spatial fidelity, and visual clarity; (2) the scoring procedure, with each metric rated on a 0-1 scale (yielding a total out of 10) and any weighting applied; (3) clarification that scoring was performed by the authors, who have domain expertise in transportation safety; and (4) an explicit statement acknowledging the absence of inter-rater reliability statistics as a limitation, along with plans to incorporate expert validation in follow-up work. These additions will help demonstrate that the scores target actual diagram quality rather than prompt adherence alone. revision: yes

  2. Referee: [Data and experimental setup] The selection criteria for the 79 crash reports, any blinding procedures, and statistical tests for differences between model scores are not reported. Without these, the comparative claims cannot be assessed for robustness or generalizability beyond the specific sample.

    Authors: We acknowledge the need for greater transparency in the experimental setup. The revised methods section will specify the selection criteria for the 79 reports (drawn from publicly available police crash reports involving multi-lane roundabouts, filtered for sufficient textual detail on vehicle positions and maneuvers). We will also note that no formal blinding procedures were employed, as the evaluation relied on direct author review of model outputs. Finally, we will add statistical comparisons (e.g., one-way ANOVA with post-hoc tests) to assess differences between model scores, reporting p-values and effect sizes in the results. These revisions will allow readers to better evaluate the robustness and scope of the findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation uses external references

full rationale

The paper describes an empirical comparison of VLM-generated crash diagrams against human-prepared reference diagrams using a custom 10-metric rubric for semantic accuracy, spatial fidelity, and visual clarity. No mathematical derivations, fitted parameters, predictions, or self-citations are present that reduce any result to the inputs by construction. The performance scores (e.g., GPT-4o at 6.29/10) are computed directly from the external metric applied to model outputs versus references, with no self-definitional loops or load-bearing self-citations. This is a standard empirical setup without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that existing VLMs possess sufficient spatial and semantic reasoning for diagram synthesis when guided by the authors' prompt framework; no free parameters are fitted and no new entities are introduced.

axioms (1)
  • domain assumption VLMs can reliably interpret textual crash descriptions and synthesize accurate visual diagrams when given a structured three-part prompt
    Invoked as the basis for the entire automation pipeline without independent verification outside the reported experiments.

pith-pipeline@v0.9.0 · 5526 in / 1289 out tokens · 43073 ms · 2026-05-15T14:44:24.298642+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

  1. [1]

    Fernandez, P

    D. Fernandez, P. MohajerAnsari, A. Salarpour, and M. Pesé. Avoiding the crash: a vision-language model evaluation of critical traffic scenarios.SAE Int. J. Adv. Curr. Prac. in Mobility, 7:2255–2266, 2025. doi: 10.4271/2025-01-8213

  2. [2]

    Jaradat, N

    S. Jaradat, N. Acharya, S. Shivshankar, T. Alhadidi, and M. Elhenawy. AI for data quality auditing: detecting mis- labeled work zone crashes using large language models.Algorithms, 18:317, 2025.doi:10.3390/a18060317

  3. [3]

    Transportation injury mapping system (TIMS)

    UC Berkeley SafeTREC. Transportation injury mapping system (TIMS). UC Regents, 2025. URL: https: //tims.berkeley.edu

  4. [4]

    Crash magic online

    PdMagic. Crash magic online. Pd’ Programming, Inc., 2025. URL:https://www.pdmagic.com

  5. [5]

    Aashtoware safety intersection

    AASHTOWare. Aashtoware safety intersection. American Association of State Highway and Transportation Officials, 2025. URL: https://www.aashtoware.org/products/safety/ aashtoware-safety-intersection

  6. [6]

    Zhen and J

    H. Zhen and J. Yang. Tab-text: bridging tabular data and natural language for enhanced traffic safety analysis and modeling.Expert Syst. Appl., 290:128450, 2025.doi:10.1016/j.eswa.2025.128450

  7. [7]

    H. Zhen, Y . Shi, Y . Huang, J. Yang, and N. Liu. Leveraging large language models with chain-of-thought and prompt engineering for traffic crash severity analysis and inference.Computers, 13:232, 2024. doi: 10.3390/computers13090232

  8. [8]

    Zhen and J

    H. Zhen and J. Yang. Crashsage: a large language model-centered framework for contextual and interpretable traffic crash analysis.Artificial Intelligence for Transportation, 3–4:100030, 2025. doi:10.1016/j.ait.2025. 100030

  9. [9]

    Akter, I

    S. Akter, I. Shihab, and A. Sharma. Large language models for crash detection in video: a survey of methods, datasets, and challenges. 2025.arXiv:2507.02074,doi:10.48550/arXiv.2507.02074

  10. [10]

    X. Cao, T. Zhou, Y . Ma, W. Ye, C. Cui, K. Tang, et al. MapLM: a real-world large-scale vision-language benchmark for map and traffic scene understanding. InProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21819–21830, 2024.doi:10.1109/CVPR52733.2024.02061

  11. [11]

    H. Ding, Y . Du, and Z. Xia. Urban road anomaly monitoring using vision-language models for enhanced safety management.Appl. Sci., 15:2517, 2025.doi:10.3390/app15052517

  12. [12]

    GPT-4o System Card

    OpenAI. GPT-4o system card. Technical report, OpenAI, 2024. URL: https://arxiv.org/pdf/2410.21276

  13. [13]

    R. Anil, S. Borgeaud, J. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, et al. Gemini: a family of highly capable multimodal models. 2023.arXiv:2312.11805,doi:10.48550/arXiv.2312.11805. 15 Automating Crash Diagram Generation Using Vision-Language Models: A Case Study on Multi-Lane Roundabouts

  14. [14]

    J. Chen, Z. Cai, P. Chen, S. Chen, K. Ji, X. Wang, et al. ShareGPT-4o-Image: aligning multimodal models with GPT-4o-level image generation. 2025.arXiv:2506.18095,doi:10.48550/arXiv.2506.18095

  15. [15]

    Medina, J

    A. Medina, J. Bansen, B. Williams, A. Pochowski, L. Rodegerdts, J. Markosian, et al. Reasons for drivers failing to yield at multi-lane roundabout exits: transportation pooled fund study final report. Technical Report FHW A-HRT-23-023, 2023. URL:https://rosap.ntl.bts.gov/view/dot/66498. 16