DriveSafe: A Framework for Risk Detection and Safety Suggestions in Driving Scenarios

Avijit Dasgupta; C. V. Jawahar; Sainithin Artham; Shankar Gangisetty

arxiv: 2605.16892 · v1 · pith:IZFSFNTTnew · submitted 2026-05-16 · 💻 cs.CV · cs.AI· cs.CL

DriveSafe: A Framework for Risk Detection and Safety Suggestions in Driving Scenarios

Sainithin Artham , Shankar Gangisetty , Avijit Dasgupta , C. V. Jawahar This is my paper

Pith reviewed 2026-05-19 21:36 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords driving risk detectionspatially grounded captionsmultimodal large language modelsautonomous vehiclessafety suggestionsDRAMA benchmarkscene understanding

0 comments

The pith

DriveSafe improves driving risk assessment by conditioning it on explicit language-based scene representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that zero-shot multimodal large language models underperform in fine-grained, spatially grounded risk assessment for driving compared to domain-specific methods. DriveSafe addresses this by first generating spatially grounded captions that include motion, spatial, and depth cues. These descriptions then feed into risk assessment to identify hazards and suggest safety actions. A lightweight adapter is fine-tuned on caption-risk pairings to add domain knowledge. This leads to state-of-the-art results on the DRAMA benchmark, suggesting a path for more reliable situational awareness in autonomous vehicles.

Core claim

Our method first generates spatially grounded captions enriched with multimodal context, including motion, spatial, and depth cues. These captions are then used for downstream risk assessment, explicitly identifying hazardous objects, their locations, and the unsafe behaviors they imply, followed by actionable safety suggestions. To further improve performance, we employ caption-risk pairings to fine-tune a lightweight adapter module, efficiently injecting domain-specific knowledge into the base LLM. By conditioning risk assessment on explicit language-based scene representations, DriveSafe achieves significant gains over both zero-shot MLLMs and prior domain-specific baselines.

What carries the argument

Spatially grounded captions enriched with motion, spatial, and depth cues that serve as explicit language-based scene representations for risk assessment and safety suggestion generation.

If this is right

Significant gains over zero-shot MLLMs and prior domain-specific baselines in risk assessment.
State-of-the-art performance on the DRAMA benchmark for driving scenarios.
Validation of key design choices through ablation studies on caption generation and adapter fine-tuning.
Actionable safety suggestions generated after identifying hazardous objects and unsafe behaviors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar language-mediated intermediate representations could enhance interpretability in other vision-based safety systems such as surveillance or robotics.
Optimizing the caption generation for lower latency could enable real-time deployment in moving vehicles.
Pairing this method with direct sensor inputs might create hybrid systems that combine linguistic clarity with raw data precision.

Load-bearing premise

That generating spatially grounded captions enriched with multimodal context will provide sufficient and accurate information to enable superior risk assessment compared to direct zero-shot use of MLLMs.

What would settle it

A direct comparison on the DRAMA benchmark showing that DriveSafe does not outperform zero-shot MLLMs or prior baselines would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2605.16892 by Avijit Dasgupta, C. V. Jawahar, Sainithin Artham, Shankar Gangisetty.

**Figure 1.** Figure 1: Previous works in driving scenarios [9], [10], [11] primarily address risk perception but fall short of offering actionable safety guidance. Similarly, general-purpose MLLMs [12], [13], [14] are still unreliable in this regard. In contrast, our approach, DriveSafe, integrates risk assessment with clear, human-understandable safety suggestions. as RAIN [6] emphasize risk-aware trajectory prediction by highl… view at source ↗

**Figure 2.** Figure 2: Prompt template P used to guide the LLM in generating structured, spatially grounded summaries of driving scenes. assessment based on driving behaviors that violate safety commonsense [26]. B. MLLMs in Autonomous Driving. MLLMs have recently garnered significant interest for their ability to analyze non-textual modalities, such as images and point clouds, through language-based reasoning [27], [28], [29], … view at source ↗

**Figure 3.** Figure 3: Our proposed DriveSafe framework for the caption generation and safety suggestion task in driving. We first derive [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Risk-aware prompt template consistent with notation. The LLM Fθ maps Cv to (ˆr, Cr, K, ˆ ˆb). Fθ. The model then fuses these modalities to produce a geometry-aware description of the video: Cv = Fθ(P(Xv)), (2) where Cv denotes the generated caption for the sequence. B. Risk Assessment and Safety Suggestion Zero-Shot: Given the geometric-aware caption Cv, we prompt the LLM Fθ using a structured template as … view at source ↗

**Figure 5.** Figure 5: Distribution of driving decision categories in the curated test set. Each bar corresponds to a safety suggestion category, showing the aggregated frequency of its representative keywords. as LCP [9], VTS [10], LLaVA-v1.5 [18] and HoP [11]. In addition, we also compare with several open source MLLMs such as Qwen2.5 [12], LLaVA-NeXT [13], and VideoLLaMA 3 [14] to assess their performance on risk prediction, … view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of DriveSafe-ZeroShot, DriveSafe-Finetuned, and Qwen2.5-VL [12] on three driving scenarios from the DRAMA dataset [9]. Risky object grounding is shown with bounding boxes so is respective models with text highlighting, while generated captions and safety suggestions are marked as correct (green) or incorrect (red). On an NVIDIA A6000 GPU, LLaMA-Adapter 3.1 (8B) achieves a per-token l… view at source ↗

**Figure 7.** Figure 7: Qualitative comparison of Safety Suggestions across Ground-truth , Qwen2.5-VL and DriveSafe-Finetuned in different challenging driving scenarios [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Top-to-bottom sequence with DriveSafe-Finetuned and Ground-Truth predictions; safety suggestions appear top-left. than LLaVA-NeXT [13], underscoring its stronger temporal grounding and video understanding. In the LLM-wise comparison (VLM fixed to Qwen2.5-VL), LLaMA-3.1 provides consistently strong alignment, while DeepSeek [44] yields a 55% lower METEOR but a 13% higher CLAIR score. Since CLAIR [39] relie… view at source ↗

read the original abstract

Comprehensive situational awareness is essential for autonomous vehicles operating in safety-critical environments, as it enables the identification and mitigation of potential risks. Although recent Multimodal Large Language Models (MLLMs) have shown promise on general vision-language tasks, our findings indicate that zero-shot MLLMs still underperform compared to domain-specific methods in fine-grained, spatially grounded risk assessment. To address this gap, we propose DriveSafe, a framework for risk-aware scene understanding that leverages structured natural language descriptions. Specifically, our method first generates spatially grounded captions enriched with multimodal context, including motion, spatial, and depth cues. These captions are then used for downstream risk assessment, explicitly identifying hazardous objects, their locations, and the unsafe behaviors they imply, followed by actionable safety suggestions. To further improve performance, we employ caption-risk pairings to fine-tune a lightweight adapter module, efficiently injecting domain-specific knowledge into the base LLM. By conditioning risk assessment on explicit language-based scene representations, DriveSafe achieves significant gains over both zero-shot MLLMs and prior domain-specific baselines. Exhaustive experiments on the DRAMA benchmark demonstrate state-of-the-art performance, while ablation studies validate the effectiveness of our key design choices. Project page: https://cvit.iiit.ac.in/ research/projects/cvit-projects/drivesafe

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DriveSafe routes driving scenes through enriched captions before risk assessment and adapter tuning, but the SOTA claims rest on unshown numbers.

read the letter

The main point is that this paper builds a pipeline where driving images first get turned into spatially grounded captions that include motion, depth, and location cues, then those captions feed into risk detection and safety suggestions, with a lightweight adapter fine-tuned on the caption-risk pairs to add domain knowledge. That chaining step is the concrete addition over plain zero-shot MLLM use in driving scenes. It targets a practical gap where general models miss fine-grained hazards, and the approach keeps the language representation explicit so downstream steps can point to specific objects and behaviors. The ablation studies mentioned are a plus for checking which parts matter. The abstract also notes experiments on the DRAMA benchmark, which at least gives a concrete testbed. The soft spots sit mostly in the evidence. The text asserts significant gains and state-of-the-art results without listing any accuracy numbers, baselines, or error bars, so it is difficult to judge how large the improvement actually is or whether the caption step itself drives it rather than the adapter or prompt choices. The stress-test concern about captions possibly omitting key risk details is reasonable to raise, because the paper does not appear to include a direct check that the generated descriptions preserve all hazardous elements from the raw visuals. If that step loses information, the claimed advantage over direct MLLM use could shrink. This work is aimed at researchers applying multimodal models to autonomous driving safety rather than those seeking broad theoretical advances. A reader already working on DRAMA or similar driving datasets could extract useful implementation ideas from the pipeline. I would send it to peer review so the full results section and any caption-quality checks can be examined properly, but the authors should be asked to put the quantitative comparisons in the abstract or early results so the gains are visible without digging.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the DriveSafe framework for risk detection and safety suggestions in driving scenarios. It generates spatially grounded captions enriched with motion, spatial, and depth cues to create explicit language-based scene representations. These captions are then used for risk assessment to identify hazardous objects, their locations, and unsafe behaviors, followed by actionable safety suggestions. A lightweight adapter is fine-tuned using caption-risk pairings to efficiently adapt the base LLM with domain knowledge. The work claims significant gains over zero-shot MLLMs and prior baselines, demonstrating state-of-the-art performance on the DRAMA benchmark via exhaustive experiments and ablation studies.

Significance. If the empirical claims hold, the framework provides a practical method to improve MLLM performance on fine-grained risk assessment tasks in autonomous driving by using interpretable language intermediaries and parameter-efficient adaptation. This could have implications for safety-critical applications where explicit reasoning is beneficial.

major comments (2)

[§3.2] §3.2: The caption generation step is presented as key to providing sufficient information for superior risk assessment, but there is no direct validation (e.g., human evaluation or comparison metrics) showing that the enriched captions preserve all risk-relevant details without omissions or distortions of hazardous objects or behaviors. This assumption is load-bearing for attributing gains to the language representation rather than other factors like prompt engineering or fine-tuning.
[§4 Experiments] §4 Experiments: While the abstract asserts state-of-the-art performance and significant gains, the provided description lacks specific quantitative metrics, error bars, or detailed baseline comparisons. The experimental setup information is insufficient to fully evaluate the central claim of superiority over zero-shot MLLMs and domain-specific methods.

minor comments (2)

[Abstract] The abstract could benefit from including at least one key quantitative result to support the claims of significant gains and SOTA performance.
[Overall] Ensure all figures and tables are clearly labeled and referenced in the text for better readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2: The caption generation step is presented as key to providing sufficient information for superior risk assessment, but there is no direct validation (e.g., human evaluation or comparison metrics) showing that the enriched captions preserve all risk-relevant details without omissions or distortions of hazardous objects or behaviors. This assumption is load-bearing for attributing gains to the language representation rather than other factors like prompt engineering or fine-tuning.

Authors: We agree that explicit validation of caption quality would provide stronger support for attributing performance gains to the enriched language representations. The manuscript currently relies on ablation studies and end-to-end performance improvements on the DRAMA benchmark to demonstrate the value of the captions. In the revised version, we will add a human evaluation study in which annotators rate the captions for completeness and accuracy with respect to hazardous objects, locations, motions, and unsafe behaviors. This addition will help isolate the contribution of the language intermediary from other factors such as fine-tuning. revision: yes
Referee: [§4 Experiments] §4 Experiments: While the abstract asserts state-of-the-art performance and significant gains, the provided description lacks specific quantitative metrics, error bars, or detailed baseline comparisons. The experimental setup information is insufficient to fully evaluate the central claim of superiority over zero-shot MLLMs and domain-specific methods.

Authors: The full manuscript contains tables reporting quantitative results, baseline comparisons, and ablation studies on the DRAMA benchmark. To improve clarity and address the referee's concern, we will expand the experimental section with explicit numerical results, error bars from repeated runs where available, and a more detailed account of the evaluation protocol, hyperparameters, and baseline implementations. These additions will make the superiority claims easier to verify without altering the existing experimental findings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with independent evaluation

full rationale

The DriveSafe paper presents an applied ML pipeline that generates spatially grounded captions from multimodal inputs and fine-tunes a lightweight adapter on caption-risk pairs before performing risk assessment. No equations, first-principles derivations, or predictions are claimed; performance is measured directly via exhaustive experiments on the external DRAMA benchmark. The central claim rests on empirical gains from explicit language conditioning rather than any reduction of outputs to fitted inputs or self-citation chains. Self-citations, if present, are not load-bearing for the method itself. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Based on abstract only; the approach rests on domain assumptions about MLLM limitations and the value of language representations rather than new invented entities or many free parameters.

free parameters (1)

lightweight adapter parameters
The adapter module is fine-tuned on caption-risk pairings, implying learned parameters specific to the domain knowledge injection step.

axioms (2)

domain assumption Zero-shot MLLMs underperform compared to domain-specific methods in fine-grained, spatially grounded risk assessment
Explicitly stated as a finding that motivates the framework in the abstract.
domain assumption Spatially grounded captions enriched with motion, spatial, and depth cues can effectively support downstream risk assessment and safety suggestions
Core premise of the method description for generating and using the captions.

pith-pipeline@v0.9.0 · 5774 in / 1681 out tokens · 57411 ms · 2026-05-19T21:36:27.014475+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages

[1]

Risk assessment, modelling and proactive safety manage- ment system in aviation: a literature review,

I. Sikora, “Risk assessment, modelling and proactive safety manage- ment system in aviation: a literature review,” inTransportation Systems with International Participation, 2015

work page 2015
[2]

Risk management in the healthcare safety management system,

Y . V oskanyan, I. Shikina, F. Kidalov, D. Davidov, and T. Abrosimova, “Risk management in the healthcare safety management system,” Journal of Digital Science, 2021

work page 2021
[3]

Safety assessment of collaborative robotics through automated formal verifi- cation,

F. Vicentini, M. Askarpour, M. G. Rossi, and D. Mandrioli, “Safety assessment of collaborative robotics through automated formal verifi- cation,”IEEE Transactions on Robotics, 2019

work page 2019
[4]

Road traffic injuries fact sheet,

World Health Organization, “Road traffic injuries fact sheet,” 2024

work page 2024
[5]

Fatality statistics: State-by- state,

Insurance Institute for Highway Safety, “Fatality statistics: State-by- state,” 2023

work page 2023
[6]

Rain: Reinforced hybrid attention inference network for motion forecasting,

J. Li, F. Yang, H. Ma, S. Malla, M. Tomizuka, and C. Choi, “Rain: Reinforced hybrid attention inference network for motion forecasting,” inICCV, 2021

work page 2021
[7]

Rein- forcement learning for autonomous driving with latent state inference and spatial-temporal relationships,

X. Ma, J. Li, M. J. Kochenderfer, D. Isele, and K. Fujimura, “Rein- forcement learning for autonomous driving with latent state inference and spatial-temporal relationships,” inICRA, 2021

work page 2021
[8]

Interaction graphs for object importance estimation in on-road driving videos,

Z. Zhang, A. Tawari, S. Martin, and D. Crandall, “Interaction graphs for object importance estimation in on-road driving videos,” inICRA, 2020

work page 2020
[9]

Drama: Joint risk localization and captioning in driving,

S. Malla, C. Choi, I. Dwivedi, J. H. Choi, and J. Li, “Drama: Joint risk localization and captioning in driving,” inWACV, 2023

work page 2023
[10]

Token merging: Your vit but faster,

D. Bolya, C.-Y . Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoff- man, “Token merging: Your vit but faster,”arXiv, 2022

work page 2022
[11]

Hints of prompt: Enhancing visual representation for multimodal llms in autonomous driving,

H. Zhou, Z. Gao, M. Ye, Z. Chen, Q. Chen, T. Cao, and H. Qi, “Hints of prompt: Enhancing visual representation for multimodal llms in autonomous driving,”arXiv, 2024

work page 2024
[12]

Qwen2. 5-vl technical report,

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2. 5-vl technical report,”arXiv, 2025

work page 2025
[13]

Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models,

F. Li, R. Zhang, H. Zhang, Y . Zhang, B. Li, W. Li, Z. Ma, and C. Li, “Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models,”arXiv, 2024

work page 2024
[14]

Videollama 3: Frontier multimodal foundation models for image and video understanding,

B. Zhang, K. Li, Z. Cheng, Z. Hu, Y . Yuan, G. Chen, S. Leng, Y . Jiang, H. Zhang, X. Liet al., “Videollama 3: Frontier multimodal foundation models for image and video understanding,”arXiv, 2025

work page 2025
[15]

Rank2tell: A multimodal driving dataset for joint importance ranking and reasoning,

E. Sachdeva, N. Agarwal, S. Chundi, S. Roelofs, J. Li, M. Kochen- derfer, C. Choi, and B. Dariush, “Rank2tell: A multimodal driving dataset for joint importance ranking and reasoning,” inWACV, 2024

work page 2024
[16]

Video token sparsification for efficient multimodal llms in autonomous driving,

Y . Ma, A. Abdelraouf, R. Gupta, Z. Wang, and K. Han, “Video token sparsification for efficient multimodal llms in autonomous driving,” arXiv, 2024

work page 2024
[17]

Roadsocial: A diverse videoqa dataset and benchmark for road event understanding from social video narratives,

C. Parikh, D. Rawat, R. R. T., T. Ghosh, and R. K. Sarvadevabhatla, “Roadsocial: A diverse videoqa dataset and benchmark for road event understanding from social video narratives,”CVPR, 2025

work page 2025
[18]

Improved baselines with visual instruction tuning,

H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inCVPR, 2024

work page 2024
[19]

Potential risk assessment for safe driving of autonomous vehicles under occluded vision,

D. Wang, W. Fu, Q. Song, and J. Zhou, “Potential risk assessment for safe driving of autonomous vehicles under occluded vision,”Scientific Reports, 2022

work page 2022
[20]

Evaluating the poten- tial risks posed by autonomous vehicles by using a decomposed fuzzy multi-criteria decision-making model,

M. Aslantas, F. K. Gündogdu, and S. Moslem, “Evaluating the poten- tial risks posed by autonomous vehicles by using a decomposed fuzzy multi-criteria decision-making model,”Transportation Engineering, 2025

work page 2025
[21]

Goal-oriented object importance estimation in on-road driving videos,

M. Gao, A. Tawari, and S. Martin, “Goal-oriented object importance estimation in on-road driving videos,” inICRA, 2019

work page 2019
[22]

Are all objects equal? deep spatio- temporal importance prediction in driving videos,

E. Ohn-Bar and M. M. Trivedi, “Are all objects equal? deep spatio- temporal importance prediction in driving videos,”Pattern Recogni- tion, 2017

work page 2017
[23]

Interaction graphs for object importance estimation in on-road driving videos,

Z. Zhang, A. Tawari, S. Martin, and D. J. Crandall, “Interaction graphs for object importance estimation in on-road driving videos,”ICRA, 2020

work page 2020
[24]

Who make drivers stop? towards driver-centric risk assessment: Risk object identification via causal inference,

C. Li, S. H. Chan, and Y .-T. Chen, “Who make drivers stop? towards driver-centric risk assessment: Risk object identification via causal inference,”IROS, 2020

work page 2020
[25]

Toward an adaptive situational awareness support system for urban driving,

T. Wu, E. Sachdeva, K. Akash, X. Wu, T. Misu, and J. Ortiz, “Toward an adaptive situational awareness support system for urban driving,” IV Symposium, 2022

work page 2022
[26]

Risk assessment method for autonomous vehicles violating safety common sense based on driving behavior,

Z. Pang, Z. Chen, J. Lu, B. Sun, T. Gong, X. Feng, Y . Wang, S. Yang, and Y . Cao, “Risk assessment method for autonomous vehicles violating safety common sense based on driving behavior,” IEEE Access, 2025

work page 2025
[27]

Drivelm: Driving with graph visual question answering,

C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual question answering,”arXiv, 2023

work page 2023
[28]

Embodied understanding of driving scenarios,

Y . Zhou, L. Huang, Q. Bu, J. Zeng, T. Li, H. Qiu, H. Zhu, M. Guo, Y . Qiao, and H. Li, “Embodied understanding of driving scenarios,” inECCV, 2024

work page 2024
[29]

Gpt-driver: Learning to drive with gpt,

J. Mao, Y . Qian, J. Ye, H. Zhao, and Y . Wang, “Gpt-driver: Learning to drive with gpt,”arXiv, 2023

work page 2023
[30]

Holistic autonomous driving understanding by bird’s-eye-view injected multi- modal large models,

X. Ding, J. Han, H. Xu, X. Liang, W. Zhang, and X. Li, “Holistic autonomous driving understanding by bird’s-eye-view injected multi- modal large models,” 2024

work page 2024
[31]

Hilm-d: Enhancing mllms with multi-scale high-resolution details for autonomous driv- ing,

X. Ding, J. Han, H. Xu, W. Zhang, and X. Li, “Hilm-d: Enhancing mllms with multi-scale high-resolution details for autonomous driv- ing,”IJCV, 2025

work page 2025
[32]

Mllm-sul: Multimodal large language model for semantic scene understanding and localization in traffic scenarios,

J. Fan, J. Wu, J. Gao, J. Yu, Y . Wang, H. Chu, and B. Gao, “Mllm-sul: Multimodal large language model for semantic scene understanding and localization in traffic scenarios,”arXiv, 2024

work page 2024
[33]

V2v-llm: Vehicle-to-vehicle cooperative autonomous driving with multi-modal large language models,

H.-k. Chiu, R. Hachiuma, C.-Y . Wang, S. F. Smith, Y .-C. F. Wang, and M.-H. Chen, “V2v-llm: Vehicle-to-vehicle cooperative autonomous driving with multi-modal large language models,”arXiv, 2025

work page 2025
[34]

Bleu: a method for automatic evaluation of machine translation,

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inACL, 2002

work page 2002
[35]

Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,

S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” inACL, 2005

work page 2005
[36]

Rouge: A package for automatic evaluation of summaries,

C.-Y . Lin, “Rouge: A package for automatic evaluation of summaries,” inACL, 2004

work page 2004
[37]

Cider: Consensus- based image description evaluation,

R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus- based image description evaluation,” inCVPR, 2015

work page 2015
[38]

Spice: Semantic propositional image caption evaluation,

P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propositional image caption evaluation,” inECCV, 2016

work page 2016
[39]

Clair: Evaluating image captions with large language models,

D. Chan, S. Petryk, J. E. Gonzalez, T. Darrell, and J. Canny, “Clair: Evaluating image captions with large language models,”arXiv, 2023

work page 2023
[40]

Hybridnets: End-to-end perception network,

V . Dat, N. Bao, and P. Hung, “Hybridnets: End-to-end perception network,”Pattern Recognition and Image Analysis, 2025

work page 2025
[41]

Depth anything v2,

L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”arXiv, 2024

work page 2024
[42]

Llama-adapter: Efficient fine-tuning of language models with zero-init attention,

R. Zhang, J. Han, C. Liu, P. Gao, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, and Y . Qiao, “Llama-adapter: Efficient fine-tuning of language models with zero-init attention,” inICLR, 2024

work page 2024
[43]

The llama 3 herd of models,

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,”arXiv, 2024

work page 2024
[44]

Deepseek-v3 technical report,

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., “Deepseek-v3 technical report,”arXiv, 2024

work page 2024

[1] [1]

Risk assessment, modelling and proactive safety manage- ment system in aviation: a literature review,

I. Sikora, “Risk assessment, modelling and proactive safety manage- ment system in aviation: a literature review,” inTransportation Systems with International Participation, 2015

work page 2015

[2] [2]

Risk management in the healthcare safety management system,

Y . V oskanyan, I. Shikina, F. Kidalov, D. Davidov, and T. Abrosimova, “Risk management in the healthcare safety management system,” Journal of Digital Science, 2021

work page 2021

[3] [3]

Safety assessment of collaborative robotics through automated formal verifi- cation,

F. Vicentini, M. Askarpour, M. G. Rossi, and D. Mandrioli, “Safety assessment of collaborative robotics through automated formal verifi- cation,”IEEE Transactions on Robotics, 2019

work page 2019

[4] [4]

Road traffic injuries fact sheet,

World Health Organization, “Road traffic injuries fact sheet,” 2024

work page 2024

[5] [5]

Fatality statistics: State-by- state,

Insurance Institute for Highway Safety, “Fatality statistics: State-by- state,” 2023

work page 2023

[6] [6]

Rain: Reinforced hybrid attention inference network for motion forecasting,

J. Li, F. Yang, H. Ma, S. Malla, M. Tomizuka, and C. Choi, “Rain: Reinforced hybrid attention inference network for motion forecasting,” inICCV, 2021

work page 2021

[7] [7]

Rein- forcement learning for autonomous driving with latent state inference and spatial-temporal relationships,

X. Ma, J. Li, M. J. Kochenderfer, D. Isele, and K. Fujimura, “Rein- forcement learning for autonomous driving with latent state inference and spatial-temporal relationships,” inICRA, 2021

work page 2021

[8] [8]

Interaction graphs for object importance estimation in on-road driving videos,

Z. Zhang, A. Tawari, S. Martin, and D. Crandall, “Interaction graphs for object importance estimation in on-road driving videos,” inICRA, 2020

work page 2020

[9] [9]

Drama: Joint risk localization and captioning in driving,

S. Malla, C. Choi, I. Dwivedi, J. H. Choi, and J. Li, “Drama: Joint risk localization and captioning in driving,” inWACV, 2023

work page 2023

[10] [10]

Token merging: Your vit but faster,

D. Bolya, C.-Y . Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoff- man, “Token merging: Your vit but faster,”arXiv, 2022

work page 2022

[11] [11]

Hints of prompt: Enhancing visual representation for multimodal llms in autonomous driving,

H. Zhou, Z. Gao, M. Ye, Z. Chen, Q. Chen, T. Cao, and H. Qi, “Hints of prompt: Enhancing visual representation for multimodal llms in autonomous driving,”arXiv, 2024

work page 2024

[12] [12]

Qwen2. 5-vl technical report,

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2. 5-vl technical report,”arXiv, 2025

work page 2025

[13] [13]

Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models,

F. Li, R. Zhang, H. Zhang, Y . Zhang, B. Li, W. Li, Z. Ma, and C. Li, “Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models,”arXiv, 2024

work page 2024

[14] [14]

Videollama 3: Frontier multimodal foundation models for image and video understanding,

B. Zhang, K. Li, Z. Cheng, Z. Hu, Y . Yuan, G. Chen, S. Leng, Y . Jiang, H. Zhang, X. Liet al., “Videollama 3: Frontier multimodal foundation models for image and video understanding,”arXiv, 2025

work page 2025

[15] [15]

Rank2tell: A multimodal driving dataset for joint importance ranking and reasoning,

E. Sachdeva, N. Agarwal, S. Chundi, S. Roelofs, J. Li, M. Kochen- derfer, C. Choi, and B. Dariush, “Rank2tell: A multimodal driving dataset for joint importance ranking and reasoning,” inWACV, 2024

work page 2024

[16] [16]

Video token sparsification for efficient multimodal llms in autonomous driving,

Y . Ma, A. Abdelraouf, R. Gupta, Z. Wang, and K. Han, “Video token sparsification for efficient multimodal llms in autonomous driving,” arXiv, 2024

work page 2024

[17] [17]

Roadsocial: A diverse videoqa dataset and benchmark for road event understanding from social video narratives,

C. Parikh, D. Rawat, R. R. T., T. Ghosh, and R. K. Sarvadevabhatla, “Roadsocial: A diverse videoqa dataset and benchmark for road event understanding from social video narratives,”CVPR, 2025

work page 2025

[18] [18]

Improved baselines with visual instruction tuning,

H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inCVPR, 2024

work page 2024

[19] [19]

Potential risk assessment for safe driving of autonomous vehicles under occluded vision,

D. Wang, W. Fu, Q. Song, and J. Zhou, “Potential risk assessment for safe driving of autonomous vehicles under occluded vision,”Scientific Reports, 2022

work page 2022

[20] [20]

Evaluating the poten- tial risks posed by autonomous vehicles by using a decomposed fuzzy multi-criteria decision-making model,

M. Aslantas, F. K. Gündogdu, and S. Moslem, “Evaluating the poten- tial risks posed by autonomous vehicles by using a decomposed fuzzy multi-criteria decision-making model,”Transportation Engineering, 2025

work page 2025

[21] [21]

Goal-oriented object importance estimation in on-road driving videos,

M. Gao, A. Tawari, and S. Martin, “Goal-oriented object importance estimation in on-road driving videos,” inICRA, 2019

work page 2019

[22] [22]

Are all objects equal? deep spatio- temporal importance prediction in driving videos,

E. Ohn-Bar and M. M. Trivedi, “Are all objects equal? deep spatio- temporal importance prediction in driving videos,”Pattern Recogni- tion, 2017

work page 2017

[23] [23]

Interaction graphs for object importance estimation in on-road driving videos,

Z. Zhang, A. Tawari, S. Martin, and D. J. Crandall, “Interaction graphs for object importance estimation in on-road driving videos,”ICRA, 2020

work page 2020

[24] [24]

Who make drivers stop? towards driver-centric risk assessment: Risk object identification via causal inference,

C. Li, S. H. Chan, and Y .-T. Chen, “Who make drivers stop? towards driver-centric risk assessment: Risk object identification via causal inference,”IROS, 2020

work page 2020

[25] [25]

Toward an adaptive situational awareness support system for urban driving,

T. Wu, E. Sachdeva, K. Akash, X. Wu, T. Misu, and J. Ortiz, “Toward an adaptive situational awareness support system for urban driving,” IV Symposium, 2022

work page 2022

[26] [26]

Risk assessment method for autonomous vehicles violating safety common sense based on driving behavior,

Z. Pang, Z. Chen, J. Lu, B. Sun, T. Gong, X. Feng, Y . Wang, S. Yang, and Y . Cao, “Risk assessment method for autonomous vehicles violating safety common sense based on driving behavior,” IEEE Access, 2025

work page 2025

[27] [27]

Drivelm: Driving with graph visual question answering,

C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual question answering,”arXiv, 2023

work page 2023

[28] [28]

Embodied understanding of driving scenarios,

Y . Zhou, L. Huang, Q. Bu, J. Zeng, T. Li, H. Qiu, H. Zhu, M. Guo, Y . Qiao, and H. Li, “Embodied understanding of driving scenarios,” inECCV, 2024

work page 2024

[29] [29]

Gpt-driver: Learning to drive with gpt,

J. Mao, Y . Qian, J. Ye, H. Zhao, and Y . Wang, “Gpt-driver: Learning to drive with gpt,”arXiv, 2023

work page 2023

[30] [30]

Holistic autonomous driving understanding by bird’s-eye-view injected multi- modal large models,

X. Ding, J. Han, H. Xu, X. Liang, W. Zhang, and X. Li, “Holistic autonomous driving understanding by bird’s-eye-view injected multi- modal large models,” 2024

work page 2024

[31] [31]

Hilm-d: Enhancing mllms with multi-scale high-resolution details for autonomous driv- ing,

X. Ding, J. Han, H. Xu, W. Zhang, and X. Li, “Hilm-d: Enhancing mllms with multi-scale high-resolution details for autonomous driv- ing,”IJCV, 2025

work page 2025

[32] [32]

Mllm-sul: Multimodal large language model for semantic scene understanding and localization in traffic scenarios,

J. Fan, J. Wu, J. Gao, J. Yu, Y . Wang, H. Chu, and B. Gao, “Mllm-sul: Multimodal large language model for semantic scene understanding and localization in traffic scenarios,”arXiv, 2024

work page 2024

[33] [33]

V2v-llm: Vehicle-to-vehicle cooperative autonomous driving with multi-modal large language models,

H.-k. Chiu, R. Hachiuma, C.-Y . Wang, S. F. Smith, Y .-C. F. Wang, and M.-H. Chen, “V2v-llm: Vehicle-to-vehicle cooperative autonomous driving with multi-modal large language models,”arXiv, 2025

work page 2025

[34] [34]

Bleu: a method for automatic evaluation of machine translation,

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inACL, 2002

work page 2002

[35] [35]

Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,

S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” inACL, 2005

work page 2005

[36] [36]

Rouge: A package for automatic evaluation of summaries,

C.-Y . Lin, “Rouge: A package for automatic evaluation of summaries,” inACL, 2004

work page 2004

[37] [37]

Cider: Consensus- based image description evaluation,

R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus- based image description evaluation,” inCVPR, 2015

work page 2015

[38] [38]

Spice: Semantic propositional image caption evaluation,

P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propositional image caption evaluation,” inECCV, 2016

work page 2016

[39] [39]

Clair: Evaluating image captions with large language models,

D. Chan, S. Petryk, J. E. Gonzalez, T. Darrell, and J. Canny, “Clair: Evaluating image captions with large language models,”arXiv, 2023

work page 2023

[40] [40]

Hybridnets: End-to-end perception network,

V . Dat, N. Bao, and P. Hung, “Hybridnets: End-to-end perception network,”Pattern Recognition and Image Analysis, 2025

work page 2025

[41] [41]

Depth anything v2,

L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”arXiv, 2024

work page 2024

[42] [42]

Llama-adapter: Efficient fine-tuning of language models with zero-init attention,

R. Zhang, J. Han, C. Liu, P. Gao, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, and Y . Qiao, “Llama-adapter: Efficient fine-tuning of language models with zero-init attention,” inICLR, 2024

work page 2024

[43] [43]

The llama 3 herd of models,

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,”arXiv, 2024

work page 2024

[44] [44]

Deepseek-v3 technical report,

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., “Deepseek-v3 technical report,”arXiv, 2024

work page 2024