pith. sign in

arxiv: 2509.25944 · v2 · submitted 2025-09-30 · 💻 cs.AI

NuRisk: A Visual Question Answering Dataset for Agent-Level Risk Assessment in Autonomous Driving

Pith reviewed 2026-05-18 12:41 UTC · model grok-4.3

classification 💻 cs.AI
keywords visual question answeringautonomous drivingrisk assessmentspatio-temporal reasoningvision-language modelsagent-level annotationsnuScenesWaymo
0
0 comments X

The pith

NuRisk supplies sequential driving images with 1.1 million agent-level risk scores to test whether vision-language models can track how dangers evolve over time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents NuRisk, a visual question answering dataset built from real nuScenes and Waymo recordings plus simulated safety-critical cases. It supplies bird's-eye-view image sequences together with numerical risk values attached to every agent at every time step. Standard vision-language models reach only 33 percent accuracy on the resulting questions and operate at high latency because they do not reason explicitly about risk changes across frames. A fine-tuned seven-billion-parameter model raises accuracy to 41 percent while cutting latency by 75 percent, yet the still-modest score demonstrates that the underlying spatio-temporal reasoning task remains difficult. The dataset therefore functions as a new benchmark for measuring progress on agent-level risk assessment in autonomous driving.

Core claim

NuRisk comprises 2.9K scenarios and 1.1M agent-level samples drawn from nuScenes, Waymo, and CommonRoad data; each sample pairs bird's-eye-view sequential images with quantitative, agent-level risk annotations that enable explicit spatio-temporal reasoning about evolving risks. Benchmark experiments show that existing VLMs achieve at most 33 percent accuracy at high latency because they lack this reasoning, whereas a fine-tuned 7B VLM reaches 41 percent accuracy and 75 percent lower latency while exhibiting the spatio-temporal capabilities absent in proprietary models.

What carries the argument

The NuRisk dataset of sequential bird's-eye-view images paired with per-agent quantitative risk annotations over time.

If this is right

  • Existing vision-language models cannot perform the explicit spatio-temporal reasoning required to assess how agent risks change across video frames in driving scenes.
  • A fine-tuned 7B model can exceed the accuracy and speed of much larger proprietary models on this specific task.
  • The remaining gap to high accuracy indicates that agent-level risk assessment in dynamic traffic is still an open challenge for current methods.
  • NuRisk supplies a concrete test bed that future models must pass to demonstrate improved spatio-temporal understanding in autonomous driving.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same annotation pipeline could be applied to additional real-world datasets to test whether the learned risk patterns generalize beyond the three sources used here.
  • Pairing the dataset with reinforcement-learning loops might allow agents to be trained to select actions that reduce predicted risk over the next several seconds.
  • Adding explicit uncertainty estimates to the risk labels could help models distinguish between clear high-risk situations and genuinely ambiguous ones.

Load-bearing premise

The quantitative agent-level risk annotations generated from nuScenes, Waymo, and CommonRoad data accurately capture the true evolving risk that a human driver or AV should perceive.

What would settle it

Collect independent risk ratings from human drivers or safety experts on a held-out subset of the same scenarios and measure agreement with the dataset's numerical annotations; substantial disagreement would show that the labels do not reflect perceived risk.

Figures

Figures reproduced from arXiv: 2509.25944 by Johannes Betz, Mattia Piccinini, Roberto Brusnicki, Yuan Gao, Yuchen Zhang.

Figure 1
Figure 1. Figure 1: Overview of NuRisk: Existing VLM-based risk assessment is [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Framework of NuRisk. Multi-modal inputs are processed into BEV scenes and risk metrics to enable conversation-based VQA with chain-of-thought [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Dataset statistics and risk distribution of NuRisk. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: NuRisk VLM Agent Fine-tuning Architecture. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance comparison between the best proprietary VLMs and [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Understanding risk in autonomous driving requires not only perception and prediction, but also high-level reasoning about agent behavior and context. Current Vision Language Model (VLM)-based methods primarily ground agents in static images and provide qualitative judgments, lacking the spatio-temporal reasoning needed to capture how risks evolve over time. To address this gap, we propose NuRisk, a comprehensive Visual Question Answering (VQA) dataset comprising 2.9K scenarios and 1.1M agent-level samples, built on real-world data from nuScenes and Waymo, completed with safety-critical scenarios from the CommonRoad simulator. The dataset provides Bird's-eye view (BEV) based sequential images with quantitative, agent-level risk annotations, enabling spatio-temporal reasoning. We benchmark well-known VLMs across different prompting techniques and find that they fail to perform explicit spatio-temporal reasoning, resulting in a peak accuracy of 33% at high latency. To address these shortcomings, our fine-tuned 7B VLM agent improves accuracy to 41% and reduces latency by 75%, demonstrating explicit spatio-temporal reasoning capabilities that proprietary models lacked. While this represents a significant step forward, the modest accuracy underscores the profound challenge of the task, establishing NuRisk as a critical benchmark for advancing spatio-temporal reasoning in autonomous driving. More information can be found at https://github.com/TUM-AVS/NuRisk.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces NuRisk, a VQA dataset comprising 2.9K scenarios and 1.1M agent-level samples drawn from nuScenes, Waymo, and CommonRoad. It supplies BEV sequential images paired with quantitative risk annotations to support spatio-temporal reasoning about agent behavior. Benchmarking shows existing VLMs reach at most 33% accuracy with high latency; fine-tuning a 7B VLM raises accuracy to 41% and cuts latency by 75%, which the authors interpret as evidence of explicit spatio-temporal reasoning that proprietary models lack.

Significance. If the risk annotations prove to be a reliable proxy for evolving agent-level risk, NuRisk would constitute a useful large-scale benchmark for VLM-based risk assessment in autonomous driving. The scale, inclusion of both real-world and safety-critical simulated scenarios, and the reported fine-tuning gains would help quantify current limitations and motivate further work on temporal reasoning. The modest absolute accuracy also usefully underscores the difficulty of the task.

major comments (2)
  1. [Dataset construction and risk annotation methodology] The central empirical claims rest on the quantitative risk annotations being faithful proxies for true evolving risk that a human driver or AV should perceive. The manuscript provides no external validation of these labels (e.g., correlation with human expert risk ratings, retrospective near-miss/collision analysis, or inter-annotator agreement on edge cases). Without such checks, the observed 33% to 41% accuracy lift could reflect overfitting to annotation heuristics rather than acquisition of genuine spatio-temporal reasoning.
  2. [Benchmarking and fine-tuning experiments] The abstract and results sections report a 33% to 41% accuracy improvement and a 75% latency reduction after fine-tuning, yet supply no error bars, statistical significance tests, or controlled baseline comparisons (e.g., against the same 7B model with different prompting or against larger proprietary models under identical conditions). This weakens the strength of the claim that the fine-tuned model demonstrates capabilities proprietary models lacked.
minor comments (2)
  1. [Experimental setup] Clarify the exact prompting strategies and temperature settings used for the proprietary-model baselines so that the post-hoc comparisons can be reproduced.
  2. [Dataset statistics] The GitHub link is provided, but the manuscript should include a brief description of the data split (train/val/test) and any filtering criteria applied to the 1.1M samples.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and indicate the planned revisions.

read point-by-point responses
  1. Referee: [Dataset construction and risk annotation methodology] The central empirical claims rest on the quantitative risk annotations being faithful proxies for true evolving risk that a human driver or AV should perceive. The manuscript provides no external validation of these labels (e.g., correlation with human expert risk ratings, retrospective near-miss/collision analysis, or inter-annotator agreement on edge cases). Without such checks, the observed 33% to 41% accuracy lift could reflect overfitting to annotation heuristics rather than acquisition of genuine spatio-temporal reasoning.

    Authors: We appreciate the referee highlighting the importance of validating the risk annotations. The annotations are derived from established quantitative metrics in the autonomous driving literature, including time-to-collision, minimum distance, and trajectory overlap scores computed directly from ground-truth data in nuScenes, Waymo, and CommonRoad. These metrics are deterministic and designed to capture evolving spatio-temporal risk. We agree that additional external validation, such as correlation with human expert ratings, would strengthen the work. In the revised manuscript, we will expand the dataset construction section with a more detailed description of the annotation pipeline and add a dedicated limitations subsection discussing potential heuristic biases. A human validation study is planned as future work. revision: partial

  2. Referee: [Benchmarking and fine-tuning experiments] The abstract and results sections report a 33% to 41% accuracy improvement and a 75% latency reduction after fine-tuning, yet supply no error bars, statistical significance tests, or controlled baseline comparisons (e.g., against the same 7B model with different prompting or against larger proprietary models under identical conditions). This weakens the strength of the claim that the fine-tuned model demonstrates capabilities proprietary models lacked.

    Authors: We agree that including statistical measures and clearer baselines would strengthen the experimental section. In the revised manuscript, we will report error bars from multiple evaluation runs and include statistical significance tests (e.g., paired t-tests) for the accuracy and latency improvements. Our original benchmarking already covered multiple VLMs and prompting strategies, with direct comparisons to proprietary models under consistent conditions. To further address the concern, we will add controlled ablations using the base 7B model with varied prompting and clarify the experimental setup to better support the interpretation of improved spatio-temporal reasoning. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical dataset and benchmark paper

full rationale

This is a dataset construction and VLM benchmarking paper with no mathematical derivation, first-principles equations, fitted parameters, or predictions that reduce to inputs by construction. The NuRisk dataset is assembled from external sources (nuScenes, Waymo, CommonRoad) with agent-level risk annotations generated via trajectory-based heuristics; reported accuracies (33% baseline to 41% fine-tuned) are direct empirical measurements on held-out VQA samples. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked to justify core claims. The work is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No mathematical axioms or free parameters are invoked; the work rests on the domain assumption that simulator-augmented real-world data plus rule-based risk labeling produces ground truth suitable for training and evaluating spatio-temporal reasoning.

axioms (1)
  • domain assumption Risk annotations derived from nuScenes, Waymo, and CommonRoad accurately reflect agent-level spatio-temporal risk.
    The central benchmark results depend on these labels being reliable; the abstract does not describe the labeling procedure or validation.

pith-pipeline@v0.9.0 · 5789 in / 1234 out tokens · 29567 ms · 2026-05-18T12:41:13.225055+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views

    cs.RO 2026-04 conditional novelty 8.0

    V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baselin...

  2. Can VLMs Unlock Semantic Anomaly Detection? A Framework for Structured Reasoning

    cs.CV 2025-10 conditional novelty 6.0

    SAVANT reformulates semantic anomaly detection as layered consistency verification, raising VLM recall by 18.5% on real driving images and enabling a fine-tuned 7B open model to reach 90.8% recall and 93.8% accuracy.

  3. Can VLMs Unlock Semantic Anomaly Detection? A Framework for Structured Reasoning

    cs.CV 2025-10 unverdicted novelty 5.0

    SAVANT boosts VLM recall for semantic anomaly detection in driving images by 18.5% via structured reasoning and enables fine-tuning a 7B open model to 90.8% recall and 93.8% accuracy.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 2 Pith papers

  1. [1]

    Waymo one: The next step on our self-driving journey,

    Waymo, “Waymo one: The next step on our self-driving journey,” 2018. [Online]. Available: https://waymo.com/blog/2018/12/ waymo-one-next-step-on-our-self-driving

  2. [2]

    Taxonomy and definitions for terms related to driving automation systems for on-road motor vehicles,

    S. International, “Taxonomy and definitions for terms related to driving automation systems for on-road motor vehicles,”SAE J3016, 2021

  3. [3]

    Perception, planning, control, and coordination for autonomous ve- hicles,

    S. D. Pendleton, H. Andersen, X. Du, X. Shen, M. Meghjaniet al., “Perception, planning, control, and coordination for autonomous ve- hicles,”Machines, vol. 5, no. 1, p. 6, 2017

  4. [4]

    End-to-end autonomous driving: Challenges and frontiers,

    L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geigeret al., “End-to-end autonomous driving: Challenges and frontiers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  5. [5]

    A new taxonomy for automated driving: Structuring applications based on their operational design domain, level of automation and automation readiness,

    J. Betz, M. Lutwitzi, and S. Peters, “A new taxonomy for automated driving: Structuring applications based on their operational design domain, level of automation and automation readiness,” in2024 IEEE Intelligent Vehicles Symposium (IV), 2024, pp. 1–7

  6. [6]

    Vision language models in autonomous driving: A survey and outlook,

    X. Zhou, M. Liu, E. Yurtsever, B. L. Zagar, W. Zimmeret al., “Vision language models in autonomous driving: A survey and outlook,”IEEE Transactions on Intelligent Vehicles, 2024

  7. [7]

    Latte: Lightweight attention-based traffic accident anticipation engine,

    J. Zhang, Y . Guan, C. Wang, H. Liao, G. Zhanget al., “Latte: Lightweight attention-based traffic accident anticipation engine,”arXiv preprint arXiv:2504.04103, 2025

  8. [8]

    Using multimodal large language models (mllms) for automated detection of traffic safety-critical events,

    M. Abu Tami, H. I. Ashqar, M. Elhenawy, S. Glaser, and A. Rako- tonirainy, “Using multimodal large language models (mllms) for automated detection of traffic safety-critical events,”Vehicles, vol. 6, no. 3, pp. 1571–1590, 2024

  9. [9]

    Large (vision) language models for autonomous vehicles: Current trends and future directions,

    H. Tian, K. Reddy, Y . Feng, M. Quddus, Y . Demiriset al., “Large (vision) language models for autonomous vehicles: Current trends and future directions,”Authorea Preprints, 2024

  10. [10]

    Foundation models in autonomous driving: A survey on scenario generation and scenario analysis,

    Y . Gao, M. Piccinini, Y . Zhang, D. Wang, K. Molleret al., “Foundation models in autonomous driving: A survey on scenario generation and scenario analysis,”arXiv preprint arXiv:2506.11526, 2025

  11. [11]

    Is it safe to cross? inter- pretable risk assessment with gpt-4v for safety-aware street crossing,

    H. Hwang, S. Kwon, Y . Kim, and D. Kim, “Is it safe to cross? inter- pretable risk assessment with gpt-4v for safety-aware street crossing,” in2024 21st International Conference on Ubiquitous Robots (UR). IEEE, 2024, pp. 281–288

  12. [12]

    Vision foundation model embedding- based semantic anomaly detection.arXiv preprint arXiv:2505.07998, 2025

    M. P. Ronecker, M. Foutter, A. Elhafsi, D. Gammelli, I. Barakaiev et al., “Vision foundation model embedding-based semantic anomaly detection,”arXiv preprint arXiv:2505.07998, 2025

  13. [13]

    Lora: Low-rank adaptation of large language models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” ICLR, 2022

  14. [14]

    Think-driver: From driving-scene understanding to decision-making with vision language models,

    Q. Zhang, M. Zhu, and H. F. Yang, “Think-driver: From driving-scene understanding to decision-making with vision language models,” in European Conference on Computer Vision Workshop, 2024

  15. [15]

    SFF rendering-based uncertainty prediction using visionLLM,

    J. Lee, J. Cho, H. Suk, and S. Kim, “SFF rendering-based uncertainty prediction using visionLLM,” inAAAI 2025 Workshop LM4Plan, 2025

  16. [16]

    Insight: Enhancing autonomous driving safety through vision-language models on context- aware hazard detection and edge case evaluation,

    D. Chen, Z. Zhang, Y . Liu, and X. T. Yang, “Insight: Enhancing autonomous driving safety through vision-language models on context- aware hazard detection and edge case evaluation,” 2025

  17. [17]

    Bridging human oversight and black-box driver assistance: Vision-language models for predictive alerting in lane keeping assist systems,

    Y . Wang and H. Zhou, “Bridging human oversight and black-box driver assistance: Vision-language models for predictive alerting in lane keeping assist systems,”arXiv preprint arXiv:2505.11535, 2025

  18. [18]

    Talk2bev: Language-enhanced bird’s-eye view maps for autonomous driving,

    T. Choudhary, V . Dewangan, S. Chandhok, S. Priyadarshan, A. Jain et al., “Talk2bev: Language-enhanced bird’s-eye view maps for autonomous driving,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 16 345–16 352

  19. [19]

    Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning,

    S. Wang, Z. Yu, X. Jiang, S. Lan, M. Shiet al., “Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning,”arXiv preprint arXiv:2405.01533, 2024

  20. [20]

    Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving,

    M. Nie, R. Peng, C. Wang, X. Cai, J. Hanet al., “Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 292–308

  21. [21]

    Drivelmm- o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding,

    A. Ishaq, J. Lahoud, K. More, O. Thawakar, R. Thawkaret al., “Drivelmm-o1: A step-by-step reasoning dataset and large multi- modal model for driving scenario understanding,”arXiv preprint arXiv:2503.10621, 2025

  22. [22]

    Autodrive-qa-automated generation of multiple-choice questions for autonomous driving datasets using large vision-language models,

    B. Khalili and A. W. Smyth, “Autodrive-qa-automated generation of multiple-choice questions for autonomous driving datasets using large vision-language models,”arXiv preprint arXiv:2503.15778, 2025

  23. [23]

    Nuplanqa: A large-scale dataset and benchmark for multi-view driving scene understanding in multi-modal large language models,

    S.-Y . Park, C. Cui, Y . Ma, A. Moradipari, R. Guptaet al., “Nuplanqa: A large-scale dataset and benchmark for multi-view driving scene understanding in multi-modal large language models,”arXiv preprint arXiv:2503.12772, 2025

  24. [24]

    Tumtraffic-videoqa: A benchmark for unified spatio-temporal video understanding in traffic scenes,

    X. Zhou, K. Larintzakis, H. Guo, W. Zimmer, M. Liuet al., “Tumtraffic-videoqa: A benchmark for unified spatio-temporal video understanding in traffic scenes,”arXiv preprint arXiv:2502.02449, 2025

  25. [25]

    Holistic autonomous driving understanding by bird’s-eye-view injected multi- modal large models,

    X. Ding, J. Han, H. Xu, X. Liang, W. Zhang, and X. Li, “Holistic autonomous driving understanding by bird’s-eye-view injected multi- modal large models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 668–13 677

  26. [26]

    Hilm-d: Towards high-resolution understanding in multimodal large language models for autonomous driving,

    X. Ding, J. Han, H. Xu, W. Zhang, and X. Li, “Hilm-d: Towards high-resolution understanding in multimodal large language models for autonomous driving,”arXiv preprint arXiv:2309.05186, 2023

  27. [27]

    Are vision llms road- ready? a comprehensive benchmark for safety-critical driving video understanding,

    T. Zeng, L. Wu, L. Shi, D. Zhou, and F. Guo, “Are vision llms road- ready? a comprehensive benchmark for safety-critical driving video understanding,”arXiv preprint arXiv:2504.14526, 2025

  28. [28]

    Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset,

    S. Ettinger, S. Cheng, B. Caine, C. Liu, H. Zhaoet al., “Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9710–9719

  29. [29]

    nuscenes: A multimodal dataset for autonomous driving,

    H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Lionget al., “nuscenes: A multimodal dataset for autonomous driving,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631

  30. [30]

    Commonroad: Composable benchmarks for motion planning on roads,

    M. Althoff, M. Koschi, and S. Manzinger, “Commonroad: Composable benchmarks for motion planning on roads,” in2017 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2017, pp. 719–726

  31. [31]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34 892–34 916, 2023

  32. [32]

    From words to collisions: Llm-guided evaluation and adversarial generation of safety-critical driving scenarios,

    Y . Gao, M. Piccinini, K. Moller, A. Alanwar, and J. Betz, “From words to collisions: Llm-guided evaluation and adversarial generation of safety-critical driving scenarios,”arXiv preprint arXiv:2502.02145, 2025