How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study

· 2026 · cs.CV · arXiv 2604.06750

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Vision-Language Models (VLMs) are increasingly proposed for autonomous driving tasks, yet their performance on sequential driving scenes remains poorly characterized, particularly regarding how input configurations affect their capabilities. We introduce VENUSS (VLM Evaluation oN Understanding Sequential Scenes), a framework for systematic sensitivity analysis of VLM performance on sequential driving scenes, establishing baselines for future research. Building upon existing datasets, VENUSS extracts temporal sequences from driving videos, and generates structured evaluations across custom categories. By comparing 25+ existing VLMs across 2,600+ scenarios, we reveal how even top models achieve only 57% accuracy, not matching human performance under similar constraints (65%) and exposing significant capability gaps. Our analysis shows that VLMs excel with static object detection but struggle with understanding vehicle dynamics and temporal relations. VENUSS offers the first systematic sensitivity analysis of VLMs focused on how input image configurations - resolution, frame count, temporal intervals, spatial layouts, and presentation modes - affect performance on sequential driving scenes. Supplementary material available at https://TUM-AVS.github.io/VENUSS/.

representative citing papers

TPS-Drive: Task-Guided Representation Purification for VLM-based Autonomous Driving

cs.RO · 2026-05-26 · unverdicted · novelty 7.0

TPS-Drive uses an agent-centric tokenizer supervised by a frozen 3D detection head to purify VLM spatial representations, enabling better scene forecasting and lower collision rates on nuScenes and NAVSIM benchmarks.

citing papers explorer

Showing 1 of 1 citing paper.

TPS-Drive: Task-Guided Representation Purification for VLM-based Autonomous Driving cs.RO · 2026-05-26 · unverdicted · none · ref 19 · internal anchor
TPS-Drive uses an agent-centric tokenizer supervised by a frozen 3D detection head to purify VLM spatial representations, enabling better scene forecasting and lower collision rates on nuScenes and NAVSIM benchmarks.

How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study

fields

years

verdicts

representative citing papers

citing papers explorer