arxiv: 2604.06750 · v1 · submitted 2026-04-08 · 💻 cs.CV

Recognition: no theorem link

How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study

Roberto Brusnicki , Mattia Piccinini , Johannes Betz

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:41 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelssequential driving scenestemporal understandingvehicle dynamicssensitivity analysisbenchmark evaluationautonomous driving

0 comments

The pith

Vision-language models reach only 57% accuracy on sequential driving scenes and fall short of human performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models are proposed for autonomous driving tasks, but their handling of sequences of driving scenes has not been systematically measured. The paper introduces the VENUSS framework to extract temporal sequences from videos and test how input configurations affect results across more than 25 models and 2600 scenarios. Top models achieve 57% accuracy, below the 65% humans reach under similar limits, with clear success on static object detection but clear failure on vehicle dynamics and temporal relations. The work provides baselines and shows that factors such as frame count, resolution, and temporal spacing change performance.

Core claim

Even the leading vision-language models attain no more than 57 percent accuracy when tested on sequential driving scenes, compared with 65 percent for humans, and they demonstrate clear strengths in static object detection alongside pronounced weaknesses in grasping vehicle dynamics and temporal relations. The VENUSS framework provides the first systematic way to analyze sensitivity to input image configurations including resolution, frame count, temporal intervals, spatial layouts, and presentation modes.

What carries the argument

The VENUSS framework, which extracts temporal sequences from driving videos and generates structured evaluations across custom categories to test VLM sensitivity to input configurations.

If this is right

VLMs succeed at static object detection but fail to track vehicle dynamics and temporal relations across frames.
Accuracy changes with input settings such as the number of frames, their spacing, and image resolution.
Human performance at 65 percent sets a target that current models do not meet under the tested conditions.
The framework supplies a repeatable method for measuring future improvements in sequential scene understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The identified gaps suggest that adding explicit motion-prediction modules could improve VLM reliability in driving.
The same sensitivity-testing approach could be applied to sequential tasks in robotics or video surveillance.
Training on datasets that emphasize dynamics rather than static scenes might close part of the performance difference.

Load-bearing premise

That the custom-generated questions and sequences extracted from existing driving videos provide an unbiased and representative test of sequential understanding without introducing artifacts from the extraction or question-generation process.

What would settle it

A new model or input configuration that achieves over 65 percent accuracy on the same set of sequential driving scenarios under the human-comparable constraints would falsify the claim of significant capability gaps.

Figures

Figures reproduced from arXiv: 2604.06750 by Johannes Betz, Mattia Piccinini, Roberto Brusnicki.

**Figure 1.** Figure 1: VENUSS framework overview. Starting from driving datasets, VENUSS generates structured evaluation data with controlled variations in image count, timing, resolution, layout, and presentation mode. It evaluates both VLMs and humans on identical tasks across custom categories, identifies optimal input configurations per model, and establishes performance baselines. To address this gap, we introduce VENUSS, a… view at source ↗

**Figure 2.** Figure 2: Visualization of all CoVLA textual descriptions automatically categorized by VENUSS. The framework identified seven categories from the natural language descriptions: motion types (blue), velocity descriptors (yellow), directional behaviors (orange), acceleration patterns (purple), following behavior (dark green), traffic light conditions (light red), and road curvature detection (gray). We release VENUSS … view at source ↗

**Figure 3.** Figure 3: Simplified example of the human evaluation questionnaire with the seven-category format (Sec. III-C). The interface presents sequential driving images with temporal intervals, and seven multiple-choice questions for the corresponding seven categories. The final answer key (bottom right) concatenates the responses for comparison with ground truth annotations. evaluation. Subsequently, five additional partic… view at source ↗

**Figure 4.** Figure 4: Performance analysis across 4 dimensions as described in IV-A. (a) Performance by resolution level (1-6), (b) Performance by time interval (100ms increments, 1-10), (c) Performance by number of images (1-10), and (d) Performance by presentation mode (b=batch, c=collage, s=separate). Horizontal lines show human baselines: green for collage-based evaluation, blue for GIF-based evaluation. Box-and-whisker plo… view at source ↗

**Figure 5.** Figure 5: Performance comparison across different grid formats for the image sequences. Specific grid layouts demonstrate better performance for temporal understanding tasks. ( [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) are increasingly proposed for autonomous driving tasks, yet their performance on sequential driving scenes remains poorly characterized, particularly regarding how input configurations affect their capabilities. We introduce VENUSS (VLM Evaluation oN Understanding Sequential Scenes), a framework for systematic sensitivity analysis of VLM performance on sequential driving scenes, establishing baselines for future research. Building upon existing datasets, VENUSS extracts temporal sequences from driving videos, and generates structured evaluations across custom categories. By comparing 25+ existing VLMs across 2,600+ scenarios, we reveal how even top models achieve only 57% accuracy, not matching human performance in similar constraints (65%) and exposing significant capability gaps. Our analysis shows that VLMs excel with static object detection but struggle with understanding the vehicle dynamics and temporal relations. VENUSS offers the first systematic sensitivity analysis of VLMs focused on how input image configurations - resolution, frame count, temporal intervals, spatial layouts, and presentation modes - affect performance on sequential driving scenes. Supplementary material available at https://V3NU55.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VENUSS shows real gaps in VLM temporal reasoning for driving but the benchmark's question and sequence construction lack the validation needed to trust the size of those gaps.

read the letter

The main thing to know is that the best VLMs top out at 57% on these sequential driving questions while humans reach 65% under similar conditions, with the models doing fine on static objects but falling down on vehicle motion and time relations. VENUSS is the new framework they built to test this systematically by varying resolution, frame count, intervals, and presentation modes across more than 25 models and 2600 scenarios extracted from existing driving videos. That scale and the focus on input configuration sensitivity are the useful parts. It gives practitioners some concrete guidance on how to feed image streams to these models without having to guess. The work is straightforward empirical benchmarking with no fitted parameters or circular claims, which keeps it clean on that front. The soft spot is exactly where the stress-test flagged: the custom sequence extraction and question generation. The abstract and description give no inter-annotator numbers, no checks for single-frame solvability, and no ablations on how the clips were pulled. Without those, it's difficult to separate model limits from possible artifacts in the test items. The performance gap could shrink or shift if the questions turn out to be easier or harder than intended. This paper is aimed at people building or evaluating VLMs for autonomous driving who need a starting point on temporal weaknesses. It is worth sending to peer review because the topic is safety-relevant and the sensitivity results are actionable, but any referee will need to press on dataset quality controls before the numbers can be taken as firm baselines.

Referee Report

2 major / 2 minor

Summary. The paper introduces the VENUSS framework for systematic sensitivity analysis of vision-language models (VLMs) on sequential driving scenes. It extracts temporal sequences from existing driving videos, generates structured custom questions across categories, and benchmarks 25+ VLMs on 2600+ scenarios. Key results show top models achieving 57% accuracy (vs. 65% for humans under similar constraints), with VLMs performing well on static object detection but struggling with vehicle dynamics and temporal relations. The work further analyzes how input configurations (resolution, frame count, temporal intervals, spatial layouts, presentation modes) affect performance.

Significance. If the benchmark holds, this large-scale empirical study (25+ models, 2600+ scenarios) provides useful baselines and identifies concrete capability gaps relevant to autonomous driving applications. The sensitivity analysis on input factors is a positive contribution that can guide practical VLM usage. The direct comparison to human performance under matched constraints adds interpretability, though the overall significance is tempered by the need to confirm benchmark robustness.

major comments (2)

Abstract and evaluation setup: the headline result (top VLMs at 57% accuracy vs. humans at 65%) and the claim of specific deficits in dynamics/temporal relations are presented without error bars, statistical tests, or inter-annotator agreement metrics on the generated questions. This makes it difficult to assess whether the reported gap and category-specific conclusions are robust or could be influenced by question-generation artifacts.
VENUSS framework description (sequence extraction and question generation): no validation, ablations on extraction parameters (temporal intervals, frame selection), or controls for single-frame solvability / language-model priors are reported. Since the central claims about capability gaps and sensitivity to temporal relations rest on these custom sequences and questions being an unbiased probe, the absence of such checks is load-bearing for the interpretation of the 57% result and the dynamics/temporal deficits.

minor comments (2)

The abstract mentions 'Supplementary material available at https://V3NU55.github.io' but does not summarize what additional data, code, or question examples are provided there; including a brief description would improve reproducibility.
Prompt sensitivity is flagged in the reader's assessment but not explicitly controlled or ablated in the reported experiments; a short note on prompt variations would strengthen the sensitivity analysis section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript on the VENUSS benchmark. We have carefully considered the major comments and outline our responses below, including planned revisions to address concerns about statistical robustness and framework validation.

read point-by-point responses

Referee: Abstract and evaluation setup: the headline result (top VLMs at 57% accuracy vs. humans at 65%) and the claim of specific deficits in dynamics/temporal relations are presented without error bars, statistical tests, or inter-annotator agreement metrics on the generated questions. This makes it difficult to assess whether the reported gap and category-specific conclusions are robust or could be influenced by question-generation artifacts.

Authors: We agree that error bars, statistical tests, and additional details on question quality would strengthen the presentation of our results. In the revised manuscript, we will report confidence intervals or standard errors for the accuracy metrics based on the 2600+ scenarios. We will also include statistical significance tests (such as McNemar's test for paired comparisons) to evaluate the differences between top VLMs, other models, and the human baseline, as well as across question categories. For the generated questions, we will expand the methods section to describe the structured generation pipeline in detail, including any automated validation steps and manual quality checks performed on a subset of questions. While traditional inter-annotator agreement metrics do not directly apply because the questions are derived programmatically from existing dataset annotations rather than independent human labeling, these additions will help demonstrate that the reported gaps and category-specific findings are not driven by generation artifacts. revision: yes
Referee: VENUSS framework description (sequence extraction and question generation): no validation, ablations on extraction parameters (temporal intervals, frame selection), or controls for single-frame solvability / language-model priors are reported. Since the central claims about capability gaps and sensitivity to temporal relations rest on these custom sequences and questions being an unbiased probe, the absence of such checks is load-bearing for the interpretation of the 57% result and the dynamics/temporal deficits.

Authors: We acknowledge that explicit validation and controls for the sequence extraction and question generation components would better support the interpretation of our findings. In the revision, we will add ablations varying key extraction parameters such as temporal intervals and frame selection criteria, reporting their effects on overall accuracy and category performance. We will also introduce controls for single-frame solvability by evaluating selected VLMs on individual frames from the sequences and comparing results to the full temporal setting. To address potential language-model priors, we will include experiments with shuffled or non-sequential frame presentations. These new analyses will be incorporated into the results and discussion sections to show that the observed weaknesses in vehicle dynamics and temporal relations reflect genuine capability gaps rather than benchmark construction issues. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with direct measurements

full rationale

This paper introduces the VENUSS framework for sensitivity analysis of VLMs on sequential driving scenes. It extracts temporal sequences from existing driving video datasets, generates custom question categories, and reports accuracy of 25+ models across 2,600+ scenarios against human baselines. No mathematical derivations, first-principles predictions, fitted parameters, or self-referential definitions exist. All results are direct empirical measurements; claims about capability gaps in dynamics/temporal relations rest on the benchmark construction itself rather than any reduction of outputs to inputs by construction. Self-citations are absent from load-bearing steps, and the study is self-contained as an external evaluation protocol.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical sensitivity study that relies on existing driving video datasets and off-the-shelf VLMs; no new physical or mathematical axioms are introduced.

pith-pipeline@v0.9.0 · 5489 in / 1120 out tokens · 30620 ms · 2026-05-10T18:41:46.404826+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 9 canonical work pages · 1 internal anchor

[1]

Vlp: Vision language planning for autonomous driving,

C. Pan, B. Yaman, T. Nesti, A. Mallik, A. G. Allievi, S. Velipasalar, and L. Ren, “Vlp: Vision language planning for autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[2]

Applications of large language models and multimodal large models in autonomous driving: A comprehensive review,

J. Li, J. Li, G. Yang, L. Yang, H. Chi, and L. Yang, “Applications of large language models and multimodal large models in autonomous driving: A comprehensive review,”Drones, vol. 9, no. 4, p. 238, 2025

2025
[3]

A survey on multimodal large language models,

S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,”National Science Review, vol. 11, no. 12, 2024

2024
[4]

VERDI: VLM-Embedded Reasoning for Autonomous Driving

B. Feng, Z. Mei, B. Li, J. Ost, F. Ghilotti, R. Girgis, A. Majumdar, and F. Heide, “Verdi: Vlm-embedded reasoning for autonomous driving,” arXiv preprint arXiv:2505.15925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Lampilot: An open benchmark dataset for autonomous driving with language model programs,

Y . Ma, C. Cui, X. Cao, W. Ye, P. Liu, J. Lu, A. Abdelraouf, R. Gupta, K. Han, A. Bera, J. M. Rehg, and Z. Wang, “Lampilot: An open benchmark dataset for autonomous driving with language model programs,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[6]

Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reason- ing.arXiv preprint arXiv:2503.07608, 2025

B. Jiang, S. Chen, Q. Zhang, W. Liu, and X. Wang, “Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning,”arXiv preprint arXiv:2503.07608, 2025

work page arXiv 2025
[7]

Multi-frame, lightweight & efficient vision-language models for question answering in autonomous driving,

A. Gopalkrishnan, R. Greer, and M. Trivedi, “Multi-frame, lightweight & efficient vision-language models for question answering in autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2024

2024
[8]

Explainable artificial intelligence for autonomous driving: A comprehensive overview and field guide for future research directions,

S. Atakishiyev, M. Salameh, H. Yao, and R. Goebel, “Explainable artificial intelligence for autonomous driving: A comprehensive overview and field guide for future research directions,”IEEE Access, vol. 12, pp. 101 603–101 625, 2024

2024
[9]

Fine-grained evaluation of large vision-language models in autonomous driving,

Y . Li, M. Tian, Z. Lin, J. Zhu, D. Zhu, H. Liu, Z. Wang, Y . Zhang, Z. Xiong, and X. Zhao, “Fine-grained evaluation of large vision-language models in autonomous driving,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025
[10]

St-vlm: Kinematic instruction tuning for spatio-temporal reasoning in vision-language models,

D. Ko, S. Kim, Y . Suh, V . Kumar B.G, M. Yoon, M. Chandraker, and H. J. Kim, “St-vlm: Kinematic instruction tuning for spatio-temporal reasoning in vision-language models,”arXiv preprint arXiv:2503.19355, 2025

work page arXiv 2025
[11]

Covla: Comprehensive vision-language-action dataset for autonomous driving,

H. Arai, K. Miwa, K. Sasaki, Y . Yamaguchi, K. Watanabe, S. Aoki, and I. Yamamoto, “Covla: Comprehensive vision-language-action dataset for autonomous driving,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 1933–1943

2025
[12]

Are vlms ready for autonomous driving? an empirical study from the reliability, data, and metric perspectives,

S. Xie, L. Kong, Y . Dong, C. Sima, W. Zhang, Q. A. Chen, Z. Liu, and L. Pan, “Are vlms ready for autonomous driving? an empirical study from the reliability, data, and metric perspectives,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

2025
[13]

Dynamic traffic scene classification with space-time coherence,

A. Narayanan, I. Dwivedi, and B. Dariush, “Dynamic traffic scene classification with space-time coherence,” inProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2019, pp. 5629–5635

2019
[14]

nuscenes: A multimodal dataset for autonomous driving,

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11 621–11 631

2020
[15]

Scalability in perception for autonomous driving: Waymo open dataset,

P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caine, V . Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y . Zhang, J. Shlens, Z. Chen, and D. Anguelov, “Scalability in perception for autonomous driving: Waymo open dataset,” in Proceedings of the IEEE/C...

2020
[16]

A survey on multimodal large language models for autonomous driving,

C. Cui, Y . Ma, X. Cao, W. Ye, Y . Zhou, K. Liang, J. Chen, J. Lu, Z. Yang, K.-D. Liaoet al., “A survey on multimodal large language models for autonomous driving,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), 2024, pp. 958–979

2024
[17]

Reasondrive: Efficient visual question answering for autonomous vehicles with reasoning-enhanced small vision-language models,

A. Chahe and L. Zhou, “Reasondrive: Efficient visual question answering for autonomous vehicles with reasoning-enhanced small vision-language models,” 2025. [Online]. Available: https://arxiv.org/ abs/2504.10757

work page arXiv 2025
[18]

Move-kd: Knowledge distillation for vlms with mixture of visual encoders,

J. Cao, Y . Zhang, T. Huang, M. Lu, Q. Zhang, R. An, N. Ma, and S. Zhang, “Move-kd: Knowledge distillation for vlms with mixture of visual encoders,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[19]

Jeff Dean

T. Kwon, N. D. Palo, and E. Johns, “Language models as zero-shot trajectory generators,”IEEE Robotics and Automation Letters, vol. 9, no. 7, p. 6728–6735, Jul. 2024. [Online]. Available: http://dx.doi.org/10.1109/LRA.2024.3410155

work page doi:10.1109/lra.2024.3410155 2024
[20]

Language-grounded dynamic scene graphs for interactive object search with mobile manipulation,

D. Honerkamp, M. Büchner, F. Despinoy, T. Welschehold, and A. Valada, “Language-grounded dynamic scene graphs for interactive object search with mobile manipulation,”IEEE Robotics and Automation Letters, vol. 9, no. 10, p. 8298–8305, Oct. 2024. [Online]. Available: http://dx.doi.org/10.1109/LRA.2024.3441495

work page doi:10.1109/lra.2024.3441495 2024
[21]

Deploying and evaluating llms to program service mobile robots,

Z. Hu, F. Lucchetti, C. Schlesinger, Y . Saxena, A. Freeman, S. Modak, A. Guha, and J. Biswas, “Deploying and evaluating llms to program service mobile robots,”IEEE Robotics and Automation Letters, vol. 9, no. 3, p. 2853–2860, Mar. 2024. [Online]. Available: http://dx.doi.org/10.1109/LRA.2024.3360020

work page doi:10.1109/lra.2024.3360020 2024
[22]

Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation,

A. Werby, C. Huang, M. Büchner, A. Valada, and W. Burgard, “Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation,” inRobotics: Science and Systems XX, ser. RSS2024. Robotics: Science and Systems Foundation, Jul. 2024. [Online]. Available: http://dx.doi.org/10.15607/RSS.2024.XX.077

work page doi:10.15607/rss.2024.xx.077 2024
[23]

Navila: Legged robot vision-language-action model for navigation,

A.-C. Cheng, Y . Ji, Z. Yang, Z. Gongye, X. Zou, J. Kautz, E. Bıyık, H. Yin, S. Liu, and X. Wang, “Navila: Legged robot vision-language-action model for navigation,” inRobotics: Science and Systems XXI, 2025

2025
[24]

Multiagent multitraversal multimodal self-driving: Open mars dataset,

Y . Li, Z. Li, N. Chen, M. Gong, Z. Lyu, Z. Wang, P. Jiang, and C. Feng, “Multiagent multitraversal multimodal self-driving: Open mars dataset,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 22 041–22 051

2024
[25]

Large language models can learn temporal reasoning,

S. Xiong, A. Payani, R. Kompella, and F. Fekri, “Large language models can learn temporal reasoning,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024

2024
[26]

Applications, challenges, and future directions of human-in-the-loop learning,

S. Kumar, S. Datta, V . Singh, D. Datta, S. K. Singh, and R. Sharma, “Applications, challenges, and future directions of human-in-the-loop learning,”IEEE Access, vol. 12, pp. 75 735–75 760, 2024

2024
[27]

Human-in-the-loop reinforcement learning: A survey and position on requirements, challenges, and opportunities,

C. O. Retzlaff, S. Das, C. Wayllace, P. Mousavi, M. Afshari, T. Yang, A. Saranti, A. Angerschmid, M. E. Taylor, and A. Holzinger, “Human-in-the-loop reinforcement learning: A survey and position on requirements, challenges, and opportunities,”Journal of Artificial Intelligence Research, vol. 79, 2024. [Online]. Available: https://doi.org/10.1613/jair.1.15348

work page doi:10.1613/jair.1.15348 2024
[28]

Active learning literature survey,

B. Settles, “Active learning literature survey,” University of Wisconsin-Madison, Department of Computer Sciences, Tech. Rep. 1648, 2009

2009