pith. sign in

arxiv: 2604.06750 · v2 · pith:4FJ3YBJJnew · submitted 2026-04-08 · 💻 cs.CV

How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study

Pith reviewed 2026-05-21 09:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelssequential driving scenessensitivity analysistemporal understandingautonomous drivingmodel evaluationvehicle dynamics
0
0 comments X

The pith

Vision-language models reach only 57% accuracy on sequential driving scenes, below the 65% human level.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VENUSS, a framework that extracts temporal sequences from driving videos and runs structured tests on how well vision-language models handle those sequences. It evaluates more than 25 existing models across 2600 scenarios and finds consistent shortfalls, especially when questions involve vehicle motion or changes across frames. Models handle static object detection reasonably well but show clear weaknesses in dynamics and timing relations. The work also maps how choices like image resolution, number of frames, and layout affect results. This matters because many proposals now suggest using these models for tasks in self-driving vehicles where sequence understanding is required.

Core claim

Building on existing driving datasets, VENUSS extracts temporal sequences from videos and generates custom evaluation categories to test VLMs. Across more than 25 models and 2600 scenarios, the strongest models reach 57% accuracy, short of the 65% human baseline measured under comparable conditions. Performance is stronger on static object detection tasks but drops for questions about vehicle dynamics and temporal relations. The framework further shows that input configurations including resolution, frame count, temporal intervals, spatial layouts, and presentation modes produce measurable differences in output quality.

What carries the argument

VENUSS, a sensitivity-analysis framework that extracts temporal sequences from driving videos and produces structured evaluations across custom categories to measure how input configurations affect VLM performance.

If this is right

  • Current VLMs are unlikely to support reliable decision making in dynamic driving environments without further advances in temporal reasoning.
  • Performance on sequential driving tasks can be improved or degraded by changing resolution, frame count, or spatial arrangement of inputs.
  • Evaluation benchmarks focused on static scenes will miss the largest capability gaps in models intended for driving.
  • The VENUSS categories provide a concrete way to track progress on vehicle-dynamics and timing understanding in future models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models that close the gap on VENUSS dynamics questions may also show gains on other time-sensitive vision tasks such as action recognition in video.
  • Hybrid systems that pair VLMs with explicit motion trackers could bypass some of the temporal weaknesses shown here.
  • If the performance patterns hold across additional driving datasets, the results would point to a general architectural limit rather than a data-specific issue.

Load-bearing premise

The extracted temporal sequences and custom evaluation categories from existing driving datasets accurately measure genuine sequential understanding rather than dataset-specific artifacts or annotation biases.

What would settle it

Run the same VENUSS scenarios on a new model or human group and observe whether accuracy on dynamics and temporal questions rises above 65% while static detection stays flat, or whether removing time intervals from the inputs leaves scores unchanged.

Figures

Figures reproduced from arXiv: 2604.06750 by Johannes Betz, Mattia Piccinini, Roberto Brusnicki.

Figure 1
Figure 1. Figure 1: VENUSS framework overview. Starting from driving datasets, VENUSS generates structured evaluation data with controlled variations in image count, timing, resolution, layout, and presentation mode. It evaluates both VLMs and humans on identical tasks across custom categories, identifies optimal input configurations per model, and establishes performance baselines. To address this gap, we introduce VENUSS, a… view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of all CoVLA textual descriptions automatically categorized by VENUSS. The framework identified seven categories from the natural language descriptions: motion types (blue), velocity descriptors (yellow), directional behaviors (orange), acceleration patterns (purple), following behavior (dark green), traffic light conditions (light red), and road curvature detection (gray). We release VENUSS … view at source ↗
Figure 3
Figure 3. Figure 3: Simplified example of the human evaluation questionnaire with the seven-category format (Sec. III-C). The interface presents sequential driving images with temporal intervals, and seven multiple-choice questions for the corresponding seven categories. The final answer key (bottom right) concatenates the responses for comparison with ground truth annotations. evaluation. Subsequently, five additional partic… view at source ↗
Figure 4
Figure 4. Figure 4: Performance analysis across 4 dimensions as described in IV-A. (a) Performance by resolution level (1-6), (b) Performance by time interval (100ms increments, 1-10), (c) Performance by number of images (1-10), and (d) Performance by presentation mode (b=batch, c=collage, s=separate). Horizontal lines show human baselines: green for collage-based evaluation, blue for GIF-based evaluation. Box-and-whisker plo… view at source ↗
Figure 5
Figure 5. Figure 5: Performance comparison across different grid formats for the image sequences. Specific grid layouts demonstrate better performance for temporal understanding tasks. ( [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Vision-Language Models (VLMs) are increasingly proposed for autonomous driving tasks, yet their performance on sequential driving scenes remains poorly characterized, particularly regarding how input configurations affect their capabilities. We introduce VENUSS (VLM Evaluation oN Understanding Sequential Scenes), a framework for systematic sensitivity analysis of VLM performance on sequential driving scenes, establishing baselines for future research. Building upon existing datasets, VENUSS extracts temporal sequences from driving videos, and generates structured evaluations across custom categories. By comparing 25+ existing VLMs across 2,600+ scenarios, we reveal how even top models achieve only 57% accuracy, not matching human performance under similar constraints (65%) and exposing significant capability gaps. Our analysis shows that VLMs excel with static object detection but struggle with understanding vehicle dynamics and temporal relations. VENUSS offers the first systematic sensitivity analysis of VLMs focused on how input image configurations - resolution, frame count, temporal intervals, spatial layouts, and presentation modes - affect performance on sequential driving scenes. Supplementary material available at https://TUM-AVS.github.io/VENUSS/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the VENUSS framework for evaluating vision-language models (VLMs) on their understanding of sequential driving scenes. It builds on existing driving datasets to extract temporal sequences and conducts a large-scale evaluation of over 25 VLMs across more than 2,600 scenarios in custom categories such as vehicle dynamics and temporal relations. The central claims are that top-performing VLMs achieve only 57% accuracy, falling short of human performance at 65%, and that VLMs perform better on static object detection than on dynamic and temporal aspects. The paper also includes a sensitivity study on how input configurations like resolution, frame count, temporal intervals, spatial layouts, and presentation modes affect VLM performance.

Significance. If the empirical results hold, this study would be significant for the computer vision and autonomous driving communities by establishing baselines for VLM capabilities in sequential scene understanding and identifying specific weaknesses in handling temporal information. The systematic sensitivity analysis on input configurations provides actionable insights for improving VLM applications in driving. Credit is due for the scale of the evaluation (25+ models, 2600+ scenarios) and for making supplementary material available, which aids reproducibility.

major comments (2)
  1. [Results and Evaluation sections] The accuracy figures (e.g., 57% for top models and category-wise breakdowns) are reported without error bars, statistical significance tests, or details on exact prompt templates. This is load-bearing for the central performance-gap claim relative to the 65% human baseline.
  2. [Human Baseline subsection] Details on human baseline collection (exact frame count, resolution, and presentation mode) are insufficient to confirm the input conditions were matched to those used for VLMs. This directly affects the validity of the 57% vs. 65% gap as evidence of capability differences.
minor comments (2)
  1. [Abstract] The abstract references supplementary material at https://TUM-AVS.github.io/VENUSS/ but the main text would benefit from explicit cross-references to specific tables or figures that present the sensitivity analysis outcomes.
  2. [Dataset and Category Definition] Clarify with concrete examples how the custom categories (vehicle dynamics, temporal relations) are defined and annotated to distinguish them from static cues that may correlate with labels.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below and outline the revisions we will make to improve the clarity and rigor of our evaluation.

read point-by-point responses
  1. Referee: [Results and Evaluation sections] The accuracy figures (e.g., 57% for top models and category-wise breakdowns) are reported without error bars, statistical significance tests, or details on exact prompt templates. This is load-bearing for the central performance-gap claim relative to the 65% human baseline.

    Authors: We acknowledge the importance of providing error bars, statistical tests, and prompt details to support our claims. In the revised manuscript, we will add the exact prompt templates to the supplementary material and describe them in the main text. Additionally, we will include error bars computed via bootstrapping over the scenarios and report results of statistical significance tests (such as Wilcoxon signed-rank tests) comparing VLM performances to the human baseline. These additions will be incorporated into the Results and Evaluation sections. revision: yes

  2. Referee: [Human Baseline subsection] Details on human baseline collection (exact frame count, resolution, and presentation mode) are insufficient to confirm the input conditions were matched to those used for VLMs. This directly affects the validity of the 57% vs. 65% gap as evidence of capability differences.

    Authors: Thank you for pointing this out. We will revise the Human Baseline subsection to provide precise details on the collection process, confirming that the human participants were shown the same sequential frames with identical frame counts, resolutions, and presentation modes as those used in the VLM evaluations. Any deviations, if present, will be explicitly noted along with their rationale. This will ensure the comparison is fair and the performance gap is interpretable. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical benchmark study

full rationale

The paper introduces VENUSS as an empirical evaluation framework that extracts temporal sequences from existing driving datasets and directly measures VLM accuracy (57% top models) against human baselines (65%) across 2,600+ scenarios. No mathematical derivations, fitted parameters renamed as predictions, self-citation load-bearing arguments, or uniqueness theorems are present; all headline results are direct comparisons on held-out sequences using custom categories, making the study self-contained against external benchmarks and human labels with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that existing driving datasets contain representative sequential scenes and that the authors' custom categories isolate temporal understanding without introducing new biases. No free parameters or invented physical entities are introduced; the only new construct is the evaluation framework itself.

axioms (1)
  • domain assumption Existing driving video datasets contain temporal sequences that are sufficient to test VLM sequential understanding when reorganized into the authors' categories.
    Invoked when the paper states it builds upon existing datasets to extract sequences.
invented entities (1)
  • VENUSS framework no independent evidence
    purpose: Systematic sensitivity analysis of VLMs on sequential driving scenes
    Newly defined evaluation pipeline that varies input configurations and generates structured tests.

pith-pipeline@v0.9.0 · 5722 in / 1280 out tokens · 40306 ms · 2026-05-21T09:49:23.084404+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 2 internal anchors

  1. [1]

    Vlp: Vision language planning for autonomous driving,

    C. Pan, B. Yaman, T. Nesti, A. Mallik, A. G. Allievi, S. Velipasalar, and L. Ren, “Vlp: Vision language planning for autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  2. [2]

    Applications of large language models and multimodal large models in autonomous driving: A comprehensive review,

    J. Li, J. Li, G. Yang, L. Yang, H. Chi, and L. Yang, “Applications of large language models and multimodal large models in autonomous driving: A comprehensive review,”Drones, vol. 9, no. 4, p. 238, 2025

  3. [3]

    A survey on multimodal large language models,

    S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,”National Science Review, vol. 11, no. 12, 2024

  4. [4]

    VERDI: VLM-Embedded Reasoning for Autonomous Driving

    B. Feng, Z. Mei, B. Li, J. Ost, F. Ghilotti, R. Girgis, A. Majumdar, and F. Heide, “Verdi: Vlm-embedded reasoning for autonomous driving,” arXiv preprint arXiv:2505.15925, 2025

  5. [5]

    Lampilot: An open benchmark dataset for autonomous driving with language model programs,

    Y . Ma, C. Cui, X. Cao, W. Ye, P. Liu, J. Lu, A. Abdelraouf, R. Gupta, K. Han, A. Bera, J. M. Rehg, and Z. Wang, “Lampilot: An open benchmark dataset for autonomous driving with language model programs,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  6. [6]

    AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning

    B. Jiang, S. Chen, Q. Zhang, W. Liu, and X. Wang, “Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning,”arXiv preprint arXiv:2503.07608, 2025

  7. [7]

    Multi-frame, lightweight & efficient vision-language models for question answering in autonomous driving,

    A. Gopalkrishnan, R. Greer, and M. Trivedi, “Multi-frame, lightweight & efficient vision-language models for question answering in autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2024

  8. [8]

    Explainable artificial intelligence for autonomous driving: A comprehensive overview and field guide for future research directions,

    S. Atakishiyev, M. Salameh, H. Yao, and R. Goebel, “Explainable artificial intelligence for autonomous driving: A comprehensive overview and field guide for future research directions,”IEEE Access, vol. 12, pp. 101 603–101 625, 2024

  9. [9]

    Fine-grained evaluation of large vision-language models in autonomous driving,

    Y . Li, M. Tian, Z. Lin, J. Zhu, D. Zhu, H. Liu, Z. Wang, Y . Zhang, Z. Xiong, and X. Zhao, “Fine-grained evaluation of large vision-language models in autonomous driving,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

  10. [10]

    St-vlm: Kinematic instruction tuning for spatio-temporal reasoning in vision-language models,

    D. Ko, S. Kim, Y . Suh, V . Kumar B.G, M. Yoon, M. Chandraker, and H. J. Kim, “St-vlm: Kinematic instruction tuning for spatio-temporal reasoning in vision-language models,”arXiv preprint arXiv:2503.19355, 2025

  11. [11]

    Covla: Comprehensive vision-language-action dataset for autonomous driving,

    H. Arai, K. Miwa, K. Sasaki, Y . Yamaguchi, K. Watanabe, S. Aoki, and I. Yamamoto, “Covla: Comprehensive vision-language-action dataset for autonomous driving,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 1933–1943

  12. [12]

    Are vlms ready for autonomous driving? an empirical study from the reliability, data, and metric perspectives,

    S. Xie, L. Kong, Y . Dong, C. Sima, W. Zhang, Q. A. Chen, Z. Liu, and L. Pan, “Are vlms ready for autonomous driving? an empirical study from the reliability, data, and metric perspectives,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

  13. [13]

    Dynamic traffic scene classification with space-time coherence,

    A. Narayanan, I. Dwivedi, and B. Dariush, “Dynamic traffic scene classification with space-time coherence,” inProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2019, pp. 5629–5635

  14. [14]

    nuscenes: A multimodal dataset for autonomous driving,

    H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11 621–11 631

  15. [15]

    Scalability in perception for autonomous driving: Waymo open dataset,

    P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caine, V . Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y . Zhang, J. Shlens, Z. Chen, and D. Anguelov, “Scalability in perception for autonomous driving: Waymo open dataset,” in Proceedings of the IEEE/C...

  16. [16]

    A survey on multimodal large language models for autonomous driving,

    C. Cui, Y . Ma, X. Cao, W. Ye, Y . Zhou, K. Liang, J. Chen, J. Lu, Z. Yang, K.-D. Liaoet al., “A survey on multimodal large language models for autonomous driving,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), 2024, pp. 958–979

  17. [17]

    Reasondrive: Efficient visual question answering for autonomous vehicles with reasoning-enhanced small vision-language models,

    A. Chahe and L. Zhou, “Reasondrive: Efficient visual question answering for autonomous vehicles with reasoning-enhanced small vision-language models,” 2025. [Online]. Available: https://arxiv.org/ abs/2504.10757

  18. [18]

    Move-kd: Knowledge distillation for vlms with mixture of visual encoders,

    J. Cao, Y . Zhang, T. Huang, M. Lu, Q. Zhang, R. An, N. Ma, and S. Zhang, “Move-kd: Knowledge distillation for vlms with mixture of visual encoders,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  19. [19]

    Language models as zero-shot trajectory generators,

    T. Kwon, N. D. Palo, and E. Johns, “Language models as zero-shot trajectory generators,”IEEE Robotics and Automation Letters, vol. 9, no. 7, p. 6728–6735, Jul. 2024. [Online]. Available: http://dx.doi.org/10.1109/LRA.2024.3410155

  20. [20]

    Language-grounded dynamic scene graphs for interactive object search with mobile manipulation,

    D. Honerkamp, M. Büchner, F. Despinoy, T. Welschehold, and A. Valada, “Language-grounded dynamic scene graphs for interactive object search with mobile manipulation,”IEEE Robotics and Automation Letters, vol. 9, no. 10, p. 8298–8305, Oct. 2024. [Online]. Available: http://dx.doi.org/10.1109/LRA.2024.3441495

  21. [21]

    Deploying and evaluating llms to program service mobile robots,

    Z. Hu, F. Lucchetti, C. Schlesinger, Y . Saxena, A. Freeman, S. Modak, A. Guha, and J. Biswas, “Deploying and evaluating llms to program service mobile robots,”IEEE Robotics and Automation Letters, vol. 9, no. 3, p. 2853–2860, Mar. 2024. [Online]. Available: http://dx.doi.org/10.1109/LRA.2024.3360020

  22. [22]

    Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation,

    A. Werby, C. Huang, M. Büchner, A. Valada, and W. Burgard, “Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation,” inRobotics: Science and Systems XX, ser. RSS2024. Robotics: Science and Systems Foundation, Jul. 2024. [Online]. Available: http://dx.doi.org/10.15607/RSS.2024.XX.077

  23. [23]

    Navila: Legged robot vision-language-action model for navigation,

    A.-C. Cheng, Y . Ji, Z. Yang, Z. Gongye, X. Zou, J. Kautz, E. Bıyık, H. Yin, S. Liu, and X. Wang, “Navila: Legged robot vision-language-action model for navigation,” inRobotics: Science and Systems XXI, 2025

  24. [24]

    Multiagent multitraversal multimodal self-driving: Open mars dataset,

    Y . Li, Z. Li, N. Chen, M. Gong, Z. Lyu, Z. Wang, P. Jiang, and C. Feng, “Multiagent multitraversal multimodal self-driving: Open mars dataset,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 22 041–22 051

  25. [25]

    Large language models can learn temporal reasoning,

    S. Xiong, A. Payani, R. Kompella, and F. Fekri, “Large language models can learn temporal reasoning,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024

  26. [26]

    Applications, challenges, and future directions of human-in-the-loop learning,

    S. Kumar, S. Datta, V . Singh, D. Datta, S. K. Singh, and R. Sharma, “Applications, challenges, and future directions of human-in-the-loop learning,”IEEE Access, vol. 12, pp. 75 735–75 760, 2024

  27. [27]

    Human-in-the-loop reinforcement learning: A survey and position on requirements, challenges, and opportunities,

    C. O. Retzlaff, S. Das, C. Wayllace, P. Mousavi, M. Afshari, T. Yang, A. Saranti, A. Angerschmid, M. E. Taylor, and A. Holzinger, “Human-in-the-loop reinforcement learning: A survey and position on requirements, challenges, and opportunities,”Journal of Artificial Intelligence Research, vol. 79, 2024. [Online]. Available: https://doi.org/10.1613/jair.1.15348

  28. [28]

    Active learning literature survey,

    B. Settles, “Active learning literature survey,” University of Wisconsin-Madison, Department of Computer Sciences, Tech. Rep. 1648, 2009