How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study
Pith reviewed 2026-05-21 09:49 UTC · model grok-4.3
The pith
Vision-language models reach only 57% accuracy on sequential driving scenes, below the 65% human level.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Building on existing driving datasets, VENUSS extracts temporal sequences from videos and generates custom evaluation categories to test VLMs. Across more than 25 models and 2600 scenarios, the strongest models reach 57% accuracy, short of the 65% human baseline measured under comparable conditions. Performance is stronger on static object detection tasks but drops for questions about vehicle dynamics and temporal relations. The framework further shows that input configurations including resolution, frame count, temporal intervals, spatial layouts, and presentation modes produce measurable differences in output quality.
What carries the argument
VENUSS, a sensitivity-analysis framework that extracts temporal sequences from driving videos and produces structured evaluations across custom categories to measure how input configurations affect VLM performance.
If this is right
- Current VLMs are unlikely to support reliable decision making in dynamic driving environments without further advances in temporal reasoning.
- Performance on sequential driving tasks can be improved or degraded by changing resolution, frame count, or spatial arrangement of inputs.
- Evaluation benchmarks focused on static scenes will miss the largest capability gaps in models intended for driving.
- The VENUSS categories provide a concrete way to track progress on vehicle-dynamics and timing understanding in future models.
Where Pith is reading between the lines
- Models that close the gap on VENUSS dynamics questions may also show gains on other time-sensitive vision tasks such as action recognition in video.
- Hybrid systems that pair VLMs with explicit motion trackers could bypass some of the temporal weaknesses shown here.
- If the performance patterns hold across additional driving datasets, the results would point to a general architectural limit rather than a data-specific issue.
Load-bearing premise
The extracted temporal sequences and custom evaluation categories from existing driving datasets accurately measure genuine sequential understanding rather than dataset-specific artifacts or annotation biases.
What would settle it
Run the same VENUSS scenarios on a new model or human group and observe whether accuracy on dynamics and temporal questions rises above 65% while static detection stays flat, or whether removing time intervals from the inputs leaves scores unchanged.
Figures
read the original abstract
Vision-Language Models (VLMs) are increasingly proposed for autonomous driving tasks, yet their performance on sequential driving scenes remains poorly characterized, particularly regarding how input configurations affect their capabilities. We introduce VENUSS (VLM Evaluation oN Understanding Sequential Scenes), a framework for systematic sensitivity analysis of VLM performance on sequential driving scenes, establishing baselines for future research. Building upon existing datasets, VENUSS extracts temporal sequences from driving videos, and generates structured evaluations across custom categories. By comparing 25+ existing VLMs across 2,600+ scenarios, we reveal how even top models achieve only 57% accuracy, not matching human performance under similar constraints (65%) and exposing significant capability gaps. Our analysis shows that VLMs excel with static object detection but struggle with understanding vehicle dynamics and temporal relations. VENUSS offers the first systematic sensitivity analysis of VLMs focused on how input image configurations - resolution, frame count, temporal intervals, spatial layouts, and presentation modes - affect performance on sequential driving scenes. Supplementary material available at https://TUM-AVS.github.io/VENUSS/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the VENUSS framework for evaluating vision-language models (VLMs) on their understanding of sequential driving scenes. It builds on existing driving datasets to extract temporal sequences and conducts a large-scale evaluation of over 25 VLMs across more than 2,600 scenarios in custom categories such as vehicle dynamics and temporal relations. The central claims are that top-performing VLMs achieve only 57% accuracy, falling short of human performance at 65%, and that VLMs perform better on static object detection than on dynamic and temporal aspects. The paper also includes a sensitivity study on how input configurations like resolution, frame count, temporal intervals, spatial layouts, and presentation modes affect VLM performance.
Significance. If the empirical results hold, this study would be significant for the computer vision and autonomous driving communities by establishing baselines for VLM capabilities in sequential scene understanding and identifying specific weaknesses in handling temporal information. The systematic sensitivity analysis on input configurations provides actionable insights for improving VLM applications in driving. Credit is due for the scale of the evaluation (25+ models, 2600+ scenarios) and for making supplementary material available, which aids reproducibility.
major comments (2)
- [Results and Evaluation sections] The accuracy figures (e.g., 57% for top models and category-wise breakdowns) are reported without error bars, statistical significance tests, or details on exact prompt templates. This is load-bearing for the central performance-gap claim relative to the 65% human baseline.
- [Human Baseline subsection] Details on human baseline collection (exact frame count, resolution, and presentation mode) are insufficient to confirm the input conditions were matched to those used for VLMs. This directly affects the validity of the 57% vs. 65% gap as evidence of capability differences.
minor comments (2)
- [Abstract] The abstract references supplementary material at https://TUM-AVS.github.io/VENUSS/ but the main text would benefit from explicit cross-references to specific tables or figures that present the sensitivity analysis outcomes.
- [Dataset and Category Definition] Clarify with concrete examples how the custom categories (vehicle dynamics, temporal relations) are defined and annotated to distinguish them from static cues that may correlate with labels.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below and outline the revisions we will make to improve the clarity and rigor of our evaluation.
read point-by-point responses
-
Referee: [Results and Evaluation sections] The accuracy figures (e.g., 57% for top models and category-wise breakdowns) are reported without error bars, statistical significance tests, or details on exact prompt templates. This is load-bearing for the central performance-gap claim relative to the 65% human baseline.
Authors: We acknowledge the importance of providing error bars, statistical tests, and prompt details to support our claims. In the revised manuscript, we will add the exact prompt templates to the supplementary material and describe them in the main text. Additionally, we will include error bars computed via bootstrapping over the scenarios and report results of statistical significance tests (such as Wilcoxon signed-rank tests) comparing VLM performances to the human baseline. These additions will be incorporated into the Results and Evaluation sections. revision: yes
-
Referee: [Human Baseline subsection] Details on human baseline collection (exact frame count, resolution, and presentation mode) are insufficient to confirm the input conditions were matched to those used for VLMs. This directly affects the validity of the 57% vs. 65% gap as evidence of capability differences.
Authors: Thank you for pointing this out. We will revise the Human Baseline subsection to provide precise details on the collection process, confirming that the human participants were shown the same sequential frames with identical frame counts, resolutions, and presentation modes as those used in the VLM evaluations. Any deviations, if present, will be explicitly noted along with their rationale. This will ensure the comparison is fair and the performance gap is interpretable. revision: yes
Circularity Check
No circularity in empirical benchmark study
full rationale
The paper introduces VENUSS as an empirical evaluation framework that extracts temporal sequences from existing driving datasets and directly measures VLM accuracy (57% top models) against human baselines (65%) across 2,600+ scenarios. No mathematical derivations, fitted parameters renamed as predictions, self-citation load-bearing arguments, or uniqueness theorems are present; all headline results are direct comparisons on held-out sequences using custom categories, making the study self-contained against external benchmarks and human labels with no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing driving video datasets contain temporal sequences that are sufficient to test VLM sequential understanding when reorganized into the authors' categories.
invented entities (1)
-
VENUSS framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Vlp: Vision language planning for autonomous driving,
C. Pan, B. Yaman, T. Nesti, A. Mallik, A. G. Allievi, S. Velipasalar, and L. Ren, “Vlp: Vision language planning for autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[2]
J. Li, J. Li, G. Yang, L. Yang, H. Chi, and L. Yang, “Applications of large language models and multimodal large models in autonomous driving: A comprehensive review,”Drones, vol. 9, no. 4, p. 238, 2025
work page 2025
-
[3]
A survey on multimodal large language models,
S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,”National Science Review, vol. 11, no. 12, 2024
work page 2024
-
[4]
VERDI: VLM-Embedded Reasoning for Autonomous Driving
B. Feng, Z. Mei, B. Li, J. Ost, F. Ghilotti, R. Girgis, A. Majumdar, and F. Heide, “Verdi: Vlm-embedded reasoning for autonomous driving,” arXiv preprint arXiv:2505.15925, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Lampilot: An open benchmark dataset for autonomous driving with language model programs,
Y . Ma, C. Cui, X. Cao, W. Ye, P. Liu, J. Lu, A. Abdelraouf, R. Gupta, K. Han, A. Bera, J. M. Rehg, and Z. Wang, “Lampilot: An open benchmark dataset for autonomous driving with language model programs,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[6]
B. Jiang, S. Chen, Q. Zhang, W. Liu, and X. Wang, “Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning,”arXiv preprint arXiv:2503.07608, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
A. Gopalkrishnan, R. Greer, and M. Trivedi, “Multi-frame, lightweight & efficient vision-language models for question answering in autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2024
work page 2024
-
[8]
S. Atakishiyev, M. Salameh, H. Yao, and R. Goebel, “Explainable artificial intelligence for autonomous driving: A comprehensive overview and field guide for future research directions,”IEEE Access, vol. 12, pp. 101 603–101 625, 2024
work page 2024
-
[9]
Fine-grained evaluation of large vision-language models in autonomous driving,
Y . Li, M. Tian, Z. Lin, J. Zhu, D. Zhu, H. Liu, Z. Wang, Y . Zhang, Z. Xiong, and X. Zhao, “Fine-grained evaluation of large vision-language models in autonomous driving,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025
work page 2025
-
[10]
St-vlm: Kinematic instruction tuning for spatio-temporal reasoning in vision-language models,
D. Ko, S. Kim, Y . Suh, V . Kumar B.G, M. Yoon, M. Chandraker, and H. J. Kim, “St-vlm: Kinematic instruction tuning for spatio-temporal reasoning in vision-language models,”arXiv preprint arXiv:2503.19355, 2025
-
[11]
Covla: Comprehensive vision-language-action dataset for autonomous driving,
H. Arai, K. Miwa, K. Sasaki, Y . Yamaguchi, K. Watanabe, S. Aoki, and I. Yamamoto, “Covla: Comprehensive vision-language-action dataset for autonomous driving,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 1933–1943
work page 2025
-
[12]
S. Xie, L. Kong, Y . Dong, C. Sima, W. Zhang, Q. A. Chen, Z. Liu, and L. Pan, “Are vlms ready for autonomous driving? an empirical study from the reliability, data, and metric perspectives,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025
work page 2025
-
[13]
Dynamic traffic scene classification with space-time coherence,
A. Narayanan, I. Dwivedi, and B. Dariush, “Dynamic traffic scene classification with space-time coherence,” inProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2019, pp. 5629–5635
work page 2019
-
[14]
nuscenes: A multimodal dataset for autonomous driving,
H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11 621–11 631
work page 2020
-
[15]
Scalability in perception for autonomous driving: Waymo open dataset,
P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caine, V . Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y . Zhang, J. Shlens, Z. Chen, and D. Anguelov, “Scalability in perception for autonomous driving: Waymo open dataset,” in Proceedings of the IEEE/C...
work page 2020
-
[16]
A survey on multimodal large language models for autonomous driving,
C. Cui, Y . Ma, X. Cao, W. Ye, Y . Zhou, K. Liang, J. Chen, J. Lu, Z. Yang, K.-D. Liaoet al., “A survey on multimodal large language models for autonomous driving,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), 2024, pp. 958–979
work page 2024
-
[17]
A. Chahe and L. Zhou, “Reasondrive: Efficient visual question answering for autonomous vehicles with reasoning-enhanced small vision-language models,” 2025. [Online]. Available: https://arxiv.org/ abs/2504.10757
-
[18]
Move-kd: Knowledge distillation for vlms with mixture of visual encoders,
J. Cao, Y . Zhang, T. Huang, M. Lu, Q. Zhang, R. An, N. Ma, and S. Zhang, “Move-kd: Knowledge distillation for vlms with mixture of visual encoders,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
work page 2025
-
[19]
Language models as zero-shot trajectory generators,
T. Kwon, N. D. Palo, and E. Johns, “Language models as zero-shot trajectory generators,”IEEE Robotics and Automation Letters, vol. 9, no. 7, p. 6728–6735, Jul. 2024. [Online]. Available: http://dx.doi.org/10.1109/LRA.2024.3410155
-
[20]
Language-grounded dynamic scene graphs for interactive object search with mobile manipulation,
D. Honerkamp, M. Büchner, F. Despinoy, T. Welschehold, and A. Valada, “Language-grounded dynamic scene graphs for interactive object search with mobile manipulation,”IEEE Robotics and Automation Letters, vol. 9, no. 10, p. 8298–8305, Oct. 2024. [Online]. Available: http://dx.doi.org/10.1109/LRA.2024.3441495
-
[21]
Deploying and evaluating llms to program service mobile robots,
Z. Hu, F. Lucchetti, C. Schlesinger, Y . Saxena, A. Freeman, S. Modak, A. Guha, and J. Biswas, “Deploying and evaluating llms to program service mobile robots,”IEEE Robotics and Automation Letters, vol. 9, no. 3, p. 2853–2860, Mar. 2024. [Online]. Available: http://dx.doi.org/10.1109/LRA.2024.3360020
-
[22]
Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation,
A. Werby, C. Huang, M. Büchner, A. Valada, and W. Burgard, “Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation,” inRobotics: Science and Systems XX, ser. RSS2024. Robotics: Science and Systems Foundation, Jul. 2024. [Online]. Available: http://dx.doi.org/10.15607/RSS.2024.XX.077
-
[23]
Navila: Legged robot vision-language-action model for navigation,
A.-C. Cheng, Y . Ji, Z. Yang, Z. Gongye, X. Zou, J. Kautz, E. Bıyık, H. Yin, S. Liu, and X. Wang, “Navila: Legged robot vision-language-action model for navigation,” inRobotics: Science and Systems XXI, 2025
work page 2025
-
[24]
Multiagent multitraversal multimodal self-driving: Open mars dataset,
Y . Li, Z. Li, N. Chen, M. Gong, Z. Lyu, Z. Wang, P. Jiang, and C. Feng, “Multiagent multitraversal multimodal self-driving: Open mars dataset,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 22 041–22 051
work page 2024
-
[25]
Large language models can learn temporal reasoning,
S. Xiong, A. Payani, R. Kompella, and F. Fekri, “Large language models can learn temporal reasoning,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024
work page 2024
-
[26]
Applications, challenges, and future directions of human-in-the-loop learning,
S. Kumar, S. Datta, V . Singh, D. Datta, S. K. Singh, and R. Sharma, “Applications, challenges, and future directions of human-in-the-loop learning,”IEEE Access, vol. 12, pp. 75 735–75 760, 2024
work page 2024
-
[27]
C. O. Retzlaff, S. Das, C. Wayllace, P. Mousavi, M. Afshari, T. Yang, A. Saranti, A. Angerschmid, M. E. Taylor, and A. Holzinger, “Human-in-the-loop reinforcement learning: A survey and position on requirements, challenges, and opportunities,”Journal of Artificial Intelligence Research, vol. 79, 2024. [Online]. Available: https://doi.org/10.1613/jair.1.15348
-
[28]
Active learning literature survey,
B. Settles, “Active learning literature survey,” University of Wisconsin-Madison, Department of Computer Sciences, Tech. Rep. 1648, 2009
work page 2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.