DriveSpatial benchmark shows the best of 15 VLMs trails humans by 28.4 points on spatiotemporal driving tasks, with cognitive scene construction as the main failure mode.
Spatial-dise: A unified benchmark for evaluating spatial reasoning in vision-language models
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it