DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving

Anh Nguyen; Chase Rainwater; Duc Minh Nguyen; Duy Minh Ho Nguyen; Gladys Gawugah; Hao Vo; Khoa Vo; Ngan Le; Nghi D. Q. Bui; Ngo Xuan Cuong

REVIEW 2 major objections 2 minor 111 references

DriveSpatial benchmark shows vision-language models trail humans by 28.4 points on spatiotemporal driving tasks, limited by cognitive scene construction.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-06-30 16:37 UTC pith:FKTQIMTM

load-bearing objection DriveSpatial builds a scene-graph benchmark to test multi-view temporal reasoning in VLMs and reports a 28-point human gap, but the abstract gives almost no supporting details on construction or stats. the 2 major comments →

arxiv 2605.23176 v2 pith:FKTQIMTM submitted 2026-05-22 cs.CV

DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving

Hao Vo , Khoa Vo , Phu Loc Nguyen , Sieu Tran , Duc Minh Nguyen , Ngo Xuan Cuong , Gladys Gawugah , Sreevenkata Anjani Tishita Godavarthi

show 5 more authors

Chase Rainwater Nghi D. Q. Bui Anh Nguyen Duy Minh Ho Nguyen Ngan Le

This is my paper

classification cs.CV

keywords vision-language modelsautonomous drivingspatiotemporal reasoningscene constructionmulti-view understandingtemporal reasoningbenchmarkcognitive scene construction

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates DriveSpatial to check if vision-language models can build a unified understanding of a driving scene from several camera views, keep track of objects over time, and reason about their positions and interactions. Questions come from a scene graph that records object details, how they relate in space, how they interact, which cameras see them, and how things change over time. When 15 different models are tested, even the best one falls 28.4 points short of human performance, and the biggest problem is constructing the scene in the first place. This indicates that today's models do not yet have the kind of scene-building skill required for safe autonomous driving decisions.

Core claim

DriveSpatial evaluates four abilities in VLMs: Cognitive Scene Construction, Multi-view Relational Understanding, Temporal Reasoning, and Generalization using 15.6K QA pairs. The benchmark is built on a dynamic multi-relational scene graph encoding object states, spatial relations, interactions, camera visibility, and temporal correspondences. Results show the strongest VLM trails humans by 28.4 points with Cognitive Scene Construction as the key bottleneck, suggesting current VLMs lack the scene-construction ability needed for reliable spatiotemporal driving intelligence. Explicit BEV grounding improves performance while language-only prompting does not.

What carries the argument

A dynamic multi-relational scene graph that encodes object states, spatial relations, interactions, camera visibility, and temporal correspondences to generate QA pairs enforcing genuine cross-view and spatiotemporal reasoning.

Load-bearing premise

The generated QA pairs require models to perform actual cross-view and spatiotemporal reasoning rather than relying on statistical shortcuts or single-view cues.

What would settle it

A VLM reaching near-human scores on DriveSpatial while still failing to maintain object continuity or spatial relations in a closed-loop driving simulation would show the benchmark does not test the claimed ability.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Language-only prompting proves insufficient for these tasks.
Explicit BEV grounding consistently improves VLM performance on the benchmark.
Cognitive Scene Construction remains the primary performance bottleneck compared to the other three abilities.
The 15.6K QA pairs cover 20 tasks drawn from five large-scale autonomous driving datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models may need training approaches that explicitly reward building internal scene representations instead of surface-level pattern matching.
The same construction limits could appear in other multi-view temporal tasks such as robotic manipulation or surveillance analysis.
Releasing the scene-graph pipeline makes it possible to test whether the identified gap persists when new datasets or question types are added.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

DriveSpatial builds a scene-graph benchmark to test multi-view temporal reasoning in VLMs and reports a 28-point human gap, but the abstract gives almost no supporting details on construction or stats.

read the letter

The main point is that this paper introduces DriveSpatial, a benchmark of 15.6K QA pairs drawn from five AD datasets and generated via a dynamic multi-relational scene graph. It targets four abilities—cognitive scene construction, multi-view relational understanding, temporal reasoning, and generalization—across 20 tasks, then evaluates 15 VLMs and finds the best model 28.4 points behind humans, with scene construction as the clear weak area. Explicit BEV grounding helps while language-only prompts do not.

What stands out is the attempt to move past single-view or static benchmarks by using the scene graph to encode states, relations, visibility, and temporal links. That setup is meant to force genuine cross-view and spatiotemporal reasoning instead of shortcuts. The scale and the human-model comparison are concrete, and the diagnostics on prompting give a practical signal.

The soft spot is the absence of any real evidence on how the QA pairs were verified, how the splits were chosen, whether the gap is statistically reliable, or what error patterns look like. The abstract states the numbers but supplies none of the usual checks that would let a reader confirm the scene graph actually prevents easy solutions. Without those, the central claim rests on trust rather than shown rigor.

This is for groups working on VLMs for driving or on spatiotemporal benchmarks. A reader who needs task definitions or a quick model ranking would find it useful. It deserves peer review because the problem it targets is real and the generation approach is a reasonable step, but the current write-up needs the missing methodological pieces before the results can be taken as settled.

Referee Report

2 major / 2 minor

Summary. The paper introduces DriveSpatial, a benchmark of 15.6K human-verified QA pairs spanning 20 tasks drawn from five large-scale autonomous driving datasets. It targets four core abilities in VLMs—Cognitive Scene Construction, Multi-view Relational Understanding, Temporal Reasoning, and Generalization—by constructing questions from a dynamic multi-relational scene graph that encodes object states, spatial relations, interactions, camera visibility, and temporal correspondences. Evaluation of 15 representative VLMs shows the strongest model trails human performance by 28.4 points, with Cognitive Scene Construction identified as the primary bottleneck; language-only prompting is shown to be insufficient while explicit BEV grounding improves results. The authors conclude that current VLMs lack the scene-construction capacity required for reliable spatiotemporal driving intelligence and will release the benchmark and construction pipeline.

Significance. If the evaluation results and scene-graph construction hold under scrutiny, the work is significant for the autonomous-driving and VLM communities. It moves beyond existing single-view or static benchmarks by enforcing cross-view and temporal reasoning, and the release of the dataset plus pipeline constitutes a concrete contribution that can support reproducible follow-up research. The identification of scene construction as the dominant failure mode supplies a falsifiable direction for model improvement.

major comments (2)

[Abstract and §4] Abstract and §4 (Evaluation): the reported 28.4-point human-model gap and the claim that Cognitive Scene Construction is the key bottleneck are presented without accompanying details on data splits, statistical significance testing, per-task error analysis, or inter-annotator agreement for the human-verified QA pairs. These omissions make it impossible to verify that the gap is robust rather than an artifact of a particular split or annotation procedure.
[§3] §3 (Benchmark Construction): the central assumption that the dynamic multi-relational scene graph forces genuine cross-view and spatiotemporal reasoning (rather than permitting shortcut solutions) is stated but not accompanied by an explicit validation experiment, such as an ablation that removes temporal correspondences or visibility constraints and measures the resulting change in VLM performance.

minor comments (2)

[§2] §2 (Related Work): several prior AD-VLM benchmarks are cited; a concise table comparing task coverage, number of QA pairs, and use of multi-view/temporal graphs would improve readability.
[Figure 1 and §3.2] Figure 1 and §3.2: the caption and surrounding text should explicitly state the total number of unique scene graphs and the distribution of QA pairs across the five source datasets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation details and benchmark validation. We address each major comment below and will revise the manuscript accordingly to strengthen the claims.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Evaluation): the reported 28.4-point human-model gap and the claim that Cognitive Scene Construction is the key bottleneck are presented without accompanying details on data splits, statistical significance testing, per-task error analysis, or inter-annotator agreement for the human-verified QA pairs. These omissions make it impossible to verify that the gap is robust rather than an artifact of a particular split or annotation procedure.

Authors: We agree these details are necessary to establish robustness. In the revised manuscript we will expand §4 (and add an appendix section) to include: explicit train/validation/test splits across the five source datasets; statistical significance testing (bootstrap resampling with 95% CI and paired tests) on the 28.4-point gap and per-ability scores; a full per-task error breakdown; and inter-annotator agreement statistics (Cohen’s κ and raw agreement) for the human verification step. These additions will directly support the reported gap and the identification of Cognitive Scene Construction as the bottleneck. revision: yes
Referee: [§3] §3 (Benchmark Construction): the central assumption that the dynamic multi-relational scene graph forces genuine cross-view and spatiotemporal reasoning (rather than permitting shortcut solutions) is stated but not accompanied by an explicit validation experiment, such as an ablation that removes temporal correspondences or visibility constraints and measures the resulting change in VLM performance.

Authors: The scene-graph construction explicitly encodes temporal correspondences and camera visibility to block shortcuts, as detailed in §3. We nevertheless recognize that an empirical ablation would provide stronger evidence. We will generate two controlled variants of the benchmark—one with temporal correspondences removed and one with visibility constraints removed—and report VLM performance deltas on these variants in the revised §3. This will quantify how much the enforced constraints affect model scores versus potential shortcuts. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper constructs a benchmark from existing AD datasets via a dynamic multi-relational scene graph and evaluates 15 VLMs on generated QA pairs. No equations, fitted parameters, predictions, or derivations are present that could reduce to inputs by construction. No self-citations are invoked as load-bearing support for uniqueness or ansatzes. The central claims rest on the benchmark construction and empirical results, which are independent of any prior author work referenced in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the scene graph accurately captures the required relations and that human verification produces reliable QA pairs measuring the intended abilities.

axioms (2)

domain assumption The dynamic multi-relational scene graph encodes object states, spatial relations, interactions, camera visibility, and temporal correspondences sufficiently to generate questions requiring genuine spatiotemporal reasoning.
Invoked to justify that the 15.6K QA pairs test the four abilities without shortcuts.
domain assumption Human verification of the QA pairs ensures they are correct and enforce the intended reasoning.
Stated as the basis for benchmark quality.

pith-pipeline@v0.9.1-grok · 5840 in / 1351 out tokens · 42570 ms · 2026-06-30T16:37:34.962733+00:00 · methodology

0 comments

read the original abstract

Spatiotemporal intelligence in autonomous driving (AD) requires an agent to integrate multi-view observations into a coherent scene representation, maintain object continuity across viewpoints and time, and reason about spatial relations, interactions, and future dynamics. However, existing AD vision-language benchmarks largely focus on single-view, static, ego-centric, or single-source question answering, leaving it unclear whether current Vision-Language Models (VLMs) can truly construct and reason over dynamic driving scenes. We introduce DriveSpatial, a benchmark of 15.6K human-verified QA pairs across 20 tasks from five large-scale AD datasets. DriveSpatial evaluates four abilities: Cognitive Scene Construction, Multi-view Relational Understanding, Temporal Reasoning, and Generalization. Unlike prior benchmarks, DriveSpatial is generated from a dynamic multi-relational scene graph that encodes object states, spatial relations, interactions, camera visibility, and temporal correspondences, enabling QA pairs that enforce genuine cross-view and spatiotemporal reasoning. Evaluating 15 representative VLMs reveals a substantial human-model gap: the strongest model trails humans by 28.4 points, with Cognitive Scene Construction emerging as the key bottleneck. Further diagnostics show that language-only prompting is insufficient, while explicit BEV grounding consistently improves performance. These results suggest that current VLMs lack the scene-construction ability needed for reliable spatiotemporal driving intelligence. DriveSpatial and its construction pipeline will be released to support future research.

Figures

Figures reproduced from arXiv: 2605.23176 by Anh Nguyen, Chase Rainwater, Duc Minh Nguyen, Duy Minh Ho Nguyen, Gladys Gawugah, Hao Vo, Khoa Vo, Ngan Le, Nghi D. Q. Bui, Ngo Xuan Cuong, Phu Loc Nguyen, Sieu Tran, Sreevenkata Anjani Tishita Godavarthi.

**Figure 1.** Figure 1: We present DRIVESPATIAL: A spatiotemporal intelligence evaluation benchmark for Autonomous Driving that mirrors human navigation cognition. (I, Top) In driving scenarios, humans gather observations from multiple viewpoints to mentally construct an internal representation (Cognitive Scene Construction), infer spatial relationships between objects (Multi-view Relational Understanding), and connect these pe… view at source ↗

**Figure 2.** Figure 2: Representative question samples from DRIVESPATIAL across nine selected tasks (out of 20). Each cell shows a multiple-choice question with its visual input and answer options; correct answers are bold. Tasks are grouped by spatiotemporal ability: Const. , Unders. , Reas. Spatial and Spatiotemporal Intelligence in VLMs. A growing body of work probes whether VLMs possess genuine spatial intelligence. General-… view at source ↗

**Figure 3.** Figure 3: DRIVESPATIAL statistics. (Left) Sunburst view of the 20 tasks under abilities Const. , Unders. and Reas. . (Right) Scene-level diversity distribution ( Gen. ). relationships across viewpoints, Reas. asks whether it can leverage temporal context to infer dynamics and anticipate future events, and Gen. measures whether these abilities remain reliable across datasets and driving conditions. Task Taxonomy & S… view at source ↗

**Figure 4.** Figure 4: DRIVESPATIAL construction pipeline. (1) standardize five AV datasets into a unified schema; (2) complete scene-level metadata; (3) construct a dynamic multi-relational graph; and (4) apply 20 rule-based algorithms to generate QA pairs. To ensure quality, human-in-the-loop is applied. cam(v t i ) ∩ cam(v t j ) = ∅ for pairwise relation queries. These constraints prevent the answer from being recovered from … view at source ↗

**Figure 5.** Figure 5: Per-task comparison against human performance. (left [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Breakdown of VLM performance for testing [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

111 extracted references · 111 canonical work pages · 21 internal anchors

[1]

Studies in spatial learning

Edward C Tolman, Benbow F Ritchie, and Donald Kalish. Studies in spatial learning. ii. place learning versus response learning.Journal of experimental psychology, 36(3):221, 1946

work page 1946
[2]

Cognitive maps in rats and men.Psychological review, 55(4):189, 1948

Edward C Tolman. Cognitive maps in rats and men.Psychological review, 55(4):189, 1948

work page 1948
[3]

The hippocampus and context revisited

Lynn Nadel. The hippocampus and context revisited. 2008

work page 2008
[4]

The effect of vehicle navigation systems on the formation of cognitive maps

Gary E Burnett and Kate Lee. The effect of vehicle navigation systems on the formation of cognitive maps. InInternational conference of traffic and transport psychology, 2005

work page 2005
[5]

Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22442–22452, 2025

work page 2025
[6]

Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024

Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024

work page 2024
[7]

Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models

Xinpeng Ding, Jianhua Han, Hang Xu, Xiaodan Liang, Wei Zhang, and Xiaomeng Li. Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13668–13677, 2024

work page 2024
[8]

Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives

Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, and Liang Pan. Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6585–6597, October 2025

work page 2025
[9]

Vlm-ad: End-to-end autonomous driving through vision-language model supervision.Conference on Robot Learning (CoRL), 2024

Yi Xu, Yuxin Hu, Zaiwei Zhang, Gregory P Meyer, Siva Karthik Mustikovela, Siddhartha Srinivasa, Eric M Wolff, and Xin Huang. Vlm-ad: End-to-end autonomous driving through vision-language model supervision.Conference on Robot Learning (CoRL), 2024

work page 2024
[10]

Robotron-drive: All-in-one large multimodal model for autonomous driving

Zhijian Huang, Chengjian Feng, Feng Yan, Baihui Xiao, Zequn Jie, Yujie Zhong, Xiaodan Liang, and Lin Ma. Robotron-drive: All-in-one large multimodal model for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8011–8021, 2025

work page 2025
[11]

Fastdrivevla: Efficient end-to-end driving via plug-and-play reconstruction- based token pruning.the Association for the Advancement of Artificial Intelligence (AAAI), 2026

Jiajun Cao, Qizhe Zhang, Peidong Jia, Xuhui Zhao, Bo Lan, Xiaoan Zhang, Zhuo Li, Xiaobao Wei, Sixiang Chen, Liyun Li, et al. Fastdrivevla: Efficient end-to-end driving via plug-and-play reconstruction- based token pruning.the Association for the Advancement of Artificial Intelligence (AAAI), 2026

work page 2026
[12]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Letters, 2024

work page 2024
[13]

Covla: Comprehensive vision-language-action dataset for autonomous driving

Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1933–1943. IEEE, 2025

work page 1933
[14]

Emma: End-to-end multimodal model for autonomous driving.TMLR, 2025

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.TMLR, 2025

work page 2025
[15]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024

work page 2024
[16]

DriveVLM: The convergence of autonomous driving and large vision-language models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, XianPeng Lang, and Hang Zhao. DriveVLM: The convergence of autonomous driving and large vision-language models. In8th Annual Conference on Robot Learning, 2024. 10

work page 2024
[17]

Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[18]

arXiv preprint arXiv:2504.03164 , year=

Kexin Tian, Jingrui Mao, Yunlong Zhang, Jiwan Jiang, Yang Zhou, and Zhengzhong Tu. Nuscenes- spatialqa: A spatial understanding and reasoning benchmark for vision-language models in autonomous driving.arXiv preprint arXiv:2504.03164, 2025

work page arXiv 2025
[19]

Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding

Xu Cao, Tong Zhou, Yunsheng Ma, Wenqian Ye, Can Cui, Kun Tang, Zhipeng Cao, Kaizhao Liang, Ziran Wang, James M Rehg, et al. Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21819–21830, 2024

work page 2024
[20]

arXiv preprint arXiv:2509.06266 (2025) 2

Mohsen Gholami, Ahmad Rezaei, Zhou Weimin, Sitong Mao, Shunbo Zhou, Yong Zhang, and Moham- mad Akbari. Spatial reasoning with vision-language models in ego-centric multi-view scenes.arXiv preprint arXiv:2509.06266, 2025

work page arXiv 2025
[21]

Surds: Benchmarking spatial understanding and reasoning in driving scenarios with vision language models

Xianda Guo, Ruijun Zhang, Yiqun Duan, Yuhang He, Dujun Nie, Wenke Huang, Chenming Zhang, Shuai Liu, Hao Zhao, and Long Chen. Surds: Benchmarking spatial understanding and reasoning in driving scenarios with vision language models. InNeurIPS, 2025

work page 2025
[22]

Automated evaluation of large vision-language models on self-driving corner cases

Kai Chen, Yanze Li, Wenhua Zhang, Yanxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li, et al. Automated evaluation of large vision-language models on self-driving corner cases. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 7817–7826. IEEE, 2025

work page 2025
[23]

Lingoqa: Visual question answering for autonomous driving

Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, et al. Lingoqa: Visual question answering for autonomous driving. InEuropean Conference on Computer Vision, pages 252–269. Springer, 2024

work page 2024
[24]

Towards physics- informed spatial intelligence with human priors: An autonomous driving pilot study.International Conference on Learning Representations (ICLR), 2025

Guanlin Wu, Boyan Su, Yang Zhao, Pu Wang, Yichen Lin, and Hao Frank Yang. Towards physics- informed spatial intelligence with human priors: An autonomous driving pilot study.International Conference on Learning Representations (ICLR), 2025

work page 2025
[25]

Stsbench: A spatio-temporal scenario benchmark for multi-modal large language models in autonomous driving.Conference and Workshop on Neural Information Processing Systems, 2025

Christian Fruhwirth-Reisinger, Dušan Mali´c, Wei Lin, David Schinagl, Samuel Schulter, and Horst Possegger. Stsbench: A spatio-temporal scenario benchmark for multi-modal large language models in autonomous driving.Conference and Workshop on Neural Information Processing Systems, 2025

work page 2025
[26]

Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4542–4550, 2024

work page 2024
[27]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krish- nan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020

work page 2020
[28]

Argoverse 2: Next generation datasets for self-driving perception and forecasting.Conference on Neural Information Processing Systems, 202

Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting.Conference on Neural Information Processing Systems, 202

work page
[29]

Man truckscenes: A multimodal dataset for autonomous trucking in diverse conditions.Advances in Neural Information Processing Systems, 37:62062–62082, 2024

Felix Fent, Fabian Kuttenreich, Florian Ruch, Farija Rizwin, Stefan Juergens, Lorenz Lechermann, Christian Nissler, Andrea Perl, Ulrich V oll, Min Yan, et al. Man truckscenes: A multimodal dataset for autonomous trucking in diverse conditions.Advances in Neural Information Processing Systems, 37:62062–62082, 2024

work page 2024
[30]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020. 11

work page 2020
[31]

One million scenes for autonomous driving: Once dataset

Jiageng Mao, Minzhe Niu, Chenhan Jiang, Hanxue Liang, Jingheng Chen, Xiaodan Liang, Yamin Li, Chaoqiang Ye, Wei Zhang, Zhenguo Li, et al. One million scenes for autonomous driving: Once dataset. Conference and Workshop on Neural Information Processing Systems, 2021

work page 2021
[32]

Vision language models in autonomous driving: A survey and outlook.arXiv preprint arXiv:2310.14414, 2023

Xingcheng Zhou, Mingyu Liu, Ekim Yurtsever, Bare Luka Zagar, Walter Zimmer, Hu Cao, and Alois C Knoll. Vision language models in autonomous driving: A survey and outlook.arXiv preprint arXiv:2310.14414, 2023

work page arXiv 2023
[33]

Available: http://arxiv.org/abs/2311.12320

Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, et al. A survey on multimodal large language models for autonomous driving.arXiv preprint arXiv:2311.12320, 2023

work page arXiv 2023
[34]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Llava-onevision: Easy visual task transfer.TMLR, 2024

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.TMLR, 2024

work page 2024
[40]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Henasy: Learning to assemble scene-entities for interpretable egocentric video-language model.Advances in Neural Information Processing Systems, 37:86483–86499, 2024

Khoa V o, Thinh Phan, Kashu Yamazaki, Minh Tran, and Ngan Le. Henasy: Learning to assemble scene-entities for interpretable egocentric video-language model.Advances in Neural Information Processing Systems, 37:86483–86499, 2024

work page 2024
[43]

Directed-tokens: A robust multi-modality alignment approach to large language-vision models.arXiv preprint arXiv:2508.14264, 2025

Thanh-Dat Truong, Huu-Thien Tran, Tran Thai Son, Bhiksha Raj, and Khoa Luu. Directed- tokens: A robust multi-modality alignment approach to large language-vision models.arXiv preprint arXiv:2508.14264, 2025

work page arXiv 2025
[44]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving

Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[46]

DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world-driven world models for autonomous driving.arXiv preprint arXiv:2309.09777, 2023. 12

work page Pith review arXiv 2023
[47]

Dolphins: Multimodal language model for driving

Yingzi Ma, Yulong Cao, Jiachen Sun, Marco Pavone, and Chaowei Xiao. Dolphins: Multimodal language model for driving. InEuropean Conference on Computer Vision, pages 403–420. Springer, 2024

work page 2024
[48]

Rea- son2drive: Towards interpretable and chain-based reasoning for autonomous driving

Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, and Li Zhang. Rea- son2drive: Towards interpretable and chain-based reasoning for autonomous driving. InEuropean Conference on Computer Vision, pages 292–308. Springer, 2024

work page 2024
[49]

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

work page 2023
[50]

Tin Stribor Sohn, Maximilian Dillitzer, Jason J

Fatemeh Shiri, Xiao-Yu Guo, Mona Golestan Far, Xin Yu, Gholamreza Haffari, and Yuan-Fang Li. An empirical analysis on spatial reasoning capabilities of large multimodal models.arXiv preprint arXiv:2411.06048, 2024

work page arXiv 2024
[51]

Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models

Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. Embspatial-bench: Bench- marking spatial understanding for embodied tasks with large vision-language models.arXiv preprint arXiv:2406.05756, 2024

work page arXiv 2024
[52]

Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems, 37:135062–135093, 2024

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems, 37:135062–135093, 2024

work page 2024
[53]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024

work page 2024
[54]

Spatialbot: Precise spatial understanding with vision language models

Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9490–9498. IEEE, 2025

work page 2025
[55]

Spatialllm: A compound 3d- informed design towards spatially-intelligent large multimodal models

Wufei Ma, Luoxin Ye, Celso M de Melo, Alan Yuille, and Jieneng Chen. Spatialllm: A compound 3d- informed design towards spatially-intelligent large multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17249–17260, 2025

work page 2025
[56]

Robospa- tial: Teaching spatial understanding to 2d and 3d vision-language models for robotics

Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospa- tial: Teaching spatial understanding to 2d and 3d vision-language models for robotics. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15768–15780, 2025

work page 2025
[57]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gon- zalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Spatialcot: Advancing spatial reasoning through coordinate alignment and chain-of-thought for embodied task planning, 2025 b

Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yaochen Hu, Lingfeng Zhang, Yingxue Zhang, Shuang Wu, Tongtong Cao, Guowei Huang, et al. Spatialcot: Advancing spatial reason- ing through coordinate alignment and chain-of-thought for embodied task planning.arXiv preprint arXiv:2501.10074, 2025

work page arXiv 2025
[60]

arXiv preprint arXiv:2506.03135 (2025)

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025

work page arXiv 2025
[61]

arXiv preprint arXiv:2506.03922 , year=

Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang, Ziwen Wang, Huaxuan Ding, Zhuo Cheng, Wenhao Cao, Zhiyuan Feng, et al. Hssbench: Benchmarking humanities and social sciences ability for multimodal large language models.arXiv preprint arXiv:2506.03922, 2025

work page arXiv 2025
[62]

Spatial-dise: A unified benchmark for evaluating spatial reasoning in vision-language models

Xinmiao Huang, Qisong He, Zhenglin Huang, Boxuan Wang, Zhuoyun Li, Guangliang Cheng, and Yi Dong. Spatial-dise: A unified benchmark for evaluating spatial reasoning in vision-language models. arXiv preprint arXiv:2510.13394, 2025. 13

work page arXiv 2025
[63]

Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models

Ilias Stogiannidis, Steven McDonagh, and Sotirios A Tsaftaris. Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707, 2025

work page Pith review arXiv 2025
[65]

arXiv preprint arXiv:2512.10863 , year=

Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, and Chenming Zhu. Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence.arXiv preprint arXiv:2512.10863, 2025

work page arXiv 2025
[66]

Cambrian-S: Towards Spatial Supersensing in Video

Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, and Zihan Zhen. Cambrian-s: Towards spatial supersensing in video.arXiv preprint arXiv:2511.04670, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

Agqa: A benchmark for compositional spatio-temporal reasoning

Madeleine Grunde-McLaughlin, Ranjay Krishna, and Maneesh Agrawala. Agqa: A benchmark for compositional spatio-temporal reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

work page 2021
[68]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

work page 2021
[69]

Visuospatial perspective taking in multimodal language models.arXiv preprint arXiv:2603.23510, 2026

Jonathan Prunty, Seraphina Zhang, Patrick Quinn, Jianxun Lian, Xing Xie, and Lucy Cheke. Visuospatial perspective taking in multimodal language models.arXiv preprint arXiv:2603.23510, 2026

work page arXiv 2026
[70]

Egocentric bias in vision-language models.arXiv preprint arXiv:2602.15892, 2026

Maijunxian Wang, Yijiang Li, Bingyang Wang, Tianwei Zhao, Ran Ji, Qingying Gao, Emmy Liu, and Hokin Deng. Egocentric bias in vision-language models.arXiv preprint arXiv:2602.15892, 2026

work page internal anchor Pith review arXiv 2026
[71]

Allocentric perceiver: Disentangling allocentric reasoning from egocentric visual priors via frame instantiation.arXiv preprint arXiv:2602.05789, 2026

Hengyi Wang, Ruiqiang Zhang, Chang Liu, Guanjie Wang, Zehua Ma, Han Fang, and Weiming Zhang. Allocentric perceiver: Disentangling allocentric reasoning from egocentric visual priors via frame instantiation.arXiv preprint arXiv:2602.05789, 2026

work page arXiv 2026
[72]

Keep it sympl: Symbolic projective layout for allocentric spatial reasoning in vision-language models.arXiv preprint arXiv:2602.19117, 2026

Jaeyun Jang, Seunghui Shin, Taeho Park, and Hyoseok Hwang. Keep it sympl: Symbolic projective layout for allocentric spatial reasoning in vision-language models.arXiv preprint arXiv:2602.19117, 2026

work page arXiv 2026
[73]

Capture: Evaluating spatial reasoning in vision language models via occluded object counting.arXiv preprint arXiv:2504.15485, 2025

Atin Pothiraj, Elias Stengel-Eskin, Jaemin Cho, and Mohit Bansal. Capture: Evaluating spatial reasoning in vision language models via occluded object counting.arXiv preprint arXiv:2504.15485, 2025

work page arXiv 2025
[74]

Beyond the visible: Benchmarking occlusion perception in multimodal large language models.arXiv preprint arXiv:2508.04059, 2025

Zhaochen Liu, Kaiwen Gao, Shuyi Liang, Bin Xiao, Limeng Qiao, Lin Ma, and Tingting Jiang. Beyond the visible: Benchmarking occlusion perception in multimodal large language models.arXiv preprint arXiv:2508.04059, 2025

work page arXiv 2025
[75]

Mind over space: Can multimodal large language models mentally navigate?arXiv preprint arXiv:2603.21577, 2026

Qihui Zhu, Shouwei Ruan, Xiao Yang, Hao Jiang, Yao Huang, Shiji Zhao, Hanwei Fan, Hang Su, and Xingxing Wei. Mind over space: Can multimodal large language models mentally navigate?arXiv preprint arXiv:2603.21577, 2026

work page arXiv 2026
[76]

Video2layout: Recall and reconstruct metric-grounded cognitive map for spatial reasoning.arXiv preprint arXiv:2511.16160, 2025

Yibin Huang, Wang Xu, Wanyue Zhang, Helu Zhi, Jingjing Huang, Yangbin Xu, Yangang Sun, and Conghui Zhu. Video2layout: Recall and reconstruct metric-grounded cognitive map for spatial reasoning. arXiv preprint arXiv:2511.16160, 2025

work page arXiv 2025
[77]

Embodied-r: Collaborative framework for activating embodied spatial reasoning in foundation models via reinforcement learning.arXiv preprint arXiv:2504.12680, 2025

Baining Zhao, Ziyou Wang, Jianjie Fang, Chen Gao, Fanhang Man, Jinqiang Cui, Xin Wang, and Xinlei Chen. Embodied-r: Collaborative framework for activating embodied spatial reasoning in foundation models via reinforcement learning.arXiv preprint arXiv:2504.12680, 2025

work page arXiv 2025
[78]

Talk2car: Taking control of your self-driving car

Thierry Deruyttere, Simon Vandenhende, Dusan Grujicic, Luc Van Gool, and Marie Francine Moens. Talk2car: Taking control of your self-driving car. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2088–2098, 2019

work page 2019
[79]

Textual explanations for self-driving vehicles

Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John Canny, and Zeynep Akata. Textual explanations for self-driving vehicles. InProceedings of the European conference on computer vision (ECCV), pages 563–578, 2018. 14

work page 2018
[80]

Nuscenes-mqa: Integrated evalua- tion of captions and qa for autonomous driving datasets using markup annotations.arXiv preprint arXiv:2312.06352, 2023

Yuichi Inoue, Yuki Yada, Kotaro Tanahashi, and Yu Yamaguchi. Nuscenes-mqa: Integrated evalua- tion of captions and qa for autonomous driving datasets using markup annotations.arXiv preprint arXiv:2312.06352, 2023

work page arXiv 2023
[81]

Stride-qa: Visual question answering dataset for spatiotemporal reasoning in urban driving scenes.arXiv preprint arXiv:2508.10427, 2025

Keishi Ishihara, Kento Sasaki, Tsubasa Takahashi, Daiki Shiono, and Yu Yamaguchi. Stride-qa: Visual question answering dataset for spatiotemporal reasoning in urban driving scenes.arXiv preprint arXiv:2508.10427, 2025

work page arXiv 2025
[82]

Bench2drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving.arXiv preprint arXiv:2406.03877, 2024

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving.arXiv preprint arXiv:2406.03877, 2024

work page arXiv 2024

Showing first 80 references.

[1] [1]

Studies in spatial learning

Edward C Tolman, Benbow F Ritchie, and Donald Kalish. Studies in spatial learning. ii. place learning versus response learning.Journal of experimental psychology, 36(3):221, 1946

work page 1946

[2] [2]

Cognitive maps in rats and men.Psychological review, 55(4):189, 1948

Edward C Tolman. Cognitive maps in rats and men.Psychological review, 55(4):189, 1948

work page 1948

[3] [3]

The hippocampus and context revisited

Lynn Nadel. The hippocampus and context revisited. 2008

work page 2008

[4] [4]

The effect of vehicle navigation systems on the formation of cognitive maps

Gary E Burnett and Kate Lee. The effect of vehicle navigation systems on the formation of cognitive maps. InInternational conference of traffic and transport psychology, 2005

work page 2005

[5] [5]

Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22442–22452, 2025

work page 2025

[6] [6]

Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024

Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024

work page 2024

[7] [7]

Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models

Xinpeng Ding, Jianhua Han, Hang Xu, Xiaodan Liang, Wei Zhang, and Xiaomeng Li. Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13668–13677, 2024

work page 2024

[8] [8]

Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives

Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, and Liang Pan. Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6585–6597, October 2025

work page 2025

[9] [9]

Vlm-ad: End-to-end autonomous driving through vision-language model supervision.Conference on Robot Learning (CoRL), 2024

Yi Xu, Yuxin Hu, Zaiwei Zhang, Gregory P Meyer, Siva Karthik Mustikovela, Siddhartha Srinivasa, Eric M Wolff, and Xin Huang. Vlm-ad: End-to-end autonomous driving through vision-language model supervision.Conference on Robot Learning (CoRL), 2024

work page 2024

[10] [10]

Robotron-drive: All-in-one large multimodal model for autonomous driving

Zhijian Huang, Chengjian Feng, Feng Yan, Baihui Xiao, Zequn Jie, Yujie Zhong, Xiaodan Liang, and Lin Ma. Robotron-drive: All-in-one large multimodal model for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8011–8021, 2025

work page 2025

[11] [11]

Fastdrivevla: Efficient end-to-end driving via plug-and-play reconstruction- based token pruning.the Association for the Advancement of Artificial Intelligence (AAAI), 2026

Jiajun Cao, Qizhe Zhang, Peidong Jia, Xuhui Zhao, Bo Lan, Xiaoan Zhang, Zhuo Li, Xiaobao Wei, Sixiang Chen, Liyun Li, et al. Fastdrivevla: Efficient end-to-end driving via plug-and-play reconstruction- based token pruning.the Association for the Advancement of Artificial Intelligence (AAAI), 2026

work page 2026

[12] [12]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Letters, 2024

work page 2024

[13] [13]

Covla: Comprehensive vision-language-action dataset for autonomous driving

Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1933–1943. IEEE, 2025

work page 1933

[14] [14]

Emma: End-to-end multimodal model for autonomous driving.TMLR, 2025

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.TMLR, 2025

work page 2025

[15] [15]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024

work page 2024

[16] [16]

DriveVLM: The convergence of autonomous driving and large vision-language models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, XianPeng Lang, and Hang Zhao. DriveVLM: The convergence of autonomous driving and large vision-language models. In8th Annual Conference on Robot Learning, 2024. 10

work page 2024

[17] [17]

Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[18] [18]

arXiv preprint arXiv:2504.03164 , year=

Kexin Tian, Jingrui Mao, Yunlong Zhang, Jiwan Jiang, Yang Zhou, and Zhengzhong Tu. Nuscenes- spatialqa: A spatial understanding and reasoning benchmark for vision-language models in autonomous driving.arXiv preprint arXiv:2504.03164, 2025

work page arXiv 2025

[19] [19]

Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding

Xu Cao, Tong Zhou, Yunsheng Ma, Wenqian Ye, Can Cui, Kun Tang, Zhipeng Cao, Kaizhao Liang, Ziran Wang, James M Rehg, et al. Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21819–21830, 2024

work page 2024

[20] [20]

arXiv preprint arXiv:2509.06266 (2025) 2

Mohsen Gholami, Ahmad Rezaei, Zhou Weimin, Sitong Mao, Shunbo Zhou, Yong Zhang, and Moham- mad Akbari. Spatial reasoning with vision-language models in ego-centric multi-view scenes.arXiv preprint arXiv:2509.06266, 2025

work page arXiv 2025

[21] [21]

Surds: Benchmarking spatial understanding and reasoning in driving scenarios with vision language models

Xianda Guo, Ruijun Zhang, Yiqun Duan, Yuhang He, Dujun Nie, Wenke Huang, Chenming Zhang, Shuai Liu, Hao Zhao, and Long Chen. Surds: Benchmarking spatial understanding and reasoning in driving scenarios with vision language models. InNeurIPS, 2025

work page 2025

[22] [22]

Automated evaluation of large vision-language models on self-driving corner cases

Kai Chen, Yanze Li, Wenhua Zhang, Yanxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li, et al. Automated evaluation of large vision-language models on self-driving corner cases. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 7817–7826. IEEE, 2025

work page 2025

[23] [23]

Lingoqa: Visual question answering for autonomous driving

Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, et al. Lingoqa: Visual question answering for autonomous driving. InEuropean Conference on Computer Vision, pages 252–269. Springer, 2024

work page 2024

[24] [24]

Towards physics- informed spatial intelligence with human priors: An autonomous driving pilot study.International Conference on Learning Representations (ICLR), 2025

Guanlin Wu, Boyan Su, Yang Zhao, Pu Wang, Yichen Lin, and Hao Frank Yang. Towards physics- informed spatial intelligence with human priors: An autonomous driving pilot study.International Conference on Learning Representations (ICLR), 2025

work page 2025

[25] [25]

Stsbench: A spatio-temporal scenario benchmark for multi-modal large language models in autonomous driving.Conference and Workshop on Neural Information Processing Systems, 2025

Christian Fruhwirth-Reisinger, Dušan Mali´c, Wei Lin, David Schinagl, Samuel Schulter, and Horst Possegger. Stsbench: A spatio-temporal scenario benchmark for multi-modal large language models in autonomous driving.Conference and Workshop on Neural Information Processing Systems, 2025

work page 2025

[26] [26]

Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4542–4550, 2024

work page 2024

[27] [27]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krish- nan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020

work page 2020

[28] [28]

Argoverse 2: Next generation datasets for self-driving perception and forecasting.Conference on Neural Information Processing Systems, 202

Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting.Conference on Neural Information Processing Systems, 202

work page

[29] [29]

Man truckscenes: A multimodal dataset for autonomous trucking in diverse conditions.Advances in Neural Information Processing Systems, 37:62062–62082, 2024

Felix Fent, Fabian Kuttenreich, Florian Ruch, Farija Rizwin, Stefan Juergens, Lorenz Lechermann, Christian Nissler, Andrea Perl, Ulrich V oll, Min Yan, et al. Man truckscenes: A multimodal dataset for autonomous trucking in diverse conditions.Advances in Neural Information Processing Systems, 37:62062–62082, 2024

work page 2024

[30] [30]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020. 11

work page 2020

[31] [31]

One million scenes for autonomous driving: Once dataset

Jiageng Mao, Minzhe Niu, Chenhan Jiang, Hanxue Liang, Jingheng Chen, Xiaodan Liang, Yamin Li, Chaoqiang Ye, Wei Zhang, Zhenguo Li, et al. One million scenes for autonomous driving: Once dataset. Conference and Workshop on Neural Information Processing Systems, 2021

work page 2021

[32] [32]

Vision language models in autonomous driving: A survey and outlook.arXiv preprint arXiv:2310.14414, 2023

Xingcheng Zhou, Mingyu Liu, Ekim Yurtsever, Bare Luka Zagar, Walter Zimmer, Hu Cao, and Alois C Knoll. Vision language models in autonomous driving: A survey and outlook.arXiv preprint arXiv:2310.14414, 2023

work page arXiv 2023

[33] [33]

Available: http://arxiv.org/abs/2311.12320

Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, et al. A survey on multimodal large language models for autonomous driving.arXiv preprint arXiv:2311.12320, 2023

work page arXiv 2023

[34] [34]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Llava-onevision: Easy visual task transfer.TMLR, 2024

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.TMLR, 2024

work page 2024

[40] [40]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

Henasy: Learning to assemble scene-entities for interpretable egocentric video-language model.Advances in Neural Information Processing Systems, 37:86483–86499, 2024

Khoa V o, Thinh Phan, Kashu Yamazaki, Minh Tran, and Ngan Le. Henasy: Learning to assemble scene-entities for interpretable egocentric video-language model.Advances in Neural Information Processing Systems, 37:86483–86499, 2024

work page 2024

[43] [43]

Directed-tokens: A robust multi-modality alignment approach to large language-vision models.arXiv preprint arXiv:2508.14264, 2025

Thanh-Dat Truong, Huu-Thien Tran, Tran Thai Son, Bhiksha Raj, and Khoa Luu. Directed- tokens: A robust multi-modality alignment approach to large language-vision models.arXiv preprint arXiv:2508.14264, 2025

work page arXiv 2025

[44] [44]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving

Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[46] [46]

DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world-driven world models for autonomous driving.arXiv preprint arXiv:2309.09777, 2023. 12

work page Pith review arXiv 2023

[47] [47]

Dolphins: Multimodal language model for driving

Yingzi Ma, Yulong Cao, Jiachen Sun, Marco Pavone, and Chaowei Xiao. Dolphins: Multimodal language model for driving. InEuropean Conference on Computer Vision, pages 403–420. Springer, 2024

work page 2024

[48] [48]

Rea- son2drive: Towards interpretable and chain-based reasoning for autonomous driving

Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, and Li Zhang. Rea- son2drive: Towards interpretable and chain-based reasoning for autonomous driving. InEuropean Conference on Computer Vision, pages 292–308. Springer, 2024

work page 2024

[49] [49]

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

work page 2023

[50] [50]

Tin Stribor Sohn, Maximilian Dillitzer, Jason J

Fatemeh Shiri, Xiao-Yu Guo, Mona Golestan Far, Xin Yu, Gholamreza Haffari, and Yuan-Fang Li. An empirical analysis on spatial reasoning capabilities of large multimodal models.arXiv preprint arXiv:2411.06048, 2024

work page arXiv 2024

[51] [51]

Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models

Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. Embspatial-bench: Bench- marking spatial understanding for embodied tasks with large vision-language models.arXiv preprint arXiv:2406.05756, 2024

work page arXiv 2024

[52] [52]

Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems, 37:135062–135093, 2024

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems, 37:135062–135093, 2024

work page 2024

[53] [53]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024

work page 2024

[54] [54]

Spatialbot: Precise spatial understanding with vision language models

Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9490–9498. IEEE, 2025

work page 2025

[55] [55]

Spatialllm: A compound 3d- informed design towards spatially-intelligent large multimodal models

Wufei Ma, Luoxin Ye, Celso M de Melo, Alan Yuille, and Jieneng Chen. Spatialllm: A compound 3d- informed design towards spatially-intelligent large multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17249–17260, 2025

work page 2025

[56] [56]

Robospa- tial: Teaching spatial understanding to 2d and 3d vision-language models for robotics

Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospa- tial: Teaching spatial understanding to 2d and 3d vision-language models for robotics. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15768–15780, 2025

work page 2025

[57] [57]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gon- zalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

Spatialcot: Advancing spatial reasoning through coordinate alignment and chain-of-thought for embodied task planning, 2025 b

Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yaochen Hu, Lingfeng Zhang, Yingxue Zhang, Shuang Wu, Tongtong Cao, Guowei Huang, et al. Spatialcot: Advancing spatial reason- ing through coordinate alignment and chain-of-thought for embodied task planning.arXiv preprint arXiv:2501.10074, 2025

work page arXiv 2025

[59] [60]

arXiv preprint arXiv:2506.03135 (2025)

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025

work page arXiv 2025

[60] [61]

arXiv preprint arXiv:2506.03922 , year=

Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang, Ziwen Wang, Huaxuan Ding, Zhuo Cheng, Wenhao Cao, Zhiyuan Feng, et al. Hssbench: Benchmarking humanities and social sciences ability for multimodal large language models.arXiv preprint arXiv:2506.03922, 2025

work page arXiv 2025

[61] [62]

Spatial-dise: A unified benchmark for evaluating spatial reasoning in vision-language models

Xinmiao Huang, Qisong He, Zhenglin Huang, Boxuan Wang, Zhuoyun Li, Guangliang Cheng, and Yi Dong. Spatial-dise: A unified benchmark for evaluating spatial reasoning in vision-language models. arXiv preprint arXiv:2510.13394, 2025. 13

work page arXiv 2025

[62] [63]

Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models

Ilias Stogiannidis, Steven McDonagh, and Sotirios A Tsaftaris. Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707, 2025

work page Pith review arXiv 2025

[63] [65]

arXiv preprint arXiv:2512.10863 , year=

Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, and Chenming Zhu. Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence.arXiv preprint arXiv:2512.10863, 2025

work page arXiv 2025

[64] [66]

Cambrian-S: Towards Spatial Supersensing in Video

Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, and Zihan Zhen. Cambrian-s: Towards spatial supersensing in video.arXiv preprint arXiv:2511.04670, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[65] [67]

Agqa: A benchmark for compositional spatio-temporal reasoning

Madeleine Grunde-McLaughlin, Ranjay Krishna, and Maneesh Agrawala. Agqa: A benchmark for compositional spatio-temporal reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

work page 2021

[66] [68]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

work page 2021

[67] [69]

Visuospatial perspective taking in multimodal language models.arXiv preprint arXiv:2603.23510, 2026

Jonathan Prunty, Seraphina Zhang, Patrick Quinn, Jianxun Lian, Xing Xie, and Lucy Cheke. Visuospatial perspective taking in multimodal language models.arXiv preprint arXiv:2603.23510, 2026

work page arXiv 2026

[68] [70]

Egocentric bias in vision-language models.arXiv preprint arXiv:2602.15892, 2026

Maijunxian Wang, Yijiang Li, Bingyang Wang, Tianwei Zhao, Ran Ji, Qingying Gao, Emmy Liu, and Hokin Deng. Egocentric bias in vision-language models.arXiv preprint arXiv:2602.15892, 2026

work page internal anchor Pith review arXiv 2026

[69] [71]

Allocentric perceiver: Disentangling allocentric reasoning from egocentric visual priors via frame instantiation.arXiv preprint arXiv:2602.05789, 2026

Hengyi Wang, Ruiqiang Zhang, Chang Liu, Guanjie Wang, Zehua Ma, Han Fang, and Weiming Zhang. Allocentric perceiver: Disentangling allocentric reasoning from egocentric visual priors via frame instantiation.arXiv preprint arXiv:2602.05789, 2026

work page arXiv 2026

[70] [72]

Keep it sympl: Symbolic projective layout for allocentric spatial reasoning in vision-language models.arXiv preprint arXiv:2602.19117, 2026

Jaeyun Jang, Seunghui Shin, Taeho Park, and Hyoseok Hwang. Keep it sympl: Symbolic projective layout for allocentric spatial reasoning in vision-language models.arXiv preprint arXiv:2602.19117, 2026

work page arXiv 2026

[71] [73]

Capture: Evaluating spatial reasoning in vision language models via occluded object counting.arXiv preprint arXiv:2504.15485, 2025

Atin Pothiraj, Elias Stengel-Eskin, Jaemin Cho, and Mohit Bansal. Capture: Evaluating spatial reasoning in vision language models via occluded object counting.arXiv preprint arXiv:2504.15485, 2025

work page arXiv 2025

[72] [74]

Beyond the visible: Benchmarking occlusion perception in multimodal large language models.arXiv preprint arXiv:2508.04059, 2025

Zhaochen Liu, Kaiwen Gao, Shuyi Liang, Bin Xiao, Limeng Qiao, Lin Ma, and Tingting Jiang. Beyond the visible: Benchmarking occlusion perception in multimodal large language models.arXiv preprint arXiv:2508.04059, 2025

work page arXiv 2025

[73] [75]

Mind over space: Can multimodal large language models mentally navigate?arXiv preprint arXiv:2603.21577, 2026

Qihui Zhu, Shouwei Ruan, Xiao Yang, Hao Jiang, Yao Huang, Shiji Zhao, Hanwei Fan, Hang Su, and Xingxing Wei. Mind over space: Can multimodal large language models mentally navigate?arXiv preprint arXiv:2603.21577, 2026

work page arXiv 2026

[74] [76]

Video2layout: Recall and reconstruct metric-grounded cognitive map for spatial reasoning.arXiv preprint arXiv:2511.16160, 2025

Yibin Huang, Wang Xu, Wanyue Zhang, Helu Zhi, Jingjing Huang, Yangbin Xu, Yangang Sun, and Conghui Zhu. Video2layout: Recall and reconstruct metric-grounded cognitive map for spatial reasoning. arXiv preprint arXiv:2511.16160, 2025

work page arXiv 2025

[75] [77]

Embodied-r: Collaborative framework for activating embodied spatial reasoning in foundation models via reinforcement learning.arXiv preprint arXiv:2504.12680, 2025

Baining Zhao, Ziyou Wang, Jianjie Fang, Chen Gao, Fanhang Man, Jinqiang Cui, Xin Wang, and Xinlei Chen. Embodied-r: Collaborative framework for activating embodied spatial reasoning in foundation models via reinforcement learning.arXiv preprint arXiv:2504.12680, 2025

work page arXiv 2025

[76] [78]

Talk2car: Taking control of your self-driving car

Thierry Deruyttere, Simon Vandenhende, Dusan Grujicic, Luc Van Gool, and Marie Francine Moens. Talk2car: Taking control of your self-driving car. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2088–2098, 2019

work page 2019

[77] [79]

Textual explanations for self-driving vehicles

Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John Canny, and Zeynep Akata. Textual explanations for self-driving vehicles. InProceedings of the European conference on computer vision (ECCV), pages 563–578, 2018. 14

work page 2018

[78] [80]

Nuscenes-mqa: Integrated evalua- tion of captions and qa for autonomous driving datasets using markup annotations.arXiv preprint arXiv:2312.06352, 2023

Yuichi Inoue, Yuki Yada, Kotaro Tanahashi, and Yu Yamaguchi. Nuscenes-mqa: Integrated evalua- tion of captions and qa for autonomous driving datasets using markup annotations.arXiv preprint arXiv:2312.06352, 2023

work page arXiv 2023

[79] [81]

Stride-qa: Visual question answering dataset for spatiotemporal reasoning in urban driving scenes.arXiv preprint arXiv:2508.10427, 2025

Keishi Ishihara, Kento Sasaki, Tsubasa Takahashi, Daiki Shiono, and Yu Yamaguchi. Stride-qa: Visual question answering dataset for spatiotemporal reasoning in urban driving scenes.arXiv preprint arXiv:2508.10427, 2025

work page arXiv 2025

[80] [82]

Bench2drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving.arXiv preprint arXiv:2406.03877, 2024

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi- ability benchmarking of closed-loop end-to-end autonomous driving.arXiv preprint arXiv:2406.03877, 2024

work page arXiv 2024