pith. sign in

arxiv: 2412.14171 · v2 · pith:BY5ENMHMnew · submitted 2024-12-18 · 💻 cs.CV

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Pith reviewed 2026-05-22 09:23 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal large language modelsvisual-spatial intelligencecognitive mapsspatial reasoningVSI-Benchvideo understandingworld models
0
0 comments X

The pith

Multimodal large language models think in space from videos but stay subhuman, limited mainly by spatial reasoning while cognitive maps help distance tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests if MLLMs trained on large video data can remember and reason about physical spaces the way humans do from sequential views. It introduces VSI-Bench with more than 5,000 questions and shows models reach competitive performance yet fall short of humans. Standard language-based reasoning tricks fail to raise scores, but requiring the model to generate an explicit cognitive map during answering improves results on distance questions. This indicates that some spatial awareness arises inside these models even without direct supervision for it.

Core claim

MLLMs exhibit competitive though subhuman visual-spatial intelligence on VSI-Bench; spatial reasoning remains the primary bottleneck, local world models and spatial awareness emerge, linguistic reasoning techniques fail to improve performance, and explicitly generating cognitive maps enhances spatial distance ability.

What carries the argument

VSI-Bench, a video-based visual-spatial intelligence benchmark of over 5,000 question-answer pairs, together with explicit cognitive map generation as a reasoning step that improves distance judgments.

If this is right

  • Spatial reasoning, not perception or memory, sets the main limit on MLLM performance for space-related tasks.
  • Local world models and spatial awareness form inside MLLMs from video training alone.
  • Chain-of-thought, self-consistency, and tree-of-thoughts produce no gains on these spatial questions.
  • Forcing explicit cognitive map output specifically boosts accuracy on spatial distance items.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future model designs could add dedicated modules for maintaining and updating spatial maps to close the remaining gap with humans.
  • The same video training that produces these abilities might already support downstream uses such as navigation planning or scene reconstruction.
  • New tests could check whether the observed spatial awareness transfers to physical robots or to environments never seen during pretraining.

Load-bearing premise

The VSI-Bench questions and evaluation protocol accurately measure human-like visual-spatial intelligence rather than surface-level pattern matching or dataset artifacts.

What would settle it

Human participants scoring only marginally above the best MLLMs on the full VSI-Bench, or MLLMs maintaining high scores on distance questions even when prevented from generating cognitive maps.

read the original abstract

Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also ``think in space'' from videos? We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit competitive - though subhuman - visual-spatial intelligence. We probe models to express how they think in space both linguistically and visually and find that while spatial reasoning capabilities remain the primary bottleneck for MLLMs to reach higher benchmark performance, local world models and spatial awareness do emerge within these models. Notably, prevailing linguistic reasoning techniques (e.g., chain-of-thought, self-consistency, tree-of-thoughts) fail to improve performance, whereas explicitly generating cognitive maps during question-answering enhances MLLMs' spatial distance ability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces VSI-Bench, a new video-based benchmark with over 5,000 QA pairs, to evaluate visual-spatial intelligence in MLLMs. It reports that current MLLMs achieve competitive but subhuman performance on the benchmark, identifies spatial reasoning as the primary bottleneck, shows that linguistic techniques such as chain-of-thought fail to help, and finds that explicitly generating cognitive maps during inference improves performance on spatial distance tasks.

Significance. If the benchmark accurately isolates sequential spatial reasoning, the work supplies a useful new evaluation resource and empirical evidence that explicit cognitive-map generation can mitigate a key limitation in MLLMs. The new data collection and the contrast between failed linguistic interventions and successful map generation are concrete strengths that could inform future model design for better world modeling.

major comments (3)
  1. [§3] §3 (VSI-Bench construction): the paper provides no details on question validation, inter-annotator agreement, or explicit controls that would rule out solutions via frame-level object statistics, temporal co-occurrence patterns, or linguistic priors rather than genuine sequential spatial reconstruction.
  2. [§5] §5 (Experiments and results): performance gaps and the reported benefit of cognitive-map generation are presented without statistical significance tests, without ablations that hold video length and model scale constant, and without human baselines collected under matched conditions.
  3. [§6] §6 (Analysis of reasoning strategies): the claim that spatial reasoning remains the primary bottleneck rests on the assumption that VSI-Bench items cannot be solved by surface cues; no diagnostic experiments or error analysis are supplied to support this assumption.
minor comments (2)
  1. [Figures] Figure captions and axis labels in several result plots omit units or confidence intervals, reducing readability.
  2. [Related Work] A small number of recent works on spatial reasoning benchmarks for vision-language models are missing from the related-work section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive comments, which have helped us identify areas where the manuscript can be strengthened. We address each major comment below and describe the revisions we will incorporate.

read point-by-point responses
  1. Referee: [§3] §3 (VSI-Bench construction): the paper provides no details on question validation, inter-annotator agreement, or explicit controls that would rule out solutions via frame-level object statistics, temporal co-occurrence patterns, or linguistic priors rather than genuine sequential spatial reconstruction.

    Authors: We appreciate the referee's emphasis on rigorous benchmark validation. The original manuscript outlines the data collection pipeline in Section 3 but does not provide sufficient specifics on validation steps. In the revised manuscript, we will expand this section to report: (i) the multi-stage question validation process involving independent annotators, (ii) inter-annotator agreement statistics such as Cohen's kappa, and (iii) explicit filtering and control measures (e.g., adversarial question design and statistical checks) intended to prevent solutions based on single-frame cues, temporal co-occurrences, or linguistic shortcuts. These additions will more clearly establish that successful performance requires sequential spatial reconstruction. revision: yes

  2. Referee: [§5] §5 (Experiments and results): performance gaps and the reported benefit of cognitive-map generation are presented without statistical significance tests, without ablations that hold video length and model scale constant, and without human baselines collected under matched conditions.

    Authors: We agree that stronger statistical support and controlled comparisons would improve the results section. In the revision we will: (1) include statistical significance tests (paired t-tests or appropriate non-parametric equivalents) for key performance differences, (2) add or clarify ablations that hold video length and model scale fixed where feasible, and (3) provide additional details on the human baseline collection protocol, including the number of participants, instructions given, and how conditions were matched to model evaluations. These changes will allow readers to better assess the reliability of the reported gaps and the benefit of cognitive-map generation. revision: yes

  3. Referee: [§6] §6 (Analysis of reasoning strategies): the claim that spatial reasoning remains the primary bottleneck rests on the assumption that VSI-Bench items cannot be solved by surface cues; no diagnostic experiments or error analysis are supplied to support this assumption.

    Authors: We acknowledge that the current analysis would benefit from more direct evidence that surface cues are insufficient. We will add a dedicated subsection presenting diagnostic experiments (e.g., performance on variants where spatial relations are perturbed while preserving low-level statistics) together with a systematic error analysis that categorizes model failures according to whether they stem from spatial mis-reasoning versus other factors. This will provide empirical support for identifying spatial reasoning as the primary bottleneck. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark study with no load-bearing circularity

full rationale

The paper introduces VSI-Bench (a new video-based visual-spatial intelligence benchmark with over 5,000 QA pairs) and reports empirical evaluations of MLLMs on it, including probes for linguistic vs. visual thinking and the effect of explicit cognitive map generation. No equations, fitted parameters, or derivation chains exist that reduce the reported performance numbers or intervention benefits to prior self-citations by construction. Central claims rest on fresh data collection and controlled experiments rather than self-referential definitions or uniqueness theorems imported from the authors' prior work. Any self-citations present are incidental and non-load-bearing for the benchmark results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Claims rest on the unverified assumption that the new benchmark faithfully captures visual-spatial intelligence and that model outputs during probing reflect genuine internal spatial representations rather than prompt artifacts.

axioms (1)
  • domain assumption The constructed VSI-Bench questions validly measure visual-spatial intelligence comparable to human performance
    This premise is required to interpret subhuman model scores and the cognitive-map improvement as evidence about model capabilities.

pith-pipeline@v0.9.0 · 5706 in / 1170 out tokens · 30397 ms · 2026-05-22T09:23:14.427608+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs... MLLMs exhibit competitive - though subhuman - visual-spatial intelligence... explicitly generating cognitive maps during question-answering enhances MLLMs' spatial distance ability.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

    cs.CV 2026-05 unverdicted novelty 7.0

    SpaceDG introduces the first large-scale degradation-aware spatial reasoning dataset using 3D Gaussian Splatting synthesis, showing that visual degradations impair MLLM performance but finetuning on the data improves ...

  2. Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?

    cs.CV 2026-05 unverdicted novelty 7.0

    The Token Replacement Test shows VLMs keep most accuracy gains even after corrupting or replacing continuous thought token content, indicating the tokens are not used as information bottlenecks.

  3. SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?

    cs.AI 2026-05 unverdicted novelty 7.0

    SaaS-Bench provides 106 realistic professional tasks across 23 deployable SaaS platforms to evaluate LLM-based agents, finding that even the strongest models complete fewer than 4% of tasks end-to-end.

  4. EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks

    cs.CV 2026-04 unverdicted novelty 7.0

    EgoTL provides a new egocentric dataset with think-aloud chains and metric labels that benchmarks VLMs on long-horizon tasks and improves their planning, reasoning, and spatial grounding after finetuning.

  5. Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

    cs.CV 2025-05 conditional novelty 7.0

    Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.

  6. SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

    cs.CV 2025-04 unverdicted novelty 7.0

    SpaceR uses a new verifiable dataset and map-imagination-augmented RLVR to reach SOTA spatial reasoning accuracy in MLLMs, exceeding GPT-4o on VSI-Bench.

  7. Video-R1: Reinforcing Video Reasoning in MLLMs

    cs.CV 2025-03 conditional novelty 7.0

    Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.

  8. Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

    cs.CV 2026-05 unverdicted novelty 6.0

    Flat-Pack Bench is a new evaluation suite that shows state-of-the-art LVLMs perform poorly on nuanced spatio-temporal reasoning required for furniture assembly videos.

  9. SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.

  10. Video-ToC: Video Tree-of-Cue Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.

  11. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  12. Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation

    cs.RO 2025-08 conditional novelty 6.0

    Embodied-R1 uses a pointing-centric representation and reinforced fine-tuning on a 200K dataset to achieve state-of-the-art results on embodied benchmarks plus 56.2% success in SIMPLEREnv and 87.5% on real XArm tasks ...

  13. Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

    cs.CV 2025-06 unverdicted novelty 6.0

    VILASR integrates visual drawing operations with reasoning in LVLMs via cold-start synthetic training, reflective rejection sampling, and reinforcement learning, yielding an 18.4% average gain on spatial reasoning benchmarks.

  14. Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

    cs.CV 2025-05 unverdicted novelty 6.0

    Spatial-MLLM adds a 3D spatial encoder initialized from a visual geometry model and space-aware frame sampling to MLLMs to improve spatial understanding and reasoning from purely 2D visual inputs.

  15. Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

    cs.CV 2025-05 unverdicted novelty 6.0

    Spatial-MLLM boosts MLLM spatial intelligence from 2D inputs via dual encoders initialized from geometry models plus space-aware sampling, claiming state-of-the-art results.

  16. Grounded Reinforcement Learning for Visual Reasoning

    cs.CV 2025-05 unverdicted novelty 6.0

    ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.

  17. VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

    cs.CV 2025-05 unverdicted novelty 6.0

    VLM-3R augments VLMs with implicit 3D tokens from monocular video via geometry encoding and 200K+ 3D reconstructive QA pairs, plus a new 138K-pair temporal benchmark, to support spatial and embodied reasoning.

  18. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    cs.CV 2025-04 conditional novelty 6.0

    InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

  19. MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding

    cs.CV 2026-04 unverdicted novelty 5.0

    MAG-3D is a training-free multi-agent framework that coordinates planning, grounding, and coding agents with off-the-shelf VLMs to achieve grounded 3D reasoning and state-of-the-art benchmark results.

  20. SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    cs.RO 2025-01 unverdicted novelty 5.0

    SpatialVLA adds 3D-aware position encoding and adaptive discretized action grids to visual-language-action models, enabling strong zero-shot performance and fine-tuning on new robot setups after pre-training on 1.1 mi...

  21. LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map

    cs.CV 2026-05 unverdicted novelty 4.0

    LASAR pairs a dual-memory system with spatio-temporal contrastive learning to induce latent cognitive maps, reporting 2-3.5% zero-shot gains on VLN-CE and VSI-Bench plus high map self-consistency.

  22. EasyVideoR1: Easier RL for Video Understanding

    cs.CV 2026-04 unverdicted novelty 4.0

    EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

  23. Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning

    cs.CV 2025-06 unverdicted novelty 4.0

    Vision-EKIPL injects high-quality actions from external models into RL training to expand exploration and raise the reasoning ceiling of MLLMs, reporting up to 5% gains on the Reason-RFT-CoT benchmark.

  24. Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

    cs.CV 2025-03 unverdicted novelty 2.0

    The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

Reference graph

Works this paper leans on

122 extracted references · 122 canonical work pages · cited by 23 Pith papers · 20 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. NeurIPS, 2022. 2

  2. [2]

    Working memory

    Alan Baddeley. Working memory. Science, 255(5044): 556–559, 1992. 2

  3. [3]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023. 2, 8

  4. [4]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 ,

  5. [5]

    ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. InNeurIPS,

  6. [6]

    Large linguistic models: Analyzing theoretical linguistic abilities of llms

    Gašper Beguš, Maksymilian D ˛ abkowski, and Ryan Rhodes. Large linguistic models: Analyzing theoretical linguistic abilities of llms. arXiv preprint arXiv:2305.00948 , 2023. 2

  7. [7]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In CoRL, 2023. 2

  8. [8]

    Rt-1: Robotics transformer for real-world con- trol at scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world con- trol at scale. In RSS, 2023. 2

  9. [9]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, S...

  10. [10]

    Spa- tialbot: Precise spatial understanding with vision language models

    Wenxiao Cai, Yaroslav Ponomarenko, Jianhao Yuan, Xi- aoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spa- tialbot: Precise spatial understanding with vision language models. arXiv preprint arXiv:2406.13642, 2024. 8

  11. [11]

    Spatial and object visualization cognitive styles: Validation studies in 3800 individuals

    Christopher F Chabris, Thomas E Jerde, Anita W Woolley, Margaret E Gerbasi, Jonathon P Schuldt, Sean L Bennett, J Richard Hackman, and Stephen M Kosslyn. Spatial and object visualization cognitive styles: Validation studies in 3800 individuals. Group brain technical report , 2:1–20,

  12. [12]

    Hadzic, Taran Kota, Jimming He, Cristobal Eyzaguirre, Zane Du- rante, Manling Li, Jiajun Wu, and Fei-Fei Li

    Keshigeyan Chandrasegaran, Agrim Gupta, Lea M. Hadzic, Taran Kota, Jimming He, Cristobal Eyzaguirre, Zane Du- rante, Manling Li, Jiajun Wu, and Fei-Fei Li. Hourvideo: 1-hour video-language understanding. In NeurIPS, 2024. 2, 17

  13. [13]

    Spatialvlm: Endow- ing vision-language models with spatial reasoning capabil- ities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endow- ing vision-language models with spatial reasoning capabil- ities. In CVPR, 2024. 8

  14. [14]

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? clos- ing the gap to commercial multimodal models with open- source suites. arXiv preprint arXiv:2404.16821, 2024. 4

  15. [15]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In CVPR, 2024. 2

  16. [16]

    Spatialrgpt: Grounded spatial reasoning in vision-language models

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. In NeurIPS, 2024. 8

  17. [17]

    Spatially-aware transformers for embodied agents

    Junmo Cho, Jaesik Yoon, and Sungjin Ahn. Spatially-aware transformers for embodied agents. In ICLR, 2023. 8

  18. [18]

    Clark and Allan Paivio

    James M. Clark and Allan Paivio. Dual coding theory and education. Educational Psychology Review, 3(3):149–210,

  19. [19]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017. 3, 13

  20. [20]

    Milton J. Dehn. Working Memory and Academic Learning: Assessment and Intervention. John Wiley & Sons, 2011. 3

  21. [21]

    Palm-e: An embodied multimodal language model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. In ICML, 2023. 2, 8

  22. [22]

    The pascal visual object classes (voc) challenge

    Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010. 4

  23. [23]

    Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing

    Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing. In NeurIPS, 2024. 8

  24. [24]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 4, 7, 8, 16

  25. [25]

    Self- explanation prompting improves dialogue understanding in large language models

    Haoyu Gao, Ting-En Lin, Hangyu Li, Min Yang, Yuchuan Wu, Wentao Ma, Fei Huang, and Yongbin Li. Self- explanation prompting improves dialogue understanding in large language models. In COLING, 2024. 5

  26. [26]

    Frames of Mind: The Theory of Multi- ple Intelligences

    Howard Gardner. Frames of Mind: The Theory of Multi- ple Intelligences. Basic Books, tenth-anniversary edition, second paperback edition edition, 1983. 2

  27. [27]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, 2022. 2

  28. [28]

    A real-world webagent with planning, long context under- standing, and program synthesis

    Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Saf- dari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context under- standing, and program synthesis. In ICLR, 2024. 2

  29. [29]

    Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection

    Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, and Si Liu. Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection. arXiv preprint arXiv:2411.14794, 2024. 8

  30. [30]

    Masked autoencoders are scal- able vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scal- able vision learners. In CVPR, 2022. 8

  31. [31]

    Mea- suring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding. In ICLR,

  32. [32]

    Can large language models explain themselves? a study of llm-generated self- explanations

    Shiyuan Huang, Siddarth Mamidanna, Shreedhar Jangam, Yilun Zhou, and Leilani H Gilpin. Can large language models explain themselves? a study of llm-generated self- explanations. arXiv preprint arXiv:2310.11207, 2023. 5

  33. [33]

    Language models as zero-shot planners: Ex- tracting actionable knowledge for embodied agents

    Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Ex- tracting actionable knowledge for embodied agents. In ICML, 2022. 2, 6

  34. [34]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 2, 4, 8

  35. [35]

    SWE-bench: Can language models resolve real-world github issues? In ICLR, 2024

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? In ICLR, 2024. 2, 6

  36. [36]

    Language models with rationality

    Nora Kassner, Oyvind Tafjord, Ashish Sabharwal, Kyle Richardson, Hinrich Schütze, and Peter Clark. Language models with rationality. In EMNLP, 2023. 2

  37. [37]

    Openvla: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. In CoRL,

  38. [38]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In NeurIPS, 2022. 7, 15

  39. [39]

    Seed-bench: Bench- marking multimodal large language models

    Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Bench- marking multimodal large language models. In CVPR,

  40. [40]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 4, 8

  41. [41]

    Topviewrs: Vision-language models as top-view spatial reasoners

    Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, and Ivan Vuli ´c. Topviewrs: Vision-language models as top-view spatial reasoners. arXiv preprint arXiv:2406.02537, 2024. 8

  42. [42]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023. 2

  43. [43]

    Mvbench: A comprehensive multi-modal video under- standing benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video under- standing benchmark. In CVPR, 2024. 8

  44. [44]

    Vitatecs: A diag- nostic dataset for temporal concept understanding of video- language models

    Shicheng Li, Lei Li, Shuhuai Ren, Yuanxin Liu, Yi Liu, Rundong Gao, Xu Sun, and Lu Hou. Vitatecs: A diag- nostic dataset for temporal concept understanding of video- language models. arXiv preprint arXiv:2311.17404, 2023. 8

  45. [45]

    Vila: On pre-training for vi- sual language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for vi- sual language models. In CVPR, 2024. 4

  46. [46]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 4

  47. [47]

    Multi- modal situated reasoning in 3d scenes

    Xiongkun Linghu, Jiangyong Huang, Xuesong Niu, Xiao- jian Shawn Ma, Baoxiong Jia, and Siyuan Huang. Multi- modal situated reasoning in 3d scenes. Advances in Neu- ral Information Processing Systems , 37:140903–140936,

  48. [48]

    Coarse correspondence elicit 3d spacetime understanding in mul- timodal language model

    Benlin Liu, Yuhao Dong, Yiqin Wang, Yongming Rao, Yan- song Tang, Wei-Chiu Ma, and Ranjay Krishna. Coarse correspondence elicit 3d spacetime understanding in mul- timodal language model. arXiv preprint arXiv:2408.00754,

  49. [49]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. NeurIPS, 2024. 2, 17

  50. [50]

    World Model on Million-Length Video And Language With Blockwise RingAttention

    Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024. 8

  51. [51]

    Temp- Compass: Do video LLMs really understand videos? In Findings of ACL, 2024

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Temp- Compass: Do video LLMs really understand videos? In Findings of ACL, 2024. 8

  52. [52]

    Mmbench: Is your multi- modal model an all-around player? In ECCV, 2025

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi- modal model an all-around player? In ECCV, 2025. 8

  53. [53]

    Faithful chain-of-thought reasoning

    Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison- Burch. Faithful chain-of-thought reasoning. In ACL, 2023. 5

  54. [54]

    Openeqa: Embodied question answering in the era of foun- dation models

    Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. Openeqa: Embodied question answering in the era of foun- dation models. In CVPR, 2024. 8, 17

  55. [55]

    Egoschema: A diagnostic benchmark for very long- form video language understanding

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. NeurIPS, 2023. 2, 8, 16

  56. [56]

    Julia McAfoose and Bernhard T. Baune. Exploring vi- sual–spatial working memory: A critical review of concepts and models. Neuropsychology Review, 2009. 3

  57. [57]

    Individual dif- ferences in navigation: an introductory overview

    Chiara Meneghetti, Laura Miola, Tommaso Feraco, Veron- ica Muffato, and Tommaso Feraco Miola. Individual dif- ferences in navigation: an introductory overview. Prime archives in psychology, 2022. 2

  58. [58]

    Evaluating cognitive maps and planning in large language models with cogeval

    Ida Momennejad, Hosein Hasanbeig, Felipe Vieira Fru- jeri, Hiteshi Sharma, Nebojsa Jojic, Hamid Palangi, Robert Ness, and Jonathan Larson. Evaluating cognitive maps and planning in large language models with cogeval. NeurIPS,

  59. [59]

    Embodiedgpt: Vision-language pre-training via embodied chain of thought

    Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, and Ping Luo. Embodiedgpt: Vision-language pre-training via embodied chain of thought. NeurIPS, 2024. 8

  60. [60]

    The Hippocampus and Context Revisited

    Lynn Nadel. The Hippocampus and Context Revisited. Ox- ford University Press, 2008. 7

  61. [61]

    A Comprehensive Overview of Large Language Models

    Humza Naveed, Asad Ullah Khan, Shi Qiu, Muham- mad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A comprehen- sive overview of large language models. arXiv preprint arXiv:2307.06435, 2023. 2

  62. [62]

    Newcombe

    Nora S. Newcombe. Spatial Cognition. MIT Press, 2024. https://oecs.mit.edu/pub/or750iar. 2, 7

  63. [63]

    Video-bench: A com- prehensive benchmark and toolkit for evaluating video-based 10 large language models

    Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A comprehensive benchmark and toolkit for evaluat- ing video-based large language models. arXiv preprint arXiv:2311.16103, 2023. 8

  64. [64]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023. 2, 8

  65. [65]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El- Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patric...

  66. [66]

    On measuring faith- fulness or self-consistency of natural language explana- tions

    Letitia Parcalabescu and Anette Frank. On measuring faith- fulness or self-consistency of natural language explana- tions. In ACL, 2024. 5

  67. [67]

    Improving language understanding by gen- erative pre-training

    Alec Radford. Improving language understanding by gen- erative pre-training. OpenAI Blog, 2018. 2, 8

  68. [68]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9,

  69. [69]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In ICML, 2021. 8

  70. [70]

    Does spatial cog- nition emerge in frontier models? arXiv preprint arXiv:2410.06468, 2024

    Santhosh Kumar Ramakrishnan, Erik Wijmans, Philipp Kraehenbuehl, and Vladlen Koltun. Does spatial cog- nition emerge in frontier models? arXiv preprint arXiv:2410.06468, 2024. 8

  71. [71]

    why should i trust you?

    Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. " why should i trust you?" explaining the predictions of any classifier. In KDD, 2016. 5

  72. [72]

    Grounding natu- ral language instructions: Can large language models cap- ture spatial information? arXiv preprint arXiv:2109.08634,

    Julia Rozanova, Deborah Ferreira, Krishna Dubba, Weiwei Cheng, Dell Zhang, and Andre Freitas. Grounding natu- ral language instructions: Can large language models cap- ture spatial information? arXiv preprint arXiv:2109.08634,

  73. [73]

    Gerard Salton and Michael J. McGill. Introduction to Mod- ern Information Retrieval. McGraw-Hill, Inc., USA, 1986. 4

  74. [74]

    Mental rotation: ef- fects of dimensionality of objects and type of task

    Shenna Shepard and Douglas Metzler. Mental rotation: ef- fects of dimensionality of objects and type of task. Journal of experimental psychology: Human perception and perfor- mance, 14(1):3, 1988. 2

  75. [75]

    Vipergpt: Visual inference via python execution for reasoning

    Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. In ICCV, 2023. 6

  76. [76]

    Sparkle: Mastering basic spatial capabilities in vision language models elicits gen- eralization to composite spatial reasoning

    Yihong Tang, Ao Qu, Zhaokai Wang, Dingyi Zhuang, Zhaofeng Wu, Wei Ma, Shenhao Wang, Yunhan Zheng, Zhan Zhao, and Jinhua Zhao. Sparkle: Mastering basic spatial capabilities in vision language models elicits gen- eralization to composite spatial reasoning. arXiv preprint arXiv:2410.16162, 2024. 8

  77. [77]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalk- wyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 2, 8

  78. [78]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text. arXiv preprint arXiv:2403.05530, 2024. 2, 4, 5, 8, 17

  79. [79]

    Drivevlm: The convergence of autonomous driving and large vision-language models

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. In CoRL, 2024. 2

  80. [80]

    E. C. Tolman. Cognitive maps in rats and men. Psycholog- ical Review, 55(4):189–208, 1948. 2, 7

Showing first 80 references.