pith. machine review for the scientific record. sign in

arxiv: 2605.09449 · v1 · submitted 2026-05-10 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:41 UTC · model grok-4.3

classification 💻 cs.CV
keywords allocentric cognitive mapvoxelized representationvideo MLLMspatial reasoning3D fusioncoordinate-guided fusionobject permanenceegocentric to allocentric
0
0 comments X

The pith

SpaceMind++ builds a voxelized allocentric cognitive map from RGB video and fuses it back into pretrained video MLLMs via coordinate-guided iterative fusion for consistent 3D spatial reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current video MLLMs process visual input in an egocentric, view-dependent way that fragments spatial information across frames. SpaceMind++ constructs an explicit voxelized cognitive map that reorganizes these observations into a shared, metric 3D world representation. This map preserves object permanence and spatial topology even when the camera viewpoint changes. A new Coordinate-Guided Deep Iterative Fusion step then injects the map-level knowledge back into the model's original 2D visual features using coordinate embeddings and 3D rotary positional encoding. The result is stronger spatial reasoning without breaking the pretrained model's native token interface.

Core claim

SpaceMind++ explicitly builds a voxelized cognitive map from RGB videos that reorganizes fragmented egocentric observations into a shared 3D metric representation, enabling the model to preserve object permanence and spatial topology across changing viewpoints, then relays this allocentric knowledge back into the pretrained video MLLM's 2D visual features through Coordinate-Guided Deep Iterative Fusion guided by coordinate embeddings and 3D Rotary Positional Encoding.

What carries the argument

The voxelized allocentric cognitive map, which converts egocentric video frames into a persistent world-centered 3D metric memory, combined with Coordinate-Guided Deep Iterative Fusion that grounds semantic interactions in metric space using coordinate embeddings and 3D Rotary Positional Encoding.

If this is right

  • Achieves new state-of-the-art performance on VSI-Bench for video spatial understanding tasks.
  • Demonstrates superior out-of-distribution generalization on SPBench, SITE-Bench, and SPAR-Bench in unseen 3D environments.
  • Maintains object permanence and spatial topology across viewpoint changes without altering the native visual-token interface.
  • Grounds semantic interactions in explicit metric 3D space through coordinate embeddings and 3D rotary positional encoding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fusion approach could support incremental map updates for longer or dynamic video sequences.
  • Explicit allocentric maps may transfer to embodied tasks such as robot navigation or 3D scene manipulation.
  • The coordinate-guided mechanism suggests a general way to inject metric structure into other pretrained multimodal models.

Load-bearing premise

A voxelized cognitive map extracted from RGB video can be fused back into a pretrained MLLM's 2D visual features via coordinate guidance without eroding the model's original visual or language capabilities or creating new spatial inconsistencies.

What would settle it

A controlled video sequence with known 3D ground-truth trajectories where the model incorrectly reports object locations or relations after a large viewpoint shift despite having built the cognitive map.

Figures

Figures reproduced from arXiv: 2605.09449 by Bo Gu, Lingyun Li, Zhenyuan Chen, Zhikang Zhang, Zhuoyi Song, Zizhuang Wei.

Figure 1
Figure 1. Figure 1: Left: biological motivation. Mammalian spatial cognition separates semantic identity and geometric localization through ventral (red area) and dorsal visual streams (red area), and integrates them into an allocentric cognitive map (cyan area) for spatial awareness and reasoning. Right: model architecture. SpaceMind++ extracts semantic and spatial features from video, organizes them into a voxelized allocen… view at source ↗
Figure 2
Figure 2. Figure 2: Detailed components of SpaceMind++. The map constructor transforms patch-level visual [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Statistics of the SpaceMind￾900K datasets Training data. We train SpaceMind++ on a spa￾tial instruction-tuning corpus containing approximately 900k QA samples. The corpus integrates four 3D rea￾soning sources, including ViCA-322K[15], VLM-3R￾data[13], SQA3D-train[35], and VSI-590K-Video[67]. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Recent multimodal large language models (MLLMs) have made remarkable progress in visual understanding and language-based reasoning, yet they lack a persistent world-centered representation for spatially consistent reasoning in 3D environments. Inspired by the mammalian dual-stream system, where semantic and spatial cues are processed separately and integrated into an allocentric cognitive map, we propose SpaceMind++, a video MLLM architecture that explicitly builds a voxelized cognitive map from RGB videos. This map reorganizes fragmented egocentric observations into a shared 3D metric representation, enabling the model to preserve object permanence and spatial topology across changing viewpoints. To make this allocentric representation usable by a pretrained video MLLM without disrupting its native visual-token interface, we introduce Coordinate-Guided Deep Iterative Fusion, a new mechanism that relays map-level spatial knowledge back into the original 2D visual features. This fusion is explicitly guided by coordinate embeddings and 3D Rotary Positional Encoding, which ground semantic interactions in metric 3D space, resembling the entorhinal binding of sensory features to metric space. Extensive experiments show that SpaceMind++ achieves new state-of-the-art performance on VSI-Bench. Furthermore, it demonstrates superior out-of-distribution generalization on SPBench, SITE-Bench, and SPAR-Bench, underscoring its robustness in unseen 3D environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SpaceMind++, a video MLLM architecture that builds an explicit voxelized allocentric cognitive map from monocular RGB video to reorganize egocentric observations into a shared 3D metric representation. It introduces Coordinate-Guided Deep Iterative Fusion, which uses coordinate embeddings and 3D Rotary Positional Encoding to inject map-level spatial knowledge back into the pretrained model's native 2D visual features. The paper claims this yields new state-of-the-art results on VSI-Bench together with superior out-of-distribution generalization on SPBench, SITE-Bench, and SPAR-Bench.

Significance. If the fusion operator can be shown to preserve the original visual feature manifold and non-spatial reasoning capabilities while adding allocentric spatial consistency, the work would constitute a substantive advance toward spatially grounded video MLLMs. The neuroscience-inspired separation of semantic and spatial streams and the explicit construction of a persistent 3D metric map from video address a recognized limitation in current MLLMs; the coordinate-guided iterative fusion mechanism is a concrete technical contribution that could influence subsequent architectures.

major comments (2)
  1. [Abstract, §5] Abstract and §5 (Experiments): The abstract asserts new SOTA performance on VSI-Bench and superior OOD generalization on three additional benchmarks, yet the provided text contains no numerical results, baseline comparisons, ablation tables, or error analysis. Without these data it is impossible to determine the magnitude of the claimed gains or to isolate the contribution of the voxelized map from possible side-effects of the fusion operator.
  2. [§4.3] §4.3 (Coordinate-Guided Deep Iterative Fusion): The text states that the mechanism 'relays map-level spatial knowledge back into the original 2D visual features' without disrupting the native visual-token interface. No quantitative verification is supplied (e.g., before/after performance on non-spatial VQA tasks, embedding-distribution statistics, or attention-pattern comparisons) to confirm that iterative coordinate-guided updates leave the pretrained visual manifold unchanged. This assumption is load-bearing for the central claim that spatial improvements are obtained without collateral degradation.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by the inclusion of one or two key quantitative results (e.g., absolute accuracy deltas on VSI-Bench) to support the SOTA claim.
  2. [§3, §4] Notation for the voxel grid resolution, coordinate embedding dimension, and number of iterative fusion steps should be introduced once in §3 or §4 and used consistently thereafter.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important aspects of clarity and empirical support that we will address in the revision. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [Abstract, §5] Abstract and §5 (Experiments): The abstract asserts new SOTA performance on VSI-Bench and superior OOD generalization on three additional benchmarks, yet the provided text contains no numerical results, baseline comparisons, ablation tables, or error analysis. Without these data it is impossible to determine the magnitude of the claimed gains or to isolate the contribution of the voxelized map from possible side-effects of the fusion operator.

    Authors: We agree that the abstract and experiments section must contain concrete numerical evidence to substantiate the performance claims. The full manuscript includes detailed results, tables, and comparisons in §5, but we acknowledge that the current presentation does not foreground them sufficiently for immediate evaluation. In the revised version we will (i) insert key quantitative results (e.g., accuracy deltas on VSI-Bench and the three OOD benchmarks) directly into the abstract, (ii) ensure §5 opens with a consolidated main-results table that includes all baselines, and (iii) expand the ablation and error-analysis subsections to isolate the contribution of the voxelized map versus the fusion operator. These changes will make the magnitude of the gains and the source of improvements transparent. revision: yes

  2. Referee: [§4.3] §4.3 (Coordinate-Guided Deep Iterative Fusion): The text states that the mechanism 'relays map-level spatial knowledge back into the original 2D visual features' without disrupting the native visual-token interface. No quantitative verification is supplied (e.g., before/after performance on non-spatial VQA tasks, embedding-distribution statistics, or attention-pattern comparisons) to confirm that iterative coordinate-guided updates leave the pretrained visual manifold unchanged. This assumption is load-bearing for the central claim that spatial improvements are obtained without collateral degradation.

    Authors: We concur that explicit quantitative verification of manifold preservation is necessary to support the central claim. The current manuscript relies on architectural design arguments and indirect evidence from downstream spatial tasks, but does not report the requested controls. In the revision we will add a dedicated subsection under §4.3 (or a new appendix) that includes: (a) before/after accuracy on a suite of non-spatial VQA benchmarks, (b) cosine-similarity and distributional statistics (mean, variance, KL divergence) of visual embeddings before and after fusion, and (c) qualitative attention-map comparisons on representative frames. These measurements will directly test whether the coordinate-guided updates leave the pretrained visual manifold intact. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal with no reductive derivations

full rationale

The paper proposes SpaceMind++ as an empirical architecture that constructs a voxelized allocentric cognitive map from monocular RGB video and integrates it via a new Coordinate-Guided Deep Iterative Fusion mechanism using coordinate embeddings and 3D RoPE. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. Claims of SOTA performance and OOD generalization rest on benchmark experiments rather than any step that reduces by construction to the inputs. The central fusion step is presented as an explicit design choice, not a mathematical necessity derived from prior self-referential results. This matches the default expectation of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The proposal rests on the untested assumption that an explicit voxelized allocentric map can be constructed from monocular RGB video and successfully reinjected into a frozen pretrained MLLM. No free parameters or invented physical entities are described in the abstract.

axioms (1)
  • domain assumption Semantic and spatial cues are processed separately in the mammalian dual-stream system and integrated into an allocentric cognitive map
    Explicitly stated as the biological inspiration for the architecture.
invented entities (1)
  • Voxelized cognitive map no independent evidence
    purpose: Reorganize fragmented egocentric observations into a shared 3D metric representation that preserves object permanence and spatial topology
    Core new representation introduced by the paper

pith-pipeline@v0.9.0 · 5558 in / 1295 out tokens · 33788 ms · 2026-05-12T04:41:14.568342+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 17 internal anchors

  1. [1]

    Flamingo: A visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, and others. Flamingo: A visual language model for few-shot learning. InAdvances in Neural Information Processing Systems, 2022

  2. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A frontier large vision-language model with versatile abilities.arXiv, abs/2308.12966, 2023

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, and others. Qwen2.5-VL technical report.arXiv, abs/2502.13923, 2025

  4. [4]

    Seed1.5-VL Technical Report

    ByteDance Seed et al. Seed1.5-VL technical report.arXiv, abs/2505.07062, 2025

  5. [5]

    SpatialBot: Precise spatial understanding with vision language models

    Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. SpatialBot: Precise spatial understanding with vision language models. InIEEE International Conference on Robotics and Automation, 2025

  6. [6]

    Scaling spatial intelligence with multi- modal foundation models.arXiv preprint arXiv:2511.13719,

    Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Xiangyu Fan, Hanming Deng, Lewei Lu, Liang Pan, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. Scaling spa...

  7. [7]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv, abs/2412.05271, 2024

  8. [8]

    InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qin- glong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  9. [9]

    EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

    Zhenghao Chen, Huiqun Wang, and Di Huang. EgoMind: Activating spatial cognition through linguistic reasoning in MLLMs.arXiv, abs/2604.03318, 2026

  10. [10]

    SpatialRGPT: Grounded spatial reasoning in vision-language models

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. SpatialRGPT: Grounded spatial reasoning in vision-language models. In Advances in Neural Information Processing Systems, 2024

  11. [11]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, David Bieber, Mike Schaekermann, Panupong Pasupat, Noveen Sachdeva, Inderjit Dhillon, Michael Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv, abs/2507.06261, 2025

  12. [12]

    Epstein, Eva Zita Patai, Joshua B

    Russell A. Epstein, Eva Zita Patai, Joshua B. Julian, and Hugo J. Spiers. The cognitive map in humans: Spatial navigation and beyond.Nature Neuroscience, 20(11):1504–1513, 2017. doi: 10.1038/nn.4656

  13. [13]

    VLM-3r: Vision-language mod- els augmented with instruction-aligned 3d reconstruction

    Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, and Rakesh Ranjan. VLM-3r: Vision-language mod- els augmented with instruction-aligned 3d reconstruction. InProceedings of the IEEE/CVF Conference on C...

  14. [14]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Benyou Wang, and Xiangyu Yue. Video-R1: Reinforcing video reasoning in MLLMs.arXiv, abs/2503.21776, 2025. 10

  15. [15]

    Towards visuospatial cognition via hierarchical fusion of visual experts.arXiv preprint arXiv:2505.12363, 2025

    Qing Feng. Towards visuospatial cognition via hierarchical fusion of semantic and spatial representations.arXiv, abs/2505.12363, 2025

  16. [16]

    Scene-LLM: Extending language model for 3d visual understanding and reasoning

    Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wenhan Xiong. Scene-LLM: Extending language model for 3d visual understanding and reasoning. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025

  17. [17]

    Map2thought: Explicit 3d spatial reasoning via metric cognitive maps.arXiv, abs/2601.11442, 2026

    Xiangjun Gao, Zhensong Zhang, Dave Zhenyu Chen, Songcen Xu, Long Quan, Eduardo Pérez-Pellitero, and Youngkyoon Jang. Map2thought: Explicit 3d spatial reasoning via metric cognitive maps.arXiv, abs/2601.11442, 2026

  18. [18]

    Goodale and A

    Melvyn A. Goodale and A. David Milner. Separate visual pathways for perception and action. Trends in Neurosciences, 15(1):20–25, 1992. doi: 10.1016/0166-2236(92)90344-8

  19. [19]

    Gemini 3 pro: The frontier of vision AI, 2025

    Google DeepMind. Gemini 3 pro: The frontier of vision AI, 2025. Accessed: 2026-03-21

  20. [20]

    Cognitive mapping and planning for visual navigation

    Saurabh Gupta, Varun Tolani, James Davidson, Sergey Levine, Rahul Sukthankar, and Jitendra Malik. Cognitive mapping and planning for visual navigation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2616–2625, 2017

  21. [21]

    Cog3dmap: Multi-view vision-language reasoning with 3d cognitive maps.arXiv, abs/2603.23023, 2026

    Chanyoung Gwak, Yoonwoo Jeong, Byungwoo Jeon, Hyunseok Lee, Jinwoo Shin, and Minsu Cho. Cog3dmap: Multi-view vision-language reasoning with 3d cognitive maps.arXiv, abs/2603.23023, 2026

  22. [22]

    Torkel Hafting, Marianne Fyhn, Sturla Molden, May-Britt Moser, and Edvard I. Moser. Mi- crostructure of a spatial map in the entorhinal cortex.Nature, 436(7052):801–806, 2005. doi: 10.1038/nature03721

  23. [23]

    3d-LLM: Injecting the 3d world into large language models

    Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-LLM: Injecting the 3d world into large language models. InAdvances in Neural Information Processing Systems, 2023

  24. [24]

    LoRA: Low-Rank Adaptation of Large Language Models

    J. Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.ArXiv, abs/2106.09685, 2021

  25. [25]

    arXiv preprint arXiv:2505.22657 , year=

    Wenbo Hu, Yining Hong, Yanjun Wang, Leison Gao, Zibu Wei, Xingcheng Yao, Nanyun Peng, Yonatan Bitton, Idan Szpektor, and Kai-Wei Chang. 3dllm-mem: Long-term spatial-temporal memory for embodied 3d large language model.arXiv, abs/2505.22657, 2025

  26. [26]

    Video2layout: Recall and reconstruct metric-grounded cognitive map for spatial reasoning.arXiv, abs/2511.16160, 2025

    Yibin Huang, Wang Xu, Wanyue Zhang, Helu Zhi, Jingjing Huang, Yangbin Xu, Yangang Sun, Conghui Zhu, and Tiejun Zhao. Video2layout: Recall and reconstruct metric-grounded cognitive map for spatial reasoning.arXiv, abs/2511.16160, 2025

  27. [27]

    The spatial semantic hierarchy.Artificial Intelligence, 119(1–2):191–233, 2000

    Benjamin Kuipers. The spatial semantic hierarchy.Artificial Intelligence, 119(1–2):191–233, 2000

  28. [28]

    Grounding image matching in 3d with mast3r, 2024

    Vincent Leroy, Yohann Cabon, and Jerome Revaud. MASt3r: Grounding image matching in 3d. arXiv, abs/2406.09756, 2024

  29. [29]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Li, Yanwei Liu, and Chunyuan Li. LLaV A-OneVision: Easy visual task transfer.arXiv, abs/2408.03326, 2024

  30. [30]

    Spatialladder: Progressive training for spatial reasoning in vision-language models.arXiv preprint arXiv:2510.08531,

    Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. SpatialLadder: Progressive training for spatial reasoning in vision-language models.arXiv, abs/2510.08531, 2025

  31. [31]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning, volume 202, pages 19730–19742, 2023

  32. [32]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-LLaV A: Learning united visual representation by alignment before projection.arXiv, abs/2311.10122, 2024. 11

  33. [33]

    LLaV A-NeXT: Improved reasoning, OCR, and world knowledge.https://llava-vl

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. LLaV A-NeXT: Improved reasoning, OCR, and world knowledge.https://llava-vl. github.io/blog/2024-01-30-llava-next/, 2024

  34. [34]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

  35. [35]

    SQA3d: Situated question answering in 3d scenes

    Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. SQA3d: Situated question answering in 3d scenes. InInternational Conference on Learning Representations, 2023

  36. [36]

    arXiv preprint arXiv:2506.07491 , year=

    Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, and Zihan Zhou. SpatialLM: Training large language models for structured indoor modeling.arXiv, abs/2506.07491, 2025

  37. [37]

    Kimi-VL Technical Report

    Moonshot AI et al. Kimi-VL technical report.arXiv, abs/2504.07491, 2025

  38. [38]

    Clarendon Press, Oxford, 1978

    John O’Keefe and Lynn Nadel.The Hippocampus as a Cognitive Map. Clarendon Press, Oxford, 1978

  39. [39]

    GPT-4 Technical Report

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, et al. GPT-4 technical report.arXiv, abs/2303.08774, 2023

  40. [40]

    GPT-4o System Card

    OpenAI, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, et al. GPT-4o system card.arXiv, abs/2410.21276, 2024

  41. [41]

    arXiv preprint arXiv:2504.01805 (2025) 22 B

    Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. SpaceR: Reinforcing MLLMs in video spatial reasoning.arXiv, abs/2504.01805, 2025

  42. [42]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInterna- tional Conference on Machine Learning, pages 8748–8763, 2021

  43. [43]

    Edmund T. Rolls. Spatial view cells and the representation of place in the primate hippocampus. Hippocampus, 9(4):467–480, 1999

  44. [44]

    Rolls, Richard G

    Edmund T. Rolls, Richard G. Robertson, and Philippe Georges-Francois. Spatial view cells in the primate hippocampus.European Journal of Neuroscience, 9(8):1789–1794, 1997. doi: 10.1111/j.1460-9568.1997.tb01538.x

  45. [45]

    From reactive to cognitive: Brain-inspired spatial intelligence for embodied agents

    Shouwei Ruan, Liyuan Wang, Caixin Kang, Qihui Zhu, Songming Liu, Xingxing Wei, and Hang Su. From reactive to cognitive: Brain-inspired spatial intelligence for embodied agents. arXiv, 2025

  46. [46]

    Schönberger and Jan-Michael Frahm

    Johannes L. Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4104–4113, 2016

  47. [47]

    Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm

    Johannes L. Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. InEuropean Conference on Computer Vision, pages 501–518, 2016

  48. [48]

    OpenAI GPT-5 System Card

    Aman Singh et al. OpenAI GPT-5 system card.arXiv, abs/2601.03267, 2026

  49. [49]

    Tsaftaris

    Ilias Stogiannidis, Steven McDonagh, and Sotirios A. Tsaftaris. Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv, abs/2503.19707, 2025

  50. [50]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.arXiv, abs/2104.09864, 2021. doi: 10.48550/arXiv.2104.09864. 12

  51. [51]

    Edward C. Tolman. Cognitive maps in rats and men.Psychological Review, 55(4):189–208,

  52. [52]

    doi: 10.1037/h0061626

  53. [53]

    Ungerleider and Mortimer Mishkin

    Leslie G. Ungerleider and Mortimer Mishkin. Two cortical visual systems. In David J. Ingle, Melvyn A. Goodale, and Richard J. W. Mansfield, editors,Analysis of Visual Behavior, pages 549–586. MIT Press, Cambridge, MA, 1982

  54. [54]

    PatchmatchNet: Learned multi-view patchmatch stereo

    Fangjinhua Wang, Silvano Galliani, Christoph V ogel, Pablo Speciale, and Marc Pollefeys. PatchmatchNet: Learned multi-view patchmatch stereo. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14194–14203, 2021

  55. [55]

    VGGT: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotný. VGGT: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5294–5306, 2025

  56. [56]

    Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. CUT3r: Continuous 3d perception model with persistent state.arXiv, abs/2501.12387, 2025

  57. [57]

    DUSt3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024

  58. [58]

    SITE: Towards spatial intelligence thorough evaluation

    Wenqi Wang, Reuben Tan, Pengyue Zhu, Jianwei Yang, Zhengyuan Yang, Lijuan Wang, Andrey Kolobov, Jianfeng Gao, and Boqing Gong. SITE: Towards spatial intelligence thorough evaluation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

  59. [59]

    James C. R. Whittington, Timothy H. Muller, Shirley Mark, Guifen Chen, Caswell Barry, Neil Burgess, and Timothy E. J. Behrens. The tolman-eichenbaum machine: Unifying space and relational memory through generalization in the hippocampal formation.Cell, 183(5): 1249–1263, 2020. doi: 10.1016/j.cell.2020.10.024

  60. [60]

    James C. R. Whittington, Joseph Warren, and Timothy E. J. Behrens. Relating transformers to models and neural representations of the hippocampal formation.arXiv, abs/2112.04035, 2021

  61. [61]

    James C. R. Whittington, David McCaffary, Jacob J. W. Bakermans, and Timothy E. J. Behrens. How to build a cognitive map: Insights from models of the hippocampal formation.Nature Neuroscience, 25(10):1257–1272, 2022. doi: 10.1038/s41593-022-01153-y

  62. [62]

    Spatial-MLLM: Boosting MLLM capabilities in visual-based spatial intelligence

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-MLLM: Boosting MLLM capabilities in visual-based spatial intelligence. InAdvances in Neural Information Processing Systems, 2025

  63. [63]

    Reinforcing spatial reasoning in vision-language models with interwoven think- ing and visual drawing.arXiv preprint arXiv:2506.09965,

    Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv, abs/2506.09965, 2025

  64. [64]

    Grok 4 model card

    xAI. Grok 4 model card. https://data.x.ai/2025-08-20-grok-4-model-card.pdf ,

  65. [65]

    Accessed: 2026-05-06

  66. [66]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, and others. Qwen3 technical report.arXiv, abs/2505.09388, 2025

  67. [67]

    Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv preprint arXiv:2412.14171, 2024

    Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv, abs/2412.14171, 2024

  68. [68]

    Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025

    Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, and Hengshuang Zhao. Visual spatial tuning.arXiv, abs/2511.05491, 2025

  69. [69]

    Cambrian-S: Towards Spatial Supersensing in Video.arXiv preprint arXiv:2511.04670, 2025

    Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei-Fei, and Saining Xie. Cambrian-s: Towards spatial supersensing in video.arXiv, abs/2511.04670, 2025. 13

  70. [70]

    MVSNet: Depth inference for unstructured multi-view stereo

    Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. MVSNet: Depth inference for unstructured multi-view stereo. InEuropean Conference on Computer Vision, pages 767–783, 2018

  71. [71]

    How far are vlms from visual spatial intel- ligence? a benchmark-driven perspective.arXiv preprint arXiv:2509.18905, 2025

    Songsong Yu, Yuxin Chen, Hao Ju, Lianjie Jia, Fuxi Zhang, Shaofei Huang, Yuhan Wu, Rundi Cui, Binghao Ran, Zaibin Zhang, Zhedong Zheng, Zhipeng Zhang, Yifan Wang, Lin Song, Lijun Wang, Yanwei Li, Ying Shan, and Huchuan Lu. How far are VLMs from visual spatial intelligence? a benchmark-driven perspective.arXiv, abs/2509.18905, 2025

  72. [72]

    From flatland to space: Teaching vision-language models to perceive and reason in 3d.arXiv preprint arXiv:2503.22976, 2025

    Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, and Li Zhang. From flatland to space: Teaching vision-language models to perceive and reason in 3d.arXiv, abs/2503.22976, 2025

  73. [73]

    Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration? InICLR, 2026

    Pingyue Zhang, Zihan Huang, Yue Wang, Jieyu Zhang, Letian Xue, Zihan Wang, Qineng Wang, Keshigeyan Chandrasegaran, Ruohan Zhang, Yejin Choi, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, and Manling Li. Theory of space: Can foundation models construct spatial beliefs through active exploration?arXiv, abs/2602.07055, 2026

  74. [74]

    SpaceMind: Camera-guided modality fusion for spatial reasoning in vision-language models.arXiv, abs/2511.23075, 2025

    Ruosen Zhao, Zhikang Zhang, Jialei Xu, Jiahao Chang, Dong Chen, Lingyun Li, Weijian Sun, and Zizhuang Wei. SpaceMind: Camera-guided modality fusion for spatial reasoning in vision-language models.arXiv, abs/2511.23075, 2025

  75. [75]

    Roborefer: Towards spatial referring with reasoning in vision-language models for robotics.arXiv preprint arXiv:2506.04308, 2025

    Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, and Shanghang Zhang. RoboRefer: Towards spatial referring with reasoning in vision-language models for robotics.arXiv, abs/2506.04308, 2025

  76. [76]

    LLaV A-3d: A simple yet effective pathway to empowering LMMs with 3d awareness

    Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. LLaV A-3d: A simple yet effective pathway to empowering LMMs with 3d awareness. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

  77. [77]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, et al. InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv, abs/2504.10479, 2025. 14