arxiv: 2605.09449 · v1 · submitted 2026-05-10 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs

Bo Gu , Zhikang Zhang , Zizhuang Wei , Zhenyuan Chen , Lingyun Li , Zhuoyi Song

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:41 UTC · model grok-4.3

classification 💻 cs.CV

keywords allocentric cognitive mapvoxelized representationvideo MLLMspatial reasoning3D fusioncoordinate-guided fusionobject permanenceegocentric to allocentric

0 comments

The pith

SpaceMind++ builds a voxelized allocentric cognitive map from RGB video and fuses it back into pretrained video MLLMs via coordinate-guided iterative fusion for consistent 3D spatial reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current video MLLMs process visual input in an egocentric, view-dependent way that fragments spatial information across frames. SpaceMind++ constructs an explicit voxelized cognitive map that reorganizes these observations into a shared, metric 3D world representation. This map preserves object permanence and spatial topology even when the camera viewpoint changes. A new Coordinate-Guided Deep Iterative Fusion step then injects the map-level knowledge back into the model's original 2D visual features using coordinate embeddings and 3D rotary positional encoding. The result is stronger spatial reasoning without breaking the pretrained model's native token interface.

Core claim

SpaceMind++ explicitly builds a voxelized cognitive map from RGB videos that reorganizes fragmented egocentric observations into a shared 3D metric representation, enabling the model to preserve object permanence and spatial topology across changing viewpoints, then relays this allocentric knowledge back into the pretrained video MLLM's 2D visual features through Coordinate-Guided Deep Iterative Fusion guided by coordinate embeddings and 3D Rotary Positional Encoding.

What carries the argument

The voxelized allocentric cognitive map, which converts egocentric video frames into a persistent world-centered 3D metric memory, combined with Coordinate-Guided Deep Iterative Fusion that grounds semantic interactions in metric space using coordinate embeddings and 3D Rotary Positional Encoding.

If this is right

Achieves new state-of-the-art performance on VSI-Bench for video spatial understanding tasks.
Demonstrates superior out-of-distribution generalization on SPBench, SITE-Bench, and SPAR-Bench in unseen 3D environments.
Maintains object permanence and spatial topology across viewpoint changes without altering the native visual-token interface.
Grounds semantic interactions in explicit metric 3D space through coordinate embeddings and 3D rotary positional encoding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion approach could support incremental map updates for longer or dynamic video sequences.
Explicit allocentric maps may transfer to embodied tasks such as robot navigation or 3D scene manipulation.
The coordinate-guided mechanism suggests a general way to inject metric structure into other pretrained multimodal models.

Load-bearing premise

A voxelized cognitive map extracted from RGB video can be fused back into a pretrained MLLM's 2D visual features via coordinate guidance without eroding the model's original visual or language capabilities or creating new spatial inconsistencies.

What would settle it

A controlled video sequence with known 3D ground-truth trajectories where the model incorrectly reports object locations or relations after a large viewpoint shift despite having built the cognitive map.

Figures

Figures reproduced from arXiv: 2605.09449 by Bo Gu, Lingyun Li, Zhenyuan Chen, Zhikang Zhang, Zhuoyi Song, Zizhuang Wei.

**Figure 1.** Figure 1: Left: biological motivation. Mammalian spatial cognition separates semantic identity and geometric localization through ventral (red area) and dorsal visual streams (red area), and integrates them into an allocentric cognitive map (cyan area) for spatial awareness and reasoning. Right: model architecture. SpaceMind++ extracts semantic and spatial features from video, organizes them into a voxelized allocen… view at source ↗

**Figure 2.** Figure 2: Detailed components of SpaceMind++. The map constructor transforms patch-level visual [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Statistics of the SpaceMind900K datasets Training data. We train SpaceMind++ on a spatial instruction-tuning corpus containing approximately 900k QA samples. The corpus integrates four 3D reasoning sources, including ViCA-322K[15], VLM-3Rdata[13], SQA3D-train[35], and VSI-590K-Video[67]. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Recent multimodal large language models (MLLMs) have made remarkable progress in visual understanding and language-based reasoning, yet they lack a persistent world-centered representation for spatially consistent reasoning in 3D environments. Inspired by the mammalian dual-stream system, where semantic and spatial cues are processed separately and integrated into an allocentric cognitive map, we propose SpaceMind++, a video MLLM architecture that explicitly builds a voxelized cognitive map from RGB videos. This map reorganizes fragmented egocentric observations into a shared 3D metric representation, enabling the model to preserve object permanence and spatial topology across changing viewpoints. To make this allocentric representation usable by a pretrained video MLLM without disrupting its native visual-token interface, we introduce Coordinate-Guided Deep Iterative Fusion, a new mechanism that relays map-level spatial knowledge back into the original 2D visual features. This fusion is explicitly guided by coordinate embeddings and 3D Rotary Positional Encoding, which ground semantic interactions in metric 3D space, resembling the entorhinal binding of sensory features to metric space. Extensive experiments show that SpaceMind++ achieves new state-of-the-art performance on VSI-Bench. Furthermore, it demonstrates superior out-of-distribution generalization on SPBench, SITE-Bench, and SPAR-Bench, underscoring its robustness in unseen 3D environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpaceMind++ proposes a voxelized allocentric map plus coordinate-guided iterative fusion to add persistent 3D spatial structure to video MLLMs, but the abstract gives no numbers to support the SOTA and OOD claims.

read the letter

The main point is that this paper builds an explicit voxelized 3D cognitive map from RGB video and then feeds the spatial information back into a pretrained video MLLM's 2D tokens using a new fusion step guided by coordinate embeddings and 3D rotary positional encoding. The abstract claims this delivers state-of-the-art results on VSI-Bench plus stronger out-of-distribution performance on SPBench, SITE-Bench, and SPAR-Bench, yet it supplies zero quantitative scores, baselines, or ablations to back any of that up.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SpaceMind++, a video MLLM architecture that builds an explicit voxelized allocentric cognitive map from monocular RGB video to reorganize egocentric observations into a shared 3D metric representation. It introduces Coordinate-Guided Deep Iterative Fusion, which uses coordinate embeddings and 3D Rotary Positional Encoding to inject map-level spatial knowledge back into the pretrained model's native 2D visual features. The paper claims this yields new state-of-the-art results on VSI-Bench together with superior out-of-distribution generalization on SPBench, SITE-Bench, and SPAR-Bench.

Significance. If the fusion operator can be shown to preserve the original visual feature manifold and non-spatial reasoning capabilities while adding allocentric spatial consistency, the work would constitute a substantive advance toward spatially grounded video MLLMs. The neuroscience-inspired separation of semantic and spatial streams and the explicit construction of a persistent 3D metric map from video address a recognized limitation in current MLLMs; the coordinate-guided iterative fusion mechanism is a concrete technical contribution that could influence subsequent architectures.

major comments (2)

[Abstract, §5] Abstract and §5 (Experiments): The abstract asserts new SOTA performance on VSI-Bench and superior OOD generalization on three additional benchmarks, yet the provided text contains no numerical results, baseline comparisons, ablation tables, or error analysis. Without these data it is impossible to determine the magnitude of the claimed gains or to isolate the contribution of the voxelized map from possible side-effects of the fusion operator.
[§4.3] §4.3 (Coordinate-Guided Deep Iterative Fusion): The text states that the mechanism 'relays map-level spatial knowledge back into the original 2D visual features' without disrupting the native visual-token interface. No quantitative verification is supplied (e.g., before/after performance on non-spatial VQA tasks, embedding-distribution statistics, or attention-pattern comparisons) to confirm that iterative coordinate-guided updates leave the pretrained visual manifold unchanged. This assumption is load-bearing for the central claim that spatial improvements are obtained without collateral degradation.

minor comments (2)

[Abstract] The abstract would be strengthened by the inclusion of one or two key quantitative results (e.g., absolute accuracy deltas on VSI-Bench) to support the SOTA claim.
[§3, §4] Notation for the voxel grid resolution, coordinate embedding dimension, and number of iterative fusion steps should be introduced once in §3 or §4 and used consistently thereafter.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important aspects of clarity and empirical support that we will address in the revision. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [Abstract, §5] Abstract and §5 (Experiments): The abstract asserts new SOTA performance on VSI-Bench and superior OOD generalization on three additional benchmarks, yet the provided text contains no numerical results, baseline comparisons, ablation tables, or error analysis. Without these data it is impossible to determine the magnitude of the claimed gains or to isolate the contribution of the voxelized map from possible side-effects of the fusion operator.

Authors: We agree that the abstract and experiments section must contain concrete numerical evidence to substantiate the performance claims. The full manuscript includes detailed results, tables, and comparisons in §5, but we acknowledge that the current presentation does not foreground them sufficiently for immediate evaluation. In the revised version we will (i) insert key quantitative results (e.g., accuracy deltas on VSI-Bench and the three OOD benchmarks) directly into the abstract, (ii) ensure §5 opens with a consolidated main-results table that includes all baselines, and (iii) expand the ablation and error-analysis subsections to isolate the contribution of the voxelized map versus the fusion operator. These changes will make the magnitude of the gains and the source of improvements transparent. revision: yes
Referee: [§4.3] §4.3 (Coordinate-Guided Deep Iterative Fusion): The text states that the mechanism 'relays map-level spatial knowledge back into the original 2D visual features' without disrupting the native visual-token interface. No quantitative verification is supplied (e.g., before/after performance on non-spatial VQA tasks, embedding-distribution statistics, or attention-pattern comparisons) to confirm that iterative coordinate-guided updates leave the pretrained visual manifold unchanged. This assumption is load-bearing for the central claim that spatial improvements are obtained without collateral degradation.

Authors: We concur that explicit quantitative verification of manifold preservation is necessary to support the central claim. The current manuscript relies on architectural design arguments and indirect evidence from downstream spatial tasks, but does not report the requested controls. In the revision we will add a dedicated subsection under §4.3 (or a new appendix) that includes: (a) before/after accuracy on a suite of non-spatial VQA benchmarks, (b) cosine-similarity and distributional statistics (mean, variance, KL divergence) of visual embeddings before and after fusion, and (c) qualitative attention-map comparisons on representative frames. These measurements will directly test whether the coordinate-guided updates leave the pretrained visual manifold intact. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal with no reductive derivations

full rationale

The paper proposes SpaceMind++ as an empirical architecture that constructs a voxelized allocentric cognitive map from monocular RGB video and integrates it via a new Coordinate-Guided Deep Iterative Fusion mechanism using coordinate embeddings and 3D RoPE. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. Claims of SOTA performance and OOD generalization rest on benchmark experiments rather than any step that reduces by construction to the inputs. The central fusion step is presented as an explicit design choice, not a mathematical necessity derived from prior self-referential results. This matches the default expectation of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The proposal rests on the untested assumption that an explicit voxelized allocentric map can be constructed from monocular RGB video and successfully reinjected into a frozen pretrained MLLM. No free parameters or invented physical entities are described in the abstract.

axioms (1)

domain assumption Semantic and spatial cues are processed separately in the mammalian dual-stream system and integrated into an allocentric cognitive map
Explicitly stated as the biological inspiration for the architecture.

invented entities (1)

Voxelized cognitive map no independent evidence
purpose: Reorganize fragmented egocentric observations into a shared 3D metric representation that preserves object permanence and spatial topology
Core new representation introduced by the paper

pith-pipeline@v0.9.0 · 5558 in / 1295 out tokens · 33788 ms · 2026-05-12T04:41:14.568342+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

projects frame-level 2D semantic features into a voxelized map organized in 3D space... persistent voxelized allocentric cognitive map... metric 3D space

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 17 internal anchors

[1]

Flamingo: A visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, and others. Flamingo: A visual language model for few-shot learning. InAdvances in Neural Information Processing Systems, 2022

work page 2022
[2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A frontier large vision-language model with versatile abilities.arXiv, abs/2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, and others. Qwen2.5-VL technical report.arXiv, abs/2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Seed1.5-VL Technical Report

ByteDance Seed et al. Seed1.5-VL technical report.arXiv, abs/2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

SpatialBot: Precise spatial understanding with vision language models

Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. SpatialBot: Precise spatial understanding with vision language models. InIEEE International Conference on Robotics and Automation, 2025

work page 2025
[6]

Scaling spatial intelligence with multi- modal foundation models.arXiv preprint arXiv:2511.13719,

Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Xiangyu Fan, Hanming Deng, Lewei Lu, Liang Pan, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. Scaling spa...

work page arXiv 2025
[7]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv, abs/2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qin- glong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[9]

EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

Zhenghao Chen, Huiqun Wang, and Di Huang. EgoMind: Activating spatial cognition through linguistic reasoning in MLLMs.arXiv, abs/2604.03318, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

SpatialRGPT: Grounded spatial reasoning in vision-language models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. SpatialRGPT: Grounded spatial reasoning in vision-language models. In Advances in Neural Information Processing Systems, 2024

work page 2024
[11]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, David Bieber, Mike Schaekermann, Panupong Pasupat, Noveen Sachdeva, Inderjit Dhillon, Michael Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv, abs/2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Epstein, Eva Zita Patai, Joshua B

Russell A. Epstein, Eva Zita Patai, Joshua B. Julian, and Hugo J. Spiers. The cognitive map in humans: Spatial navigation and beyond.Nature Neuroscience, 20(11):1504–1513, 2017. doi: 10.1038/nn.4656

work page doi:10.1038/nn.4656 2017
[13]

VLM-3r: Vision-language mod- els augmented with instruction-aligned 3d reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, and Rakesh Ranjan. VLM-3r: Vision-language mod- els augmented with instruction-aligned 3d reconstruction. InProceedings of the IEEE/CVF Conference on C...

work page 2026
[14]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Benyou Wang, and Xiangyu Yue. Video-R1: Reinforcing video reasoning in MLLMs.arXiv, abs/2503.21776, 2025. 10

work page internal anchor Pith review arXiv 2025
[15]

Towards visuospatial cognition via hierarchical fusion of visual experts.arXiv preprint arXiv:2505.12363, 2025

Qing Feng. Towards visuospatial cognition via hierarchical fusion of semantic and spatial representations.arXiv, abs/2505.12363, 2025

work page arXiv 2025
[16]

Scene-LLM: Extending language model for 3d visual understanding and reasoning

Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wenhan Xiong. Scene-LLM: Extending language model for 3d visual understanding and reasoning. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025

work page 2025
[17]

Map2thought: Explicit 3d spatial reasoning via metric cognitive maps.arXiv, abs/2601.11442, 2026

Xiangjun Gao, Zhensong Zhang, Dave Zhenyu Chen, Songcen Xu, Long Quan, Eduardo Pérez-Pellitero, and Youngkyoon Jang. Map2thought: Explicit 3d spatial reasoning via metric cognitive maps.arXiv, abs/2601.11442, 2026

work page arXiv 2026
[18]

Goodale and A

Melvyn A. Goodale and A. David Milner. Separate visual pathways for perception and action. Trends in Neurosciences, 15(1):20–25, 1992. doi: 10.1016/0166-2236(92)90344-8

work page doi:10.1016/0166-2236(92)90344-8 1992
[19]

Gemini 3 pro: The frontier of vision AI, 2025

Google DeepMind. Gemini 3 pro: The frontier of vision AI, 2025. Accessed: 2026-03-21

work page 2025
[20]

Cognitive mapping and planning for visual navigation

Saurabh Gupta, Varun Tolani, James Davidson, Sergey Levine, Rahul Sukthankar, and Jitendra Malik. Cognitive mapping and planning for visual navigation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2616–2625, 2017

work page 2017
[21]

Cog3dmap: Multi-view vision-language reasoning with 3d cognitive maps.arXiv, abs/2603.23023, 2026

Chanyoung Gwak, Yoonwoo Jeong, Byungwoo Jeon, Hyunseok Lee, Jinwoo Shin, and Minsu Cho. Cog3dmap: Multi-view vision-language reasoning with 3d cognitive maps.arXiv, abs/2603.23023, 2026

work page arXiv 2026
[22]

Torkel Hafting, Marianne Fyhn, Sturla Molden, May-Britt Moser, and Edvard I. Moser. Mi- crostructure of a spatial map in the entorhinal cortex.Nature, 436(7052):801–806, 2005. doi: 10.1038/nature03721

work page doi:10.1038/nature03721 2005
[23]

3d-LLM: Injecting the 3d world into large language models

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-LLM: Injecting the 3d world into large language models. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[24]

LoRA: Low-Rank Adaptation of Large Language Models

J. Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.ArXiv, abs/2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[25]

arXiv preprint arXiv:2505.22657 , year=

Wenbo Hu, Yining Hong, Yanjun Wang, Leison Gao, Zibu Wei, Xingcheng Yao, Nanyun Peng, Yonatan Bitton, Idan Szpektor, and Kai-Wei Chang. 3dllm-mem: Long-term spatial-temporal memory for embodied 3d large language model.arXiv, abs/2505.22657, 2025

work page arXiv 2025
[26]

Video2layout: Recall and reconstruct metric-grounded cognitive map for spatial reasoning.arXiv, abs/2511.16160, 2025

Yibin Huang, Wang Xu, Wanyue Zhang, Helu Zhi, Jingjing Huang, Yangbin Xu, Yangang Sun, Conghui Zhu, and Tiejun Zhao. Video2layout: Recall and reconstruct metric-grounded cognitive map for spatial reasoning.arXiv, abs/2511.16160, 2025

work page arXiv 2025
[27]

The spatial semantic hierarchy.Artificial Intelligence, 119(1–2):191–233, 2000

Benjamin Kuipers. The spatial semantic hierarchy.Artificial Intelligence, 119(1–2):191–233, 2000

work page 2000
[28]

Grounding image matching in 3d with mast3r, 2024

Vincent Leroy, Yohann Cabon, and Jerome Revaud. MASt3r: Grounding image matching in 3d. arXiv, abs/2406.09756, 2024

work page arXiv 2024
[29]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Li, Yanwei Liu, and Chunyuan Li. LLaV A-OneVision: Easy visual task transfer.arXiv, abs/2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Spatialladder: Progressive training for spatial reasoning in vision-language models.arXiv preprint arXiv:2510.08531,

Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. SpatialLadder: Progressive training for spatial reasoning in vision-language models.arXiv, abs/2510.08531, 2025

work page arXiv 2025
[31]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning, volume 202, pages 19730–19742, 2023

work page 2023
[32]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-LLaV A: Learning united visual representation by alignment before projection.arXiv, abs/2311.10122, 2024. 11

work page internal anchor Pith review arXiv 2024
[33]

LLaV A-NeXT: Improved reasoning, OCR, and world knowledge.https://llava-vl

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. LLaV A-NeXT: Improved reasoning, OCR, and world knowledge.https://llava-vl. github.io/blog/2024-01-30-llava-next/, 2024

work page 2024
[34]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

work page 2019
[35]

SQA3d: Situated question answering in 3d scenes

Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. SQA3d: Situated question answering in 3d scenes. InInternational Conference on Learning Representations, 2023

work page 2023
[36]

arXiv preprint arXiv:2506.07491 , year=

Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, and Zihan Zhou. SpatialLM: Training large language models for structured indoor modeling.arXiv, abs/2506.07491, 2025

work page arXiv 2025
[37]

Kimi-VL Technical Report

Moonshot AI et al. Kimi-VL technical report.arXiv, abs/2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Clarendon Press, Oxford, 1978

John O’Keefe and Lynn Nadel.The Hippocampus as a Cognitive Map. Clarendon Press, Oxford, 1978

work page 1978
[39]

GPT-4 Technical Report

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, et al. GPT-4 technical report.arXiv, abs/2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

GPT-4o System Card

OpenAI, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, et al. GPT-4o system card.arXiv, abs/2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

arXiv preprint arXiv:2504.01805 (2025) 22 B

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. SpaceR: Reinforcing MLLMs in video spatial reasoning.arXiv, abs/2504.01805, 2025

work page arXiv 2025
[42]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInterna- tional Conference on Machine Learning, pages 8748–8763, 2021

work page 2021
[43]

Edmund T. Rolls. Spatial view cells and the representation of place in the primate hippocampus. Hippocampus, 9(4):467–480, 1999

work page 1999
[44]

Rolls, Richard G

Edmund T. Rolls, Richard G. Robertson, and Philippe Georges-Francois. Spatial view cells in the primate hippocampus.European Journal of Neuroscience, 9(8):1789–1794, 1997. doi: 10.1111/j.1460-9568.1997.tb01538.x

work page doi:10.1111/j.1460-9568.1997.tb01538.x 1997
[45]

From reactive to cognitive: Brain-inspired spatial intelligence for embodied agents

Shouwei Ruan, Liyuan Wang, Caixin Kang, Qihui Zhu, Songming Liu, Xingxing Wei, and Hang Su. From reactive to cognitive: Brain-inspired spatial intelligence for embodied agents. arXiv, 2025

work page 2025
[46]

Schönberger and Jan-Michael Frahm

Johannes L. Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4104–4113, 2016

work page 2016
[47]

Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm

Johannes L. Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. InEuropean Conference on Computer Vision, pages 501–518, 2016

work page 2016
[48]

OpenAI GPT-5 System Card

Aman Singh et al. OpenAI GPT-5 system card.arXiv, abs/2601.03267, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[49]

Tsaftaris

Ilias Stogiannidis, Steven McDonagh, and Sotirios A. Tsaftaris. Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv, abs/2503.19707, 2025

work page arXiv 2025
[50]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.arXiv, abs/2104.09864, 2021. doi: 10.48550/arXiv.2104.09864. 12

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2104.09864 2021
[51]

Edward C. Tolman. Cognitive maps in rats and men.Psychological Review, 55(4):189–208,

work page
[52]

doi: 10.1037/h0061626

work page doi:10.1037/h0061626
[53]

Ungerleider and Mortimer Mishkin

Leslie G. Ungerleider and Mortimer Mishkin. Two cortical visual systems. In David J. Ingle, Melvyn A. Goodale, and Richard J. W. Mansfield, editors,Analysis of Visual Behavior, pages 549–586. MIT Press, Cambridge, MA, 1982

work page 1982
[54]

PatchmatchNet: Learned multi-view patchmatch stereo

Fangjinhua Wang, Silvano Galliani, Christoph V ogel, Pablo Speciale, and Marc Pollefeys. PatchmatchNet: Learned multi-view patchmatch stereo. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14194–14203, 2021

work page 2021
[55]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotný. VGGT: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5294–5306, 2025

work page 2025
[56]

Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. CUT3r: Continuous 3d perception model with persistent state.arXiv, abs/2501.12387, 2025

work page arXiv 2025
[57]

DUSt3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024

work page 2024
[58]

SITE: Towards spatial intelligence thorough evaluation

Wenqi Wang, Reuben Tan, Pengyue Zhu, Jianwei Yang, Zhengyuan Yang, Lijuan Wang, Andrey Kolobov, Jianfeng Gao, and Boqing Gong. SITE: Towards spatial intelligence thorough evaluation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

work page 2025
[59]

James C. R. Whittington, Timothy H. Muller, Shirley Mark, Guifen Chen, Caswell Barry, Neil Burgess, and Timothy E. J. Behrens. The tolman-eichenbaum machine: Unifying space and relational memory through generalization in the hippocampal formation.Cell, 183(5): 1249–1263, 2020. doi: 10.1016/j.cell.2020.10.024

work page doi:10.1016/j.cell.2020.10.024 2020
[60]

James C. R. Whittington, Joseph Warren, and Timothy E. J. Behrens. Relating transformers to models and neural representations of the hippocampal formation.arXiv, abs/2112.04035, 2021

work page arXiv 2021
[61]

James C. R. Whittington, David McCaffary, Jacob J. W. Bakermans, and Timothy E. J. Behrens. How to build a cognitive map: Insights from models of the hippocampal formation.Nature Neuroscience, 25(10):1257–1272, 2022. doi: 10.1038/s41593-022-01153-y

work page doi:10.1038/s41593-022-01153-y 2022
[62]

Spatial-MLLM: Boosting MLLM capabilities in visual-based spatial intelligence

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-MLLM: Boosting MLLM capabilities in visual-based spatial intelligence. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[63]

Reinforcing spatial reasoning in vision-language models with interwoven think- ing and visual drawing.arXiv preprint arXiv:2506.09965,

Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv, abs/2506.09965, 2025

work page arXiv 2025
[64]

Grok 4 model card

xAI. Grok 4 model card. https://data.x.ai/2025-08-20-grok-4-model-card.pdf ,

work page 2025
[65]

Accessed: 2026-05-06

work page 2026
[66]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, and others. Qwen3 technical report.arXiv, abs/2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv preprint arXiv:2412.14171, 2024

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv, abs/2412.14171, 2024

work page arXiv 2024
[68]

Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025

Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, and Hengshuang Zhao. Visual spatial tuning.arXiv, abs/2511.05491, 2025

work page arXiv 2025
[69]

Cambrian-S: Towards Spatial Supersensing in Video.arXiv preprint arXiv:2511.04670, 2025

Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei-Fei, and Saining Xie. Cambrian-s: Towards spatial supersensing in video.arXiv, abs/2511.04670, 2025. 13

work page arXiv 2025
[70]

MVSNet: Depth inference for unstructured multi-view stereo

Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. MVSNet: Depth inference for unstructured multi-view stereo. InEuropean Conference on Computer Vision, pages 767–783, 2018

work page 2018
[71]

How far are vlms from visual spatial intel- ligence? a benchmark-driven perspective.arXiv preprint arXiv:2509.18905, 2025

Songsong Yu, Yuxin Chen, Hao Ju, Lianjie Jia, Fuxi Zhang, Shaofei Huang, Yuhan Wu, Rundi Cui, Binghao Ran, Zaibin Zhang, Zhedong Zheng, Zhipeng Zhang, Yifan Wang, Lin Song, Lijun Wang, Yanwei Li, Ying Shan, and Huchuan Lu. How far are VLMs from visual spatial intelligence? a benchmark-driven perspective.arXiv, abs/2509.18905, 2025

work page arXiv 2025
[72]

From flatland to space: Teaching vision-language models to perceive and reason in 3d.arXiv preprint arXiv:2503.22976, 2025

Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, and Li Zhang. From flatland to space: Teaching vision-language models to perceive and reason in 3d.arXiv, abs/2503.22976, 2025

work page arXiv 2025
[73]

Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration? InICLR, 2026

Pingyue Zhang, Zihan Huang, Yue Wang, Jieyu Zhang, Letian Xue, Zihan Wang, Qineng Wang, Keshigeyan Chandrasegaran, Ruohan Zhang, Yejin Choi, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, and Manling Li. Theory of space: Can foundation models construct spatial beliefs through active exploration?arXiv, abs/2602.07055, 2026

work page arXiv 2026
[74]

SpaceMind: Camera-guided modality fusion for spatial reasoning in vision-language models.arXiv, abs/2511.23075, 2025

Ruosen Zhao, Zhikang Zhang, Jialei Xu, Jiahao Chang, Dong Chen, Lingyun Li, Weijian Sun, and Zizhuang Wei. SpaceMind: Camera-guided modality fusion for spatial reasoning in vision-language models.arXiv, abs/2511.23075, 2025

work page arXiv 2025
[75]

Roborefer: Towards spatial referring with reasoning in vision-language models for robotics.arXiv preprint arXiv:2506.04308, 2025

Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, and Shanghang Zhang. RoboRefer: Towards spatial referring with reasoning in vision-language models for robotics.arXiv, abs/2506.04308, 2025

work page arXiv 2025
[76]

LLaV A-3d: A simple yet effective pathway to empowering LMMs with 3d awareness

Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. LLaV A-3d: A simple yet effective pathway to empowering LMMs with 3d awareness. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

work page 2025
[77]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, et al. InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv, abs/2504.10479, 2025. 14

work page internal anchor Pith review Pith/arXiv arXiv 2025