Recognition: 2 theorem links
· Lean TheoremSpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs
Pith reviewed 2026-05-12 04:41 UTC · model grok-4.3
The pith
SpaceMind++ builds a voxelized allocentric cognitive map from RGB video and fuses it back into pretrained video MLLMs via coordinate-guided iterative fusion for consistent 3D spatial reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SpaceMind++ explicitly builds a voxelized cognitive map from RGB videos that reorganizes fragmented egocentric observations into a shared 3D metric representation, enabling the model to preserve object permanence and spatial topology across changing viewpoints, then relays this allocentric knowledge back into the pretrained video MLLM's 2D visual features through Coordinate-Guided Deep Iterative Fusion guided by coordinate embeddings and 3D Rotary Positional Encoding.
What carries the argument
The voxelized allocentric cognitive map, which converts egocentric video frames into a persistent world-centered 3D metric memory, combined with Coordinate-Guided Deep Iterative Fusion that grounds semantic interactions in metric space using coordinate embeddings and 3D Rotary Positional Encoding.
If this is right
- Achieves new state-of-the-art performance on VSI-Bench for video spatial understanding tasks.
- Demonstrates superior out-of-distribution generalization on SPBench, SITE-Bench, and SPAR-Bench in unseen 3D environments.
- Maintains object permanence and spatial topology across viewpoint changes without altering the native visual-token interface.
- Grounds semantic interactions in explicit metric 3D space through coordinate embeddings and 3D rotary positional encoding.
Where Pith is reading between the lines
- The same fusion approach could support incremental map updates for longer or dynamic video sequences.
- Explicit allocentric maps may transfer to embodied tasks such as robot navigation or 3D scene manipulation.
- The coordinate-guided mechanism suggests a general way to inject metric structure into other pretrained multimodal models.
Load-bearing premise
A voxelized cognitive map extracted from RGB video can be fused back into a pretrained MLLM's 2D visual features via coordinate guidance without eroding the model's original visual or language capabilities or creating new spatial inconsistencies.
What would settle it
A controlled video sequence with known 3D ground-truth trajectories where the model incorrectly reports object locations or relations after a large viewpoint shift despite having built the cognitive map.
Figures
read the original abstract
Recent multimodal large language models (MLLMs) have made remarkable progress in visual understanding and language-based reasoning, yet they lack a persistent world-centered representation for spatially consistent reasoning in 3D environments. Inspired by the mammalian dual-stream system, where semantic and spatial cues are processed separately and integrated into an allocentric cognitive map, we propose SpaceMind++, a video MLLM architecture that explicitly builds a voxelized cognitive map from RGB videos. This map reorganizes fragmented egocentric observations into a shared 3D metric representation, enabling the model to preserve object permanence and spatial topology across changing viewpoints. To make this allocentric representation usable by a pretrained video MLLM without disrupting its native visual-token interface, we introduce Coordinate-Guided Deep Iterative Fusion, a new mechanism that relays map-level spatial knowledge back into the original 2D visual features. This fusion is explicitly guided by coordinate embeddings and 3D Rotary Positional Encoding, which ground semantic interactions in metric 3D space, resembling the entorhinal binding of sensory features to metric space. Extensive experiments show that SpaceMind++ achieves new state-of-the-art performance on VSI-Bench. Furthermore, it demonstrates superior out-of-distribution generalization on SPBench, SITE-Bench, and SPAR-Bench, underscoring its robustness in unseen 3D environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SpaceMind++, a video MLLM architecture that builds an explicit voxelized allocentric cognitive map from monocular RGB video to reorganize egocentric observations into a shared 3D metric representation. It introduces Coordinate-Guided Deep Iterative Fusion, which uses coordinate embeddings and 3D Rotary Positional Encoding to inject map-level spatial knowledge back into the pretrained model's native 2D visual features. The paper claims this yields new state-of-the-art results on VSI-Bench together with superior out-of-distribution generalization on SPBench, SITE-Bench, and SPAR-Bench.
Significance. If the fusion operator can be shown to preserve the original visual feature manifold and non-spatial reasoning capabilities while adding allocentric spatial consistency, the work would constitute a substantive advance toward spatially grounded video MLLMs. The neuroscience-inspired separation of semantic and spatial streams and the explicit construction of a persistent 3D metric map from video address a recognized limitation in current MLLMs; the coordinate-guided iterative fusion mechanism is a concrete technical contribution that could influence subsequent architectures.
major comments (2)
- [Abstract, §5] Abstract and §5 (Experiments): The abstract asserts new SOTA performance on VSI-Bench and superior OOD generalization on three additional benchmarks, yet the provided text contains no numerical results, baseline comparisons, ablation tables, or error analysis. Without these data it is impossible to determine the magnitude of the claimed gains or to isolate the contribution of the voxelized map from possible side-effects of the fusion operator.
- [§4.3] §4.3 (Coordinate-Guided Deep Iterative Fusion): The text states that the mechanism 'relays map-level spatial knowledge back into the original 2D visual features' without disrupting the native visual-token interface. No quantitative verification is supplied (e.g., before/after performance on non-spatial VQA tasks, embedding-distribution statistics, or attention-pattern comparisons) to confirm that iterative coordinate-guided updates leave the pretrained visual manifold unchanged. This assumption is load-bearing for the central claim that spatial improvements are obtained without collateral degradation.
minor comments (2)
- [Abstract] The abstract would be strengthened by the inclusion of one or two key quantitative results (e.g., absolute accuracy deltas on VSI-Bench) to support the SOTA claim.
- [§3, §4] Notation for the voxel grid resolution, coordinate embedding dimension, and number of iterative fusion steps should be introduced once in §3 or §4 and used consistently thereafter.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments highlight important aspects of clarity and empirical support that we will address in the revision. Below we respond point-by-point to the major comments.
read point-by-point responses
-
Referee: [Abstract, §5] Abstract and §5 (Experiments): The abstract asserts new SOTA performance on VSI-Bench and superior OOD generalization on three additional benchmarks, yet the provided text contains no numerical results, baseline comparisons, ablation tables, or error analysis. Without these data it is impossible to determine the magnitude of the claimed gains or to isolate the contribution of the voxelized map from possible side-effects of the fusion operator.
Authors: We agree that the abstract and experiments section must contain concrete numerical evidence to substantiate the performance claims. The full manuscript includes detailed results, tables, and comparisons in §5, but we acknowledge that the current presentation does not foreground them sufficiently for immediate evaluation. In the revised version we will (i) insert key quantitative results (e.g., accuracy deltas on VSI-Bench and the three OOD benchmarks) directly into the abstract, (ii) ensure §5 opens with a consolidated main-results table that includes all baselines, and (iii) expand the ablation and error-analysis subsections to isolate the contribution of the voxelized map versus the fusion operator. These changes will make the magnitude of the gains and the source of improvements transparent. revision: yes
-
Referee: [§4.3] §4.3 (Coordinate-Guided Deep Iterative Fusion): The text states that the mechanism 'relays map-level spatial knowledge back into the original 2D visual features' without disrupting the native visual-token interface. No quantitative verification is supplied (e.g., before/after performance on non-spatial VQA tasks, embedding-distribution statistics, or attention-pattern comparisons) to confirm that iterative coordinate-guided updates leave the pretrained visual manifold unchanged. This assumption is load-bearing for the central claim that spatial improvements are obtained without collateral degradation.
Authors: We concur that explicit quantitative verification of manifold preservation is necessary to support the central claim. The current manuscript relies on architectural design arguments and indirect evidence from downstream spatial tasks, but does not report the requested controls. In the revision we will add a dedicated subsection under §4.3 (or a new appendix) that includes: (a) before/after accuracy on a suite of non-spatial VQA benchmarks, (b) cosine-similarity and distributional statistics (mean, variance, KL divergence) of visual embeddings before and after fusion, and (c) qualitative attention-map comparisons on representative frames. These measurements will directly test whether the coordinate-guided updates leave the pretrained visual manifold intact. revision: yes
Circularity Check
No circularity: empirical architecture proposal with no reductive derivations
full rationale
The paper proposes SpaceMind++ as an empirical architecture that constructs a voxelized allocentric cognitive map from monocular RGB video and integrates it via a new Coordinate-Guided Deep Iterative Fusion mechanism using coordinate embeddings and 3D RoPE. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. Claims of SOTA performance and OOD generalization rest on benchmark experiments rather than any step that reduces by construction to the inputs. The central fusion step is presented as an explicit design choice, not a mathematical necessity derived from prior self-referential results. This matches the default expectation of a self-contained empirical contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Semantic and spatial cues are processed separately in the mammalian dual-stream system and integrated into an allocentric cognitive map
invented entities (1)
-
Voxelized cognitive map
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
projects frame-level 2D semantic features into a voxelized map organized in 3D space... persistent voxelized allocentric cognitive map... metric 3D space
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Flamingo: A visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, and others. Flamingo: A visual language model for few-shot learning. InAdvances in Neural Information Processing Systems, 2022
work page 2022
-
[2]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A frontier large vision-language model with versatile abilities.arXiv, abs/2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, and others. Qwen2.5-VL technical report.arXiv, abs/2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
ByteDance Seed et al. Seed1.5-VL technical report.arXiv, abs/2505.07062, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
SpatialBot: Precise spatial understanding with vision language models
Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. SpatialBot: Precise spatial understanding with vision language models. InIEEE International Conference on Robotics and Automation, 2025
work page 2025
-
[6]
Scaling spatial intelligence with multi- modal foundation models.arXiv preprint arXiv:2511.13719,
Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Xiangyu Fan, Hanming Deng, Lewei Lu, Liang Pan, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. Scaling spa...
-
[7]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv, abs/2412.05271, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qin- glong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[9]
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs
Zhenghao Chen, Huiqun Wang, and Di Huang. EgoMind: Activating spatial cognition through linguistic reasoning in MLLMs.arXiv, abs/2604.03318, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[10]
SpatialRGPT: Grounded spatial reasoning in vision-language models
An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. SpatialRGPT: Grounded spatial reasoning in vision-language models. In Advances in Neural Information Processing Systems, 2024
work page 2024
-
[11]
Gheorghe Comanici, David Bieber, Mike Schaekermann, Panupong Pasupat, Noveen Sachdeva, Inderjit Dhillon, Michael Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv, abs/2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Epstein, Eva Zita Patai, Joshua B
Russell A. Epstein, Eva Zita Patai, Joshua B. Julian, and Hugo J. Spiers. The cognitive map in humans: Spatial navigation and beyond.Nature Neuroscience, 20(11):1504–1513, 2017. doi: 10.1038/nn.4656
-
[13]
VLM-3r: Vision-language mod- els augmented with instruction-aligned 3d reconstruction
Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, and Rakesh Ranjan. VLM-3r: Vision-language mod- els augmented with instruction-aligned 3d reconstruction. InProceedings of the IEEE/CVF Conference on C...
work page 2026
-
[14]
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Benyou Wang, and Xiangyu Yue. Video-R1: Reinforcing video reasoning in MLLMs.arXiv, abs/2503.21776, 2025. 10
work page internal anchor Pith review arXiv 2025
-
[15]
Qing Feng. Towards visuospatial cognition via hierarchical fusion of semantic and spatial representations.arXiv, abs/2505.12363, 2025
-
[16]
Scene-LLM: Extending language model for 3d visual understanding and reasoning
Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wenhan Xiong. Scene-LLM: Extending language model for 3d visual understanding and reasoning. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025
work page 2025
-
[17]
Map2thought: Explicit 3d spatial reasoning via metric cognitive maps.arXiv, abs/2601.11442, 2026
Xiangjun Gao, Zhensong Zhang, Dave Zhenyu Chen, Songcen Xu, Long Quan, Eduardo Pérez-Pellitero, and Youngkyoon Jang. Map2thought: Explicit 3d spatial reasoning via metric cognitive maps.arXiv, abs/2601.11442, 2026
-
[18]
Melvyn A. Goodale and A. David Milner. Separate visual pathways for perception and action. Trends in Neurosciences, 15(1):20–25, 1992. doi: 10.1016/0166-2236(92)90344-8
-
[19]
Gemini 3 pro: The frontier of vision AI, 2025
Google DeepMind. Gemini 3 pro: The frontier of vision AI, 2025. Accessed: 2026-03-21
work page 2025
-
[20]
Cognitive mapping and planning for visual navigation
Saurabh Gupta, Varun Tolani, James Davidson, Sergey Levine, Rahul Sukthankar, and Jitendra Malik. Cognitive mapping and planning for visual navigation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2616–2625, 2017
work page 2017
-
[21]
Cog3dmap: Multi-view vision-language reasoning with 3d cognitive maps.arXiv, abs/2603.23023, 2026
Chanyoung Gwak, Yoonwoo Jeong, Byungwoo Jeon, Hyunseok Lee, Jinwoo Shin, and Minsu Cho. Cog3dmap: Multi-view vision-language reasoning with 3d cognitive maps.arXiv, abs/2603.23023, 2026
-
[22]
Torkel Hafting, Marianne Fyhn, Sturla Molden, May-Britt Moser, and Edvard I. Moser. Mi- crostructure of a spatial map in the entorhinal cortex.Nature, 436(7052):801–806, 2005. doi: 10.1038/nature03721
-
[23]
3d-LLM: Injecting the 3d world into large language models
Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-LLM: Injecting the 3d world into large language models. InAdvances in Neural Information Processing Systems, 2023
work page 2023
-
[24]
LoRA: Low-Rank Adaptation of Large Language Models
J. Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.ArXiv, abs/2106.09685, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[25]
arXiv preprint arXiv:2505.22657 , year=
Wenbo Hu, Yining Hong, Yanjun Wang, Leison Gao, Zibu Wei, Xingcheng Yao, Nanyun Peng, Yonatan Bitton, Idan Szpektor, and Kai-Wei Chang. 3dllm-mem: Long-term spatial-temporal memory for embodied 3d large language model.arXiv, abs/2505.22657, 2025
-
[26]
Yibin Huang, Wang Xu, Wanyue Zhang, Helu Zhi, Jingjing Huang, Yangbin Xu, Yangang Sun, Conghui Zhu, and Tiejun Zhao. Video2layout: Recall and reconstruct metric-grounded cognitive map for spatial reasoning.arXiv, abs/2511.16160, 2025
-
[27]
The spatial semantic hierarchy.Artificial Intelligence, 119(1–2):191–233, 2000
Benjamin Kuipers. The spatial semantic hierarchy.Artificial Intelligence, 119(1–2):191–233, 2000
work page 2000
-
[28]
Grounding image matching in 3d with mast3r, 2024
Vincent Leroy, Yohann Cabon, and Jerome Revaud. MASt3r: Grounding image matching in 3d. arXiv, abs/2406.09756, 2024
-
[29]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Li, Yanwei Liu, and Chunyuan Li. LLaV A-OneVision: Easy visual task transfer.arXiv, abs/2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. SpatialLadder: Progressive training for spatial reasoning in vision-language models.arXiv, abs/2510.08531, 2025
-
[31]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning, volume 202, pages 19730–19742, 2023
work page 2023
-
[32]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-LLaV A: Learning united visual representation by alignment before projection.arXiv, abs/2311.10122, 2024. 11
work page internal anchor Pith review arXiv 2024
-
[33]
LLaV A-NeXT: Improved reasoning, OCR, and world knowledge.https://llava-vl
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. LLaV A-NeXT: Improved reasoning, OCR, and world knowledge.https://llava-vl. github.io/blog/2024-01-30-llava-next/, 2024
work page 2024
-
[34]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019
work page 2019
-
[35]
SQA3d: Situated question answering in 3d scenes
Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. SQA3d: Situated question answering in 3d scenes. InInternational Conference on Learning Representations, 2023
work page 2023
-
[36]
arXiv preprint arXiv:2506.07491 , year=
Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, and Zihan Zhou. SpatialLM: Training large language models for structured indoor modeling.arXiv, abs/2506.07491, 2025
-
[37]
Moonshot AI et al. Kimi-VL technical report.arXiv, abs/2504.07491, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
John O’Keefe and Lynn Nadel.The Hippocampus as a Cognitive Map. Clarendon Press, Oxford, 1978
work page 1978
-
[39]
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, et al. GPT-4 technical report.arXiv, abs/2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
OpenAI, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, et al. GPT-4o system card.arXiv, abs/2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
arXiv preprint arXiv:2504.01805 (2025) 22 B
Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. SpaceR: Reinforcing MLLMs in video spatial reasoning.arXiv, abs/2504.01805, 2025
-
[42]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInterna- tional Conference on Machine Learning, pages 8748–8763, 2021
work page 2021
-
[43]
Edmund T. Rolls. Spatial view cells and the representation of place in the primate hippocampus. Hippocampus, 9(4):467–480, 1999
work page 1999
-
[44]
Edmund T. Rolls, Richard G. Robertson, and Philippe Georges-Francois. Spatial view cells in the primate hippocampus.European Journal of Neuroscience, 9(8):1789–1794, 1997. doi: 10.1111/j.1460-9568.1997.tb01538.x
-
[45]
From reactive to cognitive: Brain-inspired spatial intelligence for embodied agents
Shouwei Ruan, Liyuan Wang, Caixin Kang, Qihui Zhu, Songming Liu, Xingxing Wei, and Hang Su. From reactive to cognitive: Brain-inspired spatial intelligence for embodied agents. arXiv, 2025
work page 2025
-
[46]
Schönberger and Jan-Michael Frahm
Johannes L. Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4104–4113, 2016
work page 2016
-
[47]
Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm
Johannes L. Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. InEuropean Conference on Computer Vision, pages 501–518, 2016
work page 2016
-
[48]
Aman Singh et al. OpenAI GPT-5 system card.arXiv, abs/2601.03267, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [49]
-
[50]
RoFormer: Enhanced Transformer with Rotary Position Embedding
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.arXiv, abs/2104.09864, 2021. doi: 10.48550/arXiv.2104.09864. 12
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2104.09864 2021
-
[51]
Edward C. Tolman. Cognitive maps in rats and men.Psychological Review, 55(4):189–208,
-
[52]
doi: 10.1037/h0061626
-
[53]
Ungerleider and Mortimer Mishkin
Leslie G. Ungerleider and Mortimer Mishkin. Two cortical visual systems. In David J. Ingle, Melvyn A. Goodale, and Richard J. W. Mansfield, editors,Analysis of Visual Behavior, pages 549–586. MIT Press, Cambridge, MA, 1982
work page 1982
-
[54]
PatchmatchNet: Learned multi-view patchmatch stereo
Fangjinhua Wang, Silvano Galliani, Christoph V ogel, Pablo Speciale, and Marc Pollefeys. PatchmatchNet: Learned multi-view patchmatch stereo. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14194–14203, 2021
work page 2021
-
[55]
VGGT: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotný. VGGT: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5294–5306, 2025
work page 2025
-
[56]
Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025
Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. CUT3r: Continuous 3d perception model with persistent state.arXiv, abs/2501.12387, 2025
-
[57]
DUSt3r: Geometric 3d vision made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024
work page 2024
-
[58]
SITE: Towards spatial intelligence thorough evaluation
Wenqi Wang, Reuben Tan, Pengyue Zhu, Jianwei Yang, Zhengyuan Yang, Lijuan Wang, Andrey Kolobov, Jianfeng Gao, and Boqing Gong. SITE: Towards spatial intelligence thorough evaluation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025
work page 2025
-
[59]
James C. R. Whittington, Timothy H. Muller, Shirley Mark, Guifen Chen, Caswell Barry, Neil Burgess, and Timothy E. J. Behrens. The tolman-eichenbaum machine: Unifying space and relational memory through generalization in the hippocampal formation.Cell, 183(5): 1249–1263, 2020. doi: 10.1016/j.cell.2020.10.024
- [60]
-
[61]
James C. R. Whittington, David McCaffary, Jacob J. W. Bakermans, and Timothy E. J. Behrens. How to build a cognitive map: Insights from models of the hippocampal formation.Nature Neuroscience, 25(10):1257–1272, 2022. doi: 10.1038/s41593-022-01153-y
-
[62]
Spatial-MLLM: Boosting MLLM capabilities in visual-based spatial intelligence
Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-MLLM: Boosting MLLM capabilities in visual-based spatial intelligence. InAdvances in Neural Information Processing Systems, 2025
work page 2025
-
[63]
Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv, abs/2506.09965, 2025
-
[64]
xAI. Grok 4 model card. https://data.x.ai/2025-08-20-grok-4-model-card.pdf ,
work page 2025
-
[65]
Accessed: 2026-05-06
work page 2026
-
[66]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, and others. Qwen3 technical report.arXiv, abs/2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[67]
Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv, abs/2412.14171, 2024
-
[68]
Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025
Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, and Hengshuang Zhao. Visual spatial tuning.arXiv, abs/2511.05491, 2025
-
[69]
Cambrian-S: Towards Spatial Supersensing in Video.arXiv preprint arXiv:2511.04670, 2025
Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei-Fei, and Saining Xie. Cambrian-s: Towards spatial supersensing in video.arXiv, abs/2511.04670, 2025. 13
-
[70]
MVSNet: Depth inference for unstructured multi-view stereo
Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. MVSNet: Depth inference for unstructured multi-view stereo. InEuropean Conference on Computer Vision, pages 767–783, 2018
work page 2018
-
[71]
Songsong Yu, Yuxin Chen, Hao Ju, Lianjie Jia, Fuxi Zhang, Shaofei Huang, Yuhan Wu, Rundi Cui, Binghao Ran, Zaibin Zhang, Zhedong Zheng, Zhipeng Zhang, Yifan Wang, Lin Song, Lijun Wang, Yanwei Li, Ying Shan, and Huchuan Lu. How far are VLMs from visual spatial intelligence? a benchmark-driven perspective.arXiv, abs/2509.18905, 2025
-
[72]
Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, and Li Zhang. From flatland to space: Teaching vision-language models to perceive and reason in 3d.arXiv, abs/2503.22976, 2025
-
[73]
Pingyue Zhang, Zihan Huang, Yue Wang, Jieyu Zhang, Letian Xue, Zihan Wang, Qineng Wang, Keshigeyan Chandrasegaran, Ruohan Zhang, Yejin Choi, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, and Manling Li. Theory of space: Can foundation models construct spatial beliefs through active exploration?arXiv, abs/2602.07055, 2026
-
[74]
Ruosen Zhao, Zhikang Zhang, Jialei Xu, Jiahao Chang, Dong Chen, Lingyun Li, Weijian Sun, and Zizhuang Wei. SpaceMind: Camera-guided modality fusion for spatial reasoning in vision-language models.arXiv, abs/2511.23075, 2025
-
[75]
Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, and Shanghang Zhang. RoboRefer: Towards spatial referring with reasoning in vision-language models for robotics.arXiv, abs/2506.04308, 2025
-
[76]
LLaV A-3d: A simple yet effective pathway to empowering LMMs with 3d awareness
Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. LLaV A-3d: A simple yet effective pathway to empowering LMMs with 3d awareness. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025
work page 2025
-
[77]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, et al. InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv, abs/2504.10479, 2025. 14
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.