SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

Bingyang Wang; Han Liu; Haoran Lu; Huifeixin Chen; Jianshu Zhang; Letian Xue; Yijiang Li

arxiv: 2605.23898 · v1 · pith:VESSBQKWnew · submitted 2026-05-22 · 💻 cs.AI

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

Jianshu Zhang , Yijiang Li , Huifeixin Chen , Haoran Lu , Letian Xue , Bingyang Wang , Han Liu This is my paper

Pith reviewed 2026-05-25 03:54 UTC · model grok-4.3

classification 💻 cs.AI

keywords Vision-Language ModelsSpatial Numerical UnderstandingCoordinate GroundingEmbodied AISpatial ReasoningTask EvaluationNum2SpaceSpace2Num

0 comments

The pith

Vision-language models largely fail to ground numerical values in spatial perception, performing near random on bidirectional mapping tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework called SpaceNum to test whether VLMs can map between visual spatial structures and numerical representations in both dynamic exploration and static layout settings. It formulates two tasks, Num2Space and Space2Num, that require models to convert spatial observations into numbers or vice versa. Across multiple models, performance stays close to random, with analysis showing reliance on shallow visual cues rather than stable coordinate representations. If correct, this means VLMs deployed in embodied settings cannot reliably produce or interpret spatial coordinates for actions or navigation. The work also tests interventions like explicit reasoning and fine-tuning, finding only partial gains from the latter.

Core claim

Current VLMs fail to ground numbers in spatial meaning across dynamic transitions and static layouts, relying heavily on shallow spatial cues, struggling to build stable coordinate-aware representations, and failing to abstract structured spatial layouts from visual observations. Explicit reasoning yields only marginal gains, while tuning partially improves understanding and transfers to external spatial benchmarks.

What carries the argument

The SpaceNum framework with its bidirectional Num2Space and Space2Num tasks that test mapping between vision-side spatial structure and language-side numerical representations.

If this is right

Explicit chain-of-thought reasoning provides only marginal improvements on spatial numerical tasks.
Fine-tuning on the tasks can partially lift performance and transfer to other spatial reasoning benchmarks.
Models perform poorly on both dynamic transitions during spatial exploration and static layouts in reasoning.
Current VLMs cannot reliably produce action magnitudes or spatial coordinates grounded in perception.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training corpora for VLMs may lack sufficient examples that force explicit coordinate-to-number alignment.
Applications in robotics could benefit from hybrid systems that add external coordinate tracking modules.
New benchmarks focused on blocking superficial cues might be needed to drive progress in spatial grounding.

Load-bearing premise

The Num2Space and Space2Num tasks measure genuine spatial numerical understanding rather than superficial visual or linguistic patterns unrelated to coordinate grounding.

What would settle it

A controlled experiment where a VLM achieves substantially above-chance accuracy on both tasks after interventions that block access to shallow visual patterns such as object counts or text overlays.

Figures

Figures reproduced from arXiv: 2605.23898 by Bingyang Wang, Han Liu, Haoran Lu, Huifeixin Chen, Jianshu Zhang, Letian Xue, Yijiang Li.

**Figure 1.** Figure 1: Overview of SPACENUM. We study spatial numerical understanding under two settings: numbers as dynamic transition in spatial exploration (left) and numbers as static layout in spatial understanding (right). We further investigate the mapping between vision-side space and languageside numbers via two tasks: NUM2SPACE, which maps numbers to visual outcomes (top), and SPACE2NUM, which maps visual inputs to nu… view at source ↗

**Figure 2.** Figure 2: Dataset statistics. SPACE2NUM. The model is given an observation o and is required to infer the numerical coordinates p of a target object under the reference coordinate system. This task requires grounding visual spatial structure into numerical representations. 2.3 Statistics [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Structured analysis of model errors across spatial scenarios. Left: larger models tend to [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Additional analyses under dynamic transitions. Top left: blind testing by masking visual [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Visual-side interventions. Left: adding anchors for dynamic transitions and reducing objects [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Representation-side interventions. Left: changing numerical representations in dynamic [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Tuning analysis for spatial numerical understanding. Left: transfer patterns across different [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce numerical outputs such as action magnitudes and spatial coordinates. Although these numbers appear meaningful, it remains unclear whether these numerical outputs are genuinely grounded in spatial perception. Therefore, in this work, we revisit spatial numerical understanding through SpaceNum, a unified framework that captures two complementary settings: numbers as dynamic transitions during spatial exploration, and numbers as static layouts in spatial reasoning. We formulate two bidirectional tasks, Num2Space and Space2Num, to evaluate how well VLMs map between vision-side spatial structure and language-side numerical representations. We systematically study whether current VLMs truly understand numerical values in spatial settings. Across dynamic transitions and static layouts, we find that models largely fail to ground numbers in spatial meaning and often perform close to random guess. Through error analysis, reasoning trace analysis, and controlled interventions, we show that current VLMs rely heavily on shallow spatial cues, struggle to build stable coordinate-aware representations, and fail to abstract structured spatial layouts from visual observations. We further show that explicit reasoning provides only marginal gains, while tuning can partially improve spatial numerical understanding and transfer to external spatial reasoning benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpaceNum adds bidirectional tasks that probe VLM spatial-number mapping, but the reported failures need tighter controls before the grounding deficit claim lands.

read the letter

The new element is the SpaceNum setup with Num2Space and Space2Num tasks that run in both directions. The paper runs these on current VLMs, reports near-random results, and uses error analysis plus interventions to argue that models lean on shallow cues instead of building coordinate-aware representations. It also checks that chain-of-thought adds little and that some tuning helps on the tasks and transfers to other spatial benchmarks. That package is a reasonable addition to the VLM evaluation toolkit for embodied settings where coordinate output matters. The work is scoped tightly to one capability, which keeps the claims proportionate. The main soft spot is the absence of concrete details on dataset construction, model selection, and how the interventions actually removed superficial correlations. The stress-test note is on point here: if residual visual or prompt statistics still align with the target numbers, the jump from low performance to “failure to abstract structured spatial layouts” stays under-supported. No math or fitted parameters are involved, so the circularity burden is low, but the empirical claim still needs verifiable ablations to hold. This paper is for researchers who build or test VLMs for robotics and spatial reasoning. A reader who wants to see fresh task designs and some initial negative results on grounding will find it useful even if the controls need tightening. It deserves a serious referee so the methods can be examined in full.

Referee Report

3 major / 2 minor

Summary. The paper introduces the SpaceNum framework with bidirectional Num2Space and Space2Num tasks to evaluate whether VLMs map between visual spatial structures and numerical representations in dynamic transitions and static layouts. It reports near-random performance across models, attributes this to reliance on shallow spatial cues rather than coordinate-aware representations, shows marginal benefits from explicit reasoning, and partial gains from tuning that transfer to other spatial benchmarks.

Significance. If the central empirical claims survive rigorous controls for task artifacts and full methodological disclosure, the work would document a concrete limitation in current VLMs for embodied settings that require numerical spatial grounding, providing diagnostic evidence via error and intervention analyses that could guide future architectural or training interventions.

major comments (3)

[Abstract] Abstract: the claim of 'systematic evaluation, error analysis, and interventions' showing near-random performance is presented without any enumeration of the specific VLMs tested, dataset sizes or construction protocol, statistical tests, or control conditions, leaving the mapping from observed failure to 'failure to abstract structured spatial layouts' unverifiable.
[§3] Task formulation (Num2Space/Space2Num): the manuscript does not report explicit ablation or verification that superficial cues (object counts, prompt-length statistics, or language-model priors over number words) have been removed from the image-generation and prompt-construction pipelines; without such checks the near-random result cannot be unambiguously attributed to absence of coordinate grounding rather than task artifacts.
[§4] Experimental section: no details are supplied on model-selection criteria, number of models or runs, baseline comparisons that isolate spatial-numerical grounding from general visual or numerical competence, or the precise form of the 'controlled interventions,' all of which are load-bearing for the claim that VLMs 'rely heavily on shallow spatial cues.'

minor comments (2)

[Abstract] Abstract: 'need produce' should read 'need to produce.'
[Throughout] Ensure consistent capitalization and first-use definition of 'VLM' / 'VLMs' and 'SpaceNum.'

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive suggestions. The comments highlight areas where additional methodological transparency will strengthen the paper. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'systematic evaluation, error analysis, and interventions' showing near-random performance is presented without any enumeration of the specific VLMs tested, dataset sizes or construction protocol, statistical tests, or control conditions, leaving the mapping from observed failure to 'failure to abstract structured spatial layouts' unverifiable.

Authors: We agree the abstract is concise and omits enumeration of these details. The main text (Sections 3–4) contains the full list of evaluated VLMs, task construction protocol, dataset sizes, and the error/intervention analyses. In revision we will expand the abstract with a brief enumeration of the models tested and key dataset statistics while preserving length constraints, and will add explicit forward references to the statistical tests and controls. revision: yes
Referee: [§3] Task formulation (Num2Space/Space2Num): the manuscript does not report explicit ablation or verification that superficial cues (object counts, prompt-length statistics, or language-model priors over number words) have been removed from the image-generation and prompt-construction pipelines; without such checks the near-random result cannot be unambiguously attributed to absence of coordinate grounding rather than task artifacts.

Authors: This observation is correct; the original submission did not include dedicated ablations isolating object counts, prompt-length statistics, or LM priors over number words. While task construction in §3 varied spatial configurations to reduce such cues, we will add explicit ablation experiments in the revised §3 that systematically vary or control these factors and report the resulting performance to confirm the attribution to coordinate grounding. revision: yes
Referee: [§4] Experimental section: no details are supplied on model-selection criteria, number of models or runs, baseline comparisons that isolate spatial-numerical grounding from general visual or numerical competence, or the precise form of the 'controlled interventions,' all of which are load-bearing for the claim that VLMs 'rely heavily on shallow spatial cues.'

Authors: We acknowledge the need for greater detail. The revised experimental section will specify model-selection criteria, the exact number of models and runs, baseline comparisons that separate spatial-numerical grounding from general visual/numerical competence, and a precise description of each controlled intervention (including how they were implemented and what they isolate). These additions will directly support the claim regarding reliance on shallow cues. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking study with no derivations or fitted predictions

full rationale

The paper is an empirical evaluation introducing Num2Space and Space2Num tasks to benchmark VLMs on spatial-numerical mapping. No mathematical derivation chain, first-principles predictions, or parameter fitting is present. Claims rest on observed model performance, error analysis, and interventions rather than any step that reduces by construction to its own inputs or self-citations. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of the new tasks as measures of spatial grounding; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The bidirectional tasks Num2Space and Space2Num accurately isolate genuine spatial numerical understanding from superficial cues.
The paper's conclusion that models fail to ground numbers rests on these tasks being faithful proxies.

pith-pipeline@v0.9.0 · 5754 in / 1136 out tokens · 22674 ms · 2026-05-25T03:54:22.720581+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formulate two bidirectional tasks, Num2Space and Space2Num, to evaluate how well VLMs map between vision-side spatial structure and language-side numerical representations.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

current VLMs rely heavily on shallow spatial cues, struggle to build stable coordinate-aware representations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 6 internal anchors

[1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Blenderkit: Online asset library for blender

BlenderKit. Blenderkit: Online asset library for blender. https://www.blenderkit.com/, 2023

work page 2023
[3]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024

work page 2024
[4]

Spacetools: Tool-augmented spatial reasoning via double interactive rl.arXiv preprint arXiv:2512.04069, 2025

Siyi Chen, Mikaela Angelina Uy, Chan Hee Song, Faisal Ladhak, Adithyavairavan Murali, Qing Qu, Stan Birchfield, Valts Blukis, and Jonathan Tremblay. Spacetools: Tool-augmented spatial reasoning via double interactive rl.arXiv preprint arXiv:2512.04069, 2025

work page arXiv 2025
[5]

Spatialrgpt: Grounded spatial reasoning in vision-language models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. Advances in Neural Information Processing Systems, 37:135062–135093, 2024

work page 2024
[6]

Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

work page 2023
[7]

Mm-spatial: Exploring 3d spatial understanding in multimodal llms

Erik Daxberger, Nina Wenzel, David Griffiths, Haiming Gang, Justin Lazarow, Gefen Kohavi, Kai Kang, Marcin Eichner, Yinfei Yang, Afshin Dehghan, et al. Mm-spatial: Exploring 3d spatial understanding in multimodal llms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7395–7408, 2025

work page 2025
[8]

Google DeepMind. Gemma 3. https://deepmind.google/models/gemma/gemma-3/,

work page
[9]

Accessed: 2026-05-01

work page 2026
[10]

Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models

Mengfei Du, Binhao Wu, Zejun Li, Xuan-Jing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 2: Short Papers), pages 346–355, 2024

work page 2024
[11]

Video2layout: Recall and reconstruct metric-grounded cognitive map for spatial reasoning

Yibin Huang, Wang Xu, Wanyue Zhang, Helu Zhi, Jingjing Huang, Yangbin Xu, Yangang Sun, Conghui Zhu, and Tiejun Zhao. Video2layout: Recall and reconstruct metric-grounded cognitive map for spatial reasoning.arXiv preprint arXiv:2511.16160, 2025

work page arXiv 2025
[12]

Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025

work page arXiv 2025
[13]

What’s “up” with vision-language models? investigating their struggle with spatial reasoning

Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language models? investigating their struggle with spatial reasoning. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9161–9175, 2023. 10

work page 2023
[14]

AI2-THOR: An Interactive 3D Environment for Visual AI

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli ´c, and Furu Wei. Imagine while reasoning in space: Multimodal visualization-of-thought.arXiv preprint arXiv:2501.07542, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Reasoning paths with reference objects elicit quantitative spatial reasoning in large vision-language models

Yuan-Hong Liao, Rafid Mahmood, Sanja Fidler, and David Acuna. Reasoning paths with reference objects elicit quantitative spatial reasoning in large vision-language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17028–17047, 2024

work page 2024
[17]

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

work page 2023
[18]

Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, et al. Ovis2. 5 technical report.arXiv preprint arXiv:2508.11737, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Chenyang Ma, Kai Lu, Ta-Ying Cheng, Niki Trigoni, and Andrew Markham. Spatialpin: Enhancing spatial reasoning capabilities of vision-language models through prompting and interacting 3d priors.Advances in neural information processing systems, 37:68803–68832, 2024

work page 2024
[20]

Spatialreasoner: Towards explicit and generalizable 3d spatial reasoning.arXiv preprint arXiv:2504.20024, 2025

Wufei Ma, Yu-Cheng Chou, Qihao Liu, Xingrui Wang, Celso de Melo, Jianwen Xie, and Alan Yuille. Spatialreasoner: Towards explicit and generalizable 3d spatial reasoning.arXiv preprint arXiv:2504.20024, 2025

work page arXiv 2025
[21]

Cosmos-reason2: Open reasoning vision-language models for physical ai

NVIDIA. Cosmos-reason2: Open reasoning vision-language models for physical ai. https: //huggingface.co/collections/nvidia/cosmos-reason2, 2026. Accessed: 2026-05- 01

work page 2026
[22]

Nvidia isaac sim

NVIDIA Corporation. Nvidia isaac sim. https://developer.nvidia.com/isaac-sim, 2023

work page 2023
[23]

Image textualization: An automatic framework for creating accurate and detailed image descriptions

Renjie Pi, Jianshu Zhang, Jipeng Zhang, Rui Pan, Zhekai Chen, and Tong Zhang. Image textualization: An automatic framework for creating accurate and detailed image descriptions. arXiv preprint arXiv:2406.07502, 2024

work page arXiv 2024
[24]

Gsr-bench: A benchmark for grounded spatial reasoning evaluation via multimodal llms.arXiv preprint arXiv:2406.13246, 2024

Navid Rajabi and Jana Kosecka. Gsr-bench: A benchmark for grounded spatial reasoning evaluation via multimodal llms.arXiv preprint arXiv:2406.13246, 2024

work page arXiv 2024
[25]

Sat: Spa- tial aptitude training for multimodal language models

Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A Plummer, Ranjay Krishna, et al. Sat: Dynamic spatial aptitude training for multimodal language models.arXiv preprint arXiv:2412.07755, 2024

work page arXiv 2024
[26]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Hydra-nav: Object navigation via adaptive dual-process reasoning.arXiv preprint arXiv:2602.09972, 2026

Zixuan Wang, Huang Fang, Shaoan Wang, Yuanfei Luo, Heng Dong, Wei Li, and Yiming Gan. Hydra-nav: Object navigation via adaptive dual-process reasoning.arXiv preprint arXiv:2602.09972, 2026

work page arXiv 2026
[29]

Spatialscore: Towards unified evaluation for multimodal spatial understanding.arXiv e-prints, pages arXiv– 2505, 2025

Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, and Weidi Xie. Spatialscore: Towards unified evaluation for multimodal spatial understanding.arXiv e-prints, pages arXiv– 2505, 2025

work page 2025
[30]

Earthspatialbench: Benchmarking spatial reasoning capabilities of multimodal llms on earth imagery.arXiv preprint arXiv:2602.15918, 2026

Zelin Xu, Yupu Zhang, Saugat Adhikari, Saiful Islam, Tingsong Xiao, Zibo Liu, Shigang Chen, Da Yan, and Zhe Jiang. Earthspatialbench: Benchmarking spatial reasoning capabilities of multimodal llms on earth imagery.arXiv preprint arXiv:2602.15918, 2026. 11

work page arXiv 2026
[31]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025

work page 2025
[32]

Mindjourney: Test-time scaling with world models for spatial reasoning

Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, and Chuang Gan. Mindjourney: Test-time scaling with world models for spatial reasoning. arXiv preprint arXiv:2507.12508, 2025

work page arXiv 2025
[33]

Spatial mental modeling from limited views

Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, et al. Spatial mental modeling from limited views. InStructural Priors for Vision Workshop at ICCV’25, 2025

work page 2025
[34]

Open3d-vqa: A benchmark for comprehen- sive spatial reasoning with multimodal large language model in open space.arXiv preprint arXiv:2503.11094, 2025

Weichen Zhang, Zile Zhou, Xin Zeng, Xuchen Liu, Jianjie Fang, Chen Gao, Yong Li, Jin- qiang Cui, Xinlei Chen, and Xiao-Ping Zhang. Open3d-vqa: A benchmark for comprehen- sive spatial reasoning with multimodal large language model in open space.arXiv preprint arXiv:2503.11094, 2025. 12

work page arXiv 2025

[1] [1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Blenderkit: Online asset library for blender

BlenderKit. Blenderkit: Online asset library for blender. https://www.blenderkit.com/, 2023

work page 2023

[3] [3]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024

work page 2024

[4] [4]

Spacetools: Tool-augmented spatial reasoning via double interactive rl.arXiv preprint arXiv:2512.04069, 2025

Siyi Chen, Mikaela Angelina Uy, Chan Hee Song, Faisal Ladhak, Adithyavairavan Murali, Qing Qu, Stan Birchfield, Valts Blukis, and Jonathan Tremblay. Spacetools: Tool-augmented spatial reasoning via double interactive rl.arXiv preprint arXiv:2512.04069, 2025

work page arXiv 2025

[5] [5]

Spatialrgpt: Grounded spatial reasoning in vision-language models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. Advances in Neural Information Processing Systems, 37:135062–135093, 2024

work page 2024

[6] [6]

Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

work page 2023

[7] [7]

Mm-spatial: Exploring 3d spatial understanding in multimodal llms

Erik Daxberger, Nina Wenzel, David Griffiths, Haiming Gang, Justin Lazarow, Gefen Kohavi, Kai Kang, Marcin Eichner, Yinfei Yang, Afshin Dehghan, et al. Mm-spatial: Exploring 3d spatial understanding in multimodal llms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7395–7408, 2025

work page 2025

[8] [8]

Google DeepMind. Gemma 3. https://deepmind.google/models/gemma/gemma-3/,

work page

[9] [9]

Accessed: 2026-05-01

work page 2026

[10] [10]

Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models

Mengfei Du, Binhao Wu, Zejun Li, Xuan-Jing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 2: Short Papers), pages 346–355, 2024

work page 2024

[11] [11]

Video2layout: Recall and reconstruct metric-grounded cognitive map for spatial reasoning

Yibin Huang, Wang Xu, Wanyue Zhang, Helu Zhi, Jingjing Huang, Yangbin Xu, Yangang Sun, Conghui Zhu, and Tiejun Zhao. Video2layout: Recall and reconstruct metric-grounded cognitive map for spatial reasoning.arXiv preprint arXiv:2511.16160, 2025

work page arXiv 2025

[12] [12]

Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025

work page arXiv 2025

[13] [13]

What’s “up” with vision-language models? investigating their struggle with spatial reasoning

Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language models? investigating their struggle with spatial reasoning. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9161–9175, 2023. 10

work page 2023

[14] [14]

AI2-THOR: An Interactive 3D Environment for Visual AI

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[15] [15]

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli ´c, and Furu Wei. Imagine while reasoning in space: Multimodal visualization-of-thought.arXiv preprint arXiv:2501.07542, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Reasoning paths with reference objects elicit quantitative spatial reasoning in large vision-language models

Yuan-Hong Liao, Rafid Mahmood, Sanja Fidler, and David Acuna. Reasoning paths with reference objects elicit quantitative spatial reasoning in large vision-language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17028–17047, 2024

work page 2024

[17] [17]

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

work page 2023

[18] [18]

Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, et al. Ovis2. 5 technical report.arXiv preprint arXiv:2508.11737, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Chenyang Ma, Kai Lu, Ta-Ying Cheng, Niki Trigoni, and Andrew Markham. Spatialpin: Enhancing spatial reasoning capabilities of vision-language models through prompting and interacting 3d priors.Advances in neural information processing systems, 37:68803–68832, 2024

work page 2024

[20] [20]

Spatialreasoner: Towards explicit and generalizable 3d spatial reasoning.arXiv preprint arXiv:2504.20024, 2025

Wufei Ma, Yu-Cheng Chou, Qihao Liu, Xingrui Wang, Celso de Melo, Jianwen Xie, and Alan Yuille. Spatialreasoner: Towards explicit and generalizable 3d spatial reasoning.arXiv preprint arXiv:2504.20024, 2025

work page arXiv 2025

[21] [21]

Cosmos-reason2: Open reasoning vision-language models for physical ai

NVIDIA. Cosmos-reason2: Open reasoning vision-language models for physical ai. https: //huggingface.co/collections/nvidia/cosmos-reason2, 2026. Accessed: 2026-05- 01

work page 2026

[22] [22]

Nvidia isaac sim

NVIDIA Corporation. Nvidia isaac sim. https://developer.nvidia.com/isaac-sim, 2023

work page 2023

[23] [23]

Image textualization: An automatic framework for creating accurate and detailed image descriptions

Renjie Pi, Jianshu Zhang, Jipeng Zhang, Rui Pan, Zhekai Chen, and Tong Zhang. Image textualization: An automatic framework for creating accurate and detailed image descriptions. arXiv preprint arXiv:2406.07502, 2024

work page arXiv 2024

[24] [24]

Gsr-bench: A benchmark for grounded spatial reasoning evaluation via multimodal llms.arXiv preprint arXiv:2406.13246, 2024

Navid Rajabi and Jana Kosecka. Gsr-bench: A benchmark for grounded spatial reasoning evaluation via multimodal llms.arXiv preprint arXiv:2406.13246, 2024

work page arXiv 2024

[25] [25]

Sat: Spa- tial aptitude training for multimodal language models

Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A Plummer, Ranjay Krishna, et al. Sat: Dynamic spatial aptitude training for multimodal language models.arXiv preprint arXiv:2412.07755, 2024

work page arXiv 2024

[26] [26]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Hydra-nav: Object navigation via adaptive dual-process reasoning.arXiv preprint arXiv:2602.09972, 2026

Zixuan Wang, Huang Fang, Shaoan Wang, Yuanfei Luo, Heng Dong, Wei Li, and Yiming Gan. Hydra-nav: Object navigation via adaptive dual-process reasoning.arXiv preprint arXiv:2602.09972, 2026

work page arXiv 2026

[29] [29]

Spatialscore: Towards unified evaluation for multimodal spatial understanding.arXiv e-prints, pages arXiv– 2505, 2025

Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, and Weidi Xie. Spatialscore: Towards unified evaluation for multimodal spatial understanding.arXiv e-prints, pages arXiv– 2505, 2025

work page 2025

[30] [30]

Earthspatialbench: Benchmarking spatial reasoning capabilities of multimodal llms on earth imagery.arXiv preprint arXiv:2602.15918, 2026

Zelin Xu, Yupu Zhang, Saugat Adhikari, Saiful Islam, Tingsong Xiao, Zibo Liu, Shigang Chen, Da Yan, and Zhe Jiang. Earthspatialbench: Benchmarking spatial reasoning capabilities of multimodal llms on earth imagery.arXiv preprint arXiv:2602.15918, 2026. 11

work page arXiv 2026

[31] [31]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025

work page 2025

[32] [32]

Mindjourney: Test-time scaling with world models for spatial reasoning

Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, and Chuang Gan. Mindjourney: Test-time scaling with world models for spatial reasoning. arXiv preprint arXiv:2507.12508, 2025

work page arXiv 2025

[33] [33]

Spatial mental modeling from limited views

Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, et al. Spatial mental modeling from limited views. InStructural Priors for Vision Workshop at ICCV’25, 2025

work page 2025

[34] [34]

Open3d-vqa: A benchmark for comprehen- sive spatial reasoning with multimodal large language model in open space.arXiv preprint arXiv:2503.11094, 2025

Weichen Zhang, Zile Zhou, Xin Zeng, Xuchen Liu, Jianjie Fang, Chen Gao, Yong Li, Jin- qiang Cui, Xinlei Chen, and Xiao-Ping Zhang. Open3d-vqa: A benchmark for comprehen- sive spatial reasoning with multimodal large language model in open space.arXiv preprint arXiv:2503.11094, 2025. 12

work page arXiv 2025