arxiv: 2602.11635 · v2 · submitted 2026-02-12 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation

Shuo Lu , Jianjie Cheng , Yinuo Xu , Yongcan Yu , Lijun Sheng , Peijie Wang , Siru Jiang , Yongguan Hu

show 11 more authors

Run Ling Yihua Shao Ao Ma Wei Feng Lingxiao He Meng Wang Qianlong Xie Xingxing Wang Nicu Sebe Ran He Jian Liang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 03:35 UTC · model grok-4.3

classification 💻 cs.AI

keywords multimodal large language modelsspatial reasoningmathematical reasoningbenchmarksAI evaluationgeometric problemsdeductive reasoningtraining datasets

0 comments

The pith

Multimodal models lag humans by more than 35 points on mathematical spatial reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that current multimodal large language models have a clear weakness in mathematical spatial reasoning, defined as parsing and manipulating two- and three-dimensional relations. Humans solve the same textbook problems with over 95 percent accuracy while most models stay below 60 percent. The authors built MathSpatial-Bench with 2000 quality-controlled problems across three categories and eleven subtypes, plus an 8000-problem training corpus with verified solutions. Results from 16 leading models confirm spatial reasoning as a persistent bottleneck, especially on abstract deduction, and demonstrate that training on the corpus produces consistent gains.

Core claim

Multimodal large language models exhibit a fundamental limitation in mathematical spatial reasoning. On the MathSpatial-Bench of 2000 problems spanning three categories and eleven subtypes, even GPT-5 trails human performance by more than 35 percentage points, with the largest shortfalls on abstract deduction tasks. Training on the MathSpatial-Corpus of 8000 problems equipped with verified solutions and structured traces yields consistent improvements across model families.

What carries the argument

MathSpatial-Bench, a set of 2000 problems in three categories and eleven subtypes that uses multi-stage quality control including geometric consistency checks to isolate spatial reasoning from perceptual noise.

If this is right

Spatial reasoning constitutes a core bottleneck that current MLLMs must overcome for broader mathematical competence.
Training on dedicated spatial-reasoning data produces measurable gains across different model families.
Abstract deduction tasks expose the largest performance deficits compared with other spatial subtypes.
The dataset supplies both an evaluation standard and a training resource for future model development.
Closing the gap with human performance would require targeted advances beyond general scaling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Current architectures may depend more on statistical correlations than on genuine internal spatial representations.
Dedicated spatial modules or training regimes could narrow the remaining gap with human-level performance.
The identified weakness may constrain downstream applications such as automated geometry solvers or robotic planning.

Load-bearing premise

The multi-stage quality controls successfully separate spatial reasoning ability from perceptual shortcuts or pattern matching in the problems.

What would settle it

An MLLM reaching 95 percent accuracy on MathSpatial-Bench without any exposure to the training corpus or similar spatial problems would show the performance gap is not a fundamental architectural limit.

Figures

Figures reproduced from arXiv: 2602.11635 by Ao Ma, Jianjie Cheng, Jian Liang, Lijun Sheng, Lingxiao He, Meng Wang, Nicu Sebe, Peijie Wang, Qianlong Xie, Ran He, Run Ling, Shuo Lu, Siru Jiang, Wei Feng, Xingxing Wang, Yihua Shao, Yinuo Xu, Yongcan Yu, Yongguan Hu.

**Figure 1.** Figure 1: Left: On MathSpatial-Bench, humans achieve over 95% accuracy while most MLLMs remain below 60%. Right: Three core challenges of spatial reasoning and the design of MathSpatial to address them. provided large-scale, high-quality training corpora for spatial reasoning, as shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: MathSpatial source data construction pipeline: Data Collection and Curation → Standardization → Geometric Consistency Checking → Solution Verification. 2 Related Work 2.1 MLLM Spatial Reasoning Multimodal large language models (MLLMs) integrate textual and visual modalities and have demonstrated remarkable potential across vision–language tasks [4, 7, 18, 24]. Recent studies indicate that MLLMs exhibit em… view at source ↗

**Figure 3.** Figure 3: Selected examples demonstrating the diverse prob [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: MathSpatial-Bench distribution and composition. In total, the benchmark consists of 518 problems in Holistic Recognition, 636 in Generative Inference, and 846 in Abstract Deduction. The detailed distribution across all subcategories is shown in [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Fine-grained error analysis on MathSpatial-Bench. (a) Error frequency distribution for baselines across 6 subcategories. (b) Overall error rate breakdown by failure mode. 7B variant (17.8%). Multiple models score 0.0% on GPC, indicating a complete inability to perform geometric property calculations. GLM-4.5V (21.0%) is the strongest base open-source model on this benchmark, yet still trails most closed-so… view at source ↗

read the original abstract

Multimodal large language models (MLLMs) have achieved strong performance on perception-oriented tasks, yet their ability to perform mathematical spatial reasoning, defined as the capacity to parse and manipulate two- and three-dimensional relations, remains unclear. Humans easily solve textbook-style spatial reasoning problems with over 95\% accuracy, but we find that most leading MLLMs fail to reach even 60\% on the same tasks. This striking gap highlights spatial reasoning as a fundamental weakness of current models. To investigate this gap, we present \emph{MathSpatial}, the first large-scale and systematic dataset resource dedicated to mathematical spatial reasoning in MLLMs. \emph{MathSpatial} provides two complementary subsets: (i)~\emph{MathSpatial-Bench}, a rigorously curated evaluation set of 2{,}000 problems spanning 3 categories and 11 subtypes, designed to isolate spatial reasoning from perceptual noise; and (ii)~\emph{MathSpatial-Corpus}, a training set of 8{,}000 problems equipped with verified solutions and structured reasoning traces. All problems are sourced from authentic educational materials and undergo multi-stage quality control including deduplication, geometric consistency checking, and cross-validated solution verification. Benchmarking 16 leading MLLMs on \emph{MathSpatial-Bench} reveals that spatial reasoning remains a fundamental bottleneck: even GPT-5 lags behind human performance by over 35 percentage points, with particularly poor results on abstract deduction tasks. We further show that training on \emph{MathSpatial-Corpus} yields consistent improvements across model families, demonstrating the dataset's practical value for advancing spatial reasoning capabilities. \emph{MathSpatial} is publicly available at https://shuolucs.github.io/MathSpatial.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MathSpatial, a new dataset resource for mathematical spatial reasoning in MLLMs consisting of MathSpatial-Bench (a 2,000-problem evaluation set spanning 3 categories and 11 subtypes) and MathSpatial-Corpus (an 8,000-problem training set with verified solutions). It benchmarks 16 leading MLLMs on the bench, reports that even GPT-5 lags human performance by over 35 percentage points with especially weak results on abstract deduction, and shows consistent gains from training on the corpus.

Significance. If the curation successfully isolates spatial reasoning, the work is significant: it supplies the first large-scale, education-sourced benchmark and training resource for this capability, quantifies a clear performance gap, and demonstrates that targeted data can improve results across model families.

major comments (3)

[§3] §3 (Dataset Construction): the multi-stage quality control (deduplication, geometric consistency checking, cross-validated solution verification) is described only at a high level; no quantitative metrics (inter-annotator agreement, post-verification error rates, or ablation showing isolation from perceptual factors) are reported, leaving the central isolation claim without direct empirical support.
[§4.1] §4.1 (Benchmarking): overall accuracy figures are given for 16 models, but per-subtype breakdowns and statistical significance tests for the abstract-deduction category are missing, so the claim of “particularly poor results” on those tasks cannot be fully evaluated from the presented data.
[§4.2] §4.2 (Training Experiments): gains from fine-tuning on MathSpatial-Corpus are shown, yet no controls for dataset size, content overlap with existing pre-training corpora, or comparison to generic math data are included, weakening attribution of improvements specifically to spatial-reasoning enhancement.

minor comments (2)

[Abstract] Abstract and §1: the model referred to as “GPT-5” should be clarified (version, access date, or whether it is a stand-in) for reproducibility.
[§4] Figure captions and Table 1: ensure all subtype labels match the 11 subtypes listed in §3.1 and that axis labels include units or scale information.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of the dataset as a pure measure of spatial reasoning, which is assumed after quality controls but not independently verified in the abstract.

axioms (1)

domain assumption The selected problems from educational materials test mathematical spatial reasoning independently of perceptual abilities
Invoked in the description of MathSpatial-Bench to isolate the capability.

pith-pipeline@v0.9.0 · 5675 in / 1132 out tokens · 156560 ms · 2026-05-16T03:35:19.177346+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MathSpatial-Bench... spanning 3 categories and 11 subtypes... clean geometric diagrams... multi-view matching and unfolding/folding
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

humans achieve 95%+ accuracy while state-of-the-art MLLMs struggle below 60%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 6 internal anchors

[1]

Anthropic. 2024. Claude 3.5 Sonnet. https://www.anthropic.com/news/claude-3- 5-sonnet

work page 2024
[2]

Anthropic. 2025. Claude 3.7 Sonnet. https://www.anthropic.com/news/claude-3- 7-sonnet

work page 2025
[3]

Anthropic. 2025. System Card: Claude Opus 4 & Claude Sonnet 4. https://www. anthropic.com/claude-4-system-card

work page 2025
[4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al . 2025. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. 2025. Spatialbot: Precise spatial understanding with vision language models. In2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 9490–9498

work page 2025
[6]

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. 2024. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProc. CVPR. 14455–14465

work page 2024
[7]

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al . 2024. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. 2024. Spatialrgpt: Grounded spatial reasoning in vision-language models. InProc. NeurIPS, Vol. 37. 135062–135093

work page 2024
[9]

Erik Daxberger, Nina Wenzel, David Griffiths, Haiming Gang, Justin Lazarow, Gefen Kohavi, Kai Kang, Marcin Eichner, Yinfei Yang, Afshin Dehghan, et al

work page
[10]

Mm-spatial: Exploring 3d spatial understanding in multimodal llms. In Proc. ICCV. 7395–7408

work page
[11]

Mengfei Du, Binhao Wu, Zejun Li, Xuan-Jing Huang, and Zhongyu Wei. 2024. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProc. ACL. 346–355

work page 2024
[12]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407

work page 2024
[13]

Jie Feng, Jinwei Zeng, Qingyue Long, Hongyi Chen, Jie Zhao, Yanxin Xi, Zhilun Zhou, Yuan Yuan, Shengyuan Wang, Qingbin Zeng, et al . 2025. A survey of large language model-powered spatial intelligence across scales: Advances in embodied agents, smart cities, and earth science.arXiv preprint arXiv:2504.09848 (2025)

work page arXiv 2025
[14]

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. 2024. Blink: Multimodal large language models can see but not perceive. InProc. ECCV. Springer, 148–166

work page 2024
[15]

2025.Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capa- bilities

Google Gemini Team. 2025.Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capa- bilities. Technical Report. DeepMind / Google. https://arxiv.org/abs/2507. 06261orDeepMindreport

work page 2025
[16]

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. 2025. Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning. arXiv e-prints(2025), arXiv–2507

work page 2025
[17]

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. 2025. OmniSpatial: Towards Comprehensive Spatial Rea- soning Benchmark for Vision Language Models.arXiv preprint arXiv:2506.03135 (2025)

work page arXiv 2025
[18]

Jiayi Kuang, Ying Shen, Jingyou Xie, Haohao Luo, Zhe Xu, Ronghao Li, Yinghui Li, Xianfeng Cheng, Xika Lin, and Yu Han. 2025. Natural language understanding and inference with mllm in visual question answering: A survey.Comput. Surveys 57, 8 (2025), 1–36

work page 2025
[19]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InProc. ICML. PMLR, 19730–19742

work page 2023
[20]

Linjie Li, Mahtab Bigverdi, Jiawei Gu, Zixian Ma, Yinuo Yang, Ziang Li, Yejin Choi, and Ranjay Krishna. 2025. Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations.arXiv preprint arXiv:2506.04633(2025)

work page arXiv 2025
[21]

Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso de Melo, and Alan Yuille. 2025. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InProc. CVPR. 6924–6934

work page 2025
[22]

OpenAI. 2024. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/

work page 2024
[23]

2025.GPT-5 System Card

OpenAI. 2025.GPT-5 System Card. Technical report. OpenAI. Accessed: 2025-08- 10

work page 2025
[24]

OpenAI. 2025. Introducing GPT-4.1 in the API. https://openai.com/index/gpt-4-1/

work page 2025
[25]

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al . 2025. Humanity’s last exam.arXiv preprint arXiv:2501.14249(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A Plummer, Ranjay Krishna, et al

work page
[27]

Sat: Dy- namic spatial aptitude training for multimodal language models.arXiv preprint arXiv:2412.07755, 2024

SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models. arXiv preprint arXiv:2412.07755(2024)

work page arXiv 2024
[28]

Sara Sarto, Marcella Cornia, Rita Cucchiara, et al. 2025. Image Captioning Evalu- ation in the Age of Multimodal LLMs: Challenges and Future Perspectives. In IJCAI

work page 2025
[29]

Ilias Stogiannidis, Steven McDonagh, and Sotirios A Tsaftaris. 2025. Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707(2025)

work page arXiv 2025
[30]

Emilia Szymańska, Mihai Dusmanu, Jan-Willem Buurlage, Mahdi Rad, and Marc Pollefeys. 2024. Space3d-bench: Spatial 3d question answering benchmark. In Proc. ECCV. Springer, 68–85

work page 2024
[31]

Kexian Tang, Junyao Gao, Yanhong Zeng, Haodong Duan, Yanan Sun, Zhening Xing, Wenran Liu, Kaifeng Lyu, and Kai Chen. 2025. LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?arXiv preprint arXiv:2503.19990 (2025)

work page arXiv 2025
[32]

Kexin Tian, Jingrui Mao, Yunlong Zhang, Jiwan Jiang, Yang Zhou, and Zhengzhong Tu. 2025. Nuscenes-spatialqa: A spatial understanding and rea- soning benchmark for vision-language models in autonomous driving.arXiv preprint arXiv:2504.03164(2025)

work page arXiv 2025
[33]

Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Sharon Li, and Neel Joshi. 2024. Is a picture worth a thousand words? delving into spatial reasoning for vision language models. InProc. NeurIPS

work page 2024
[34]

Peijie Wang, Chao Yang, Zhong-Zhi Li, Fei Yin, Dekang Ran, Mi Tian, Zhilong Ji, Jinfeng Bai, and Cheng-Lin Liu. 2025. SOLIDGEO: Measuring Multimodal Spatial Math Reasoning in Solid Geometry. InProc. NeurIPS

work page 2025
[35]

Siting Wang, Luoyang Sun, Cheng Deng, Kun Shao, Minnan Pei, Zheng Tian, Haifeng Zhang, and Jun Wang. 2025. Spatialviz-bench: Automatically generated spatial visualization reasoning tasks for mllms.arXiv preprint arXiv:2507.07610 (2025)

work page arXiv 2025
[36]

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. 2025. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747(2025)

work page internal anchor Pith review arXiv 2025
[37]

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

work page
[38]

Thinking in space: How multimodal large language models see, remember, and recall spaces. InProc. CVPR. 10632–10643

work page
[39]

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, et al. 2025. MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence.arXiv preprint arXiv:2505.23764 (2025)

work page arXiv 2025
[40]

Shaokai Ye, Haozhe Qi, Alexander Mathis, and Mackenzie W Mathis. 2025. LLaVAction: evaluating and training multi-modal large language models for action recognition.arXiv preprint arXiv:2503.18712(2025)

work page arXiv 2025
[41]

Yifu Yuan, Haiqin Cui, Yibin Chen, Zibin Dong, Fei Ni, Longxin Kou, Jinyi Liu, Pengyi Li, Yan Zheng, and Jianye Hao. 2025. From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation.arXiv preprint arXiv:2505.08548 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Jirong Zha, Yuxuan Fan, Xiao Yang, Chen Gao, and Xinlei Chen. 2025. How to enable llm with 3d capacity? a survey of spatial reasoning in llm. InIJCAI

work page 2025
[43]

Jiaxin Zhang, Zhong-Zhi Li, Ming-Liang Zhang, Fei Yin, Cheng-Lin Liu, and Yashar Moshfeghi. 2024. Geoeval: benchmark for evaluating llms and multi-modal models on geometry problem-solving. InProc. ACL. 1258–1276

work page 2024
[44]

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. 2025. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025