pith. machine review for the scientific record. sign in

arxiv: 2602.11635 · v2 · submitted 2026-02-12 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation

Authors on Pith no claims yet

Pith reviewed 2026-05-16 03:35 UTC · model grok-4.3

classification 💻 cs.AI
keywords multimodal large language modelsspatial reasoningmathematical reasoningbenchmarksAI evaluationgeometric problemsdeductive reasoningtraining datasets
0
0 comments X

The pith

Multimodal models lag humans by more than 35 points on mathematical spatial reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that current multimodal large language models have a clear weakness in mathematical spatial reasoning, defined as parsing and manipulating two- and three-dimensional relations. Humans solve the same textbook problems with over 95 percent accuracy while most models stay below 60 percent. The authors built MathSpatial-Bench with 2000 quality-controlled problems across three categories and eleven subtypes, plus an 8000-problem training corpus with verified solutions. Results from 16 leading models confirm spatial reasoning as a persistent bottleneck, especially on abstract deduction, and demonstrate that training on the corpus produces consistent gains.

Core claim

Multimodal large language models exhibit a fundamental limitation in mathematical spatial reasoning. On the MathSpatial-Bench of 2000 problems spanning three categories and eleven subtypes, even GPT-5 trails human performance by more than 35 percentage points, with the largest shortfalls on abstract deduction tasks. Training on the MathSpatial-Corpus of 8000 problems equipped with verified solutions and structured traces yields consistent improvements across model families.

What carries the argument

MathSpatial-Bench, a set of 2000 problems in three categories and eleven subtypes that uses multi-stage quality control including geometric consistency checks to isolate spatial reasoning from perceptual noise.

If this is right

  • Spatial reasoning constitutes a core bottleneck that current MLLMs must overcome for broader mathematical competence.
  • Training on dedicated spatial-reasoning data produces measurable gains across different model families.
  • Abstract deduction tasks expose the largest performance deficits compared with other spatial subtypes.
  • The dataset supplies both an evaluation standard and a training resource for future model development.
  • Closing the gap with human performance would require targeted advances beyond general scaling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Current architectures may depend more on statistical correlations than on genuine internal spatial representations.
  • Dedicated spatial modules or training regimes could narrow the remaining gap with human-level performance.
  • The identified weakness may constrain downstream applications such as automated geometry solvers or robotic planning.

Load-bearing premise

The multi-stage quality controls successfully separate spatial reasoning ability from perceptual shortcuts or pattern matching in the problems.

What would settle it

An MLLM reaching 95 percent accuracy on MathSpatial-Bench without any exposure to the training corpus or similar spatial problems would show the performance gap is not a fundamental architectural limit.

Figures

Figures reproduced from arXiv: 2602.11635 by Ao Ma, Jianjie Cheng, Jian Liang, Lijun Sheng, Lingxiao He, Meng Wang, Nicu Sebe, Peijie Wang, Qianlong Xie, Ran He, Run Ling, Shuo Lu, Siru Jiang, Wei Feng, Xingxing Wang, Yihua Shao, Yinuo Xu, Yongcan Yu, Yongguan Hu.

Figure 1
Figure 1. Figure 1: Left: On MathSpatial-Bench, humans achieve over 95% accuracy while most MLLMs remain below 60%. Right: Three core challenges of spatial reasoning and the design of MathSpatial to address them. provided large-scale, high-quality training corpora for spatial rea￾soning, as shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: MathSpatial source data construction pipeline: Data Collection and Curation → Standardization → Geometric Consistency Checking → Solution Verification. 2 Related Work 2.1 MLLM Spatial Reasoning Multimodal large language models (MLLMs) integrate textual and visual modalities and have demonstrated remarkable potential across vision–language tasks [4, 7, 18, 24]. Recent studies indi￾cate that MLLMs exhibit em… view at source ↗
Figure 3
Figure 3. Figure 3: Selected examples demonstrating the diverse prob [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: MathSpatial-Bench distribution and composition. In total, the benchmark consists of 518 problems in Holistic Recognition, 636 in Generative Inference, and 846 in Abstract De￾duction. The detailed distribution across all subcategories is shown in [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Fine-grained error analysis on MathSpatial-Bench. (a) Error frequency distribution for baselines across 6 subcategories. (b) Overall error rate breakdown by failure mode. 7B variant (17.8%). Multiple models score 0.0% on GPC, indicating a complete inability to perform geometric property calculations. GLM-4.5V (21.0%) is the strongest base open-source model on this benchmark, yet still trails most closed-so… view at source ↗
read the original abstract

Multimodal large language models (MLLMs) have achieved strong performance on perception-oriented tasks, yet their ability to perform mathematical spatial reasoning, defined as the capacity to parse and manipulate two- and three-dimensional relations, remains unclear. Humans easily solve textbook-style spatial reasoning problems with over 95\% accuracy, but we find that most leading MLLMs fail to reach even 60\% on the same tasks. This striking gap highlights spatial reasoning as a fundamental weakness of current models. To investigate this gap, we present \emph{MathSpatial}, the first large-scale and systematic dataset resource dedicated to mathematical spatial reasoning in MLLMs. \emph{MathSpatial} provides two complementary subsets: (i)~\emph{MathSpatial-Bench}, a rigorously curated evaluation set of 2{,}000 problems spanning 3 categories and 11 subtypes, designed to isolate spatial reasoning from perceptual noise; and (ii)~\emph{MathSpatial-Corpus}, a training set of 8{,}000 problems equipped with verified solutions and structured reasoning traces. All problems are sourced from authentic educational materials and undergo multi-stage quality control including deduplication, geometric consistency checking, and cross-validated solution verification. Benchmarking 16 leading MLLMs on \emph{MathSpatial-Bench} reveals that spatial reasoning remains a fundamental bottleneck: even GPT-5 lags behind human performance by over 35 percentage points, with particularly poor results on abstract deduction tasks. We further show that training on \emph{MathSpatial-Corpus} yields consistent improvements across model families, demonstrating the dataset's practical value for advancing spatial reasoning capabilities. \emph{MathSpatial} is publicly available at https://shuolucs.github.io/MathSpatial.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MathSpatial, a new dataset resource for mathematical spatial reasoning in MLLMs consisting of MathSpatial-Bench (a 2,000-problem evaluation set spanning 3 categories and 11 subtypes) and MathSpatial-Corpus (an 8,000-problem training set with verified solutions). It benchmarks 16 leading MLLMs on the bench, reports that even GPT-5 lags human performance by over 35 percentage points with especially weak results on abstract deduction, and shows consistent gains from training on the corpus.

Significance. If the curation successfully isolates spatial reasoning, the work is significant: it supplies the first large-scale, education-sourced benchmark and training resource for this capability, quantifies a clear performance gap, and demonstrates that targeted data can improve results across model families.

major comments (3)
  1. [§3] §3 (Dataset Construction): the multi-stage quality control (deduplication, geometric consistency checking, cross-validated solution verification) is described only at a high level; no quantitative metrics (inter-annotator agreement, post-verification error rates, or ablation showing isolation from perceptual factors) are reported, leaving the central isolation claim without direct empirical support.
  2. [§4.1] §4.1 (Benchmarking): overall accuracy figures are given for 16 models, but per-subtype breakdowns and statistical significance tests for the abstract-deduction category are missing, so the claim of “particularly poor results” on those tasks cannot be fully evaluated from the presented data.
  3. [§4.2] §4.2 (Training Experiments): gains from fine-tuning on MathSpatial-Corpus are shown, yet no controls for dataset size, content overlap with existing pre-training corpora, or comparison to generic math data are included, weakening attribution of improvements specifically to spatial-reasoning enhancement.
minor comments (2)
  1. [Abstract] Abstract and §1: the model referred to as “GPT-5” should be clarified (version, access date, or whether it is a stand-in) for reproducibility.
  2. [§4] Figure captions and Table 1: ensure all subtype labels match the 11 subtypes listed in §3.1 and that axis labels include units or scale information.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of the dataset as a pure measure of spatial reasoning, which is assumed after quality controls but not independently verified in the abstract.

axioms (1)
  • domain assumption The selected problems from educational materials test mathematical spatial reasoning independently of perceptual abilities
    Invoked in the description of MathSpatial-Bench to isolate the capability.

pith-pipeline@v0.9.0 · 5675 in / 1132 out tokens · 156560 ms · 2026-05-16T03:35:19.177346+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 6 internal anchors

  1. [1]

    Anthropic. 2024. Claude 3.5 Sonnet. https://www.anthropic.com/news/claude-3- 5-sonnet

  2. [2]

    Anthropic. 2025. Claude 3.7 Sonnet. https://www.anthropic.com/news/claude-3- 7-sonnet

  3. [3]

    Anthropic. 2025. System Card: Claude Opus 4 & Claude Sonnet 4. https://www. anthropic.com/claude-4-system-card

  4. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al . 2025. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923(2025)

  5. [5]

    Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. 2025. Spatialbot: Precise spatial understanding with vision language models. In2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 9490–9498

  6. [6]

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. 2024. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProc. CVPR. 14455–14465

  7. [7]

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al . 2024. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271(2024)

  8. [8]

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. 2024. Spatialrgpt: Grounded spatial reasoning in vision-language models. InProc. NeurIPS, Vol. 37. 135062–135093

  9. [9]

    Erik Daxberger, Nina Wenzel, David Griffiths, Haiming Gang, Justin Lazarow, Gefen Kohavi, Kai Kang, Marcin Eichner, Yinfei Yang, Afshin Dehghan, et al

  10. [10]

    Mm-spatial: Exploring 3d spatial understanding in multimodal llms. In Proc. ICCV. 7395–7408

  11. [11]

    Mengfei Du, Binhao Wu, Zejun Li, Xuan-Jing Huang, and Zhongyu Wei. 2024. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProc. ACL. 346–355

  12. [12]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407

  13. [13]

    Jie Feng, Jinwei Zeng, Qingyue Long, Hongyi Chen, Jie Zhao, Yanxin Xi, Zhilun Zhou, Yuan Yuan, Shengyuan Wang, Qingbin Zeng, et al . 2025. A survey of large language model-powered spatial intelligence across scales: Advances in embodied agents, smart cities, and earth science.arXiv preprint arXiv:2504.09848 (2025)

  14. [14]

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. 2024. Blink: Multimodal large language models can see but not perceive. InProc. ECCV. Springer, 148–166

  15. [15]

    2025.Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capa- bilities

    Google Gemini Team. 2025.Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capa- bilities. Technical Report. DeepMind / Google. https://arxiv.org/abs/2507. 06261orDeepMindreport

  16. [16]

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. 2025. Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning. arXiv e-prints(2025), arXiv–2507

  17. [17]

    Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. 2025. OmniSpatial: Towards Comprehensive Spatial Rea- soning Benchmark for Vision Language Models.arXiv preprint arXiv:2506.03135 (2025)

  18. [18]

    Jiayi Kuang, Ying Shen, Jingyou Xie, Haohao Luo, Zhe Xu, Ronghao Li, Yinghui Li, Xianfeng Cheng, Xika Lin, and Yu Han. 2025. Natural language understanding and inference with mllm in visual question answering: A survey.Comput. Surveys 57, 8 (2025), 1–36

  19. [19]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InProc. ICML. PMLR, 19730–19742

  20. [20]

    Linjie Li, Mahtab Bigverdi, Jiawei Gu, Zixian Ma, Yinuo Yang, Ziang Li, Yejin Choi, and Ranjay Krishna. 2025. Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations.arXiv preprint arXiv:2506.04633(2025)

  21. [21]

    Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso de Melo, and Alan Yuille. 2025. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InProc. CVPR. 6924–6934

  22. [22]

    OpenAI. 2024. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/

  23. [23]

    2025.GPT-5 System Card

    OpenAI. 2025.GPT-5 System Card. Technical report. OpenAI. Accessed: 2025-08- 10

  24. [24]

    OpenAI. 2025. Introducing GPT-4.1 in the API. https://openai.com/index/gpt-4-1/

  25. [25]

    Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al . 2025. Humanity’s last exam.arXiv preprint arXiv:2501.14249(2025)

  26. [26]

    Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A Plummer, Ranjay Krishna, et al

  27. [27]

    Sat: Dy- namic spatial aptitude training for multimodal language models.arXiv preprint arXiv:2412.07755, 2024

    SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models. arXiv preprint arXiv:2412.07755(2024)

  28. [28]

    Sara Sarto, Marcella Cornia, Rita Cucchiara, et al. 2025. Image Captioning Evalu- ation in the Age of Multimodal LLMs: Challenges and Future Perspectives. In IJCAI

  29. [29]

    Ilias Stogiannidis, Steven McDonagh, and Sotirios A Tsaftaris. 2025. Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707(2025)

  30. [30]

    Emilia Szymańska, Mihai Dusmanu, Jan-Willem Buurlage, Mahdi Rad, and Marc Pollefeys. 2024. Space3d-bench: Spatial 3d question answering benchmark. In Proc. ECCV. Springer, 68–85

  31. [31]

    Kexian Tang, Junyao Gao, Yanhong Zeng, Haodong Duan, Yanan Sun, Zhening Xing, Wenran Liu, Kaifeng Lyu, and Kai Chen. 2025. LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?arXiv preprint arXiv:2503.19990 (2025)

  32. [32]

    Kexin Tian, Jingrui Mao, Yunlong Zhang, Jiwan Jiang, Yang Zhou, and Zhengzhong Tu. 2025. Nuscenes-spatialqa: A spatial understanding and rea- soning benchmark for vision-language models in autonomous driving.arXiv preprint arXiv:2504.03164(2025)

  33. [33]

    Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Sharon Li, and Neel Joshi. 2024. Is a picture worth a thousand words? delving into spatial reasoning for vision language models. InProc. NeurIPS

  34. [34]

    Peijie Wang, Chao Yang, Zhong-Zhi Li, Fei Yin, Dekang Ran, Mi Tian, Zhilong Ji, Jinfeng Bai, and Cheng-Lin Liu. 2025. SOLIDGEO: Measuring Multimodal Spatial Math Reasoning in Solid Geometry. InProc. NeurIPS

  35. [35]

    Siting Wang, Luoyang Sun, Cheng Deng, Kun Shao, Minnan Pei, Zheng Tian, Haifeng Zhang, and Jun Wang. 2025. Spatialviz-bench: Automatically generated spatial visualization reasoning tasks for mllms.arXiv preprint arXiv:2507.07610 (2025)

  36. [36]

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. 2025. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747(2025)

  37. [37]

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

  38. [38]

    Thinking in space: How multimodal large language models see, remember, and recall spaces. InProc. CVPR. 10632–10643

  39. [39]

    Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, et al. 2025. MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence.arXiv preprint arXiv:2505.23764 (2025)

  40. [40]

    Shaokai Ye, Haozhe Qi, Alexander Mathis, and Mackenzie W Mathis. 2025. LLaVAction: evaluating and training multi-modal large language models for action recognition.arXiv preprint arXiv:2503.18712(2025)

  41. [41]

    Yifu Yuan, Haiqin Cui, Yibin Chen, Zibin Dong, Fei Ni, Longxin Kou, Jinyi Liu, Pengyi Li, Yan Zheng, and Jianye Hao. 2025. From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation.arXiv preprint arXiv:2505.08548 (2025)

  42. [42]

    Jirong Zha, Yuxuan Fan, Xiao Yang, Chen Gao, and Xinlei Chen. 2025. How to enable llm with 3d capacity? a survey of spatial reasoning in llm. InIJCAI

  43. [43]

    Jiaxin Zhang, Zhong-Zhi Li, Ming-Liang Zhang, Fei Yin, Cheng-Lin Liu, and Yashar Moshfeghi. 2024. Geoeval: benchmark for evaluating llms and multi-modal models on geometry problem-solving. InProc. ACL. 1258–1276

  44. [44]

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. 2025. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479(2025)