Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation

· 2026 · cs.AI · arXiv 2602.11635

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Multimodal large language models (MLLMs) have achieved strong performance on perception-oriented tasks, yet their ability to perform mathematical spatial reasoning, defined as the capacity to parse and manipulate two- and three-dimensional relations, remains unclear. Humans easily solve textbook-style spatial reasoning problems with over 95\% accuracy, but we find that most leading MLLMs fail to reach even 60\% on the same tasks. This striking gap highlights spatial reasoning as a fundamental weakness of current models. To investigate this gap, we present \emph{MathSpatial}, the first large-scale and systematic dataset resource dedicated to mathematical spatial reasoning in MLLMs. \emph{MathSpatial} provides two complementary subsets: (i)~\emph{MathSpatial-Bench}, a rigorously curated evaluation set of 2{,}000 problems spanning 3 categories and 11 subtypes, designed to isolate spatial reasoning from perceptual noise; and (ii)~\emph{MathSpatial-Corpus}, a training set of 8{,}000 problems equipped with verified solutions and structured reasoning traces. All problems are sourced from authentic educational materials and undergo multi-stage quality control including deduplication, geometric consistency checking, and cross-validated solution verification. Benchmarking 16 leading MLLMs on \emph{MathSpatial-Bench} reveals that spatial reasoning remains a fundamental bottleneck: even GPT-5 lags behind human performance by over 35 percentage points, with particularly poor results on abstract deduction tasks. We further show that training on \emph{MathSpatial-Corpus} yields consistent improvements across model families, demonstrating the dataset's practical value for advancing spatial reasoning capabilities. \emph{MathSpatial} is publicly available at https://shuolucs.github.io/MathSpatial.

representative citing papers

WorldCoder-Bench: Benchmarking Physically Grounded 3D World Synthesis

cs.AI · 2026-06-01 · unverdicted · novelty 7.0

Introduces WorldCoder-Bench and StateProbe for evaluating LLM-generated physically grounded 3D browser worlds, with frontier models reaching at most 27.8% verification coverage.

citing papers explorer

Showing 1 of 1 citing paper.

WorldCoder-Bench: Benchmarking Physically Grounded 3D World Synthesis cs.AI · 2026-06-01 · unverdicted · none · ref 23 · internal anchor
Introduces WorldCoder-Bench and StateProbe for evaluating LLM-generated physically grounded 3D browser worlds, with frontier models reaching at most 27.8% verification coverage.

Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation

fields

years

verdicts

representative citing papers

citing papers explorer