Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods

Weichen Liu, Qiyao Xue, Haoming Wang, Xiangyu Yin, Boyuan Yang, Wei Gao · 2025 · arXiv 2511.15722

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from reasoning.

MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

MPDocBench-Parse provides a 3,246-page benchmark and evaluation protocol for multi-page document parsing that tests text/table/formula extraction, merging, figure handling, reading order, and heading hierarchy.

ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

ArchSIBench is a new benchmark dataset and evaluation suite that measures vision-language models on architectural spatial intelligence across 17 subtasks, showing most models lag human baselines especially in transformation and configuration.

ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.

OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice

cs.CL · 2026-05-02 · unverdicted · novelty 7.0 · 2 refs

OralMLLM-Bench reveals performance gaps between multimodal large language models and clinicians on cognitive tasks for dental radiographic analysis across periapical, panoramic, and cephalometric images.

SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

cs.CV · 2026-04-24 · unverdicted · novelty 7.0

SpaMEM benchmark shows multimodal LLMs succeed at spatial tasks with text histories but sharply fail at long-horizon belief maintenance from raw visual streams alone.

PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World

cs.CV · 2026-05-13 · unverdicted · novelty 6.0 · 2 refs

PanoWorld adds spherical spatial cross-attention and pano-native training data to MLLMs for improved spatial reasoning on ERP panoramas, outperforming baselines on new and existing benchmarks.

Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

VLMs possess a latent 3D scene topology subspace corresponding to Laplacian eigenmaps that can be causally shaped via Dirichlet energy regularization to improve spatial task performance by up to 12.1%.

citing papers explorer

Showing 8 of 8 citing papers.

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis cs.CV · 2026-05-21 · unverdicted · none · ref 52
VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from reasoning.
MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing cs.AI · 2026-05-21 · unverdicted · none · ref 65
MPDocBench-Parse provides a 3,246-page benchmark and evaluation protocol for multi-page document parsing that tests text/table/formula extraction, merging, figure handling, reading order, and heading hierarchy.
ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models cs.CV · 2026-05-20 · unverdicted · none · ref 22
ArchSIBench is a new benchmark dataset and evaluation suite that measures vision-language models on architectural spatial intelligence across 17 subtasks, showing most models lag human baselines especially in transformation and configuration.
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models cs.CV · 2026-05-11 · unverdicted · none · ref 30
ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice cs.CL · 2026-05-02 · unverdicted · none · ref 2 · 2 links
OralMLLM-Bench reveals performance gaps between multimodal large language models and clinicians on cognitive tasks for dental radiographic analysis across periapical, panoramic, and cephalometric images.
SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments cs.CV · 2026-04-24 · unverdicted · none · ref 45
SpaMEM benchmark shows multimodal LLMs succeed at spatial tasks with text histories but sharply fail at long-horizon belief maintenance from raw visual streams alone.
PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World cs.CV · 2026-05-13 · unverdicted · none · ref 29 · 2 links
PanoWorld adds spherical spatial cross-attention and pano-native training data to MLLMs for improved spatial reasoning on ERP panoramas, outperforming baselines on new and existing benchmarks.
Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models cs.CV · 2026-05-08 · unverdicted · none · ref 10
VLMs possess a latent 3D scene topology subspace corresponding to Laplacian eigenmaps that can be causally shaped via Dirichlet energy regularization to improve spatial task performance by up to 12.1%.

Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer