MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

Chenming Zhu; Dahua Lin; Haodong Duan; Jiangmiao Pang; Jingli Lin; Mo Li; Runsen Xu; Sihan Yang; Sizhe Yang; Tai Wang

arxiv: 2505.23764 · v3 · pith:JRN2MVLBnew · submitted 2025-05-29 · 💻 cs.CV · cs.CL

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

Sihan Yang , Runsen Xu , Yiman Xie , Sizhe Yang , Mo Li , Jingli Lin , Chenming Zhu , Xiaochen Chen

show 5 more authors

Haodong Duan Xiangyu Yue Dahua Lin Tai Wang Jiangmiao Pang

This is my paper

classification 💻 cs.CV cs.CL

keywords reasoningspatialerrorsintelligencemmsi-benchmulti-imagebenchmarkchallenging

0 comments

read the original abstract

Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial reasoning that real-world deployments demand. We introduce MMSI-Bench, a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent more than 300 hours meticulously crafting 1,000 challenging, unambiguous multiple-choice questions from over 120,000 images, each paired with carefully designed distractors and a stepwise reasoning process. We conduct extensive experiments and evaluate 37 open-source and proprietary MLLMs, observing a wide gap: the strongest open-source model attains roughly 30% accuracy and OpenAI's GPT-5 reasoning model reaches 40%, while humans score 97%. These results underscore the challenging nature of MMSI-Bench and the substantial headroom for future research. Leveraging the annotated reasoning processes, we also provide an automated error analysis pipeline that diagnoses four dominant failure modes, including (1) grounding errors, (2) overlap-matching and scene-reconstruction errors, (3) situation-transformation reasoning errors, and (4) spatial-logic errors, offering insights for advancing spatial intelligence. Project page: https://runsenxu.com/projects/MMSI_Bench .

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 7.0

DriveSpatial benchmark shows the best of 15 VLMs trails humans by 28.4 points on spatiotemporal driving tasks, with cognitive scene construction as the main failure mode.
Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?
cs.CV 2026-05 unverdicted novelty 7.0

The Token Replacement Test shows VLMs keep most accuracy gains even after corrupting or replacing continuous thought token content, indicating the tokens are not used as information bottlenecks.
WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents
cs.CV 2026-05 accept novelty 7.0

WildRoadBench provides a professionally annotated UAV corpus and dual-track protocol showing frontier VLMs and LLM agents achieve limited performance on wild aerial road-damage grounding under unified metrics.
SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos
cs.CV 2026-04 unverdicted novelty 7.0

SiMing-Bench shows current MLLMs have weak agreement with physicians on procedural correctness in clinical videos, with intermediate step judgments remaining poor even when overall scores look acceptable.
SCP: Spatial Causal Prediction in Video
cs.CV 2026-03 unverdicted novelty 7.0

SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
cs.CV 2025-12 unverdicted novelty 7.0

4D-RGPT uses perceptual 4D distillation to boost region-level 4D perception in multimodal LLMs and reports gains on existing and new video QA benchmarks.
Cambrian-P: Pose-Grounded Video Understanding
cs.CV 2026-05 unverdicted novelty 6.0

Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
cs.CL 2026-04 unverdicted novelty 6.0

MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.
World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

Distilling view-consistent future views and action-outcome supervision from a generative world model into a VLM via two-stage post-training improves dynamic spatial reasoning on SAT-Real, VSI-Bench and similar benchma...
Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs
cs.CV 2026-04 unverdicted novelty 6.0

GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.
Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks
cs.CV 2026-03 unverdicted novelty 6.0

PulseFocus improves multi-image reasoning in VLMs by interleaving planning and attention-gated focus blocks during chain-of-thought, achieving gains on BLINK and MuirBench.
Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation
cs.AI 2026-02 unverdicted novelty 6.0

MLLMs show a large gap in spatial mathematical reasoning compared to humans, and a new 10,000-problem dataset helps narrow it through training.
A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning
cs.AI 2026-04 unverdicted novelty 5.0

A progressive training framework using spatiotemporal chain-of-thought data reduces the forward-backward temporal query performance gap in VLMs from over 70% to 6.53%.
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence
cs.CL 2026-04 unverdicted novelty 5.0

OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.
Seed1.8 Model Card: Towards Generalized Real-World Agency
cs.AI 2026-03 unverdicted novelty 5.0

Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
cs.GR 2026-05 unverdicted novelty 4.0

JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
cs.GR 2026-05 unverdicted novelty 4.0

JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
cs.CV 2026-04 unverdicted novelty 4.0

XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...
Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI
cs.AI 2025-10 unverdicted novelty 4.0

A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-groun...