arxiv: 2506.14512 · v4 · submitted 2025-06-17 · 💻 cs.CV

SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks

Zijian Song , Xiaoxin Lin , Qiuming Huang , Sihan Qin , Guangrun Wang , Liang Lin This is my paper

classification 💻 cs.CV

keywords reasoningspatialvlmssiri-benchcomplexintelligencestructuraltasks

0 comments

read the original abstract

Large Language Models (LLMs) have undergone rapid progress, largely attributed to reinforcement learning on complex reasoning tasks. In contrast, while spatial intelligence is fundamental for Vision-Language Models (VLMs) in real-world interaction, the systematic study of their complex spatial reasoning remains underexplored. To bridge this gap, we introduce SIRI-Bench, a benchmark designed to evaluate VLMs' structural spatial intelligence through spatial-grounded reasoning tasks. SIRI-Bench comprises 9,000 video-question-answer triplets, where each problem is embedded in a realistic 3D scene. The benchmark is carefully designed so that solving each problem requires both spatial comprehension and structural reasoning. To facilitate large-scale data synthesis, we develop an Automatic Scene Creation Engine that employs collaborative LLM agents to translate abstract mathematical problems into faithful 3D scenes. Experimental results reveal that state-of-the-art VLMs struggle significantly on SIRI-Bench, underscoring the challenge of structural spatial reasoning. We hope that our study will bring researchers' attention to spatially grounded reasoning and advance VLMs in visual problem-solving.

This paper has not been read by Pith yet.

SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks

discussion (0)