SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

· 2026 · cs.CV · arXiv 2603.27437

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. Moving beyond conventional late-stage vision-geometry fusion, SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, enabling the model to capture both local geometric precision and global contextual semantics. Building upon this framework, we develop VLM-SpatialStack, a model that achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks. Extensive experiments and ablations demonstrate that our multi-level fusion strategy consistently enhances 3D understanding and generalizes robustly across diverse spatial reasoning tasks, establishing SpatialStack as an effective and extensible design paradigm for vision-language-geometry integration in next-generation multimodal physical AI systems.

representative citing papers

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

cs.CV · 2026-06-04 · unverdicted · novelty 5.0

GeoVR distills camera pose, depth, scale, and multi-scale 3D features from pre-trained models into MLLMs via video supervision to improve spatial reasoning.

citing papers explorer

Showing 1 of 1 citing paper.

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models cs.CV · 2026-06-04 · unverdicted · none · ref 45 · internal anchor
GeoVR distills camera pose, depth, scale, and multi-scale 3D features from pre-trained models into MLLMs via video supervision to improve spatial reasoning.

SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

fields

years

verdicts

representative citing papers

citing papers explorer