GeoAlign: Geometric Feature Realignment for MLLM Spatial Reasoning

· 2026 · cs.CV · arXiv 2604.12630

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Multimodal large language models (MLLMs) have exhibited remarkable performance in various visual tasks, yet still struggle with spatial reasoning. Recent efforts mitigate this by injecting geometric features from 3D foundation models, but rely on static single-layer extractions. We identify that such an approach induces a task misalignment bias: the geometric features naturally evolve towards 3D pretraining objectives, which may contradict the heterogeneous spatial demands of MLLMs, rendering any single layer fundamentally insufficient. To resolve this, we propose GeoAlign, a novel framework that dynamically aggregates multi-layer geometric features to realign with the actual demands. GeoAlign constructs a hierarchical geometric feature bank and leverages the MLLM's original visual tokens as content-aware queries to perform layer-wise sparse routing, adaptively fetching the suitable geometric features for each patch. Extensive experiments on VSI-Bench, ScanQA, and SQA3D demonstrate that our compact 4B model effectively achieves state-of-the-art performance, even outperforming larger existing MLLMs.

representative citing papers

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

cs.CV · 2026-06-04 · unverdicted · novelty 5.0

GeoVR distills camera pose, depth, scale, and multi-scale 3D features from pre-trained models into MLLMs via video supervision to improve spatial reasoning.

citing papers explorer

Showing 1 of 1 citing paper.

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models cs.CV · 2026-06-04 · unverdicted · none · ref 25 · internal anchor
GeoVR distills camera pose, depth, scale, and multi-scale 3D features from pre-trained models into MLLMs via video supervision to improve spatial reasoning.

GeoAlign: Geometric Feature Realignment for MLLM Spatial Reasoning

fields

years

verdicts

representative citing papers

citing papers explorer