pith. sign in

arxiv: 2604.12630 · v1 · submitted 2026-04-14 · 💻 cs.CV · cs.CL

GeoAlign: Geometric Feature Realignment for MLLM Spatial Reasoning

Pith reviewed 2026-05-10 15:54 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords geometric feature alignmentmultimodal large language modelsspatial reasoningmulti-layer feature aggregationvisual token queries3D feature injectiontask misalignmentsparse routing
0
0 comments X

The pith

Dynamically routing multi-layer geometric features with visual tokens as queries realigns them to MLLM spatial needs

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

MLLMs struggle with spatial reasoning even when geometric features from 3D models are added. The root problem is that single fixed layers from those models carry biases from their original training goals, which clash with the varied spatial questions the language model must answer. GeoAlign creates a bank of features drawn from every layer and lets each visual token from the MLLM itself act as a query to pull the most fitting layer for its patch. This adaptive selection produces better spatial performance. A 4B-parameter model equipped with the method reaches state-of-the-art results on three standard benchmarks and surpasses larger MLLMs.

Core claim

Static single-layer geometric feature extraction induces a task misalignment bias because the features naturally evolve toward 3D pretraining objectives that may contradict the heterogeneous spatial demands of MLLMs, rendering any single layer fundamentally insufficient. GeoAlign resolves the mismatch by constructing a hierarchical geometric feature bank and leveraging the MLLM's original visual tokens as content-aware queries to perform layer-wise sparse routing that adaptively fetches suitable geometric features for each patch.

What carries the argument

Hierarchical geometric feature bank queried by visual tokens for layer-wise sparse routing

If this is right

  • A compact 4B MLLM with GeoAlign reaches state-of-the-art results on VSI-Bench, ScanQA, and SQA3D.
  • The same model outperforms larger existing MLLMs on spatial reasoning benchmarks.
  • Layer-wise sparse routing supplies geometric features matched to each patch without large added overhead.
  • The approach shows that content-aware selection can overcome objective mismatch between injected features and downstream multimodal tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same query-driven routing idea could be applied when other mismatched pretrained extractors are injected into vision-language models.
  • Further tests on models below 4B parameters would show whether the benefit scales down or requires a minimum capacity to learn useful routing.
  • If the selected layers vary systematically with spatial question type, the routing decisions themselves could become an interpretable signal for spatial understanding.

Load-bearing premise

That the spatial-reasoning shortfall is caused mainly by fixed single-layer extraction and that letting visual tokens choose among layers will correct the mismatch without new biases or high cost.

What would settle it

An ablation that keeps single-layer geometric features but otherwise matches the model size and training, then measures whether performance on VSI-Bench, ScanQA, and SQA3D remains substantially below the dynamic-routing version.

Figures

Figures reproduced from arXiv: 2604.12630 by Guanglu Wan, Limeng Qiao, Tingting Jiang, Zhaochen Liu.

Figure 1
Figure 1. Figure 1: Task misalignment bias. Feature evolution progressively aligns with the pretraining tasks, thus many geometric features valuable for spatial reason￾ing tasks are distributed in the preceding layers. representations (Hong et al., 2023; Zheng et al., 2025b; Zhu et al., 2025a), such as point clouds or depth maps. While effective on 3D question￾answering benchmarks, the rigid dependency on specialized 3D data … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the GeoAlign framework. We augment the 2D visual features with aggregated geometric features, which are adaptively selected and fused from a hierarchical feature bank built upon the 3D geometry encoder. In this dynamic routing mechanism, the original visual tokens act as content-aware queries, ensuring that the injected geometric features properly align with diverse spatial reasoning demands. a… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative visualization of the dynamic routing mechanism. We visualize the mean routing weights (α, presented as percentages) assigned to each layer of the geometric feature bank across the entire visual sequence. For distinct visual inputs, the routing distributions exhibit significant variations. Among these, the latter 12 layers achieve the best performance. This comparison suggests that early stages … view at source ↗
read the original abstract

Multimodal large language models (MLLMs) have exhibited remarkable performance in various visual tasks, yet still struggle with spatial reasoning. Recent efforts mitigate this by injecting geometric features from 3D foundation models, but rely on static single-layer extractions. We identify that such an approach induces a task misalignment bias: the geometric features naturally evolve towards 3D pretraining objectives, which may contradict the heterogeneous spatial demands of MLLMs, rendering any single layer fundamentally insufficient. To resolve this, we propose GeoAlign, a novel framework that dynamically aggregates multi-layer geometric features to realign with the actual demands. GeoAlign constructs a hierarchical geometric feature bank and leverages the MLLM's original visual tokens as content-aware queries to perform layer-wise sparse routing, adaptively fetching the suitable geometric features for each patch. Extensive experiments on VSI-Bench, ScanQA, and SQA3D demonstrate that our compact 4B model effectively achieves state-of-the-art performance, even outperforming larger existing MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that static single-layer geometric feature extraction from 3D foundation models induces a task misalignment bias in MLLMs for spatial reasoning, as features evolve toward pretraining objectives that conflict with heterogeneous downstream demands. To address this, GeoAlign constructs a hierarchical geometric feature bank and uses the MLLM's visual tokens as content-aware queries for layer-wise sparse routing to dynamically aggregate suitable multi-layer features. Experiments on VSI-Bench, ScanQA, and SQA3D show the resulting 4B model achieving SOTA performance and outperforming larger MLLMs.

Significance. If the necessity of the dynamic, query-driven realignment is verified, the work provides an efficient, parameter-light approach to aligning pretrained geometric features with MLLM spatial tasks, potentially enabling stronger performance from compact models without scaling model size. The content-aware routing design is a concrete engineering contribution that could apply to other cross-model feature alignment settings.

major comments (2)
  1. [§4 (Experiments) and Table 1] §4 (Experiments) and Table 1: No ablation isolates the proposed layer-wise sparse routing from simpler static multi-layer aggregation (e.g., mean pooling, concatenation, or learned projection across layers). The reported SOTA gains on VSI-Bench, ScanQA, and SQA3D could therefore result from accessing multiple layers rather than the content-aware routing mechanism, leaving the central diagnosis of 'task misalignment bias' and the value of the realignment unverified.
  2. [§3.2 (Layer-wise Sparse Routing)] §3.2 (Layer-wise Sparse Routing): The claim that any single layer is 'fundamentally insufficient' due to misalignment bias is load-bearing, yet the manuscript provides no direct measurement or controlled test of this bias (e.g., per-layer performance on spatial subtasks or feature-task alignment metrics) before introducing the dynamic aggregator.
minor comments (2)
  1. [Abstract and §1] The abstract and §1 do not name the specific 4B MLLM backbone or 3D foundation model used to extract geometric features, which is needed for reproducibility.
  2. Implementation details (routing threshold, bank construction, training hyperparameters) and error analysis or statistical significance tests on the benchmark results are absent.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and will revise the manuscript to incorporate additional analyses as needed.

read point-by-point responses
  1. Referee: [§4 (Experiments) and Table 1] §4 (Experiments) and Table 1: No ablation isolates the proposed layer-wise sparse routing from simpler static multi-layer aggregation (e.g., mean pooling, concatenation, or learned projection across layers). The reported SOTA gains on VSI-Bench, ScanQA, and SQA3D could therefore result from accessing multiple layers rather than the content-aware routing mechanism, leaving the central diagnosis of 'task misalignment bias' and the value of the realignment unverified.

    Authors: We agree that directly comparing the proposed content-aware sparse routing against static multi-layer aggregation methods would more rigorously isolate the contribution of the dynamic mechanism. In the revised manuscript, we will add an ablation study including mean pooling, concatenation, and learned projection baselines across layers, with results reported on VSI-Bench, ScanQA, and SQA3D. This will help confirm that the query-driven realignment, rather than multi-layer access alone, drives the observed gains. revision: yes

  2. Referee: [§3.2 (Layer-wise Sparse Routing)] §3.2 (Layer-wise Sparse Routing): The claim that any single layer is 'fundamentally insufficient' due to misalignment bias is load-bearing, yet the manuscript provides no direct measurement or controlled test of this bias (e.g., per-layer performance on spatial subtasks or feature-task alignment metrics) before introducing the dynamic aggregator.

    Authors: The claim is motivated by the fact that 3D foundation model features evolve toward pretraining objectives that may conflict with heterogeneous MLLM spatial demands. While indirect support comes from the performance gap between single-layer baselines and GeoAlign, we acknowledge that direct measurements would strengthen the argument. In the revision, we will add per-layer performance breakdowns on spatial subtasks along with feature-task alignment metrics to provide more explicit evidence for the insufficiency of any single layer. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained engineering contribution

full rationale

The paper identifies a task misalignment bias in static single-layer geometric feature extraction from 3D foundation models and proposes GeoAlign to resolve it via dynamic multi-layer aggregation using MLLM visual tokens as content-aware queries for sparse routing. This chain is presented as an independent architectural proposal with a hierarchical feature bank and layer-wise routing mechanism; it does not reduce by construction to fitted parameters, self-definitions, or prior author results. Performance is validated on external benchmarks (VSI-Bench, ScanQA, SQA3D) rather than internal predictions or self-citations, leaving the central claims empirically falsifiable and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Review is abstract-only; the paper implicitly relies on the domain assumption that multi-layer 3D geometric features contain complementary information useful for MLLM spatial tasks and introduces two new constructs without independent evidence.

axioms (1)
  • domain assumption Geometric features extracted from 3D foundation models contain information that can be aligned to heterogeneous spatial reasoning demands of MLLMs.
    Invoked to justify moving beyond single-layer extraction.
invented entities (2)
  • Hierarchical geometric feature bank no independent evidence
    purpose: Store multi-layer geometric features for adaptive selection.
    Core new component of GeoAlign.
  • Layer-wise sparse routing using visual tokens as queries no independent evidence
    purpose: Dynamically fetch suitable geometric features per patch.
    Central mechanism to resolve misalignment bias.

pith-pipeline@v0.9.0 · 5480 in / 1367 out tokens · 55214 ms · 2026-05-10T15:54:25.363234+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 2 internal anchors

  1. [1]

    InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 5828–5839

    Scannet: Richly-annotated 3D reconstructions of indoor scenes. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 5828–5839. Jiajun Deng, Tianyu He, Li Jiang, Tianyu Wang, Feras Dayoub, and Ian Reid. 2025. 3D-LLaV A: Towards generalist 3D LMMs with omni superpoint trans- former. InProceedings of the IEEE/CVF Computer ...

  2. [2]

    GPT-4o System Card

    3D-LLM: Injecting the 3D world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, and 1 others. 2024. GPT-4o system card.arXiv preprint arXiv:2410.21276. Amita Kamath, Jack Hes...

  3. [3]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models. InProceedings of the International Conference on Machine Learning, pages 19730– 19742. PMLR. Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mo- hammad Shoeybi, and Song Han. 2024. VILA: On pre-training for visual language models. InProceed- ings of t...