pith. sign in

arxiv: 2505.23747 · v2 · pith:HS5S2APJnew · submitted 2025-05-29 · 💻 cs.CV · cs.AI· cs.LG

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Pith reviewed 2026-05-22 00:54 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords Multimodal Large Language ModelsSpatial ReasoningDual-Encoder ArchitectureVisual Geometry Foundation ModelSpace-Aware Frame Sampling2D Observations3D Structure Features
0
0 comments X

The pith

Spatial-MLLM boosts spatial reasoning in MLLMs by extracting 3D structure features from 2D inputs using a dual-encoder setup.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to improve the ability of Multimodal Large Language Models to understand spatial relationships when given only ordinary two-dimensional images or videos. It does this by introducing a dual-encoder design that combines regular semantic features with three-dimensional structural features taken from a visual geometry foundation model. A special strategy for choosing important frames in videos helps the model focus on what matters for spatial tasks even with token limits. This matters because many practical uses like analyzing camera footage do not have access to full 3D scans or depth sensors. If successful, it shows that strong spatial intelligence can be added to existing models without changing the input data requirements.

Core claim

The central claim is that a dual-encoder architecture, with a 2D visual encoder for semantics and a 3D spatial encoder initialized from a feed-forward visual geometry foundation model for structure, allows MLLMs to perform enhanced spatial understanding and reasoning from purely 2D observations. Combined with space-aware frame sampling at inference and training on a multi-source dataset via supervised fine-tuning and GRPO, this leads to state-of-the-art results on visual-based spatial tasks.

What carries the argument

Dual-encoder architecture pairing a pretrained 2D semantic visual encoder with a 3D spatial encoder initialized from a visual geometry foundation model backbone, plus space-aware frame sampling strategy.

If this is right

  • The approach achieves state-of-the-art performance on a range of visual-based spatial understanding and reasoning tasks.
  • It functions using only 2D observations without needing additional 3D or 2.5D data.
  • Space-aware frame sampling ensures focus on spatially informative frames under limited token budgets for video inputs.
  • Training on a constructed multi-source dataset with supervised fine-tuning and GRPO enhances the spatial capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This design may allow similar spatial boosts in other multimodal models by swapping in different geometry pretraining sources.
  • Applications in autonomous driving or robotic vision could benefit from running on standard video without extra sensors.
  • Future work might test if the same encoder helps in generating 3D descriptions or planning from 2D scenes.

Load-bearing premise

The 3D spatial encoder, when initialized from the visual geometry foundation model, will successfully extract usable 3D structure features from 2D image and video inputs.

What would settle it

Replacing the 3D spatial encoder with a standard encoder lacking the geometry prior and measuring a large drop in performance on spatial reasoning benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2505.23747 by Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yueqi Duan.

Figure 1
Figure 1. Figure 1: We propose Spatial-MLLM, a method that significantly enhances the visual-based spatial intelligence of existing video MLLMs. As shown, Spatial-MLLM is capable of understanding and reasoning about the underlying scene from video input, achieving state-of-the-art performance across a wide range of tasks. Abstract Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced perf… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Spatial-MLLM. Our model is composed of a 2D visual encoder E2D, a spatial encoder ESpatial, which is initialized from a feed-forward visual geometry foundation model, a connector, and a large language model backbone. At inference time, we incorporate a space-aware frame sampling strategy to select spatially informative frames when the number of input frames is limited due to GPU memory constrai… view at source ↗
Figure 3
Figure 3. Figure 3: Basic statistic of our constucted Spatial-MLLM-120K dataset. Supervised Fine-tuning. Leveraging the constructed Spatial-MLLM-120k dataset, we first perform supervised fine-tuning (SFT) on our model. Since both E2D and Espatial are pre-trained on large-scale image-text and pixel-point pairs, respectively, we freeze them to preserve their ability to extract rich semantic and structural information. We jointl… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of Training Curves in the SFT and RL Stages. For the SFT stage, we present the mean token accuracy and loss curves. For the RL stage, we show the dynamics of completion length and reward. (on ScanQA and SQA3D) achieve better performance than Spatial-MLLM. However, 3D-LLaVA requires additional point cloud input, and Video-3D-LLM depends on depth maps. Despite not relying on any additional 3D o… view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of the prompts used in the SFT and GRPO stages. We use the default system prompt of Qwen2.5-VL [14] (i.e., , "You are a helpful assistant") for both stages. In the SFT stage, the user prompt consists of a question and a type template. In the GRPO stage, the user prompt includes a question, a question post string, and a type template. Uniform Frame Sampling Space-aware Frame Sampling Space-awar… view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of different frame sampling strategies. For clarity of visualization, we set Nm = 128 and Nk = 8 in this example. Uniform frame sampling often overlooks transient regions that appear briefly in the video. Furthermore, when the camera remains static for extended periods, this strategy tends to yield redundant viewpoints. In contrast, our proposed space-aware frame sampling strategy achieves mo… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative example on VSI-Bench [18]. Video Question Spatial-MLLM "<think> Let me think about this for a moment. The images show a living room with various pieces of furniture. As I scan through them, I notice a few clearly identifiable sofas. First, in the initial image, there's a partially visible sofa near the window. It appears to be a traditional design, likely made of fabric and leather. Next, in th… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative example on VSI-Bench [18]. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative example on VSI-Bench [18]. Video Question "What will be the first-time appearance order of the following categories in the video: basket, pillow, toilet, tv? Options:\nA. basket, pillow, toilet, tv B. pillow, tv, toilet, basket C. toilet, pillow, tv, basket D. toilet, basket, pillow, tv" Spatial-MLLM "<think>Let me think about this carefully. The question asks for the first appearance order of … view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative example on VSI-Bench [18]. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
read the original abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a 3D spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct a training dataset from multiple sources and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that Spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Spatial-MLLM, a multimodal LLM framework for visual-based spatial reasoning and understanding that operates exclusively on 2D image and video inputs. It proposes a dual-encoder architecture (pretrained 2D semantic encoder plus a 3D spatial encoder initialized from a feed-forward visual geometry foundation model backbone), a connector to fuse the features, a space-aware frame sampling strategy for inference, and training via supervised fine-tuning followed by GRPO on a multi-source constructed dataset. The central claim is that this yields state-of-the-art performance across a range of real-world spatial tasks without requiring explicit 3D or 2.5D data.

Significance. If the empirical results hold after proper verification, the work would be moderately significant for the MLLM and spatial AI community. It offers a practical route to inject geometric priors into video MLLMs by reusing existing visual geometry backbones rather than training from scratch or requiring 3D supervision, which could broaden applicability to standard 2D-only settings. The combination of architecture, sampling, and GRPO training constitutes a concrete recipe worth testing on additional benchmarks.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (dual-encoder architecture): The load-bearing claim that the 3D spatial encoder, initialized from the visual geometry foundation model, successfully extracts usable 3D structure features (depth ordering, relative pose, layout) directly from 2D inputs is not isolated or verified. No feature probing, visualization, or controlled ablation (e.g., swapping the 3D encoder for a standard CLIP-style encoder while keeping all other components fixed) is described, leaving open the possibility that gains arise primarily from the training dataset, SFT+GRPO, or space-aware sampling instead.
  2. [Experimental section] Experimental section (results and ablations): The manuscript asserts SOTA performance on multiple spatial benchmarks, yet the description provides no quantitative baseline tables with exact numbers, ablation breakdowns for each proposed component, dataset statistics, or error bars/statistical tests. This absence prevents assessment of whether the reported improvements are reliable, substantial, or reproducible, directly undermining the central empirical claim.
minor comments (2)
  1. [Figure 1] Figure 1 (architecture diagram): The connector module and how the two encoder outputs are projected into unified tokens could be illustrated with explicit dimension annotations to improve clarity.
  2. [§4.2] §4.2 (space-aware frame sampling): The inference-time sampling strategy would benefit from a short pseudocode listing or explicit selection criterion formula rather than a purely textual description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the changes we will make in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (dual-encoder architecture): The load-bearing claim that the 3D spatial encoder, initialized from the visual geometry foundation model, successfully extracts usable 3D structure features (depth ordering, relative pose, layout) directly from 2D inputs is not isolated or verified. No feature probing, visualization, or controlled ablation (e.g., swapping the 3D encoder for a standard CLIP-style encoder while keeping all other components fixed) is described, leaving open the possibility that gains arise primarily from the training dataset, SFT+GRPO, or space-aware sampling instead.

    Authors: We agree that the manuscript would benefit from more direct evidence isolating the contribution of the 3D spatial encoder. The existing ablations compare the full Spatial-MLLM against variants without the spatial encoder, but do not include a controlled swap with a CLIP-style encoder or feature visualizations. In the revision we will add a new ablation that replaces the 3D spatial encoder with a standard visual encoder while holding all other components fixed, together with qualitative visualizations of the extracted features to illustrate the 3D structural information being captured from 2D inputs. revision: yes

  2. Referee: [Experimental section] Experimental section (results and ablations): The manuscript asserts SOTA performance on multiple spatial benchmarks, yet the description provides no quantitative baseline tables with exact numbers, ablation breakdowns for each proposed component, dataset statistics, or error bars/statistical tests. This absence prevents assessment of whether the reported improvements are reliable, substantial, or reproducible, directly undermining the central empirical claim.

    Authors: We acknowledge that the current experimental section does not present results at the level of detail required for full reproducibility assessment. We will expand the section to include complete quantitative tables reporting exact numbers for all baselines and our model, component-wise ablation breakdowns, full dataset statistics (including source composition and size), and error bars obtained from multiple runs together with appropriate statistical tests. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture with external benchmarks

full rationale

The paper presents an empirical architecture (dual-encoder with 3D spatial encoder initialized from a visual geometry backbone) and training recipe (SFT + GRPO on constructed dataset plus space-aware sampling). No equations, derivations, or first-principles predictions appear in the provided text. Central claims rest on performance numbers from external real-world datasets rather than quantities defined inside the paper or self-citation chains. This matches the default case of a self-contained empirical result against external benchmarks, warranting score 0 with no steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that a visual geometry foundation model supplies transferable 3D structure features when its backbone is used as a 3D spatial encoder on ordinary 2D inputs, and that selecting spatially informative frames improves reasoning under token limits.

axioms (1)
  • domain assumption A pretrained visual geometry foundation model encodes useful 3D structure priors that can be extracted from 2D observations.
    The paper initializes the 3D spatial encoder from the backbone of such a model (abstract description of dual-encoder architecture).

pith-pipeline@v0.9.0 · 5839 in / 1316 out tokens · 54274 ms · 2026-05-22T00:54:21.457046+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos

    cs.CV 2026-04 unverdicted novelty 8.0

    PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four tasks showing MLLM capability gaps that improve via supervised fine-tuning.

  2. SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

    cs.CV 2026-05 unverdicted novelty 7.0

    SpaceDG introduces the first large-scale degradation-aware spatial reasoning dataset using 3D Gaussian Splatting synthesis, showing that visual degradations impair MLLM performance but finetuning on the data improves ...

  3. CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.

  4. ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.

  5. Exploring Spatial Intelligence from a Generative Perspective

    cs.CV 2026-04 unverdicted novelty 7.0

    Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.

  6. Why MLLMs Struggle to Determine Object Orientations

    cs.CV 2026-04 accept novelty 7.0

    Orientation information is recoverable from MLLM visual encoder embeddings via linear regression, contradicting the hypothesis that failures originate in the encoders.

  7. Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

    cs.CV 2026-04 unverdicted novelty 7.0

    A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.

  8. PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos

    cs.CV 2026-04 unverdicted novelty 7.0

    PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four progressive tasks built from ScanNet data.

  9. SCP: Spatial Causal Prediction in Video

    cs.CV 2026-03 unverdicted novelty 7.0

    SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.

  10. SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

    cs.AI 2025-11 unverdicted novelty 7.0

    SpatialBench creates a five-level framework and 15-task benchmark to measure hierarchical spatial reasoning in MLLMs, finding strong basic perception but weak symbolic reasoning, causal inference, and planning.

  11. AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models

    cs.CV 2025-06 unverdicted novelty 7.0

    AVA-Bench evaluates vision foundation models by disentangling 14 atomic visual abilities with aligned training-test distributions to reveal precise ability fingerprints.

  12. GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning

    cs.CV 2026-05 unverdicted novelty 6.0

    GeoWeaver performs token-adaptive geometric grounding on visual tokens from a multi-level bank prior to language modeling to support better spatio-temporal reasoning.

  13. See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

    cs.CV 2026-05 unverdicted novelty 6.0

    SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.

  14. Unlocking Dense Metric Depth Estimation in VLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    DepthVLM attaches a depth head to VLMs for native dense metric depth prediction alongside language outputs using a two-stage unified training schedule and a new indoor-outdoor benchmark.

  15. Unlocking Dense Metric Depth Estimation in VLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    DepthVLM converts a standard VLM into a dense metric depth predictor by attaching a lightweight head and training under unified vision-text supervision, outperforming prior VLMs and some pure vision models on a new in...

  16. SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images

    cs.CV 2026-05 unverdicted novelty 6.0

    SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.

  17. SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness

    cs.CV 2026-04 unverdicted novelty 6.0

    SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.

  18. ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment

    cs.CV 2026-04 unverdicted novelty 6.0

    ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.

  19. Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.

  20. EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL...

  21. Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding

    cs.CV 2026-03 unverdicted novelty 6.0

    Motion-MLLM integrates IMU egomotion data into MLLMs using cascaded filtering and asymmetric fusion to ground visual content in physical trajectories for scale-aware 3D understanding, achieving competitive accuracy at...

  22. Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation

    cs.AI 2026-02 unverdicted novelty 6.0

    MLLMs show a large gap in spatial mathematical reasoning compared to humans, and a new 10,000-problem dataset helps narrow it through training.

  23. Vision-aligned Latent Reasoning for Multi-modal Large Language Model

    cs.CV 2026-02 unverdicted novelty 6.0

    VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.

  24. SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving

    cs.CV 2025-12 conditional novelty 6.0

    SpaceDrive integrates 3D positional encodings derived from depth and ego-states into VLMs, replacing digit tokens to improve spatial reasoning and trajectory regression in autonomous driving.

  25. Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

    cs.CV 2025-07 unverdicted novelty 6.0

    Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.

  26. VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

    cs.CV 2025-05 unverdicted novelty 6.0

    VLM-3R augments VLMs with implicit 3D tokens from monocular video via geometry encoding and 200K+ 3D reconstructive QA pairs, plus a new 138K-pair temporal benchmark, to support spatial and embodied reasoning.

  27. Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence

    cs.CV 2026-05 unverdicted novelty 5.0

    Integrating generative novel-view synthesis into LMM reasoning loops improves accuracy on spatial subtasks by 1.3 to 3.9 percentage points across multiple models and tasks.

  28. SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.

  29. OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence

    cs.CL 2026-04 unverdicted novelty 5.0

    OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.

  30. LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map

    cs.CV 2026-05 unverdicted novelty 4.0

    LASAR pairs a dual-memory system with spatio-temporal contrastive learning to induce latent cognitive maps, reporting 2-3.5% zero-shot gains on VLN-CE and VSI-Bench plus high map self-consistency.

  31. OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

    cs.CV 2026-04 unverdicted novelty 4.0

    OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 29 Pith papers · 22 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning,

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al., “Flamingo: a visual language model for few-shot learning,”NeurIPS, 2022

  2. [2]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,

    J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” inICML, 2023

  3. [3]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”NeurIPS, 2024

  4. [4]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang,et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,”arXiv preprint arXiv:2403.05530, 2024

  5. [5]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

  6. [6]

    LLaVA-OneVision: Easy Visual Task Transfer

    B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, Y . Li, Z. Liu, and C. Li, “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024

  7. [7]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,

    Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu,et al., “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” inCVPR, 2024

  8. [8]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,”arXiv preprint arXiv:2308.12966, 2023

  9. [9]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    B. Lin, B. Zhu, Y . Ye, M. Ning, P. Jin, and L. Yuan, “Video-llava: Learning united visual representation by alignment before projection,”arXiv preprint arXiv:2311.10122, 2023

  10. [10]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Z. Cheng, S. Leng, H. Zhang, Y . Xin, X. Li, G. Chen, Y . Zhu, W. Zhang, Z. Luo, D. Zhao, et al., “Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms,” arXiv preprint arXiv:2406.07476, 2024

  11. [11]

    Streaming long video understanding with large language models,

    R. Qian, X. Dong, P. Zhang, Y . Zang, S. Ding, D. Lin, and J. Wang, “Streaming long video understanding with large language models,”arXiv preprint arXiv:2405.16009, 2024

  12. [12]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Y . Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li, “Video instruction tuning with synthetic data,” ArXiv, vol. abs/2410.02713, 2024

  13. [13]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K.-Y . Chen, X. Liu, J. Wang, W. Ge, Y . Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin, “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”ArXiv, vol. abs/2409.12191, 2024

  14. [14]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,”ArXiv, vol. abs/2502.13923, 2025

  15. [15]

    Audiogpt: Understanding and generating speech, music, sound, and talking head,

    R. Huang, M. Li, D. Yang, J. Shi, X. Chang, Z. Ye, Y . Wu, Z. Hong, J.-B. Huang, J. Liu, Y . Ren, Z. Zhao, and S. Watanabe, “Audiogpt: Understanding and generating speech, music, sound, and talking head,” ArXiv, vol. abs/2304.12995, 2023

  16. [16]

    SALMONN: Towards Generic Hearing Abilities for Large Language Models

    C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “Salmonn: Towards generic hearing abilities for large language models,”ArXiv, vol. abs/2310.13289, 2023

  17. [17]

    Llavanext: Improved reasoning, ocr, and world knowledge, 2024a

    Z. Liu, Y . Dong, J. Wang, Z. Liu, W. Hu, J. Lu, and Y . Rao, “Ola: Pushing the frontiers of omni-modal language model with progressive modality alignment,”ArXiv, vol. abs/2502.04328, 2025

  18. [18]

    Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

    J. Yang, S. Yang, A. W. Gupta, R. Han, F.-F. Li, and S. Xie, “Thinking in space: How multimodal large language models see, remember, and recall spaces,”ArXiv, vol. abs/2412.14171, 2024

  19. [19]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,

    B. Chen, Z. Xu, S. Kirmani, B. Ichter, D. Driess, P. Florence, D. Sadigh, L. J. Guibas, and F. Xia, “Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,” 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14455–14465, 2024

  20. [21]

    3d-llava: Towards generalist 3d lmms with omni superpoint transformer,

    J. Deng, T. He, L. Jiang, T. Wang, F. Dayoub, and I. Reid, “3d-llava: Towards generalist 3d lmms with omni superpoint transformer,”ArXiv, vol. abs/2501.01163, 2025

  21. [22]

    Chat-scene: Bridging 3d scene and large language models with object identifiers,

    H. Huang, Z. Wang, R. Huang, L. Liu, X. Cheng, Y . Zhao, T. Jin, and Z. Zhao, “Chat-scene: Bridging 3d scene and large language models with object identifiers,” inNeural Information Processing Systems, 2023

  22. [23]

    Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning,

    S. Chen, X. Chen, C. Zhang, M. Li, G. Yu, H. Fei, H. Zhu, J. Fan, and T. Chen, “Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26428–26438, 2024

  23. [24]

    Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024

    C. Zhu, T. Wang, W. Zhang, J. Pang, and X. Liu, “Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness,”arXiv preprint arXiv:2409.18125, 2024. 18

  24. [25]

    Video-3d llm: Learning position-aware video representation for 3d scene understanding,

    D. Zheng, S. Huang, and L. Wang, “Video-3d llm: Learning position-aware video representation for 3d scene understanding,”ArXiv, vol. abs/2412.00493, 2024

  25. [26]

    Datacomp: In search of the next generation of multimodal datasets

    S. Y . Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, E. Orgad, R. Entezari, G. Daras, S. Pratt, V . Ramanujan, Y . Bitton, K. Marathe, S. Mussmann, R. Vencu, M. Cherti, R. Krishna, P. W. Koh, O. Saukh, A. J. Ratner, S. Song, H. Hajishirzi, A. Farhadi, R. Beaumont, S. Oh, A. G. Dimakis, J. Jitsev,...

  26. [27]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning, 2021

  27. [28]

    Tulip: Towards unified language-image pretraining,

    Z. Tang, L. Lian, S. Eisape, X. Wang, R. Herzig, A. Yala, A. Suhr, T. Darrell, and D. M. Chan, “Tulip: Towards unified language-image pretraining,”ArXiv, vol. abs/2503.15485, 2025

  28. [29]

    Beyond semantics: Rediscovering spatial awareness in vision-language models,

    J. Qi, J. Liu, H. Tang, and Z. Zhu, “Beyond semantics: Rediscovering spatial awareness in vision-language models,”ArXiv, vol. abs/2503.17349, 2025

  29. [30]

    Long-clip: Unlocking the long-text capability of clip,

    B. Zhang, P. Zhang, X. wen Dong, Y . Zang, and J. Wang, “Long-clip: Unlocking the long-text capability of clip,” inEuropean Conference on Computer Vision, 2024

  30. [31]

    Dust3r: Geometric 3d vision made easy,

    S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud, “Dust3r: Geometric 3d vision made easy,” 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 20697–20709, 2023

  31. [32]

    Vggt: Visual geometry grounded transformer,

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  32. [33]

    Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos

    Z. Li, R. Tucker, F. Cole, Q. Wang, L. Jin, V . Ye, A. Kanazawa, A. Holynski, and N. Snavely, “Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos,”ArXiv, vol. abs/2412.04463, 2024

  33. [34]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J.-M. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B.-L. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D.-L. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H....

  34. [35]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J.-M. Song, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” ArXiv, vol. abs/2402.03300, 2024

  35. [36]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. H. Chi, F. Xia, Q. Le, and D. Zhou, “Chain of thought prompting elicits reasoning in large language models,”ArXiv, vol. abs/2201.11903, 2022

  36. [37]

    Scanqa: 3d question answering for spatial scene understanding,

    D. Azuma, T. Miyanishi, S. Kurita, and M. Kawanabe, “Scanqa: 3d question answering for spatial scene understanding,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19107–19117, 2021

  37. [38]

    Sqa3d: Situated question answering in 3d scenes,

    X. Ma, S. Yong, Z. Zheng, Q. Li, Y . Liang, S.-C. Zhu, and S. Huang, “Sqa3d: Situated question answering in 3d scenes,”ArXiv, vol. abs/2210.07474, 2022

  38. [39]

    Improved baselines with visual instruction tuning,

    H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26296–26306, 2024

  39. [40]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,”arXiv preprint arXiv:2304.10592, 2023

  40. [41]

    PandaGPT: One Model To Instruction-Follow Them All

    Y . Su, T. Lan, H. Li, J. Xu, Y . Wang, and D. Cai, “Pandagpt: One model to instruction-follow them all,” arXiv preprint arXiv:2305.16355, 2023. 19

  41. [42]

    Detgpt: Detect what you need via reasoning

    R. Pi, J. Gao, S. Diao, R. Pan, H. Dong, J. Zhang, L. Yao, J. Han, H. Xu, L. Kong, et al., “Detgpt: Detect what you need via reasoning,”arXiv preprint arXiv:2305.14167, 2023

  42. [43]

    VideoChat: Chat-Centric Video Understanding

    K. Li, Y . He, Y . Wang, Y . Li, W. Wang, P. Luo, Y . Wang, L. Wang, and Y . Qiao, “Videochat: Chat-centric video understanding,”arXiv preprint arXiv:2305.06355, 2023

  43. [44]

    Grounded 3d-llm with referent tokens,

    Y . Chen, S. Yang, H. Huang, T. Wang, R. Xu, R. Lyu, D. Lin, and J. Pang, “Grounded 3d-llm with referent tokens,”arXiv preprint arXiv:2405.10370, 2024

  44. [45]

    Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes,

    Z. Wang, H. Huang, Y . Zhao, Z. Zhang, and Z. Zhao, “Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes,”arXiv preprint arXiv:2308.08769, 2023

  45. [47]

    Chat-scene: Bridging 3d scene and large language models with object identifiers,

    H. Huang, Y . Chen, Z. Wang, R. Huang, R. Xu, T. Wang, L. Liu, X. Cheng, Y . Zhao, J. Pang, et al., “Chat-scene: Bridging 3d scene and large language models with object identifiers,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  46. [48]

    3d-llm: Injecting the 3d world into large language models,

    Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan, “3d-llm: Injecting the 3d world into large language models,”Advances in Neural Information Processing Systems, vol. 36, pp. 20482–20494, 2023

  47. [50]

    Gpt4scene: Understand 3d scenes from videos with vision-language models,

    Z. Qi, Z. Zhang, Y . Fang, J. Wang, and H. Zhao, “Gpt4scene: Understand 3d scenes from videos with vision-language models,”arXiv preprint arXiv:2501.01428, 2025

  48. [51]

    Oryx mllm: On- demand spatial-temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024

    Z. Liu, Y . Dong, Z. Liu, W. Hu, J. Lu, and Y . Rao, “Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution,”arXiv preprint arXiv:2409.12961, 2024

  49. [52]

    Videoagent: Long-form video understanding with large language model as agent,

    X. Wang, Y . Zhang, O. Zohar, and S. Yeung-Levy, “Videoagent: Long-form video understanding with large language model as agent,” inEuropean Conference on Computer Vision, pp. 58–76, Springer, 2024

  50. [53]

    Sti-bench: Are mllms ready for precise spatial-temporal world understanding?arXiv preprint arXiv:2503.23765, 2025

    Y . Li, Y . Zhang, T. Lin, X. Liu, W. Cai, Z. Liu, and B. Zhao, “Sti-bench: Are mllms ready for precise spatial-temporal world understanding?,”arXiv preprint arXiv:2503.23765, 2025

  51. [54]

    St-think: How multimodal large language models reason about 4d worlds from ego-centric videos,

    P. Wu, Y . Liu, M. Liu, and J. Shen, “St-think: How multimodal large language models reason about 4d worlds from ego-centric videos,”arXiv preprint arXiv:2503.12542, 2025

  52. [55]

    Vlm4d: Towards spatiotemporal awareness in vision language models,

    S. Zhou, A. Vilesov, X. He, Z. Wan, S. Zhang, A. N. D. C. D. Chen, and X. E. W. A. Kadambi, “Vlm4d: Towards spatiotemporal awareness in vision language models,”

  53. [56]

    Attention is all you need,

    A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inNeural Information Processing Systems, 2017

  54. [57]

    Vision Transformers Need Registers

    T. Darcet, M. Oquab, J. Mairal, and P. Bojanowski, “Vision transformers need registers,” ArXiv, vol. abs/2309.16588, 2023

  55. [58]

    Transfer between modalities with metaqueries,

    X. Pan, S. N. Shukla, A. Singh, Z. Zhao, S. K. Mishra, J. Wang, Z. Xu, J. Chen, K. Li, F. Juefei-Xu, J. Hou, and S. Xie, “Transfer between modalities with metaqueries,” 2025

  56. [59]

    An analysis of approximations for maximizing submodular set functions—i,

    G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher, “An analysis of approximations for maximizing submodular set functions—i,” Mathematical Programming, vol. 14, no. 1, pp. 265–294, 1978

  57. [60]

    D. S. Hochbaum, Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems, p. 94–143. USA: PWS Publishing Co., 1996

  58. [61]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes,

    A. Dai, A. X. Chang, M. Savva, M. Halber, T. A. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2432–2443, 2017

  59. [62]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”CoRR, vol. abs/1412.6980, 2014

  60. [63]

    LongVILA: Scaling Long-Context Visual Language Models for Long Videos

    F. Xue, Y . Chen, D. Li, Q. Hu, L. Zhu, X. Li, Y . Fang, H. Tang, S. Yang, Z. Liu, E. He, H. Yin, P. Molchanov, J. Kautz, L. Fan, Y . Zhu, Y . Lu, and S. Han, “Longvila: Scaling long-context visual language models for long videos,”ArXiv, vol. abs/2408.10188, 2024

  61. [64]

    Vila: On pre-training for visual language models,

    J. Lin, H. Yin, W. Ping, Y . Lu, P. Molchanov, A. Tao, H. Mao, J. Kautz, M. Shoeybi, and S. Han, “Vila: On pre-training for visual language models,”2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 26679–26689, 2023

  62. [65]

    Long Context Transfer from Language to Vision

    P. Zhang, K. Zhang, B. Li, G. Zeng, J. Yang, Y . Zhang, Z. Wang, H. Tan, C. Li, and Z. Liu, “Long context transfer from language to vision,”ArXiv, vol. abs/2406.16852, 2024

  63. [66]

    Scannet++: A high-fidelity dataset of 3d indoor scenes,

    C. Yeshwanth, Y .-C. Liu, M. Nießner, and A. Dai, “Scannet++: A high-fidelity dataset of 3d indoor scenes,” 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12–22, 2023. 20

  64. [67]

    ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data,

    G. Baruch, Z. Chen, A. Dehghan, T. Dimry, Y . Feigin, P. Fu, T. Gebauer, B. Joffe, D. Kurz, A. Schwartz, and E. Shulman, “ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021

  65. [68]

    An Embodied Generalist Agent in 3D World

    J. Huang, S. Yong, X. Ma, X. Linghu, P. Li, Y . Wang, Q. Li, S.-C. Zhu, B. Jia, and S. Huang, “An embodied generalist agent in 3d world,”ArXiv, vol. abs/2311.12871, 2023

  66. [69]

    3d-vista: Pre-trained transformer for 3d vision and text alignment,

    Z. Zhu, X. Ma, Y . Chen, Z. Deng, S. Huang, and Q. Li, “3d-vista: Pre-trained transformer for 3d vision and text alignment,”2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2899–2909, 2023

  67. [70]

    3d-llm: Injecting the 3d world into large language models,

    Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan, “3d-llm: Injecting the 3d world into large language models,”NeurIPS, 2023

  68. [71]

    Open3D: A Modern Library for 3D Data Processing

    Q.-Y . Zhou, J. Park, and V . Koltun, “Open3d: A modern library for 3d data processing,” ArXiv, vol. abs/1801.09847, 2018

  69. [72]

    Indoor segmentation and support inference from rgbd images,

    N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” inEuropean Conference on Computer Vision, 2012

  70. [73]

    Perceptual organization and recognition of indoor scenes from rgb-d images,

    S. Gupta, P. Arbeláez, and J. Malik, “Perceptual organization and recognition of indoor scenes from rgb-d images,”2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 564–571, 2013

  71. [74]

    Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024

    R. Fu, J. Liu, X. Chen, Y . Nie, and W. Xiong, “Scene-llm: Extending language model for 3d visual understanding and reasoning,”ArXiv, vol. abs/2403.11401, 2024. 21