Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Pith reviewed 2026-05-22 00:54 UTC · model grok-4.3
The pith
Spatial-MLLM boosts spatial reasoning in MLLMs by extracting 3D structure features from 2D inputs using a dual-encoder setup.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a dual-encoder architecture, with a 2D visual encoder for semantics and a 3D spatial encoder initialized from a feed-forward visual geometry foundation model for structure, allows MLLMs to perform enhanced spatial understanding and reasoning from purely 2D observations. Combined with space-aware frame sampling at inference and training on a multi-source dataset via supervised fine-tuning and GRPO, this leads to state-of-the-art results on visual-based spatial tasks.
What carries the argument
Dual-encoder architecture pairing a pretrained 2D semantic visual encoder with a 3D spatial encoder initialized from a visual geometry foundation model backbone, plus space-aware frame sampling strategy.
If this is right
- The approach achieves state-of-the-art performance on a range of visual-based spatial understanding and reasoning tasks.
- It functions using only 2D observations without needing additional 3D or 2.5D data.
- Space-aware frame sampling ensures focus on spatially informative frames under limited token budgets for video inputs.
- Training on a constructed multi-source dataset with supervised fine-tuning and GRPO enhances the spatial capabilities.
Where Pith is reading between the lines
- This design may allow similar spatial boosts in other multimodal models by swapping in different geometry pretraining sources.
- Applications in autonomous driving or robotic vision could benefit from running on standard video without extra sensors.
- Future work might test if the same encoder helps in generating 3D descriptions or planning from 2D scenes.
Load-bearing premise
The 3D spatial encoder, when initialized from the visual geometry foundation model, will successfully extract usable 3D structure features from 2D image and video inputs.
What would settle it
Replacing the 3D spatial encoder with a standard encoder lacking the geometry prior and measuring a large drop in performance on spatial reasoning benchmarks would falsify the central claim.
Figures
read the original abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a 3D spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct a training dataset from multiple sources and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that Spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Spatial-MLLM, a multimodal LLM framework for visual-based spatial reasoning and understanding that operates exclusively on 2D image and video inputs. It proposes a dual-encoder architecture (pretrained 2D semantic encoder plus a 3D spatial encoder initialized from a feed-forward visual geometry foundation model backbone), a connector to fuse the features, a space-aware frame sampling strategy for inference, and training via supervised fine-tuning followed by GRPO on a multi-source constructed dataset. The central claim is that this yields state-of-the-art performance across a range of real-world spatial tasks without requiring explicit 3D or 2.5D data.
Significance. If the empirical results hold after proper verification, the work would be moderately significant for the MLLM and spatial AI community. It offers a practical route to inject geometric priors into video MLLMs by reusing existing visual geometry backbones rather than training from scratch or requiring 3D supervision, which could broaden applicability to standard 2D-only settings. The combination of architecture, sampling, and GRPO training constitutes a concrete recipe worth testing on additional benchmarks.
major comments (2)
- [Abstract and §3] Abstract and §3 (dual-encoder architecture): The load-bearing claim that the 3D spatial encoder, initialized from the visual geometry foundation model, successfully extracts usable 3D structure features (depth ordering, relative pose, layout) directly from 2D inputs is not isolated or verified. No feature probing, visualization, or controlled ablation (e.g., swapping the 3D encoder for a standard CLIP-style encoder while keeping all other components fixed) is described, leaving open the possibility that gains arise primarily from the training dataset, SFT+GRPO, or space-aware sampling instead.
- [Experimental section] Experimental section (results and ablations): The manuscript asserts SOTA performance on multiple spatial benchmarks, yet the description provides no quantitative baseline tables with exact numbers, ablation breakdowns for each proposed component, dataset statistics, or error bars/statistical tests. This absence prevents assessment of whether the reported improvements are reliable, substantial, or reproducible, directly undermining the central empirical claim.
minor comments (2)
- [Figure 1] Figure 1 (architecture diagram): The connector module and how the two encoder outputs are projected into unified tokens could be illustrated with explicit dimension annotations to improve clarity.
- [§4.2] §4.2 (space-aware frame sampling): The inference-time sampling strategy would benefit from a short pseudocode listing or explicit selection criterion formula rather than a purely textual description.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the changes we will make in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (dual-encoder architecture): The load-bearing claim that the 3D spatial encoder, initialized from the visual geometry foundation model, successfully extracts usable 3D structure features (depth ordering, relative pose, layout) directly from 2D inputs is not isolated or verified. No feature probing, visualization, or controlled ablation (e.g., swapping the 3D encoder for a standard CLIP-style encoder while keeping all other components fixed) is described, leaving open the possibility that gains arise primarily from the training dataset, SFT+GRPO, or space-aware sampling instead.
Authors: We agree that the manuscript would benefit from more direct evidence isolating the contribution of the 3D spatial encoder. The existing ablations compare the full Spatial-MLLM against variants without the spatial encoder, but do not include a controlled swap with a CLIP-style encoder or feature visualizations. In the revision we will add a new ablation that replaces the 3D spatial encoder with a standard visual encoder while holding all other components fixed, together with qualitative visualizations of the extracted features to illustrate the 3D structural information being captured from 2D inputs. revision: yes
-
Referee: [Experimental section] Experimental section (results and ablations): The manuscript asserts SOTA performance on multiple spatial benchmarks, yet the description provides no quantitative baseline tables with exact numbers, ablation breakdowns for each proposed component, dataset statistics, or error bars/statistical tests. This absence prevents assessment of whether the reported improvements are reliable, substantial, or reproducible, directly undermining the central empirical claim.
Authors: We acknowledge that the current experimental section does not present results at the level of detail required for full reproducibility assessment. We will expand the section to include complete quantitative tables reporting exact numbers for all baselines and our model, component-wise ablation breakdowns, full dataset statistics (including source composition and size), and error bars obtained from multiple runs together with appropriate statistical tests. revision: yes
Circularity Check
No circularity: empirical architecture with external benchmarks
full rationale
The paper presents an empirical architecture (dual-encoder with 3D spatial encoder initialized from a visual geometry backbone) and training recipe (SFT + GRPO on constructed dataset plus space-aware sampling). No equations, derivations, or first-principles predictions appear in the provided text. Central claims rest on performance numbers from external real-world datasets rather than quantities defined inside the paper or self-citation chains. This matches the default case of a self-contained empirical result against external benchmarks, warranting score 0 with no steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A pretrained visual geometry foundation model encodes useful 3D structure priors that can be extracted from 2D observations.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a 3D spatial encoder—initialized from the backbone of the visual geometry model—to extract 3D structure features
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks using only 2D observations
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 31 Pith papers
-
PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos
PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four tasks showing MLLM capability gaps that improve via supervised fine-tuning.
-
SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation
SpaceDG introduces the first large-scale degradation-aware spatial reasoning dataset using 3D Gaussian Splatting synthesis, showing that visual degradations impair MLLM performance but finetuning on the data improves ...
-
CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models
Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
-
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
-
Exploring Spatial Intelligence from a Generative Perspective
Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.
-
Why MLLMs Struggle to Determine Object Orientations
Orientation information is recoverable from MLLM visual encoder embeddings via linear regression, contradicting the hypothesis that failures originate in the encoders.
-
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
-
PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos
PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four progressive tasks built from ScanNet data.
-
SCP: Spatial Causal Prediction in Video
SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
-
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
SpatialBench creates a five-level framework and 15-task benchmark to measure hierarchical spatial reasoning in MLLMs, finding strong basic perception but weak symbolic reasoning, causal inference, and planning.
-
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
AVA-Bench evaluates vision foundation models by disentangling 14 atomic visual abilities with aligned training-test distributions to reveal precise ability fingerprints.
-
GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning
GeoWeaver performs token-adaptive geometric grounding on visual tokens from a multi-level bank prior to language modeling to support better spatio-temporal reasoning.
-
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.
-
Unlocking Dense Metric Depth Estimation in VLMs
DepthVLM attaches a depth head to VLMs for native dense metric depth prediction alongside language outputs using a two-stage unified training schedule and a new indoor-outdoor benchmark.
-
Unlocking Dense Metric Depth Estimation in VLMs
DepthVLM converts a standard VLM into a dense metric depth predictor by attaching a lightweight head and training under unified vision-text supervision, outperforming prior VLMs and some pure vision models on a new in...
-
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images
SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
-
SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness
SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.
-
ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment
ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.
-
Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs
GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.
-
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs
EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL...
-
Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding
Motion-MLLM integrates IMU egomotion data into MLLMs using cascaded filtering and asymmetric fusion to ground visual content in physical trajectories for scale-aware 3D understanding, achieving competitive accuracy at...
-
Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation
MLLMs show a large gap in spatial mathematical reasoning compared to humans, and a new 10,000-problem dataset helps narrow it through training.
-
Vision-aligned Latent Reasoning for Multi-modal Large Language Model
VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.
-
SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving
SpaceDrive integrates 3D positional encodings derived from depth and ego-states into VLMs, replacing digit tokens to improve spatial reasoning and trajectory regression in autonomous driving.
-
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.
-
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
VLM-3R augments VLMs with implicit 3D tokens from monocular video via geometry encoding and 200K+ 3D reconstructive QA pairs, plus a new 138K-pair temporal benchmark, to support spatial and embodied reasoning.
-
Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence
Integrating generative novel-view synthesis into LMM reasoning loops improves accuracy on spatial subtasks by 1.3 to 3.9 percentage points across multiple models and tasks.
-
SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning
SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.
-
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence
OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.
-
LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map
LASAR pairs a dual-memory system with spatio-temporal contrastive learning to induce latent cognitive maps, reporting 2-3.5% zero-shot gains on VLN-CE and VSI-Bench plus high map self-consistency.
-
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning,
J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al., “Flamingo: a visual language model for few-shot learning,”NeurIPS, 2022
work page 2022
-
[2]
J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” inICML, 2023
work page 2023
-
[3]
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”NeurIPS, 2024
work page 2024
-
[4]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang,et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,”arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
LLaVA-OneVision: Easy Visual Task Transfer
B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, Y . Li, Z. Liu, and C. Li, “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,
Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu,et al., “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” inCVPR, 2024
work page 2024
-
[8]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,”arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
B. Lin, B. Zhu, Y . Ye, M. Ning, P. Jin, and L. Yuan, “Video-llava: Learning united visual representation by alignment before projection,”arXiv preprint arXiv:2311.10122, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Z. Cheng, S. Leng, H. Zhang, Y . Xin, X. Li, G. Chen, Y . Zhu, W. Zhang, Z. Luo, D. Zhao, et al., “Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms,” arXiv preprint arXiv:2406.07476, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Streaming long video understanding with large language models,
R. Qian, X. Dong, P. Zhang, Y . Zang, S. Ding, D. Lin, and J. Wang, “Streaming long video understanding with large language models,”arXiv preprint arXiv:2405.16009, 2024
-
[12]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Y . Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li, “Video instruction tuning with synthetic data,” ArXiv, vol. abs/2410.02713, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K.-Y . Chen, X. Liu, J. Wang, W. Ge, Y . Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin, “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”ArXiv, vol. abs/2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,”ArXiv, vol. abs/2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Audiogpt: Understanding and generating speech, music, sound, and talking head,
R. Huang, M. Li, D. Yang, J. Shi, X. Chang, Z. Ye, Y . Wu, Z. Hong, J.-B. Huang, J. Liu, Y . Ren, Z. Zhao, and S. Watanabe, “Audiogpt: Understanding and generating speech, music, sound, and talking head,” ArXiv, vol. abs/2304.12995, 2023
-
[16]
SALMONN: Towards Generic Hearing Abilities for Large Language Models
C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “Salmonn: Towards generic hearing abilities for large language models,”ArXiv, vol. abs/2310.13289, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Llavanext: Improved reasoning, ocr, and world knowledge, 2024a
Z. Liu, Y . Dong, J. Wang, Z. Liu, W. Hu, J. Lu, and Y . Rao, “Ola: Pushing the frontiers of omni-modal language model with progressive modality alignment,”ArXiv, vol. abs/2502.04328, 2025
-
[18]
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
J. Yang, S. Yang, A. W. Gupta, R. Han, F.-F. Li, and S. Xie, “Thinking in space: How multimodal large language models see, remember, and recall spaces,”ArXiv, vol. abs/2412.14171, 2024
work page Pith review arXiv 2024
-
[19]
Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,
B. Chen, Z. Xu, S. Kirmani, B. Ichter, D. Driess, P. Florence, D. Sadigh, L. J. Guibas, and F. Xia, “Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,” 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14455–14465, 2024
work page 2024
-
[21]
3d-llava: Towards generalist 3d lmms with omni superpoint transformer,
J. Deng, T. He, L. Jiang, T. Wang, F. Dayoub, and I. Reid, “3d-llava: Towards generalist 3d lmms with omni superpoint transformer,”ArXiv, vol. abs/2501.01163, 2025
-
[22]
Chat-scene: Bridging 3d scene and large language models with object identifiers,
H. Huang, Z. Wang, R. Huang, L. Liu, X. Cheng, Y . Zhao, T. Jin, and Z. Zhao, “Chat-scene: Bridging 3d scene and large language models with object identifiers,” inNeural Information Processing Systems, 2023
work page 2023
-
[23]
Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning,
S. Chen, X. Chen, C. Zhang, M. Li, G. Yu, H. Fei, H. Zhu, J. Fan, and T. Chen, “Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26428–26438, 2024
work page 2024
-
[24]
C. Zhu, T. Wang, W. Zhang, J. Pang, and X. Liu, “Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness,”arXiv preprint arXiv:2409.18125, 2024. 18
-
[25]
Video-3d llm: Learning position-aware video representation for 3d scene understanding,
D. Zheng, S. Huang, and L. Wang, “Video-3d llm: Learning position-aware video representation for 3d scene understanding,”ArXiv, vol. abs/2412.00493, 2024
-
[26]
Datacomp: In search of the next generation of multimodal datasets
S. Y . Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, E. Orgad, R. Entezari, G. Daras, S. Pratt, V . Ramanujan, Y . Bitton, K. Marathe, S. Mussmann, R. Vencu, M. Cherti, R. Krishna, P. W. Koh, O. Saukh, A. J. Ratner, S. Song, H. Hajishirzi, A. Farhadi, R. Beaumont, S. Oh, A. G. Dimakis, J. Jitsev,...
-
[27]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning, 2021
work page 2021
-
[28]
Tulip: Towards unified language-image pretraining,
Z. Tang, L. Lian, S. Eisape, X. Wang, R. Herzig, A. Yala, A. Suhr, T. Darrell, and D. M. Chan, “Tulip: Towards unified language-image pretraining,”ArXiv, vol. abs/2503.15485, 2025
-
[29]
Beyond semantics: Rediscovering spatial awareness in vision-language models,
J. Qi, J. Liu, H. Tang, and Z. Zhu, “Beyond semantics: Rediscovering spatial awareness in vision-language models,”ArXiv, vol. abs/2503.17349, 2025
-
[30]
Long-clip: Unlocking the long-text capability of clip,
B. Zhang, P. Zhang, X. wen Dong, Y . Zang, and J. Wang, “Long-clip: Unlocking the long-text capability of clip,” inEuropean Conference on Computer Vision, 2024
work page 2024
-
[31]
Dust3r: Geometric 3d vision made easy,
S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud, “Dust3r: Geometric 3d vision made easy,” 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 20697–20709, 2023
work page 2024
-
[32]
Vggt: Visual geometry grounded transformer,
J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[33]
Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos
Z. Li, R. Tucker, F. Cole, Q. Wang, L. Jin, V . Ye, A. Kanazawa, A. Holynski, and N. Snavely, “Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos,”ArXiv, vol. abs/2412.04463, 2024
-
[34]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J.-M. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B.-L. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D.-L. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H....
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Z. Shao, P. Wang, Q. Zhu, R. Xu, J.-M. Song, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” ArXiv, vol. abs/2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. H. Chi, F. Xia, Q. Le, and D. Zhou, “Chain of thought prompting elicits reasoning in large language models,”ArXiv, vol. abs/2201.11903, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[37]
Scanqa: 3d question answering for spatial scene understanding,
D. Azuma, T. Miyanishi, S. Kurita, and M. Kawanabe, “Scanqa: 3d question answering for spatial scene understanding,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19107–19117, 2021
work page 2022
-
[38]
Sqa3d: Situated question answering in 3d scenes,
X. Ma, S. Yong, Z. Zheng, Q. Li, Y . Liang, S.-C. Zhu, and S. Huang, “Sqa3d: Situated question answering in 3d scenes,”ArXiv, vol. abs/2210.07474, 2022
-
[39]
Improved baselines with visual instruction tuning,
H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26296–26306, 2024
work page 2024
-
[40]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,”arXiv preprint arXiv:2304.10592, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
PandaGPT: One Model To Instruction-Follow Them All
Y . Su, T. Lan, H. Li, J. Xu, Y . Wang, and D. Cai, “Pandagpt: One model to instruction-follow them all,” arXiv preprint arXiv:2305.16355, 2023. 19
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Detgpt: Detect what you need via reasoning
R. Pi, J. Gao, S. Diao, R. Pan, H. Dong, J. Zhang, L. Yao, J. Han, H. Xu, L. Kong, et al., “Detgpt: Detect what you need via reasoning,”arXiv preprint arXiv:2305.14167, 2023
-
[43]
VideoChat: Chat-Centric Video Understanding
K. Li, Y . He, Y . Wang, Y . Li, W. Wang, P. Luo, Y . Wang, L. Wang, and Y . Qiao, “Videochat: Chat-centric video understanding,”arXiv preprint arXiv:2305.06355, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Grounded 3d-llm with referent tokens,
Y . Chen, S. Yang, H. Huang, T. Wang, R. Xu, R. Lyu, D. Lin, and J. Pang, “Grounded 3d-llm with referent tokens,”arXiv preprint arXiv:2405.10370, 2024
-
[45]
Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes,
Z. Wang, H. Huang, Y . Zhao, Z. Zhang, and Z. Zhao, “Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes,”arXiv preprint arXiv:2308.08769, 2023
-
[47]
Chat-scene: Bridging 3d scene and large language models with object identifiers,
H. Huang, Y . Chen, Z. Wang, R. Huang, R. Xu, T. Wang, L. Liu, X. Cheng, Y . Zhao, J. Pang, et al., “Chat-scene: Bridging 3d scene and large language models with object identifiers,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[48]
3d-llm: Injecting the 3d world into large language models,
Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan, “3d-llm: Injecting the 3d world into large language models,”Advances in Neural Information Processing Systems, vol. 36, pp. 20482–20494, 2023
work page 2023
-
[50]
Gpt4scene: Understand 3d scenes from videos with vision-language models,
Z. Qi, Z. Zhang, Y . Fang, J. Wang, and H. Zhao, “Gpt4scene: Understand 3d scenes from videos with vision-language models,”arXiv preprint arXiv:2501.01428, 2025
-
[51]
Z. Liu, Y . Dong, Z. Liu, W. Hu, J. Lu, and Y . Rao, “Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution,”arXiv preprint arXiv:2409.12961, 2024
-
[52]
Videoagent: Long-form video understanding with large language model as agent,
X. Wang, Y . Zhang, O. Zohar, and S. Yeung-Levy, “Videoagent: Long-form video understanding with large language model as agent,” inEuropean Conference on Computer Vision, pp. 58–76, Springer, 2024
work page 2024
-
[53]
Y . Li, Y . Zhang, T. Lin, X. Liu, W. Cai, Z. Liu, and B. Zhao, “Sti-bench: Are mllms ready for precise spatial-temporal world understanding?,”arXiv preprint arXiv:2503.23765, 2025
-
[54]
St-think: How multimodal large language models reason about 4d worlds from ego-centric videos,
P. Wu, Y . Liu, M. Liu, and J. Shen, “St-think: How multimodal large language models reason about 4d worlds from ego-centric videos,”arXiv preprint arXiv:2503.12542, 2025
-
[55]
Vlm4d: Towards spatiotemporal awareness in vision language models,
S. Zhou, A. Vilesov, X. He, Z. Wan, S. Zhang, A. N. D. C. D. Chen, and X. E. W. A. Kadambi, “Vlm4d: Towards spatiotemporal awareness in vision language models,”
-
[56]
A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inNeural Information Processing Systems, 2017
work page 2017
-
[57]
Vision Transformers Need Registers
T. Darcet, M. Oquab, J. Mairal, and P. Bojanowski, “Vision transformers need registers,” ArXiv, vol. abs/2309.16588, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Transfer between modalities with metaqueries,
X. Pan, S. N. Shukla, A. Singh, Z. Zhao, S. K. Mishra, J. Wang, Z. Xu, J. Chen, K. Li, F. Juefei-Xu, J. Hou, and S. Xie, “Transfer between modalities with metaqueries,” 2025
work page 2025
-
[59]
An analysis of approximations for maximizing submodular set functions—i,
G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher, “An analysis of approximations for maximizing submodular set functions—i,” Mathematical Programming, vol. 14, no. 1, pp. 265–294, 1978
work page 1978
-
[60]
D. S. Hochbaum, Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems, p. 94–143. USA: PWS Publishing Co., 1996
work page 1996
-
[61]
Scannet: Richly-annotated 3d reconstructions of indoor scenes,
A. Dai, A. X. Chang, M. Savva, M. Halber, T. A. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2432–2443, 2017
work page 2017
-
[62]
Adam: A Method for Stochastic Optimization
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”CoRR, vol. abs/1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[63]
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
F. Xue, Y . Chen, D. Li, Q. Hu, L. Zhu, X. Li, Y . Fang, H. Tang, S. Yang, Z. Liu, E. He, H. Yin, P. Molchanov, J. Kautz, L. Fan, Y . Zhu, Y . Lu, and S. Han, “Longvila: Scaling long-context visual language models for long videos,”ArXiv, vol. abs/2408.10188, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[64]
Vila: On pre-training for visual language models,
J. Lin, H. Yin, W. Ping, Y . Lu, P. Molchanov, A. Tao, H. Mao, J. Kautz, M. Shoeybi, and S. Han, “Vila: On pre-training for visual language models,”2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 26679–26689, 2023
work page 2024
-
[65]
Long Context Transfer from Language to Vision
P. Zhang, K. Zhang, B. Li, G. Zeng, J. Yang, Y . Zhang, Z. Wang, H. Tan, C. Li, and Z. Liu, “Long context transfer from language to vision,”ArXiv, vol. abs/2406.16852, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[66]
Scannet++: A high-fidelity dataset of 3d indoor scenes,
C. Yeshwanth, Y .-C. Liu, M. Nießner, and A. Dai, “Scannet++: A high-fidelity dataset of 3d indoor scenes,” 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12–22, 2023. 20
work page 2023
-
[67]
G. Baruch, Z. Chen, A. Dehghan, T. Dimry, Y . Feigin, P. Fu, T. Gebauer, B. Joffe, D. Kurz, A. Schwartz, and E. Shulman, “ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021
work page 2021
-
[68]
An Embodied Generalist Agent in 3D World
J. Huang, S. Yong, X. Ma, X. Linghu, P. Li, Y . Wang, Q. Li, S.-C. Zhu, B. Jia, and S. Huang, “An embodied generalist agent in 3d world,”ArXiv, vol. abs/2311.12871, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[69]
3d-vista: Pre-trained transformer for 3d vision and text alignment,
Z. Zhu, X. Ma, Y . Chen, Z. Deng, S. Huang, and Q. Li, “3d-vista: Pre-trained transformer for 3d vision and text alignment,”2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2899–2909, 2023
work page 2023
-
[70]
3d-llm: Injecting the 3d world into large language models,
Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan, “3d-llm: Injecting the 3d world into large language models,”NeurIPS, 2023
work page 2023
-
[71]
Open3D: A Modern Library for 3D Data Processing
Q.-Y . Zhou, J. Park, and V . Koltun, “Open3d: A modern library for 3d data processing,” ArXiv, vol. abs/1801.09847, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[72]
Indoor segmentation and support inference from rgbd images,
N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” inEuropean Conference on Computer Vision, 2012
work page 2012
-
[73]
Perceptual organization and recognition of indoor scenes from rgb-d images,
S. Gupta, P. Arbeláez, and J. Malik, “Perceptual organization and recognition of indoor scenes from rgb-d images,”2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 564–571, 2013
work page 2013
-
[74]
R. Fu, J. Liu, X. Chen, Y . Nie, and W. Xiong, “Scene-llm: Extending language model for 3d visual understanding and reasoning,”ArXiv, vol. abs/2403.11401, 2024. 21
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.