pith. sign in

arxiv: 2505.17015 · v2 · pith:YBT6OK6Snew · submitted 2025-05-22 · 💻 cs.CV · cs.CL

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

Pith reviewed 2026-05-25 08:35 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords Multi-SpatialMLLMmulti-frame spatial understandingMultiSPA datasetMLLMsdepth perceptionvisual correspondencedynamic perceptionrobotics
0
0 comments X

The pith

A framework trains MLLMs on depth, correspondence and dynamic perception across frames using a 27-million-sample dataset to enable multi-frame spatial reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that multi-modal large language models can be extended from single-image spatial tasks to multi-frame reasoning required for physical-world applications. It does so by integrating three core skills and training them on a large dataset built through a dedicated pipeline. The resulting model records gains on a new uniform benchmark and shows utility as a robotics reward annotator. If correct, this removes a key barrier that has kept MLLMs from handling video-based spatial problems at scale.

Core claim

Integrating depth perception, visual correspondence, and dynamic perception into MLLM training on the MultiSPA dataset of more than 27 million samples from diverse 3D and 4D scenes produces Multi-SpatialMLLM, which delivers significant gains over baselines and proprietary systems while exhibiting multi-task benefits, emergent spatial capabilities, and the ability to annotate multi-frame rewards for robotics.

What carries the argument

The data pipeline that generates the MultiSPA dataset to train the three spatial skills across multiple frames inside an MLLM.

If this is right

  • Multi-frame perception becomes scalable and generalizable across 3D and 4D scenes.
  • Multi-task benefits and emergent spatial capabilities appear in challenging scenarios.
  • The model functions as a multi-frame reward annotator for robotics applications.
  • A single benchmark with uniform metrics now evaluates a wide spectrum of spatial tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same skill-integration approach could transfer to other temporal reasoning domains such as action prediction.
  • Reducing reliance on the full 27 million samples while preserving gains would make the method more practical.
  • The benchmark could serve as a shared testbed for comparing future multi-frame MLLM variants.

Load-bearing premise

The MultiSPA dataset of synthetic and collected samples covers the full diversity and complexity of real-world multi-frame spatial tasks without distribution shifts that would invalidate the reported gains.

What would settle it

A sharp drop in performance when the trained model is tested on real-world multi-frame video data collected independently of the described pipeline would falsify the generalizability claim.

Figures

Figures reproduced from arXiv: 2505.17015 by Fu-Jen Chu, Hao Tang, Kevin J. Liang, Matt Feiszli, Runsen Xu, Weiyao Wang, Xiaodong Wang, Xingyu Chen.

Figure 1
Figure 1. Figure 1: We present Multi-SpatialMLLM, a model capable of multi-frame spatial understanding, a capability overlooked by previous spatial understanding research. Multi-SpatialMLLM can support different types of input referencing and outputs for various tasks. 1. Introduction Recent years have witnessed tremendous advances in multi-modal large language models (MLLMs), which have evolved into versatile AI assistants c… view at source ↗
Figure 2
Figure 2. Figure 2: Overlap ratio calculation of image pairs. We maintain all visible points of image i, denoted as Pi , by selecting those whose projected coordinates (u, v) lie within the image bounds and are not occluded: 0 < p C i [2] < Di(u, v). (3) Depth perception data generation. To create depth per￾ception data, we randomly sample images for each scene. For each image Ii , we sample one or two visible points from Pi … view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of rigid body segmentation results. Image pairs sampling. Given a selected point that ap￾pears in T frames, one could form up to T(T −1) 2 image pairs. However, similar to ScanNet, these pairs exhibit a long-tailed distribution of motion magnitudes. We therefore bin the image pairs by the object translation distances and perform balanced sampling for each bin, ensuring diversity across small … view at source ↗
Figure 4
Figure 4. Figure 4: Scalability of Multi-SpatialMLLM. tion, and camera translation direction. Even worse, they ef￾fectively fail entirely on tasks requiring quantitative outputs such as coordinate-based visual correspondence and camera or object movement vectors. By contrast, our Multi-SpatialMLLM significantly im￾proves performance across all tasks, achieving an average 36% gain over the base model. On relatively easier qual… view at source ↗
Figure 5
Figure 5. Figure 5: Demonstrations of Multi-SpatialMLLM in zero-shot robotics tasks. Our model accurately identifies static objects and predicts movement distances, aligning with the ground truth. It exhibits potential for novel applications like multi-frame reward annotation. ditional scalability dimension beyond merely data volume and model capacity—namely, task diversity. 4.3. Emergence of spatial understanding We have sho… view at source ↗
Figure 6
Figure 6. Figure 6: Data samples of depth perception. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Data samples of visual correspondence. human: Image-1: <image> Image-2: <image> Observe and describe the camera's motion from one image to the other. The movement should be relative to the first image. Note that the objects in the images are assumed to be static. Between these two frames, is the camera more to the left or to the right now? gpt: It shows a noticeable move `left` along the horizontal axis. C… view at source ↗
Figure 8
Figure 8. Figure 8: Data samples of camera movement-translation direction. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Data samples of camera movement-orientation direction. human: Image-1: <image> Image-2: <image> Determine the change in camera viewpoint across these two images. The movement should be relative to the first image. Note that the objects in the images are assumed to be static. How many degrees does the camera revolve side-to-side from its original orientation? gpt: The camera's left-right rotation measures a… view at source ↗
Figure 10
Figure 10. Figure 10: Data samples of camera movement-orientation degree. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Data samples of camera movement-translation distance and vector. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Data samples of object perception. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Data samples of object movement-distance. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Data samples of object movement-vector. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
read the original abstract

Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for physical-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with multi-frame spatial understanding by integrating fundamental spatial skills, including depth perception, visual correspondence, and dynamic perception. We design a novel data pipeline and collect the MultiSPA dataset of more than 27 million samples spanning diverse 3D and 4D scenes to enable training. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable and generalizable multi-frame perception. We further observe multi-task benefits and emergent spatial capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes a framework to extend multi-modal large language models (MLLMs) to multi-frame spatial understanding by integrating depth perception, visual correspondence, and dynamic perception. It introduces a data pipeline to create the MultiSPA dataset (>27M samples across diverse 3D/4D scenes) and an accompanying benchmark, resulting in the Multi-SpatialMLLM model that reports significant gains over baselines and proprietary systems, plus multi-task benefits, emergent capabilities, and use as a robotics reward annotator.

Significance. If the reported gains prove robust and generalizable beyond the training distribution, the work would address a clear limitation in current MLLMs for physical-world applications. The scale of the MultiSPA dataset and the unified benchmark are notable contributions that could support further research in multi-frame perception.

major comments (2)
  1. [Abstract (data pipeline and benchmark description)] The central claim of 'significant gains' demonstrating 'scalable and generalizable multi-frame perception' rests on the MultiSPA dataset and benchmark being representative of real-world conditions. The abstract provides no details on how the synthetic data pipeline accounts for variations such as sensor noise, complex lighting, long-term occlusions, or non-rigid motion, nor does it reference any external real-world multi-frame benchmark for validation.
  2. [Abstract (model and results description)] Without ablations, error bars, or explicit integration details for the three spatial skills (depth, correspondence, dynamic perception), it is impossible to determine whether the performance improvements are additive, whether post-hoc filtering was applied, or whether the gains could be reproduced on held-out real data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the abstract and related sections to improve clarity on the data pipeline and experimental details.

read point-by-point responses
  1. Referee: [Abstract (data pipeline and benchmark description)] The central claim of 'significant gains' demonstrating 'scalable and generalizable multi-frame perception' rests on the MultiSPA dataset and benchmark being representative of real-world conditions. The abstract provides no details on how the synthetic data pipeline accounts for variations such as sensor noise, complex lighting, long-term occlusions, or non-rigid motion, nor does it reference any external real-world multi-frame benchmark for validation.

    Authors: We agree the abstract is concise and omits these specifics. The full manuscript (Section 3) describes the data pipeline, which draws from multiple 3D/4D sources to incorporate simulated variations in lighting, occlusions, non-rigid motion, and sensor-like noise. The benchmark is designed to test generalizability across diverse scenes. To address the concern directly, we will revise the abstract to briefly note these pipeline elements and the benchmark's scope. We will also add a short discussion referencing comparisons to available real-world multi-frame tasks. revision: yes

  2. Referee: [Abstract (model and results description)] Without ablations, error bars, or explicit integration details for the three spatial skills (depth, correspondence, dynamic perception), it is impossible to determine whether the performance improvements are additive, whether post-hoc filtering was applied, or whether the gains could be reproduced on held-out real data.

    Authors: The abstract summarizes outcomes due to length constraints; the full paper provides the requested details. Section 4 includes ablations on skill integration (showing additive benefits), error bars in all result tables, and explicit model architecture descriptions for combining depth, correspondence, and dynamic perception without post-hoc filtering. Held-out evaluations are reported within the benchmark splits. We will revise the abstract to reference these analyses for better context. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical training on generated dataset with independent benchmark evaluation

full rationale

The paper's central claims rest on collecting/generating the MultiSPA dataset via a described pipeline, training an MLLM, and reporting empirical gains on an introduced benchmark. No equations, fitted parameters renamed as predictions, or self-citation chains are present that would make results equivalent to inputs by construction. The work is self-contained as standard supervised training plus evaluation; benchmark gains are not forced by definition or prior author results. This matches the default expectation for non-circular empirical ML papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view yields no explicit free parameters, axioms, or invented entities beyond the implicit assumption that the three listed spatial skills plus large-scale synthetic data suffice for the task.

pith-pipeline@v0.9.0 · 5722 in / 1066 out tokens · 23194 ms · 2026-05-25T08:35:38.821345+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Token Warping Helps MLLMs Look from Nearby Viewpoints

    cs.CV 2026-04 unverdicted novelty 7.0

    Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.

  2. SpatialMosaic: A Multiview VLM Dataset for Partial Visibility

    cs.CV 2025-12 unverdicted novelty 7.0

    SpatialMosaic introduces a 2M-pair multi-view QA dataset and 1M-pair benchmark for MLLMs on spatial reasoning under partial visibility, plus a hybrid baseline that integrates 3D reconstruction models as geometry encoders.

  3. 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

    cs.CV 2025-12 unverdicted novelty 7.0

    4D-RGPT uses perceptual 4D distillation to boost region-level 4D perception in multimodal LLMs and reports gains on existing and new video QA benchmarks.

  4. Unlocking Dense Metric Depth Estimation in VLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    DepthVLM attaches a depth head to VLMs for native dense metric depth prediction alongside language outputs using a two-stage unified training schedule and a new indoor-outdoor benchmark.

  5. Unlocking Dense Metric Depth Estimation in VLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    DepthVLM converts a standard VLM into a dense metric depth predictor by attaching a lightweight head and training under unified vision-text supervision, outperforming prior VLMs and some pure vision models on a new in...

  6. Cambrian-S: Towards Spatial Supersensing in Video

    cs.CV 2025-11 unverdicted novelty 6.0

    Cambrian-S introduces VSI-SUPER benchmarks for long-horizon spatial recall and counting, shows data scaling yields 30% gains on existing tests, and demonstrates a self-supervised next-latent predictor using surprise o...

  7. From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs

    cs.CV 2026-05 unverdicted novelty 5.0

    SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.

  8. OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence

    cs.CL 2026-04 unverdicted novelty 5.0

    OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · cited by 7 Pith papers · 14 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS,

  2. [2]

    Claude 3.5 sonnet

    anthropic. Claude 3.5 sonnet. https : / / www . anthropic . com / news / claude - 3 - 5 - sonnet,

  3. [3]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Day- iheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfe...

  4. [4]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jin- gren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv:2308.12966, 2023. 2

  5. [5]

    Drivetrack: A benchmark for long-range point tracking in real-world videos

    Arjun Balasingam, Joseph Chandler, Chenning Li, Zhoutong Zhang, and Hari Balakrishnan. Drivetrack: A benchmark for long-range point tracking in real-world videos. In CVPR,

  6. [6]

    Per- ception tokens enhance visual reasoning in multimodal lan- guage models

    Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda G Shapiro, and Ranjay Krishna. Per- ception tokens enhance visual reasoning in multimodal lan- guage models. arXiv:2412.03548, 2024. 3

  7. [7]

    Pi0: A vision-language- action flow model for general robot control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mo- hith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Hao- huan Wang, and Ury Zhilinsky. Pi0: A...

  8. [8]

    Spatialbot: Pre- cise spatial understanding with vision language models

    Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Pre- cise spatial understanding with vision language models. In ICRA, 2025. 3

  9. [9]

    Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties. In CVPR, 2024. 1, 2, 3, 8

  10. [10]

    Sharegpt4v: Improving large multi-modal models with better captions

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. In ECCV, 2024. 2

  11. [11]

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv:2403.20330, 2024. 7

  12. [12]

    How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences , 2024. 2, 4

  13. [13]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. In CVPR,

  14. [14]

    Spatial- rgpt: Grounded spatial reasoning in vision-language models

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Rui- han Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatial- rgpt: Grounded spatial reasoning in vision-language models. In NeurIPS, 2024. 1, 2, 3, 8

  15. [15]

    sharegpt4o

    Erfei Cui, Yinan He, Zheng Ma, Zhe Chen, Hao Tian, Weiyun Wang, Kunchang Li, Yi Wang, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, Yali Wang, Limin Wang, Yu Qiao, and Jifeng Dai. sharegpt4o. https : / / sharegpt4o.github.io/, 2024. 2

  16. [16]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017. 2, 4, 5, 7, 9

  17. [17]

    Instructblip: Towards general- purpose vision-language models with instruction tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning. NeurIPS, 20243. 1, 2, 4

  18. [18]

    Gemini 2.0: our new ai model for the agentic era

    Google Deepmind. Gemini 2.0: our new ai model for the agentic era. https://blog.google/technology/ google - deepmind / google - gemini - ai - update - december - 2024 / ceo - message, 2024. 6

  19. [19]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv:2303.03378, 2023. 2, 3

  20. [20]

    Embspatial-bench: Benchmarking spatial un- derstanding for embodied tasks with large vision-language models

    Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial un- derstanding for embodied tasks with large vision-language models. arXiv:2406.05756, 2024. 3

  21. [21]

    Blink: Multimodal large language models can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. In ECCV, 2024. 2, 3, 7

  22. [22]

    Multimodal-gpt: A vision and language model for dialogue with humans

    Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv:2305.04790, 2023. 2

  23. [23]

    22 Vizwiz grand challenge: Answering visual questions from blind people

    Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. 22 Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, 2018. 7

  24. [24]

    Imagebind-llm: Multi-modality instruction tun- ing

    Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, et al. Imagebind-llm: Multi-modality instruction tun- ing. arXiv:2309.03905, 2023. 2

  25. [25]

    Hartley and A

    R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2003. 2

  26. [26]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR,

  27. [27]

    Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation

    Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 2

  28. [28]

    Language Is Not All You Need: Aligning Perception with Language Models

    Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language mod- els. arXiv:2302.14045, 2023. 1, 2

  29. [29]

    Open- clip, 2021

    Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip, 2021. 2

  30. [30]

    Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

    Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017. 3

  31. [31]

    Panoptic studio: A massively multiview system for social motion capture

    Hanbyul Joo, Hao Liu, Lei Tan, Lin Gui, Bart Nabbe, Iain Matthews, Takeo Kanade, Shohei Nobuhara, and Yaser Sheikh. Panoptic studio: A massively multiview system for social motion capture. In Proceedings of the IEEE interna- tional conference on computer vision , 2015. 2, 4, 5, 7, 9

  32. [32]

    Openvla: An open- source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Fos- ter, Grace Lam, Pannag Sanketi, et al. Openvla: An open- source vision-language-action model. In CoRL, 2024. 3

  33. [33]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. In ICCV, 2023. 3

  34. [34]

    Tapvid-3d: A benchmark for tracking any point in 3d

    Skanda Koppula, Ignacio Rocco, Yi Yang, Joe Heyward, Jo˜ao Carreira, Andrew Zisserman, Gabriel Brostow, and Carl Doersch. Tapvid-3d: A benchmark for tracking any point in 3d. arXiv preprint arXiv:2407.05921, 2024. 2, 4, 5

  35. [35]

    The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale

    Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui- jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV,

  36. [36]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv:2408.03326, 2024. 1, 5

  37. [37]

    What Matters in Building Vision-Language-Action Models for Generalist Robots

    Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, and Huaping Liu. Towards generalist robot policies: What matters in building vision-language-action models. arXiv preprint arXiv:2412.14058, 2024. 1, 3

  38. [38]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In EMNLP, 2023. 7

  39. [39]

    Reasoning paths with reference objects elicit quan- titative spatial reasoning in large vision-language models

    Yuan-Hong Liao, Rafid Mahmood, Sanja Fidler, and David Acuna. Reasoning paths with reference objects elicit quan- titative spatial reasoning in large vision-language models. arXiv:2409.09788, 2024. 2, 3

  40. [40]

    Vila: On pre-training for visual language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for visual language models. In CVPR, 2024. 2, 5

  41. [41]

    Visual spatial reasoning

    Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning. TACL, 2023. 2, 3

  42. [42]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023. 2

  43. [43]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR,

  44. [44]

    MMBench: Is Your Multi-modal Model an All-around Player?

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player? arXiv:2307.06281,

  45. [45]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. ICLR, 2019. 6

  46. [46]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding. arXiv:2403.05525, 2024. 2

  47. [47]

    Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts. In ICLR, 2024. 7

  48. [48]

    Spatialpin: Enhancing spatial reason- ing capabilities of vision-language models through prompt- ing and interacting 3d priors

    Chenyang Ma, Kai Lu, Ta-Ying Cheng, Niki Trigoni, and Andrew Markham. Spatialpin: Enhancing spatial reason- ing capabilities of vision-language models through prompt- ing and interacting 3d priors. In NeurIPS, 2024. 3

  49. [49]

    Ocr-vqa: Visual question answering by reading text in images

    Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019. 7

  50. [50]

    OpenAI. Gpt-4o. https://openai.com/index/ hello-gpt-4o/, 2024. 1, 4, 6

  51. [51]

    Aria digital twin: A new benchmark dataset for egocentric 3d machine perception

    Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Pe- ters, Thomas Whelan, Chen Kong, Omkar Parkhi, Richard Newcombe, and Yuheng Carl Ren. Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. In ICCV, 2023. 2, 4, 5, 7, 9 23

  52. [52]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In ICML, 2021. 2

  53. [53]

    Does spatial cognition emerge in frontier models? arXiv:2410.06468, 2024

    Santhosh Kumar Ramakrishnan, Erik Wijmans, Philipp Kraehenbuehl, and Vladlen Koltun. Does spatial cognition emerge in frontier models? arXiv:2410.06468, 2024. 3

  54. [54]

    Sat: Spa- tial aptitude training for multimodal language models

    Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A Plummer, Ranjay Krishna, Kuo-Hao Zeng, et al. Sat: Spa- tial aptitude training for multimodal language models. arXiv preprint arXiv:2412.07755, 2024. 3

  55. [55]

    Grounding dino 1.5: Advance the” edge” of open-set object detection, 2024

    Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wen- long Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, et al. Grounding dino 1.5: Advance the” edge” of open-set object detection, 2024. 2

  56. [56]

    Laion-5b: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022. 2

  57. [57]

    Laion coco: 600m syn- thetic captions from laion2b-en, 2022

    Christoph Schuhmann, Andreas K ¨opf, Richard Vencu, Theo Coombes, and Romain Beaumont. Laion coco: 600m syn- thetic captions from laion2b-en, 2022. 2

  58. [58]

    An empirical analysis on spatial reason- ing capabilities of large multimodal models

    Fatemeh Shiri, Xiao-Yu Guo, Mona Far, Xin Yu, Reza Haf, and Yuan-Fang Li. An empirical analysis on spatial reason- ing capabilities of large multimodal models. In EMNLP,

  59. [59]

    Drivelm: Driving with graph visual question answering

    Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. In ECCV, 2024. 1, 3

  60. [60]

    Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics

    Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics. arXiv:2411.16537, 2024. 3

  61. [61]

    Internlm: A multilingual language model with progressively enhanced capabilities

    InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https : / / github.com/InternLM/InternLM, 2023. 1, 2, 5, 6

  62. [62]

    Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

    Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian- 1: A fully open, vision-centric exploration of multimodal llms. arXiv preprint arXiv:2406.16860, 2024. 3

  63. [63]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023. 2

  64. [64]

    Continuous 3d perception model with persistent state

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. arXiv preprint arXiv:2501.12387, 2025. 2

  65. [65]

    Emergent Abilities of Large Language Models

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv:2206.07682, 2022. 8

  66. [66]

    Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. arXiv:2412.14171, 2024. 3

  67. [67]

    Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass

    Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition , 2025. 2

  68. [68]

    Sphere: A hierarchical evaluation on spatial perception and reasoning for vision- language models

    Wenyu Zhang, Wei En Ng, Lixin Ma, Yuwen Wang, Jungqi Zhao, Boyang Li, and Lu Wang. Sphere: A hierarchical evaluation on spatial perception and reasoning for vision- language models. arXiv preprint arXiv:2412.12693 , 2024. 3

  69. [69]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision- language understanding with advanced large language mod- els. arXiv:2304.10592, 2023. 2

  70. [70]

    Towards foundation mod- els for 3d vision: How close are we? arXiv preprint arXiv:2410.10799, 2024

    Yiming Zuo, Karhan Kayan, Maggie Wang, Kevin Jeon, Jia Deng, and Thomas L Griffiths. Towards foundation mod- els for 3d vision: How close are we? arXiv preprint arXiv:2410.10799, 2024. 2, 3 24