VLM3: Vision Language Models Are Native 3D Learners

Vikas Chandra; Yangyang Shi; Yunyang Xiong; Zechun Liu; Zhipeng Cai; Zhuang Liu

arxiv: 2605.30561 · v1 · pith:NYD5ZMD3new · submitted 2026-05-28 · 💻 cs.CV · cs.AI

VLM3: Vision Language Models Are Native 3D Learners

Zhipeng Cai , Zhuang Liu , Yunyang Xiong , Zechun Liu , Vikas Chandra , Yangyang Shi This is my paper

Pith reviewed 2026-06-29 07:43 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision language models3D understandingdepth estimationfocal length unificationtext-based pixel referencedata scalingVLM3

0 comments

The pith

Vision language models learn 3D tasks using only focal length unification, text-based pixel references, and data scaling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper claims that vision language models are inherently suited for 3D understanding and do not require specialized designs. Through large-scale studies, it finds that unifying focal lengths across data, referencing pixels via text, and carefully mixing and scaling training data enable strong performance on 3D tasks. These simple steps allow standard VLMs to handle depth estimation, camera pose, and object 3D understanding at levels matching expert models. The approach avoids complex losses, heavy augmentations, or architecture modifications that are common in traditional 3D vision work.

Core claim

VLMs are native 3D learners where focal length unification, text-based pixel reference, and data mixture and scaling suffice for effective 3D learning, rendering model architecture changes, larger models, heavy data augmentations, and complex losses unnecessary.

What carries the argument

The three enabling factors of focal length unification, text-based pixel reference, and data mixture and scaling that allow standard VLMs to master 3D tasks through text-based training.

If this is right

Depth estimation accuracy improves substantially from 0.84 to 0.9 on standard benchmarks.
Pixel correspondence, camera pose estimation, and object-level 3D understanding reach accuracy levels comparable to expert vision models.
Standard VLM architectures and text-based training suffice without task-specific modifications.
VLM3 provides a scalable method for diverse 3D tasks using the simplest design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach may allow 3D capabilities to emerge in general multimodal models without dedicated 3D training pipelines.
It could simplify integration of 3D understanding into applications like autonomous navigation by reusing existing VLM infrastructure.
Testing these factors on even larger VLMs or different data distributions might reveal further performance gains.

Load-bearing premise

The observed performance gains on 3D tasks result solely from focal length unification, text-based pixel reference, and data mixture and scaling, isolated from other training variables.

What would settle it

A controlled experiment applying focal length unification, text-based pixel reference, and data mixture and scaling to a VLM that shows no improvement in depth estimation or other 3D metrics beyond the baseline.

read the original abstract

Vision Language Models (VLMs) enable a unified model to solve various vision tasks through prompting. They have shown promising performance in semantic understanding. However, 3D understanding still largely relies on expert vision models with complex task-specific designs. The key argument this work wants to make is that VLMs are native 3D learners. Our in-depth large scale study shows that 1) focal length unification, 2) text-based pixel reference and 3) data mixture and scaling, are all you need for effective 3D learning. Model architecture changes, large models, heavy data augmentations, and complex losses including the regression formulation, many of which form the foundation of expert vision models, are actually not necessary conditions. As a result, we propose VLM3, a scalable method with the simplest design that enables standard VLMs to master diverse 3D tasks. VLM3 not only advances the VLM depth estimation accuracy by a large margin (0.84 -> 0.9), but also enables diverse 3D tasks such as pixel correspondence, camera pose estimation and object-level 3D understanding, matching expert vision model accuracy while maintaining standard architectures and text-based training. We believe VLM3 opens up a new paradigm for simple and scalable 3D learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims VLMs handle multiple 3D tasks with just focal length unification, text pixel references, and data scaling, but the abstract leaves the isolation of those factors unverified.

read the letter

Hi,

The main thing to know is that this paper argues standard VLMs are already native 3D learners. The authors say focal length unification, text-based pixel references, and data mixture plus scaling are enough to reach expert-level performance on depth estimation and tasks like pixel correspondence, camera pose, and object-level 3D understanding. They report lifting depth accuracy from 0.84 to 0.9 while keeping the original architecture and avoiding regression losses or heavy augmentations.

What the work does well is lay out a minimal recipe that unifies several 3D capabilities under one prompting and training approach. The simplicity is useful for anyone who wants to avoid building separate expert models for robotics or scene tasks. The claim that many standard 3D design choices are unnecessary is a direct challenge to current practice and worth testing.

The soft spots sit in the evidence. The abstract describes an in-depth large-scale study but supplies no dataset details, ablation tables, or controls for model size and training recipe. The stress-test concern is fair: without matched conditions it is hard to know whether the gains trace to the three listed factors or to other unmentioned changes. If the full paper contains clear isolation experiments, that would fix the gap; otherwise the central result stays hard to trust.

This paper is for VLM researchers who want practical ways to add 3D without new architectures. Readers looking for simple baselines or scaling recipes would find it relevant.

It deserves a serious referee because the claim is concrete and the potential payoff is large if the controls hold. I recommend sending it to review but asking the authors to add explicit ablation results and matched training details in the next version.

Best,

Referee Report

1 major / 0 minor

Summary. The paper claims that Vision Language Models are native 3D learners. Its central argument, based on an in-depth large-scale study, is that focal length unification, text-based pixel reference, and data mixture/scaling are jointly sufficient for effective 3D learning in standard VLMs. It proposes the simple VLM3 method, which reportedly advances depth estimation accuracy from 0.84 to 0.9 and matches expert vision model performance on pixel correspondence, camera pose estimation, and object-level 3D tasks, while showing that architecture changes, larger models, heavy augmentations, and complex losses (including regression) are unnecessary.

Significance. If the empirical isolation of the three factors holds under controlled conditions, the result would be significant: it would indicate that standard VLMs can handle diverse 3D tasks via minimal, text-based adaptations and data strategies, challenging the necessity of specialized 3D architectures and potentially enabling a simpler, more scalable paradigm.

major comments (1)

[Abstract] Abstract: The claim that focal length unification, text-based pixel reference, and data mixture/scaling are all that is needed (and that architecture changes, model scale, augmentations, and complex losses are not necessary) rests on an asserted 'in-depth large scale study,' yet the manuscript supplies no methodology details, dataset descriptions, ablation tables, or controls that hold model size, training recipes, and other variables fixed across conditions. This is load-bearing for the central claim, as performance gains (e.g., the cited depth improvement) cannot be attributed to the three factors without such isolation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive feedback. We address the major comment below and agree that greater clarity on the experimental design is warranted to support the central claims.

read point-by-point responses

Referee: The claim that focal length unification, text-based pixel reference, and data mixture/scaling are all that is needed (and that architecture changes, model scale, augmentations, and complex losses are not necessary) rests on an asserted 'in-depth large scale study,' yet the manuscript supplies no methodology details, dataset descriptions, ablation tables, or controls that hold model size, training recipes, and other variables fixed across conditions. This is load-bearing for the central claim, as performance gains (e.g., the cited depth improvement) cannot be attributed to the three factors without such isolation.

Authors: We agree that the abstract is high-level and that explicit isolation of the three factors requires clear documentation of controls. The full manuscript contains an experimental section with dataset descriptions, training details, and ablation studies; however, these may not sufficiently highlight fixed variables (model size, recipes) or directly attribute gains to focal length unification, text-based referencing, and data scaling. We will revise by expanding the methods/experiments section with a dedicated controlled-study subsection, additional ablation tables that explicitly hold other factors fixed, and clearer result attribution. This addresses the load-bearing concern without altering the core findings. revision: yes

Circularity Check

0 steps flagged

No derivation chain; empirical study presents no self-referential reductions

full rationale

The paper advances an empirical claim based on a large-scale study that focal length unification, text-based pixel reference, and data mixture/scaling suffice for 3D learning in VLMs, rendering architecture changes, model scale, augmentations, and regression losses unnecessary. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central argument is framed as experimental outcomes rather than a mathematical chain that reduces to its inputs by construction, so no circularity of the enumerated kinds is present.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central claim rests on an undescribed large-scale empirical study.

pith-pipeline@v0.9.1-grok · 5775 in / 1109 out tokens · 28285 ms · 2026-06-29T07:43:22.756531+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 20 canonical work pages · 15 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025a. Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zha...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Aleksei Bochkovskii, AmaÃG, l Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint arXiv:2410.02073,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

SpatialBot: Precise spatial understanding with vision language models.arXiv:2406.13642,

Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models.arXiv preprint arXiv:2406.13642,

work page arXiv
[4]

Depthlm: Metric depth from vision language models.arXiv preprint arXiv:2509.25413,

Zhipeng Cai, Ching-Feng Yeh, Hu Xu, Zhuang Liu, Gregory Meyer, Xinjie Lei, Changsheng Zhao, Shang-Wen Li, Vikas Chandra, and Yangyang Shi. Depthlm: Metric depth from vision language models.arXiv preprint arXiv:2509.25413,

work page arXiv
[5]

Matterport3D: Learning from RGB-D Data in Indoor Environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Seed1.5-VL Technical Report

SpaceLLaVA github contributors. Spacellava. 2024.https://huggingface.co/remyxai/SpaceLLaV A. Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction. arXiv preprint arXiv:2509.13414,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Visual-rft: Visual reinforcement fine-tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2034–2044,

2034
[12]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler

Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mattia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler.arXiv preprint arXiv:2502.20110,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Gim: Learning generalizable image matcher from internet videos.arXiv preprint arXiv:2402.11095,

12 Xuelun Shen, Zhipeng Cai, Wei Yin, Matthias Müller, Zijun Li, Kaixuan Wang, Xiaozhi Chen, and Cheng Wang. Gim: Learning generalizable image matcher from internet videos.arXiv preprint arXiv:2402.11095,

work page arXiv
[16]

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details.arXiv preprint arXiv:2507.02546, 2025b. Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProcee...

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting

Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting.arXiv preprint arXiv:2301.00493,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Dahua Lin, Matt Feiszli, and Kevin J Liang. Multi-spatialmllm: Multi-frame spatial understanding with multi-modal large language models. arXiv preprint arXiv:2505.17015,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

On the generalization capacities of mllms for spatial intelligence.arXiv preprint arXiv:2603.06704,

Gongjie Zhang, Wenhao Li, Quanhao Qian, Jiuniu Wang, Deli Zhao, Shijian Lu, and Ran Xu. On the generalization capacities of mllms for spatial intelligence.arXiv preprint arXiv:2603.06704,

work page arXiv
[20]

Ufm: A simple path towards unified dense correspondence with flow.arXiv preprint arXiv:2506.09278,

13 Yuchen Zhang, Nikhil Keetha, Chenwei Lyu, Bhuvan Jhamb, Yutian Chen, Yuheng Qiu, Jay Karhade, Shreyas Jha, Yaoyu Hu, Deva Ramanan, et al. Ufm: A simple path towards unified dense correspondence with flow.arXiv preprint arXiv:2506.09278,

work page arXiv
[21]

Stereo Magnification: Learning View Synthesis using Multiplane Images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

14 Appendix A Further Implementation Details Table 4 Hyper-parameters. TaskDepth Estimation Object-level 3D Pixel correspondence Camera pose estimation Learning rate5.5e-5 3.5e-4 2e-5 5e-5 Batch size1344 640 2816 448 Number of samples32M (10 pixels each) 1M 80M (10 pixels each) 10M Table 5 Training data statistics. Depth Estimation DatasetsNumber of image...

2023
[23]

450K dynamicreplica (Karaev et al., 2023)1M sail vos3d (Hu et al.,

2023
[24]

350K ScanNet++ (Yeshw anth et al., 2023)1M MPSD (Antequera et al.,

2023
[25]

13K RealEstate-10K (Zhou et al., 2018)880K DL3dv-10k (Ling et al.,

2018
[26]

190K Aria Synthetic Environment (A vetisyan et al., 2024)2M GTA-SFM (W ang and Shen,

2024
[27]

850K UnrealStereo4K (Tosi et al., 2021)270K MVS Synth (Huang et al.,

2021
[28]

Similar to previous works (Cai et al., 2025; Lin et al., 2025), we hold out 30 scenes from ScanNet++ to ensure the evaluation data come from unseen scenes

to randomly sample image pairs with 15 > 25%covisibility. Similar to previous works (Cai et al., 2025; Lin et al., 2025), we hold out 30 scenes from ScanNet++ to ensure the evaluation data come from unseen scenes. 16

2025

[1] [1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025a. Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zha...

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Aleksei Bochkovskii, AmaÃG, l Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint arXiv:2410.02073,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

SpatialBot: Precise spatial understanding with vision language models.arXiv:2406.13642,

Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models.arXiv preprint arXiv:2406.13642,

work page arXiv

[4] [4]

Depthlm: Metric depth from vision language models.arXiv preprint arXiv:2509.25413,

Zhipeng Cai, Ching-Feng Yeh, Hu Xu, Zhuang Liu, Gregory Meyer, Xinjie Lei, Changsheng Zhao, Shang-Wen Li, Vikas Chandra, and Yangyang Shi. Depthlm: Metric depth from vision language models.arXiv preprint arXiv:2509.25413,

work page arXiv

[5] [5]

Matterport3D: Learning from RGB-D Data in Indoor Environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Seed1.5-VL Technical Report

SpaceLLaVA github contributors. Spacellava. 2024.https://huggingface.co/remyxai/SpaceLLaV A. Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062,

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction. arXiv preprint arXiv:2509.13414,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Visual-rft: Visual reinforcement fine-tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2034–2044,

2034

[12] [12]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler

Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mattia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler.arXiv preprint arXiv:2502.20110,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Gim: Learning generalizable image matcher from internet videos.arXiv preprint arXiv:2402.11095,

12 Xuelun Shen, Zhipeng Cai, Wei Yin, Matthias Müller, Zijun Li, Kaixuan Wang, Xiaozhi Chen, and Cheng Wang. Gim: Learning generalizable image matcher from internet videos.arXiv preprint arXiv:2402.11095,

work page arXiv

[16] [16]

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details.arXiv preprint arXiv:2507.02546, 2025b. Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProcee...

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting

Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting.arXiv preprint arXiv:2301.00493,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Dahua Lin, Matt Feiszli, and Kevin J Liang. Multi-spatialmllm: Multi-frame spatial understanding with multi-modal large language models. arXiv preprint arXiv:2505.17015,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

On the generalization capacities of mllms for spatial intelligence.arXiv preprint arXiv:2603.06704,

Gongjie Zhang, Wenhao Li, Quanhao Qian, Jiuniu Wang, Deli Zhao, Shijian Lu, and Ran Xu. On the generalization capacities of mllms for spatial intelligence.arXiv preprint arXiv:2603.06704,

work page arXiv

[20] [20]

Ufm: A simple path towards unified dense correspondence with flow.arXiv preprint arXiv:2506.09278,

13 Yuchen Zhang, Nikhil Keetha, Chenwei Lyu, Bhuvan Jhamb, Yutian Chen, Yuheng Qiu, Jay Karhade, Shreyas Jha, Yaoyu Hu, Deva Ramanan, et al. Ufm: A simple path towards unified dense correspondence with flow.arXiv preprint arXiv:2506.09278,

work page arXiv

[21] [21]

Stereo Magnification: Learning View Synthesis using Multiplane Images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

14 Appendix A Further Implementation Details Table 4 Hyper-parameters. TaskDepth Estimation Object-level 3D Pixel correspondence Camera pose estimation Learning rate5.5e-5 3.5e-4 2e-5 5e-5 Batch size1344 640 2816 448 Number of samples32M (10 pixels each) 1M 80M (10 pixels each) 10M Table 5 Training data statistics. Depth Estimation DatasetsNumber of image...

2023

[23] [23]

450K dynamicreplica (Karaev et al., 2023)1M sail vos3d (Hu et al.,

2023

[24] [24]

350K ScanNet++ (Yeshw anth et al., 2023)1M MPSD (Antequera et al.,

2023

[25] [25]

13K RealEstate-10K (Zhou et al., 2018)880K DL3dv-10k (Ling et al.,

2018

[26] [26]

190K Aria Synthetic Environment (A vetisyan et al., 2024)2M GTA-SFM (W ang and Shen,

2024

[27] [27]

850K UnrealStereo4K (Tosi et al., 2021)270K MVS Synth (Huang et al.,

2021

[28] [28]

Similar to previous works (Cai et al., 2025; Lin et al., 2025), we hold out 30 scenes from ScanNet++ to ensure the evaluation data come from unseen scenes

to randomly sample image pairs with 15 > 25%covisibility. Similar to previous works (Cai et al., 2025; Lin et al., 2025), we hold out 30 scenes from ScanNet++ to ensure the evaluation data come from unseen scenes. 16

2025