pith. sign in

arxiv: 2605.15876 · v3 · pith:OOVH4AJHnew · submitted 2026-05-15 · 💻 cs.CV

Unlocking Dense Metric Depth Estimation in VLMs

Pith reviewed 2026-05-21 07:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords dense depth estimationvision-language modelsmetric depth3D spatial reasoningmultimodal foundation modelslightweight depth headunified vision-text supervision
0
0 comments X

The pith

Attaching a lightweight depth head turns a vision-language model into a native predictor of full-resolution metric depth maps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that vision-language models can be extended to output dense metric depth without relying on separate vision models for supervision. By adding a simple depth head to the language-model backbone and training it jointly with text and vision data in two stages, the same forward pass produces both language responses and accurate depth. This matters because prior approaches either lost fine geometric detail through distillation or sacrificed efficiency and multimodal fluency. If successful, the result is a single model that reasons about 3D space while still handling captioning, grounding, and other language tasks.

Core claim

DepthVLM transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. Experiments demonstrate that this approach outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models on depth accuracy, and improves complex 3D spatial reasoning.

What carries the argument

A lightweight depth head attached to the LLM backbone that decodes visual features into dense metric depth predictions under a two-stage unified vision-text supervision schedule.

If this is right

  • Depth estimation becomes a native capability of the VLM rather than a post-hoc distillation step.
  • Inference cost stays close to the original VLM because depth and language share the same forward pass.
  • Complex 3D spatial reasoning tasks improve because the model now has direct access to metric geometry.
  • A single model can be used for both 2D vision-language tasks and 3D geometry without switching architectures.
  • Unified indoor-outdoor metric depth benchmarks become feasible in VLM-compatible formats.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same head-attachment pattern could be tested on other geometric outputs such as surface normals or optical flow.
  • If depth quality holds across domains, it reduces the need for separate 3D foundation models in robotics pipelines.
  • Multimodal training schedules that balance text and dense supervision may generalize to other dense prediction tasks.
  • Real-world deployment could benefit from the model's ability to explain depth estimates in natural language.

Load-bearing premise

A simple added depth head plus two-stage joint training is enough to recover accurate dense metric geometry without error buildup from external models or loss of the base model's language abilities.

What would settle it

A controlled ablation showing that DepthVLM's depth accuracy falls below leading pure-vision models once the two-stage schedule or the depth head is removed.

Figures

Figures reproduced from arXiv: 2605.15876 by Hanxun Yu, Jianke Zhu, Lei Ke, Xuan Qu, Yuxin Wang.

Figure 1
Figure 1. Figure 1: Our method serves as a unified foundation model for both low-level dense geometry [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of prevailing VLMs with our method. (a) Prevailing VLMs are typically supervised solely in the text space, leaving dense 3D geometry out of reach. (b) DepthVLM introduces a unified vision–text supervision paradigm by integrating a lightweight depth head, natively enabling a single VLM backbone to generate dense geometry alongside language responses. (c) While even advanced VLMs such as GPT-5.5 [… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our proposed DepthVLM. We extend the standard VLM architecture with a lightweight DPT-style [43] depth prediction head, and adopt a two-stage training strategy to preserve the backbone’s general VQA capability. In addition, input images are normalized to a unified focal length, eliminating camera-induced ambiguity across heterogeneous dataset domains. produce dense metric depth map and language… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison with others. Our results show finer structural details and improved semantic consistency across diverse scenes. Depth is color-coded from near ( ) to far [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: VLM evaluation setup for metric depth estimation. A red arrow marker (20 pixels) is drawn on the input image to indicate the query point. The model receives the annotated image along with a text prompt asking for the metric depth (in meters). The zoomed-in region shows the arrow marker in detail. Image Resolution Handling. Since different datasets are captured at varying resolutions, we downscale images wi… view at source ↗
read the original abstract

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified multimodal foundation model. The project page is available at https://depthvlm.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces DepthVLM, a framework that augments a base VLM by attaching a lightweight depth head to the LLM backbone. Trained under a two-stage unified vision-text supervision schedule, the model produces full-resolution metric depth maps in a single forward pass alongside language outputs. The authors also release a unified indoor-outdoor metric depth benchmark formatted for VLMs and report that DepthVLM outperforms prior VLMs in both accuracy and efficiency, surpasses leading pure-vision depth estimators, and improves downstream 3D spatial reasoning.

Significance. If the quantitative claims are substantiated, the work would constitute a notable advance toward native dense geometric perception inside VLMs without external distillation or per-pixel querying, potentially enabling more unified multimodal foundation models for tasks that require both language and metric 3D understanding. The new benchmark could also serve as a useful community resource.

major comments (3)
  1. [Abstract and §5] Abstract and §5 (Experiments): the central claim of significant outperformance over existing VLMs and pure-vision models is asserted without any reported numbers, error bars, or ablation tables in the provided abstract; the experimental section must supply these metrics (including language-task retention scores) to make the efficiency and accuracy gains verifiable.
  2. [§3] §3 (Method): the two-stage unified vision-text supervision is presented as sufficient to recover accurate full-resolution metric depth while preserving language capabilities, yet no analysis is given of how scale ambiguity is resolved across indoor/outdoor domains with differing depth ranges or of the trade-off between depth-head training and original VLM language modeling loss.
  3. [§4] §4 (Benchmark): the new unified indoor-outdoor benchmark is introduced as VLM-compatible, but the paper must clarify the exact metric definitions, depth-range normalization, and evaluation protocol to ensure that reported gains are not artifacts of post-hoc dataset choices or inconsistent ground-truth scales.
minor comments (2)
  1. [Figure 1 and §3.2] Figure 1 and §3.2: the diagram of the depth-head attachment would benefit from explicit notation showing how the LLM token features are upsampled to full resolution.
  2. [Related Work] Related-work section: a brief quantitative comparison table with prior VLM depth methods (e.g., token-level vs. dense outputs) would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We appreciate the opportunity to clarify and strengthen the manuscript. Below we respond point-by-point to the major comments and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and §5] Abstract and §5 (Experiments): the central claim of significant outperformance over existing VLMs and pure-vision models is asserted without any reported numbers, error bars, or ablation tables in the provided abstract; the experimental section must supply these metrics (including language-task retention scores) to make the efficiency and accuracy gains verifiable.

    Authors: We agree that the abstract and experimental section would benefit from explicit quantitative support. In the revised manuscript we will augment the abstract with key metrics (e.g., absolute relative error, RMSE, and inference speed). In §5 we will add complete tables that include mean errors with standard deviations, ablation studies on training stages, and language-task retention scores measured on standard VLM benchmarks before and after depth-head training. These additions will make the efficiency and accuracy claims directly verifiable. revision: yes

  2. Referee: [§3] §3 (Method): the two-stage unified vision-text supervision is presented as sufficient to recover accurate full-resolution metric depth while preserving language capabilities, yet no analysis is given of how scale ambiguity is resolved across indoor/outdoor domains with differing depth ranges or of the trade-off between depth-head training and original VLM language modeling loss.

    Authors: The two-stage schedule first aligns the depth head using absolute metric supervision on the mixed indoor-outdoor data, then jointly optimizes with the language modeling objective; absolute depth labels in meters across the unified benchmark inherently resolve scale ambiguity without per-domain normalization. We acknowledge that an explicit analysis of the loss trade-off is currently missing. We will expand §3 with a discussion of how the staged training balances the objectives and will include a brief sensitivity study on loss weighting in the revision. revision: partial

  3. Referee: [§4] §4 (Benchmark): the new unified indoor-outdoor benchmark is introduced as VLM-compatible, but the paper must clarify the exact metric definitions, depth-range normalization, and evaluation protocol to ensure that reported gains are not artifacts of post-hoc dataset choices or inconsistent ground-truth scales.

    Authors: We will revise §4 to state the precise metric (absolute depth in meters), describe the depth-range handling (global scaling to a common maximum range while preserving relative indoor/outdoor differences), and detail the full evaluation protocol, including ground-truth alignment steps and any scene filtering criteria. These clarifications will eliminate ambiguity and confirm that reported improvements are not artifacts of inconsistent scaling. revision: yes

Circularity Check

0 steps flagged

No significant circularity in DepthVLM method or claims

full rationale

The paper describes an empirical architecture: attach a lightweight depth head to an existing VLM backbone and train it end-to-end under a two-stage unified vision-text supervision schedule against external depth benchmarks. The abstract and method statement present this as a direct engineering choice whose outputs (full-resolution metric depth maps) are produced by standard supervised learning rather than by any internal derivation that reduces to the inputs by construction. No equations are shown that equate a claimed prediction to a fitted hyper-parameter or to a self-cited prior result; no uniqueness theorem or ansatz is imported from the authors' own previous work to force the design. The performance claims are therefore falsifiable experimental outcomes of the training procedure, not tautological restatements of the method itself. The derivation chain remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of the two-stage training schedule and the assumption that the added depth head does not interfere with language capabilities; no explicit free parameters, new axioms, or invented physical entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5740 in / 1236 out tokens · 25030 ms · 2026-05-21T07:51:18.215335+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · 24 internal anchors

  1. [1]

    Pixtral 12B

    Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, Diogo Costa, Baudouin De Monicault, Saurabh Garg, Theophile Gervet, et al. Pixtral 12b. arXiv preprint arXiv:2410.07073, 2024

  2. [2]

    Scanqa: 3d question answering for spatial scene understanding

    Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129–19139, 2022

  3. [3]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  4. [4]

    Adabins: Depth estimation using adaptive bins

    Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adaptive bins. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4009–4018, 2021

  5. [5]

    ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

    Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Müller. Zoedepth: Zero-shot transfer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023

  6. [6]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  7. [7]

    Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

    Aleksei Bochkovskii, AmaãG, l Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint arXiv:2410.02073, 2024

  8. [8]

    nuscenes: A multimodal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020

  9. [9]

    Depthlm: Metric depth from vision language models.arXiv preprint arXiv:2509.25413, 2025

    Zhipeng Cai, Ching-Feng Yeh, Hu Xu, Zhuang Liu, Gregory Meyer, Xinjie Lei, Changsheng Zhao, Shang- Wen Li, Vikas Chandra, and Yangyang Shi. Depthlm: Metric depth from vision language models.arXiv preprint arXiv:2509.25413, 2025

  10. [10]

    Matterport3D: Learning from RGB-D Data in Indoor Environments

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158, 2017

  11. [11]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024

  12. [12]

    Scanrefer: 3d object localization in rgb-d scans using natural language

    Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InEuropean conference on computer vision, pages 202–221. Springer, 2020

  13. [13]

    Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning

    Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26428–26438, 2024

  14. [14]

    Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems, 37:135062–135093, 2024

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems, 37:135062–135093, 2024

  15. [15]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna

    Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023

  16. [16]

    Dreamllm: Synergistic multimodal comprehension and creation.arXiv preprint arXiv:2309.11499, 2023

    Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, et al. Dreamllm: Synergistic multimodal comprehension and creation.arXiv preprint arXiv:2309.11499, 2023

  17. [17]

    Depth map prediction from a single image using a multi-scale deep network.Advances in neural information processing systems, 27, 2014

    David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network.Advances in neural information processing systems, 27, 2014. 10

  18. [18]

    VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

    Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025

  19. [19]

    Blink: Multimodal large language models can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. In European Conference on Computer Vision, pages 148–166. Springer, 2024

  20. [20]

    Are we ready for autonomous driving? the kitti vision benchmark suite

    Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In2012 IEEE conference on computer vision and pattern recognition, pages 3354–3361. IEEE, 2012

  21. [21]

    3d packing for self- supervised monocular depth estimation

    Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, and Adrien Gaidon. 3d packing for self- supervised monocular depth estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2485–2494, 2020

  22. [22]

    Seed1.5-VL Technical Report

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1.5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

  23. [23]

    3d-llm: Injecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494, 2023

    Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494, 2023

  24. [24]

    Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero- shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10579–10596, 2024

  25. [25]

    G 2vlm: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning.arXiv preprint arXiv:2511.21688, 2025

    Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran, Lihan Jiang, Yifan Wang, Chenming Zhu, Runsen Xu, Tai Wang, and Jiangmiao Pang. G 2vlm: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning.arXiv preprint arXiv:2511.21688, 2025

  26. [26]

    Chat-scene: Bridging 3d scene and large language models with object identifiers.Advances in Neural Information Processing Systems, 37:113991–114017, 2024

    Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai Wang, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, et al. Chat-scene: Bridging 3d scene and large language models with object identifiers.Advances in Neural Information Processing Systems, 37:113991–114017, 2024

  27. [27]

    An Embodied Generalist Agent in 3D World

    Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song- Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023

  28. [28]

    3drs: Mllms need 3d-aware representation supervision for scene understanding.arXiv preprint arXiv:2506.01946, 2025

    Xiaohu Huang, Jingjing Wu, Qunyi Xie, and Kai Han. 3drs: Mllms need 3d-aware representation supervision for scene understanding.arXiv preprint arXiv:2506.01946, 2025

  29. [29]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  30. [30]

    Streamingassistant: Efficient visual token pruning for accelerating online video understanding.arXiv preprint arXiv:2512.12560, 2025

    Xinqi Jin, Hanxun Yu, Bohan Yu, Kebin Liu, Jian Liu, Keda Tao, Yixuan Pei, Huan Wang, Fan Dang, Jiangchuan Liu, et al. Streamingassistant: Efficient visual token pruning for accelerating online video understanding.arXiv preprint arXiv:2512.12560, 2025

  31. [31]

    Evaluation of cnn-based single- image depth estimation methods

    Tobias Koch, Lukas Liebel, Friedrich Fraundorfer, and Marco Korner. Evaluation of cnn-based single- image depth estimation methods. InProceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 0–0, 2018

  32. [32]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305, 2023

  33. [33]

    Refinenet: Multi-path refinement networks for high-resolution semantic segmentation

    Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1925–1934, 2017

  34. [34]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025. 11

  35. [35]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  36. [36]

    Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233. Springer, 2024

  37. [37]

    OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

    Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xucheng Yin, Cheng-lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023

  38. [38]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InThe 36th Conference on Neural Information Processing Systems (NeurIPS), 2022

  39. [39]

    Sqa3d: Situated question answering in 3d scenes,

    Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Situated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022

  40. [40]

    Openeqa: Embodied question answering in the era of foundation models

    Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. Openeqa: Embodied question answering in the era of foundation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16488–16498, 2024

  41. [41]

    OpenAI GPT-5 System Card

    OpenAI. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  42. [42]

    Unidepthv2: Universal monocular metric depth estimation made simpler.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mattia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  43. [43]

    Unidepth: Universal monocular metric depth estimation

    Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10106–10116, 2024

  44. [44]

    Gpt4scene: Understand 3d scenes from videos with vision-language models,

    Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, and Hengshuang Zhao. Gpt4scene: Understand 3d scenes from videos with vision-language models.arXiv preprint arXiv:2501.01428, 2025

  45. [45]

    Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI

    Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI. InThirty-fifth Conference on Neural Information...

  46. [46]

    Vision transformers for dense prediction

    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021

  47. [47]

    René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020

  48. [48]

    Glamm: Pixel grounding large multimodal model

    Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13009–13018, 2024

  49. [49]

    A multi-view stereo benchmark with high-resolution images and multi- camera videos

    Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi- camera videos. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3260–3269, 2017

  50. [50]

    Indoor segmentation and support inference from rgbd images

    Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InEuropean conference on computer vision, pages 746–760. Springer, 2012

  51. [51]

    Sun rgb-d: A rgb-d scene understanding benchmark suite

    Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 567–576, 2015. 12

  52. [52]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020

  53. [53]

    Ufo: A unified approach to fine-grained visual perception via open-ended language interface.arXiv preprint arXiv:2503.01342, 2025

    Hao Tang, Chenwei Xie, Haiyang Wang, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, and Liwei Wang. Ufo: A unified approach to fine-grained visual perception via open-ended language interface.arXiv preprint arXiv:2503.01342, 2025

  54. [54]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  55. [55]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  56. [56]

    Ross3d: Reconstructive visual instruction tuning with 3d-awareness

    Haochen Wang, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, and Zhaoxiang Zhang. Ross3d: Reconstructive visual instruction tuning with 3d-awareness. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9275–9286, 2025

  57. [57]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

  58. [58]

    Continuous 3d perception model with persistent state

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025

  59. [59]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  60. [60]

    $\pi^3$: Permutation-Equivariant Visual Geometry Learning

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025

  61. [61]

    N3d-vlm: Native 3d grounding enables accu- rate spatial reasoning in vision-language models.arXiv preprint arXiv:2512.16561, 2025

    Yuxin Wang, Lei Ke, Boqiang Zhang, Tianyuan Qu, Hanxun Yu, Zhenpeng Huang, Meng Yu, Dan Xu, and Dong Yu. N3d-vlm: Native 3d grounding enables accurate spatial reasoning in vision-language models. arXiv preprint arXiv:2512.16561, 2025

  62. [62]

    Physical adversarial attack meets computer vision: A decade survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9797–9817, 2024

    Hui Wei, Hao Tang, Xuemei Jia, Zhixiang Wang, Hanxun Yu, Zhubo Li, Shin’ichi Satoh, Luc Van Gool, and Zheng Wang. Physical adversarial attack meets computer vision: A decade survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9797–9817, 2024

  63. [63]

    Moiré backdoor attack (mba): A novel trigger for pedestrian detectors in the physical world

    Hui Wei, Hanxun Yu, Kewei Zhang, Zhixiang Wang, Jianke Zhu, and Zheng Wang. Moiré backdoor attack (mba): A novel trigger for pedestrian detectors in the physical world. InProceedings of the 31st ACM International Conference on Multimedia, pages 8828–8838, 2023

  64. [64]

    Youtu-vl: Unleashing visual potential via unified vision-language supervision

    Zhixiang Wei, Yi Li, Zhehan Kan, Xinghua Jiang, Zuwei Long, Shifeng Liu, Hongze Shen, Wei Liu, Xiaoyu Tan, Haojia Lin, et al. Youtu-vl: Unleashing visual potential via unified vision-language supervision. arXiv preprint arXiv:2601.19798, 2026

  65. [65]

    Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting

    Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting.arXiv preprint arXiv:2301.00493, 2023

  66. [66]

    Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025

  67. [67]

    Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks.Advances in Neural Information Processing Systems, 37:69925–69975, 2024

    Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai, Zhaoyang Liu, Zhe Chen, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, et al. Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks.Advances in Neural Information Processing Systems, 37:69925–69975, 2024

  68. [68]

    Generation models know space: Unleashing implicit 3d priors for scene understanding.arXiv preprint arXiv:2603.19235, 2026

    Xianjin Wu, Dingkang Liang, Tianrui Feng, Kui Xia, Yumeng Zhang, Xiaofan Li, Xiao Tan, and Xiang Bai. Generation models know space: Unleashing implicit 3d priors for scene understanding.arXiv preprint arXiv:2603.19235, 2026. 13

  69. [69]

    Multi-spatialmllm: Multi-frame spatial understanding with multi-modal large language models.arXiv preprint arXiv:2505.17015, 2025

    Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Dahua Lin, Matt Feiszli, and Kevin J Liang. Multi-spatialmllm: Multi-frame spatial understanding with multi-modal large language models.arXiv preprint arXiv:2505.17015, 2025

  70. [70]

    Omnistream: Mastering perception, reconstruction and action in continuous streams.arXiv preprint arXiv:2603.12265, 2026

    Yibin Yan, Jilan Xu, Shangzhe Di, Haoning Wu, and Weidi Xie. Omnistream: Mastering perception, reconstruction and action in continuous streams.arXiv preprint arXiv:2603.12265, 2026

  71. [71]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  72. [72]

    Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces

    Jihan Yang, Shusheng Yang, Anjali Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces. InCVPR, 2025

  73. [73]

    Depth anything: Unleashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10371–10381, 2024

  74. [74]

    Depth anything v2.Advances in Neural Information Processing Systems, 37:21875–21911, 2024

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.Advances in Neural Information Processing Systems, 37:21875–21911, 2024

  75. [75]

    Cambrian-S: Towards Spatial Supersensing in Video

    Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei-Fei, and Saining Xie. Cambrian-s: Towards spatial supersensing in video.arXiv preprint arXiv:2511.04670, 2025

  76. [76]

    Scannet++: A high-fidelity dataset of 3d indoor scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023

  77. [77]

    Metric3d: Towards zero-shot metric 3d prediction from a single image

    Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. InProceedings of the IEEE/CVF international conference on computer vision, pages 9043–9053, 2023

  78. [78]

    Visiontrim: Unified vision token compression for training-free mllm acceleration.arXiv preprint arXiv:2601.22674, 2026

    Hanxun Yu, Wentong Li, Xuan Qu, Song Wang, Junbo Chen, and Jianke Zhu. Visiontrim: Unified vision token compression for training-free mllm acceleration.arXiv preprint arXiv:2601.22674, 2026

  79. [79]

    Inst3d-lmm: Instance-aware 3d scene understanding with multi-modal instruction tuning

    Hanxun Yu, Wentong Li, Song Wang, Junbo Chen, and Jianke Zhu. Inst3d-lmm: Instance-aware 3d scene understanding with multi-modal instruction tuning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14147–14157, 2025

  80. [80]

    Taskonomy: Disentangling task transfer learning

    Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3712–3722, 2018

Showing first 80 references.