pith. machine review for the scientific record. sign in

arxiv: 2410.02073 · v2 · submitted 2024-10-02 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:40 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords monocular depth estimationmetric depthzero-shot depthvision transformerboundary accuracyfocal length estimationhigh-resolution depth maps
0
0 comments X

The pith

Depth Pro produces sharp, metric-scale depth maps from single images in 0.3 seconds without any camera metadata.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Depth Pro as a foundation model for zero-shot metric monocular depth estimation. It generates high-resolution depth maps that preserve fine details and absolute scale even when no camera information is supplied. The approach combines an efficient multi-scale vision transformer with training on mixed real and synthetic data to balance accuracy and edge sharpness. This combination allows the model to run fast enough for real-time use while delivering outputs that outperform earlier methods on standard benchmarks. The work also adds new evaluation measures focused on boundary quality and includes single-image focal length estimation as a supporting capability.

Core claim

Depth Pro synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. These characteristics are enabled by an efficient multi-scale vision transformer for dense prediction, a training protocol that combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing, dedicated evaluation metrics for boundary accuracy in estimated depth maps, and state-of-the-art focal length estimation from a single image.

What carries the argument

An efficient multi-scale vision transformer for dense prediction paired with a mixed real-synthetic training protocol that jointly optimizes metric scale and boundary fidelity.

If this is right

  • The model generates 2.25-megapixel depth maps in 0.3 seconds on a standard GPU.
  • Depth estimates remain metric and absolute without camera intrinsics or other metadata.
  • Boundary accuracy improves measurably through the dedicated evaluation metrics introduced.
  • Single-image focal length estimation reaches state-of-the-art levels as a byproduct.
  • Overall performance exceeds prior monocular depth methods across multiple accuracy dimensions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Single-image depth systems could now be deployed in settings where camera calibration data is unavailable or unreliable.
  • The emphasis on boundary sharpness suggests the outputs may integrate more cleanly into downstream 3D reconstruction pipelines.
  • Hybrid real-synthetic training may generalize to other dense prediction tasks that require both metric consistency and fine detail preservation.

Load-bearing premise

The training protocol that mixes real and synthetic datasets together with the multi-scale vision transformer succeeds at delivering both accurate absolute scale and sharp boundaries in zero-shot settings without camera intrinsics.

What would settle it

A collection of real-world images with independently measured ground-truth metric depths and focal lengths where Depth Pro produces large scale errors or visibly blurred object boundaries when run without any camera metadata.

read the original abstract

We present a foundation model for zero-shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. And the model is fast, producing a 2.25-megapixel depth map in 0.3 seconds on a standard GPU. These characteristics are enabled by a number of technical contributions, including an efficient multi-scale vision transformer for dense prediction, a training protocol that combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing, dedicated evaluation metrics for boundary accuracy in estimated depth maps, and state-of-the-art focal length estimation from a single image. Extensive experiments analyze specific design choices and demonstrate that Depth Pro outperforms prior work along multiple dimensions. We release code and weights at https://github.com/apple/ml-depth-pro

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents Depth Pro, a foundation model for zero-shot metric monocular depth estimation. It claims to synthesize high-resolution depth maps with high sharpness and fine details, providing absolute metric scale from a single image without camera intrinsics or metadata. The model runs in 0.3 seconds for 2.25-megapixel outputs on standard GPUs. Key contributions include an efficient multi-scale vision transformer for dense prediction, a training protocol mixing real and synthetic data for metric accuracy and boundary precision, new dedicated metrics for boundary accuracy, and state-of-the-art single-image focal length estimation. Extensive experiments are said to demonstrate outperformance over prior work along multiple dimensions, with code and weights released.

Significance. If the central claims hold, the work would be significant for computer vision applications requiring fast, high-quality metric depth from monocular images without calibration data, such as robotics, AR, and 3D reconstruction. The speed, resolution, and zero-shot metric capability without intrinsics represent a practical advance over prior methods. The open release of code and weights is a clear strength for reproducibility and further research.

major comments (2)
  1. [Experiments (focal length and metric depth evaluation)] The metric-scale claim without camera intrinsics rests on the single-image focal length estimator (described as SOTA in the abstract). No section isolates focal-length prediction error on diverse real-world test sets or quantifies its propagation into absolute depth metrics; if MAE exceeds ~10-15% on out-of-distribution scenes, the reported metric gains would be undermined.
  2. [Method and Experiments sections] The training protocol combining real and synthetic datasets is asserted to achieve both high metric accuracy and fine boundary tracing in zero-shot settings, but the manuscript provides no ablation that separates the contribution of the multi-scale vision transformer from the data mixture or the new boundary metrics.
minor comments (1)
  1. [Abstract] The abstract states outperformance 'along multiple dimensions' but does not preview any quantitative numbers, error bars, or dataset names; this should be added for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the focal length evaluation and ablation studies. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments (focal length and metric depth evaluation)] The metric-scale claim without camera intrinsics rests on the single-image focal length estimator (described as SOTA in the abstract). No section isolates focal-length prediction error on diverse real-world test sets or quantifies its propagation into absolute depth metrics; if MAE exceeds ~10-15% on out-of-distribution scenes, the reported metric gains would be undermined.

    Authors: We agree that isolating the focal length estimator's accuracy and its effect on metric depth is important for validating the zero-shot claims. The current manuscript reports overall depth metrics and states SOTA focal length performance, but does not include a dedicated breakdown. In the revision, we will add a new subsection in the Experiments section reporting focal length MAE and relative error on multiple real-world datasets (NYU, KITTI, ETH3D, and others) and will quantify propagation by recomputing depth metrics using predicted versus ground-truth focal lengths where available. This will directly address potential error accumulation in out-of-distribution scenes. revision: yes

  2. Referee: [Method and Experiments sections] The training protocol combining real and synthetic datasets is asserted to achieve both high metric accuracy and fine boundary tracing in zero-shot settings, but the manuscript provides no ablation that separates the contribution of the multi-scale vision transformer from the data mixture or the new boundary metrics.

    Authors: The manuscript includes targeted experiments on design choices and overall performance, but we acknowledge that it lacks fully disentangled ablations separating the multi-scale ViT architecture, the real+synthetic data mixture, and the boundary-specific losses/metrics. In the revised version, we will expand the Experiments section with additional ablation tables that train and evaluate controlled variants (e.g., single-scale vs. multi-scale, real-only vs. mixed data, with vs. without boundary terms) to clearly attribute the gains in metric accuracy and boundary precision to each component. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper trains a multi-scale vision transformer on external real and synthetic datasets to produce metric depth maps, with focal length estimation presented as an auxiliary SOTA component rather than a self-referential fit. No equation or claim reduces a prediction to its own inputs by construction, nor does any load-bearing step rely on a self-citation chain that itself lacks independent verification. Evaluation uses newly proposed boundary metrics on held-out benchmarks, keeping the central zero-shot metric claim externally falsifiable and independent of the model's fitted parameters.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard deep learning assumptions for vision transformers and the empirical effectiveness of mixing real and synthetic training data; no free parameters, axioms, or invented entities are explicitly introduced beyond typical model training.

pith-pipeline@v0.9.0 · 5479 in / 1121 out tokens · 46377 ms · 2026-05-14T20:40:37.299863+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction

    cs.CV 2026-05 unverdicted novelty 7.0

    AmbiSuR adds intrinsic photometric disambiguation and a self-indication module to Gaussian Splatting to resolve ambiguities and improve surface reconstruction accuracy.

  2. OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

  3. Monocular Depth Estimation via Neural Network with Learnable Algebraic Group and Ring Structures

    cs.CV 2026-04 unverdicted novelty 7.0

    LAGRNet embeds learnable algebraic group, ring, and sheaf structures into a neural network to improve accuracy and generalization in monocular depth estimation.

  4. LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation

    cs.CV 2026-04 unverdicted novelty 7.0

    A new benchmark with real lunar stereo ground truth and analog data shows that sim-to-real fine-tuned monocular depth models achieve large in-domain gains but minimal generalization to actual lunar images.

  5. Globally Optimal Pose from Orthographic Silhouettes

    cs.CV 2026-04 unverdicted novelty 7.0

    A search-based algorithm achieves globally optimal pose estimation from silhouettes alone by querying precomputed area response surfaces and auxiliary ellipse aspect ratios for any shape.

  6. 3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image

    cs.CV 2026-04 unverdicted novelty 7.0

    3D-Fixer performs in-place 3D asset completion from single-view partial point clouds via coarse-to-fine generation with ORFA conditioning, plus a new ARSG-110K dataset, to achieve higher geometric accuracy than MIDI a...

  7. Training a Student Expert via Semi-Supervised Foundation Model Distillation

    cs.CV 2026-04 conditional novelty 7.0

    A semi-supervised framework distills vision foundation models into compact instance segmentation experts that outperform their teachers by up to 11.9 AP on Cityscapes and 8.6 AP on ADE20K while being 11 times smaller.

  8. HairOrbit: Multi-view Aware 3D Hair Modeling from Single Portraits

    cs.CV 2026-04 unverdicted novelty 7.0

    HairOrbit leverages video generation priors and a neural orientation extractor to achieve state-of-the-art strand-level 3D hair reconstruction from single-view portraits in visible and invisible regions.

  9. Multi-Modal Guided Multi-Source Domain Adaptation for Object Detection

    cs.CV 2026-05 unverdicted novelty 6.0

    MS-DePro achieves state-of-the-art performance on multi-source domain adaptation benchmarks for object detection by using depth-guided region proposals and multi-modal alignment of learnable text embeddings.

  10. A Data Efficiency Study of Synthetic Fog for Object Detection Using the Clear2Fog Pipeline

    cs.CV 2026-05 unverdicted novelty 6.0

    Clear2Fog generates realistic synthetic fog from clear scenes, enabling mixed-density training that outperforms full fixed-density data and improves real-world performance by 1.67 mAP after learning-rate adjustment.

  11. GeoQuery: Geometry-Query Diffusion for Sparse-View Reconstruction

    cs.CV 2026-05 unverdicted novelty 6.0

    GeoQuery replaces corrupted rendering features with geometry-aligned proxy queries and restricts cross-view attention to local windows, enabling robust diffusion-based refinement under extreme view sparsity.

  12. Target-depth sensing with metasurface-encoder integrated optoelectronic neural network

    physics.optics 2026-04 unverdicted novelty 6.0

    A metasurface optical encoder compresses depth into 2D images for a shadow ResNet to achieve high accuracy in both target classification and depth estimation on MNIST and vehicle datasets.

  13. MLG-Stereo: ViT Based Stereo Matching with Multi-Stage Local-Global Enhancement

    cs.CV 2026-04 unverdicted novelty 6.0

    MLG-Stereo adds multi-granularity feature extraction, local-global cost volumes, and guided recurrent refinement to ViT stereo matching, yielding competitive results on Middlebury, KITTI-2015, and strong results on KI...

  14. Image Generators are Generalist Vision Learners

    cs.CV 2026-04 unverdicted novelty 6.0

    Image generation pretraining produces generalist vision models that reframe perception tasks as image synthesis and reach SOTA results on segmentation, depth estimation, and other 2D/3D tasks.

  15. Image Generators are Generalist Vision Learners

    cs.CV 2026-04 conditional novelty 6.0

    Image generation pretraining builds generalist vision models that reach SOTA on 2D and 3D perception tasks by reframing them as RGB image outputs.

  16. In Depth We Trust: Reliable Monocular Depth Supervision for Gaussian Splatting

    cs.CV 2026-04 unverdicted novelty 6.0

    A selective regularization framework lets scale-ambiguous monocular depth priors improve Gaussian Splatting geometry and rendering by isolating and supervising only ill-posed regions.

  17. Depth Anything 3: Recovering the Visual Space from Any Views

    cs.CV 2025-11 unverdicted novelty 6.0

    DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.

  18. The Midas Touch for Metric Depth

    cs.CV 2026-05 unverdicted novelty 5.0

    MTD turns relative depth into metric depth via segment-wise sparse graph optimization and discontinuity-aware geodesic pixel refinement, claiming better accuracy and generalization than prior depth methods.

  19. Sapiens2

    cs.CV 2026-04 unverdicted novelty 5.0

    Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and...

  20. Qwen-Image Technical Report

    cs.CV 2025-08 unverdicted novelty 5.0

    Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive en...

  21. Genie Sim PanoRecon: Fast Immersive Scene Generation from Single-View Panorama

    cs.RO 2026-04 unverdicted novelty 4.0

    A feed-forward Gaussian-splatting system reconstructs photo-realistic 3D scenes from single-view panoramas in seconds via cube-map decomposition and depth-aware fusion for robotic simulation use.

  22. Depth-Aware Rover: A Study of Edge AI and Monocular Vision for Real-World Implementation

    cs.CV 2026-04 unverdicted novelty 3.0

    Monocular depth estimation with UniDepthV2 on Raspberry Pi enables cost-effective rover navigation, proving more robust than stereo vision in real-world tests at 0.1 FPS depth and 10 FPS detection.

Reference graph

Works this paper leans on

294 extracted references · 200 canonical work pages · cited by 21 Pith papers

  1. [1]

    ECCV , year=

    Defocus deblurring using dual-pixel data , author=. ECCV , year=

  2. [2]

    RCA engineer , year=

    Pyramid methods in image processing , author=. RCA engineer , year=

  3. [3]

    2022 , journal=

    Attention Attention Everywhere: Monocular Depth Prediction with Skip Attention , author=. 2022 , journal=

  4. [4]

    ICML , year=

    Unilmv2: Pseudo-masked language models for unified language model pre-training , author=. ICML , year=

  5. [5]

    Hangbo Bao and Li Dong and Songhao Piao and Furu Wei , booktitle=

  6. [6]

    CVPR , year =

    Baradad, Manel and Torralba, Antonio , title =. CVPR , year =

  7. [7]

    Bauer, Zuria and Gomez-Donoso, Francisco and Cruz, Edmanuel and Orts-Escolano, Sergio and Cazorla, Miguel , journal=

  8. [8]

    arXiv , year =

    Shariq Farooq Bhat and Reiner Birkl and Diana Wofk and Peter Wonka and Matthias M. arXiv , year =

  9. [9]

    ECCV , year =

    Shariq Farooq Bhat and Ibraheem Alhashim and Peter Wonka , title =. ECCV , year =

  10. [10]

    CVPR , year =

    Shariq Farooq Bhat and Ibraheem Alhashim and Peter Wonka , title =. CVPR , year =

  11. [11]

    arXiv , year =

    Reiner Birkl and Diana Wofk and Matthias M. arXiv , year =

  12. [12]

    Black and Priyanka Patel and Joachim Tesch and Jinlong Yang , title =

    Michael J. Black and Priyanka Patel and Joachim Tesch and Jinlong Yang , title =. CVPR , year =

  13. [13]

    ECCV , year =

    A naturalistic open source movie for optical flow evaluation , author =. ECCV , year =

  14. [14]

    Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs , booktitle =

    Vladimir Bychkovsky and Sylvain Paris and Eric Chan and Fr. Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs , booktitle =

  15. [15]

    Lang and Sourabh Vora and Venice Erin Liong and Qiang Xu and Anush Krishnan and Yu Pan and Giancarlo Baldan and Oscar Beijbom , title =

    Holger Caesar and Varun Bankiti and Alex H. Lang and Sourabh Vora and Venice Erin Liong and Qiang Xu and Anush Krishnan and Yu Pan and Giancarlo Baldan and Oscar Beijbom , title =. CVPR , year =

  16. [16]

    ICCV , year =

    Han Cai and Junyan Li and Muyan Hu and Chuang Gan and Song Han , title =. ICCV , year =

  17. [17]

    Pix2Video: Video Editing using Image Diffusion , booktitle =

    Duygu Ceylan and Chun. Pix2Video: Video Editing using Image Diffusion , booktitle =

  18. [18]

    NIPS , year =

    Weifeng Chen and Zhao Fu and Dawei Yang and Jia Deng , title =. NIPS , year =

  19. [19]

    ICLR , year=

    Vision transformer adapter for dense predictions , author=. ICLR , year=

  20. [20]

    Cheng, Ho Kei and Chung, Jihoon and Tai, Yu-Wing and Tang, Chi-Keung , booktitle=

  21. [21]

    NeurIPS , year =

    Xiangxiang Chu and Zhi Tian and Yuqing Wang and Bo Zhang and Haibing Ren and Xiaolin Wei and Huaxia Xia and Chunhua Shen , title =. NeurIPS , year =

  22. [22]

    Chang and Manolis Savva and Maciej Halber and Thomas A

    Angela Dai and Angel X. Chang and Manolis Savva and Maciej Halber and Thomas A. Funkhouser and Matthias Nie. CVPR , year =

  23. [23]

    Dang-Nguyen, Duc-Tien and Pasquini, Cecilia and Conotter, Valentina and Boato, Giulia , booktitle=

  24. [24]

    NeurIPS Datasets & Benchmarks , year =

    Afshin Dehghan and Gilad Baruch and Zhuoyuan Chen and Yuri Feigin and Peter Fu and Thomas Gebauer and Daniel Kurz and Tal Dimry and Brandon Joffe and Arik Schwartz and Elad Shulman , title =. NeurIPS Datasets & Benchmarks , year =

  25. [25]

    Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle=

  26. [26]

    Dodgson , title =

    Neil A. Dodgson , title =. Stereoscopic Displays and Virtual Reality Systems XI , year =

  27. [27]

    ICLR , year =

    Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby , title =. ICLR , year =

  28. [28]

    Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in

    Ding, Xiaohan and Zhang, Xiangyu and Han, Jungong and Ding, Guiguang , booktitle=. Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in

  29. [29]

    ICCV , year =

    Ainaz Eftekhar and Alexander Sax and Jitendra Malik and Amir Zamir , title =. ICCV , year =

  30. [30]

    NIPS , year =

    David Eigen and Christian Puhrsch and Rob Fergus , title =. NIPS , year =

  31. [31]

    ICCV , year =

    David Eigen and Rob Fergus , title =. ICCV , year =

  32. [32]

    ICCV , year =

    Fan, Haoqi and Xiong, Bo and Mangalam, Karttikeya and Li, Yanghao and Yan, Zhicheng and Malik, Jitendra and Feichtenhofer, Christoph , title =. ICCV , year =

  33. [33]

    CVPR , year=

    Fang, Yuming and Zhu, Hanwei and Zeng, Yan and Ma, Kede and Wang, Zhou , title=. CVPR , year=

  34. [34]

    CVPR , year =

    Huan Fu and Mingming Gong and Chaohui Wang and Kayhan Batmanghelich and Dacheng Tao , title =. CVPR , year =

  35. [35]

    GitHub repository , howpublished =

    fvcore , year =. GitHub repository , howpublished =

  36. [36]

    CVPR , year =

    Adrien Gaidon and Qiao Wang and Yohann Cabon and Eleonora Vig , title =. CVPR , year =

  37. [37]

    IJRR , volume =

    Andreas Geiger and Philip Lenz and Christoph Stiller and Raquel Urtasun , title =. IJRR , volume =

  38. [38]

    All for One, and One for All:

    Jose Luis G. All for One, and One for All:. arXiv , year =

  39. [39]

    Gordon and Trevor Darrell and Michael Harville and John Woodfill , title =

    Gaile G. Gordon and Trevor Darrell and Michael Harville and John Woodfill , title =. CVPR , year =

  40. [40]

    Fischer and Ulrich Prestel and Pingchuan Ma and Dmytro Kotovenko and Olga Grebenkova and Stefan Andreas Baumann and Vincent Tao Hu and Bj

    Ming Gui and Johannes S. Fischer and Ulrich Prestel and Pingchuan Ma and Dmytro Kotovenko and Olga Grebenkova and Stefan Andreas Baumann and Vincent Tao Hu and Bj. AAAI , year =

  41. [41]

    ICCV , year =

    Vitor Guizilini and Igor Vasiljevic and Dian Chen and Rares Ambrus and Adrien Gaidon , title =. ICCV , year =

  42. [42]

    CVPR , year =

    Vitor Guizilini and Rares Ambrus and Sudeep Pillai and Allan Raventos and Adrien Gaidon , title =. CVPR , year =

  43. [43]

    CVPR , year =

    He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian , title =. CVPR , year =

  44. [44]

    CVPR , year =

    He, Kaiming and Chen, Xinlei and Xie, Saining and Li, Yanghao and Doll\'ar, Piotr and Girshick, Ross , title =. CVPR , year =

  45. [45]

    Peter Hedman and Suhib Alsisan and Richard Szeliski and Johannes Kopf , title =

  46. [46]

    NeurIPS , year =

    Denoising diffusion probabilistic models , author =. NeurIPS , year =

  47. [47]

    TPAMI , volume =

    Hu, Mu and Yin, Wei and Zhang, Chi and Cai, Zhipeng and Long, Xiaoxiao and Chen, Hao and Wang, Kaixuan and Yu, Gang and Shen, Chunhua and Shen, Shaojie , title=. TPAMI , volume =

  48. [48]

    , title =

    Huang, Gao and Liu, Zhuang and van der Maaten, Laurens and Weinberger, Kilian Q. , title =. CVPR , year =

  49. [49]

    TPAMI , volume =

    Xinyu Huang and Peng Wang and Xinjing Cheng and Dingfu Zhou and Qichuan Geng and Ruigang Yang , title =. TPAMI , volume =

  50. [50]

    Perceiver

    Andrew Jaegle and Sebastian Borgeaud and Jean. Perceiver. ICLR , year =

  51. [51]

    Freeman and David Salesin and Brian Curless and Ce Liu , title =

    Varun Jampani and Huiwen Chang and Kyle Sargent and Abhishek Kar and Richard Tucker and Michael Krainin and Dominik Kaeser and William T. Freeman and David Salesin and Brian Curless and Ce Liu , title =. ICCV , year =

  52. [52]

    ICCV , year =

    Yuanfeng Ji and Zhe Chen and Enze Xie and Lanqing Hong and Xihui Liu and Zhaoqiang Liu and Tong Lu and Zhenguo Li and Ping Luo , title =. ICCV , year =

  53. [53]

    CVPR , year =

    Liang, Jie and Zeng, Hui and Cui, Miaomiao and Xie, Xuansong and Zhang, Lei , title =. CVPR , year =

  54. [54]

    CVPR , year =

    Oguzhan Fatih Kar and Teresa Yeo and Andrei Atanov and Amir Zamir , title =. CVPR , year =

  55. [55]

    CVPR , year =

    Nikita Karaev and Ignacio Rocco and Benjamin Graham and Natalia Neverova and Andrea Vedaldi and Christian Rupprecht , title =. CVPR , year =

  56. [56]

    CVPR , year =

    Bingxin Ke and Anton Obukhov and Shengyu Huang and Nando Metzger and Rodrigo Caye Daudt and Konrad Schindler , title =. CVPR , year =

  57. [57]

    2023 , journal=

    MeSa: Masked, Geometric, and Supervised Pre-training for Monocular Depth Estimation , author=. 2023 , journal=

  58. [58]

    ICCV , year =

    Numair Khan and Lei Xiao and Douglas Lanman , title =. ICCV , year =

  59. [59]

    TIP , volume =

    Youngjung Kim and Bumsub Ham and Changjae Oh and Kwanghoon Sohn , title =. TIP , volume =

  60. [60]

    and Lo, Wan-Yen and Dollar, Piotr and Girshick, Ross , title =

    Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C. and Lo, Wan-Yen and Dollar, Piotr and Girshick, Ross , title =. ICCV , year =

  61. [61]

    and Tesch, Joachim and M\"uller, Lea and Hilliges, Otmar and Black, Michael J

    Kocabas, Muhammed and Huang, Chun-Hao P. and Tesch, Joachim and M\"uller, Lea and Hilliges, Otmar and Black, Michael J. , booktitle =

  62. [62]

    Evaluation of

    Tobias Koch and Lukas Liebel and Friedrich Fraundorfer and Marco K. Evaluation of. ECCV Workshops , year =

  63. [63]

    2023 , journal=

    Text-image Alignment for Diffusion-based Perception , author=. 2023 , journal=

  64. [64]

    CVPR , year =

    Anastasiia Kornilova and Marsel Faizullin and Konstantin Pakulev and Andrey Sadkov and Denis Kukushkin and Azat Akhmetyanov and Timur Akhtyamov and Hekmat Taherinejad and Gonzalo Ferrer , title =. CVPR , year =

  65. [65]

    CVPR , year=

    Pulling Things out of Perspective , author =. CVPR , year=

  66. [66]

    arXiv , year =

    Mykola Lavreniuk and Shariq Farooq Bhat and Matthias Müller and Peter Wonka , title =. arXiv , year =

  67. [67]

    WACV , year =

    Hoang. WACV , year =

  68. [68]

    2021 , journal=

    From Big to Small: Multi-Scale Local Planar Guidance for Monocular Depth Estimation , author=. 2021 , journal=

  69. [69]

    CVPR , year =

    Youngwan Lee and Jonghee Kim and Jeffrey Willette and Sung Ju Hwang , title =. CVPR , year =

  70. [70]

    CVPR , year =

    Zhengqi Li and Noah Snavely , title =. CVPR , year =

  71. [71]

    ICCV , year=

    Scale-Aware Trident Networks for Object Detection , author=. ICCV , year=

  72. [72]

    ACM-MM , year=

    Privacy-Preserving Portrait Matting , author=. ACM-MM , year=

  73. [73]

    CVPR , year =

    Zhengqi Li and Simon Niklaus and Noah Snavely and Oliver Wang , title =. CVPR , year =

  74. [74]

    ECCV , year=

    Exploring plain vision transformer backbones for object detection , author=. ECCV , year=

  75. [75]

    CVPR , year =

    Li, Yanghao and Wu, Chao-Yuan and Fan, Haoqi and Mangalam, Karttikeya and Xiong, Bo and Malik, Jitendra and Feichtenhofer, Christoph , title =. CVPR , year =

  76. [76]

    IJCV , volume=

    Bridging composite and real: Towards end-to-end deep image matting , author=. IJCV , volume=

  77. [77]

    TIP , volume =

    Zhenyu Li and Xuyang Wang and Xianming Liu and Junjun Jiang , title =. TIP , volume =

  78. [78]

    CVPR , year =

    Zhenyu Li and Shariq Farooq Bhat and Peter Wonka , title =. CVPR , year =

  79. [79]

    MIR , year=

    DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation , author=. MIR , year=

  80. [80]

    arXiv , year=

    Deep Image Matting: A Comprehensive Survey , author=. arXiv , year=

Showing first 80 references.