arxiv: 2302.12288 · v1 · submitted 2023-02-23 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

Shariq Farooq Bhat , Reiner Birkl , Diana Wofk , Peter Wonka , Matthias M\"uller

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:09 UTC · model grok-4.3

classification 💻 cs.CV

keywords depth estimationmonocular depthzero-shot transferrelative depthmetric depthgeneralization

0 comments

The pith

A model pre-trained on relative depths from twelve datasets and fine-tuned on metric depth achieves strong zero-shot generalization while preserving scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome the limitation that depth estimation models either generalize across scenes but ignore absolute scale or achieve metric accuracy only within the training distribution. It does this by first training on relative depth predictions from a large collection of datasets and then attaching and fine-tuning lightweight metric-specific heads. A latent classifier decides which head to use for each image at test time. If successful, this produces a single network that works on both indoor and outdoor scenes it has never seen, with metric outputs that remain consistent with physical distances.

Core claim

Pre-training a depth network on relative depth from twelve different datasets, followed by fine-tuning separate metric bins modules on NYU Depth v2 and KITTI, and routing inputs via a latent classifier, yields the first model that can train jointly on indoor and outdoor data without performance loss and generalizes metric depth to eight unseen datasets.

What carries the argument

The metric bins module, which learns to adjust the centers and widths of depth bins for each domain to produce metric-scale outputs, selected by a latent classifier that identifies the appropriate domain from image features.

If this is right

Without any pre-training the approach already improves state-of-the-art relative error on the NYU indoor dataset.
Pre-training on twelve datasets then fine-tuning on NYU improves relative absolute error by 21 percent.
The model can be trained jointly on NYU and KITTI with no significant performance drop.
Zero-shot transfer reaches eight previously unseen indoor and outdoor datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The routing mechanism may allow easy addition of new domains by training only a new head and classifier update.
This separation of relative and metric learning could apply to other scale-sensitive tasks such as surface normal estimation or camera pose.
Real-world deployment in mixed environments like autonomous driving in cities with indoor navigation would benefit from the domain-agnostic routing.

Load-bearing premise

The latent classifier must correctly identify which metric head to use for each input image, even when the image comes from an unseen domain or lies between domains.

What would settle it

A drop in accuracy on the eight unseen test datasets or on images that are hard to classify as indoor or outdoor would indicate the routing step fails to preserve the claimed performance.

read the original abstract

This paper tackles the problem of depth estimation from a single image. Existing work either focuses on generalization performance disregarding metric scale, i.e. relative depth estimation, or state-of-the-art results on specific datasets, i.e. metric depth estimation. We propose the first approach that combines both worlds, leading to a model with excellent generalization performance while maintaining metric scale. Our flagship model, ZoeD-M12-NK, is pre-trained on 12 datasets using relative depth and fine-tuned on two datasets using metric depth. We use a lightweight head with a novel bin adjustment design called metric bins module for each domain. During inference, each input image is automatically routed to the appropriate head using a latent classifier. Our framework admits multiple configurations depending on the datasets used for relative depth pre-training and metric fine-tuning. Without pre-training, we can already significantly improve the state of the art (SOTA) on the NYU Depth v2 indoor dataset. Pre-training on twelve datasets and fine-tuning on the NYU Depth v2 indoor dataset, we can further improve SOTA for a total of 21% in terms of relative absolute error (REL). Finally, ZoeD-M12-NK is the first model that can jointly train on multiple datasets (NYU Depth v2 and KITTI) without a significant drop in performance and achieve unprecedented zero-shot generalization performance to eight unseen datasets from both indoor and outdoor domains. The code and pre-trained models are publicly available at https://github.com/isl-org/ZoeDepth .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ZoeDepth combines relative pre-training on 12 datasets with a metric bins module and latent routing to get solid benchmark gains and joint indoor-outdoor training, but the routing step lacks the ablations needed to fully back the zero-shot claims.

read the letter

The core advance is pre-training a relative depth backbone on a dozen datasets, then attaching lightweight metric heads that use a bins adjustment module for specific domains like NYU and KITTI. At test time a latent classifier routes each image to the matching head. This setup lets them train jointly on NYU and KITTI without the usual performance drop and report a 21% REL reduction on NYU after the full pipeline, plus stronger zero-shot numbers on eight held-out indoor and outdoor sets. The code release is the practical part that makes the numbers checkable.

Referee Report

2 major / 2 minor

Summary. The paper introduces ZoeDepth, a monocular depth estimation framework that pre-trains on 12 datasets for relative depth and fine-tunes on NYU Depth v2 and KITTI using domain-specific metric bins modules. A latent classifier routes each input image to the appropriate metric head at inference time. The flagship ZoeD-M12-NK model reports a 21% REL improvement on NYU, enables joint training on NYU+KITTI without significant performance drop, and achieves strong zero-shot generalization to eight unseen indoor and outdoor datasets.

Significance. If the empirical results hold under the routing mechanism, the work provides a practical bridge between relative-depth generalization and metric-scale accuracy, with the first demonstrated joint multi-domain metric training and broad zero-shot transfer. Public code release aids verification of the reported gains.

major comments (2)

[Experiments section (zero-shot evaluation tables)] Experiments section (zero-shot evaluation tables): no confusion matrix, per-dataset routing accuracy, or forced-wrong-head ablation is reported for the latent classifier on the eight unseen datasets. Because the central claim of domain-agnostic metric performance rests on correct routing to the NYU vs. KITTI metric bins module, the absence of these diagnostics leaves open the possibility that misrouting inflates the reported REL/RMSE numbers on ambiguous inputs.
[§3.3 (metric bins module and latent classifier)] §3.3 (metric bins module and latent classifier): the training objective and architecture details for the latent classifier are not fully specified (e.g., loss, number of classes, how it is trained jointly or separately). This makes it difficult to assess whether the routing is learned reliably or could be a post-hoc selection effect.

minor comments (2)

[Figure 2] Figure 2 (architecture diagram): the flow from relative encoder through the latent classifier to the metric heads is not labeled with tensor dimensions or explicit routing logic, making the inference path harder to follow.
[Table 1] Table 1 (NYU results): the baseline comparisons should explicitly state whether the competing methods were also pre-trained on the same 12 relative-depth datasets or only on standard supervised splits.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the significance of ZoeDepth. We address the two major comments point by point below. Both points identify areas where the manuscript can be strengthened with additional details and experiments, which we will incorporate in the revised version.

read point-by-point responses

Referee: Experiments section (zero-shot evaluation tables): no confusion matrix, per-dataset routing accuracy, or forced-wrong-head ablation is reported for the latent classifier on the eight unseen datasets. Because the central claim of domain-agnostic metric performance rests on correct routing to the NYU vs. KITTI metric bins module, the absence of these diagnostics leaves open the possibility that misrouting inflates the reported REL/RMSE numbers on ambiguous inputs.

Authors: We agree that these diagnostics are important for validating the routing mechanism. In the revised manuscript we will add: (1) a confusion matrix of routing decisions across the eight unseen datasets, (2) per-dataset routing accuracy numbers, and (3) a forced-wrong-head ablation that reports the degradation in REL/RMSE when the model is deliberately routed to the incorrect metric bins module. These additions will directly address the concern that misrouting could be inflating the zero-shot numbers and will make the evidence for correct domain-agnostic routing explicit. revision: yes
Referee: §3.3 (metric bins module and latent classifier): the training objective and architecture details for the latent classifier are not fully specified (e.g., loss, number of classes, how it is trained jointly or separately). This makes it difficult to assess whether the routing is learned reliably or could be a post-hoc selection effect.

Authors: We apologize for the incomplete specification in §3.3. The latent classifier is a lightweight two-layer MLP with two output classes (NYU vs. KITTI domain). It is trained jointly with the metric heads using cross-entropy loss on the ground-truth domain labels during the fine-tuning stage; it is not trained separately or applied post-hoc. We will expand §3.3 with the exact architecture, loss function, number of classes, and joint training procedure so that readers can fully reproduce and assess the reliability of the learned routing. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training and routing on standard benchmarks

full rationale

The paper describes a standard supervised pipeline: pre-train a relative-depth backbone on 12 datasets, attach domain-specific metric bins modules, fine-tune on NYU and KITTI, and train a latent classifier to route inputs at inference. All performance numbers (REL, RMSE, zero-shot transfer) are obtained by direct evaluation on held-out test sets; no equation or claimed prediction is shown to equal a fitted parameter or self-citation by construction. The latent classifier is an ordinary learned component whose accuracy is measured on the same benchmarks, not presupposed by the reported metrics. The work is therefore self-contained against external data and does not reduce any central claim to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

This is an empirical deep-learning paper; it relies on standard assumptions that convolutional backbones learn transferable features from supervised depth labels and that a small classifier can separate indoor/outdoor domains from latent features.

free parameters (1)

number of bins in metric bins module
Chosen per domain head to discretize metric depth ranges; exact count not stated in abstract but affects scale accuracy.

axioms (1)

domain assumption Standard supervised learning on depth labels produces generalizable features when pre-trained on diverse relative-depth datasets.
Invoked in the pre-training stage on 12 datasets.

pith-pipeline@v0.9.0 · 5586 in / 1227 out tokens · 27773 ms · 2026-05-14T22:09:01.166711+00:00 · methodology

discussion (0)

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World
cs.CV 2026-05 unverdicted novelty 7.0

LAMP tracks 3D human motion from moving multi-camera headsets by converting 2D detections to a unified metric 3D world frame via device localization and fitting with an end-to-end spatio-temporal transformer.
DP-SfM: Dual-Pixel Structure-from-Motion without Scale Ambiguity
cs.CV 2026-05 unverdicted novelty 7.0

Dual-pixel defocus blur enables absolute scale estimation in SfM without reference objects or calibration.
Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
cs.CV 2026-04 unverdicted novelty 7.0

A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.
LiftFormer: Lifting and Frame Theory Based Monocular Depth Estimation Using Depth and Edge Oriented Subspace Representation
cs.CV 2026-04 unverdicted novelty 7.0

LiftFormer transforms monocular depth prediction into depth-oriented geometric and edge-aware subspace representations via lifting and frame theory, achieving state-of-the-art results on standard datasets.
EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction
cs.CV 2026-03 unverdicted novelty 7.0

EndoVGGT uses a dynamic DeGAT graph attention module to improve depth estimation and non-rigid 3D reconstruction in surgery, reporting 24.6% PSNR and 9.1% SSIM gains on SCARED with zero-shot generalization to new domains.
VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation
cs.CV 2026-03 unverdicted novelty 7.0

VGGT-360 delivers geometry-consistent zero-shot panoramic depth by converting panoramas into multi-view 3D reconstructions via VGGT models and three plug-and-play correction modules, then reprojecting the result.
3D-VLA: A 3D Vision-Language-Action Generative World Model
cs.CV 2024-03 unverdicted novelty 7.0

3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners
cs.CV 2026-04 unverdicted novelty 6.0

LILA learns temporally consistent semantic and geometric pixel features from uncurated videos via linear in-context learning on off-the-shelf depth and motion cues, yielding empirical gains on video object segmentatio...
Last-Layer-Centric Feature Recombination: Unleashing 3D Geometric Knowledge in DINOv3 for Monocular Depth Estimation
cs.CV 2026-04 unverdicted novelty 6.0

Layer analysis of DINOv3 shows non-uniform 3D geometric knowledge concentrated in deeper layers, enabling a last-layer-centric recombination module that improves monocular depth estimation accuracy to state-of-the-art levels.
SS3D: End2End Self-Supervised 3D from Web Videos
cs.CV 2026-04 unverdicted novelty 6.0

SS3D pretrains an end-to-end 3D estimator on filtered YouTube-8M videos via SfM self-supervision, achieving improved zero-shot transfer and fine-tuning over prior baselines.
SS3D: End2End Self-Supervised 3D from Web Videos
cs.CV 2026-04 unverdicted novelty 6.0

SS3D pretrains an end-to-end feed-forward 3D estimator on filtered YouTube-8M videos via SfM self-supervision, MVS filtering, and expert distillation, delivering stronger zero-shot transfer and fine-tuning than prior ...
In Depth We Trust: Reliable Monocular Depth Supervision for Gaussian Splatting
cs.CV 2026-04 unverdicted novelty 6.0

A selective regularization framework lets scale-ambiguous monocular depth priors improve Gaussian Splatting geometry and rendering by isolating and supervising only ill-posed regions.
Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas
cs.CV 2026-03 unverdicted novelty 6.0

Stepper uses stepwise panoramic expansion with a multi-view 360-degree diffusion model and geometry reconstruction to produce high-fidelity, structurally consistent immersive 3D scenes from text.
Depth Anything V2
cs.CV 2024-06 unverdicted novelty 6.0

Depth Anything V2 delivers finer, more robust monocular depth predictions by replacing real labeled images with synthetic data, scaling the teacher model, and using large-scale pseudo-labeled real images for student training.
Pose-Aware Diffusion for 3D Generation
cs.CV 2026-05 unverdicted novelty 5.0

PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.
Learning from the Unseen: Generative Data Augmentation for Geometric-Semantic Accident Anticipation
cs.CV 2026-04 unverdicted novelty 5.0

A generative video synthesis pipeline paired with a semantic graph neural network yields gains in accident anticipation accuracy and lead time on driving datasets, accompanied by a new benchmark release.
SS3D: End2End Self-Supervised 3D from Web Videos
cs.CV 2026-04 unverdicted novelty 5.0

SS3D scales SfM-based self-supervision to ~100M frames from YouTube-8M using a multi-view signal proxy for filtering and a two-stage training schedule, achieving strong zero-shot transfer and better fine-tuning than p...
Enhancing Hazy Wildlife Imagery: AnimalHaze3k and IncepDehazeGan
cs.CV 2026-04 conditional novelty 5.0

A new wildlife-specific hazy image dataset and IncepDehazeGan model that reports state-of-the-art dehazing metrics and more than doubles downstream animal detection performance.
Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction
cs.CV 2026-04 unverdicted novelty 5.0

A multilevel perceptual CRF model using Swin Transformer, HPF fusion, HA adapters, and dynamic scaling attention achieves state-of-the-art monocular depth estimation on NYU Depth v2, KITTI, and MatterPort3D with reduc...
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
cs.CV 2025-07 unverdicted novelty 5.0

MoGe-2 recovers metric-scale 3D point maps with fine details from single images via data refinement and extension of affine-invariant predictions.
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
cs.RO 2025-01 unverdicted novelty 5.0

SpatialVLA adds 3D-aware position encoding and adaptive discretized action grids to visual-language-action models, enabling strong zero-shot performance and fine-tuning on new robot setups after pre-training on 1.1 mi...
AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation
cs.CV 2026-05 unverdicted novelty 4.0

AtteConDA adds attention-based conflict suppression to multi-condition diffusion models so that generated driving-scene images retain richer structural cues from the original annotations.
ELoG-GS: Dual-Branch Gaussian Splatting with Luminance-Guided Enhancement for Extreme Low-light 3D Reconstruction
cs.CV 2026-04 unverdicted novelty 4.0

ELoG-GS integrates geometry-aware initialization and luminance-guided photometric adaptation into Gaussian Splatting, achieving PSNR 18.66 and SSIM 0.69 on the NTIRE 2026 Track 1 low-light 3D reconstruction benchmark.
Step1X-Edit: A Practical Framework for General Image Editing
cs.CV 2025-04 unverdicted novelty 4.0

Step1X-Edit integrates a multimodal LLM with a diffusion decoder, trained on a custom high-quality dataset, to deliver image editing performance that surpasses open-source baselines and approaches proprietary models o...
ELoG-GS: Dual-Branch Gaussian Splatting with Luminance-Guided Enhancement for Extreme Low-light 3D Reconstruction
cs.CV 2026-04 unverdicted novelty 3.0

ELoG-GS combines learning-based initialization and luminance-guided enhancement inside Gaussian Splatting to raise PSNR to 18.66 and SSIM to 0.69 on the NTIRE 2026 low-light 3D challenge.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 22 Pith papers · 1 internal anchor

[1]

Stegun., ia (1972)

Miton Abramowitz. Stegun., ia (1972). handbook of mathe- matical functions. Formulas, Graphs and Mathematical Ta- bles, 2002. 4

work page 1972
[2]

Attention attention everywhere: Monocular depth prediction with skip attention

Ashutosh Agarwal and Chetan Arora. Attention attention everywhere: Monocular depth prediction with skip attention. arXiv preprint arXiv:2210.09071, 2022. 3, 4

work page arXiv 2022
[3]

BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao, Li Dong, and Furu Wei. Beit: BERT pre- training of image transformers. CoRR, abs/2106.08254,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Unimodal prob- ability distributions for deep ordinal classiﬁcation

Christopher Beckham and Christopher Pal. Unimodal prob- ability distributions for deep ordinal classiﬁcation. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning , volume 70 of Proceedings of Machine Learning Research , pages 411–

work page
[5]

PMLR, 06–11 Aug 2017. 4, 9

work page 2017
[6]

Adabins: Depth estimation using adaptive bins

Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adaptive bins. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4009–4018, 2021. 1, 2, 3, 4, 6, 7, 8, 13, 14, 15, 16, 20

work page 2021
[7]

Localbins: Improving depth estimation by learning local dis- tributions

Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Localbins: Improving depth estimation by learning local dis- tributions. In European Conference on Computer Vision , pages 480–496. Springer, 2022. 1, 2, 3, 4, 6, 7, 8, 9, 13, 14, 15, 16, 20

work page 2022
[8]

Vir- tual kitti 2, 2020

Yohann Cabon, Naila Murray, and Martin Humenberger. Vir- tual kitti 2, 2020. 5, 12, 13, 15, 17

work page 2020
[9]

Structure- aware residual pyramid network for monocular depth esti- mation

Xiaotian Chen, Xuejin Chen, and Zheng-Jun Zha. Structure- aware residual pyramid network for monocular depth esti- mation. In Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence, IJCAI-19 , pages 694–700. International Joint Conferences on Artiﬁcial Intel- ligence Organization, 7 2019. 6, 20

work page 2019
[10]

Depth map prediction from a single image using a multi-scale deep net- work

David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep net- work. In NIPS, 2014. 6, 20

work page 2014
[11]

A review of sparse expert models in deep learning

William Fedus, Jeff Dean, and Barret Zoph. A review of sparse expert models in deep learning. arXiv preprint arXiv:2209.01667, 2022. 5

work page arXiv 2022
[12]

Deep ordinal regression net- work for monocular depth estimation

Huan Fu, Mingming Gong, Chaohui Wang, Nematollah Bat- manghelich, and Dacheng Tao. Deep ordinal regression net- work for monocular depth estimation. 2018 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 2002–2011, 2018. 3, 6, 7, 20

work page 2018
[13]

3d packing for self-supervised monocular depth estimation

Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raven- tos, and Adrien Gaidon. 3d packing for self-supervised monocular depth estimation. In IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2020. 5, 12, 13, 15, 18

work page 2020
[14]

Detail pre- serving depth estimation from a single image using attention guided networks

Zhixiang Hao, Yu Li, Shaodi You, and Feng Lu. Detail pre- serving depth estimation from a single image using attention guided networks. 2018 International Conference on 3D Vi- sion (3DV), pages 304–313, 2018. 6, 20

work page 2018
[15]

Revisiting single image depth estimation: Toward higher resolution maps with accurate object boundaries

Junjie Hu, Mete Ozay, Yan Zhang, and Takayuki Okatani. Revisiting single image depth estimation: Toward higher resolution maps with accurate object boundaries. 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1043–1051, 2018. 6, 20

work page 2019
[16]

The apolloscape open dataset for autonomous driving and its application

Xinyu Huang, Peng Wang, Xinjing Cheng, Dingfu Zhou, Qichuan Geng, and Ruigang Yang. The apolloscape open dataset for autonomous driving and its application. IEEE transactions on pattern analysis and machine intelligence , 42(10):2702–2719, 2019. 5, 13

work page 2019
[17]

Depth map decomposition for monocular depth estimation

Jinyoung Jun, Jae-Han Lee, Chul Lee, and Chang-Su Kim. Depth map decomposition for monocular depth estimation. arXiv preprint arXiv:2208.10762, 2022. 2, 3, 6

work page arXiv 2022
[18]

Deep monocular depth estimation via in- tegration of global and local predictions

Youngjung Kim, Hyungjoo Jung, Dongbo Min, and Kwanghoon Sohn. Deep monocular depth estimation via in- tegration of global and local predictions. IEEE transactions on Image Processing, 27(8):4131–4144, 2018. 5, 12, 13, 16, 18

work page 2018
[19]

Evaluation of cnn-based single-image depth estimation methods

Tobias Koch, Lukas Liebel, Friedrich Fraundorfer, and Marco K¨orner. Evaluation of cnn-based single-image depth estimation methods. In Proceedings ECCV 2018 Workshops,

work page 2018
[20]

Deeper depth prediction with fully convolutional residual networks

Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Fed- erico Tombari, and Nassir Navab. Deeper depth prediction with fully convolutional residual networks. 2016 Fourth In- ternational Conference on 3D Vision (3DV), pages 239–248,

work page 2016
[21]

From big to small: Multi-scale local planar guidance for monocular depth estimation.arXiv preprint arXiv:1907.10326, 2019

Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326, 2019. 6, 8, 13, 14, 15, 16, 20

work page arXiv 1907
[22]

Monocular depth es- timation using relative depth maps

Jae-Han Lee and Chang-Su Kim. Monocular depth es- timation using relative depth maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2019. 2

work page 2019
[23]

Depth- assisted real-time 3d object detection for augmented reality

Wonwoo Lee, Nohyoung Park, and Woontack Woo. Depth- assisted real-time 3d object detection for augmented reality. ICAT’11, 2:126–132, 2011. 6, 20

work page 2011
[24]

Monocular depth es- timation with hierarchical fusion of dilated cnns and soft- weighted-sum inference

Bo Li, Yuchao Dai, and Mingyi He. Monocular depth es- timation with hierarchical fusion of dilated cnns and soft- weighted-sum inference. Pattern Recognition, 83:328–339,

work page
[25]

Deep attention-based classiﬁcation network for robust depth prediction

Ruibo Li, Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, and Lingxiao Hang. Deep attention-based classiﬁcation network for robust depth prediction. In C.V . Jawahar, Hongdong Li, Greg Mori, and Konrad Schindler, editors, Computer Vision – ACCV 2018, pages 663–678, Cham, 2019. Springer Inter- national Publishing. 3

work page 2018
[26]

Megadepth: Learning single- view depth prediction from internet photos

Zhengqi Li and Noah Snavely. Megadepth: Learning single- view depth prediction from internet photos. In Computer Vision and Pattern Recognition (CVPR), 2018. 5, 13

work page 2018
[27]

Binsformer: Revisiting adaptive bins for monocular depth estimation

Zhenyu Li, Xuyang Wang, Xianming Liu, and Junjun Jiang. Binsformer: Revisiting adaptive bins for monocular depth estimation. arXiv preprint arXiv:2204.00987, 2022. 1, 2, 3, 4

work page arXiv 2022
[28]

Swin transformer v2: Scaling up capacity and resolution

Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 12009–12019, 2022. 8, 12 10

work page 2022
[29]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021. 8

work page 2021
[30]

Object scene ﬂow for au- tonomous vehicles

Moritz Menze and Andreas Geiger. Object scene ﬂow for au- tonomous vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June

work page
[31]

Single image depth estimation: An overview

Alican Mertan, Damien Jade Duff, and Gozde Unal. Single image depth estimation: An overview. Digital Signal Pro- cessing, 123:103441, 2022. 1, 2

work page 2022
[32]

Sharpnet: Fast and accurate recovery of occluding contours in monocular depth estimation

Michael Ramamonjisoa and Vincent Lepetit. Sharpnet: Fast and accurate recovery of occluding contours in monocular depth estimation. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV) Workshops , Oct 2019. 6, 20

work page 2019
[33]

Vi- sion transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12179–12188, October 2021. 2, 3, 5

work page 2021
[34]

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer

Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence (TPAMI), 2020. 1, 2, 3, 4, 5, 12, 13

work page 2020
[35]

Deep robust single image depth estimation neural network using scene understanding

Haoyu Ren, Mostafa El-Khamy, and Jungwon Lee. Deep robust single image depth estimation neural network using scene understanding. In CVPR Workshops, 2019. 3

work page 2019
[36]

Susskind

Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M. Susskind. Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding. In International Conference on Computer Vision (ICCV) 2021,

work page 2021
[37]

Progress and proposals: A case study of monocular depth estimation

Khalil Sarwari, Forrest Laine, and Claire Tomlin. Progress and proposals: A case study of monocular depth estimation. Master’s thesis, EECS Department, University of California, Berkeley, May 2021. 3, 7

work page 2021
[38]

Indoor segmentation and support inference from rgbd images

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In Computer Vision – ECCV 2012, pages 746– 760, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. 2, 5, 6, 12, 13

work page 2012
[39]

S. Song, S. P. Lichtenberg, and J. Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. In 2015 IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) , pages 567–576, 2015. 5, 12, 13, 14

work page 2015
[40]

Mingxing Tan and Quoc V . Le. Efﬁcientnet: Rethinking model scaling for convolutional neural networks. In Ka- malika Chaudhuri and Ruslan Salakhutdinov, editors, Pro- ceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, Cali- fornia, USA, volume 97 ofProceedings of Machine Learning Research, pages 6105–...

work page 2019
[41]

Dai, Andrea F

Igor Vasiljevic, Nick Kolkin, Shanyi Zhang, Ruotian Luo, Haochen Wang, Falcon Z. Dai, Andrea F. Daniele, Moham- madreza Mostajabi, Steven Basart, Matthew R. Walter, and Gregory Shakhnarovich. DIODE: A Dense Indoor and Out- door DEpth Dataset. CoRR, abs/1908.00463, 2019. 5, 12, 13, 14, 16, 19

work page arXiv 1908
[42]

Web stereo video supervision for depth prediction from dynamic scenes

Chaoyang Wang, Simon Lucey, Federico Perazzi, and Oliver Wang. Web stereo video supervision for depth prediction from dynamic scenes. In 2019 International Conference on 3D Vision (3DV), pages 348–357. IEEE, 2019. 5, 13

work page 2019
[43]

Irs: A large naturalis- tic indoor robotics stereo dataset to train deep models for disparity and surface normal estimation

Qiang Wang, Shizhen Zheng, Qingsong Yan, Fei Deng, Kaiyong Zhao, and Xiaowen Chu. Irs: A large naturalis- tic indoor robotics stereo dataset to train deep models for disparity and surface normal estimation. arXiv preprint arXiv:1912.09678, 2019. 5, 13

work page arXiv 1912
[44]

Tartanair: A dataset to push the limits of visual slam

Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se- bastian Scherer. Tartanair: A dataset to push the limits of visual slam. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4909–4916. IEEE, 2020. 5, 13

work page 2020
[45]

Pytorch image models

Ross Wightman. Pytorch image models. https : / / github . com / rwightman / pytorch - image - models, 2019. 5, 12

work page 2019
[46]

Monocular relative depth percep- tion with web stereo data supervision

Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, Yang Xiao, Ruibo Li, and Zhenbo Luo. Monocular relative depth percep- tion with web stereo data supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 311–320, 2018. 5, 13

work page 2018
[47]

Structure-guided ranking loss for single im- age depth prediction

Ke Xian, Jianming Zhang, Oliver Wang, Long Mai, Zhe Lin, and Zhiguo Cao. Structure-guided ranking loss for single im- age depth prediction. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 611–620, 2020. 5, 13

work page 2020
[48]

Blendedmvs: A large- scale dataset for generalized multi-view stereo networks

Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large- scale dataset for generalized multi-view stereo networks. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition , pages 1790–1799, 2020. 5, 13

work page 2020
[49]

En- forcing geometric constraints of virtual normal for depth pre- diction

Wei Yin, Yifan Liu, Chunhua Shen, and Youliang Yan. En- forcing geometric constraints of virtual normal for depth pre- diction. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), October 2019. 6, 20

work page 2019
[50]

Learning to recover 3d scene shape from a single image

Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Long Mai, Simon Chen, and Chunhua Shen. Learning to recover 3d scene shape from a single image. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 204–213, 2021. 3

work page 2021
[51]

New crfs: Neural window fully-connected crfs for monocular depth estimation

Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. New crfs: Neural window fully-connected crfs for monocular depth estimation. arXiv preprint arXiv:2203.01502, 2022. 1, 2, 6, 7, 8, 13, 14, 15, 16, 17, 18, 19, 20 11 A. Appendix A.1. Datasets Overview We begin by providing a detailed overview of the prop- erties of the datasets used for metric ...

work page arXiv 2022
[52]

When using the base and tiny variants Swin2-B and Swin2-T, the number of parame- ters of ZoeDepth drops to 102M and 42M, respectively

based on shifted windows. When using the base and tiny variants Swin2-B and Swin2-T, the number of parame- ters of ZoeDepth drops to 102M and 42M, respectively. We report the results of all the aforementioned models evalu- ated on NYU Depth V2 in Table 18. 12 Seen in # Train # Eval Eval Depth [m] Crop Dataset Domain Type Training? Samples Samples Min Max ...

work page