Recognition: 2 theorem links
· Lean TheoremZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth
Pith reviewed 2026-05-14 22:09 UTC · model grok-4.3
The pith
A model pre-trained on relative depths from twelve datasets and fine-tuned on metric depth achieves strong zero-shot generalization while preserving scale.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Pre-training a depth network on relative depth from twelve different datasets, followed by fine-tuning separate metric bins modules on NYU Depth v2 and KITTI, and routing inputs via a latent classifier, yields the first model that can train jointly on indoor and outdoor data without performance loss and generalizes metric depth to eight unseen datasets.
What carries the argument
The metric bins module, which learns to adjust the centers and widths of depth bins for each domain to produce metric-scale outputs, selected by a latent classifier that identifies the appropriate domain from image features.
If this is right
- Without any pre-training the approach already improves state-of-the-art relative error on the NYU indoor dataset.
- Pre-training on twelve datasets then fine-tuning on NYU improves relative absolute error by 21 percent.
- The model can be trained jointly on NYU and KITTI with no significant performance drop.
- Zero-shot transfer reaches eight previously unseen indoor and outdoor datasets.
Where Pith is reading between the lines
- The routing mechanism may allow easy addition of new domains by training only a new head and classifier update.
- This separation of relative and metric learning could apply to other scale-sensitive tasks such as surface normal estimation or camera pose.
- Real-world deployment in mixed environments like autonomous driving in cities with indoor navigation would benefit from the domain-agnostic routing.
Load-bearing premise
The latent classifier must correctly identify which metric head to use for each input image, even when the image comes from an unseen domain or lies between domains.
What would settle it
A drop in accuracy on the eight unseen test datasets or on images that are hard to classify as indoor or outdoor would indicate the routing step fails to preserve the claimed performance.
read the original abstract
This paper tackles the problem of depth estimation from a single image. Existing work either focuses on generalization performance disregarding metric scale, i.e. relative depth estimation, or state-of-the-art results on specific datasets, i.e. metric depth estimation. We propose the first approach that combines both worlds, leading to a model with excellent generalization performance while maintaining metric scale. Our flagship model, ZoeD-M12-NK, is pre-trained on 12 datasets using relative depth and fine-tuned on two datasets using metric depth. We use a lightweight head with a novel bin adjustment design called metric bins module for each domain. During inference, each input image is automatically routed to the appropriate head using a latent classifier. Our framework admits multiple configurations depending on the datasets used for relative depth pre-training and metric fine-tuning. Without pre-training, we can already significantly improve the state of the art (SOTA) on the NYU Depth v2 indoor dataset. Pre-training on twelve datasets and fine-tuning on the NYU Depth v2 indoor dataset, we can further improve SOTA for a total of 21% in terms of relative absolute error (REL). Finally, ZoeD-M12-NK is the first model that can jointly train on multiple datasets (NYU Depth v2 and KITTI) without a significant drop in performance and achieve unprecedented zero-shot generalization performance to eight unseen datasets from both indoor and outdoor domains. The code and pre-trained models are publicly available at https://github.com/isl-org/ZoeDepth .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ZoeDepth, a monocular depth estimation framework that pre-trains on 12 datasets for relative depth and fine-tunes on NYU Depth v2 and KITTI using domain-specific metric bins modules. A latent classifier routes each input image to the appropriate metric head at inference time. The flagship ZoeD-M12-NK model reports a 21% REL improvement on NYU, enables joint training on NYU+KITTI without significant performance drop, and achieves strong zero-shot generalization to eight unseen indoor and outdoor datasets.
Significance. If the empirical results hold under the routing mechanism, the work provides a practical bridge between relative-depth generalization and metric-scale accuracy, with the first demonstrated joint multi-domain metric training and broad zero-shot transfer. Public code release aids verification of the reported gains.
major comments (2)
- [Experiments section (zero-shot evaluation tables)] Experiments section (zero-shot evaluation tables): no confusion matrix, per-dataset routing accuracy, or forced-wrong-head ablation is reported for the latent classifier on the eight unseen datasets. Because the central claim of domain-agnostic metric performance rests on correct routing to the NYU vs. KITTI metric bins module, the absence of these diagnostics leaves open the possibility that misrouting inflates the reported REL/RMSE numbers on ambiguous inputs.
- [§3.3 (metric bins module and latent classifier)] §3.3 (metric bins module and latent classifier): the training objective and architecture details for the latent classifier are not fully specified (e.g., loss, number of classes, how it is trained jointly or separately). This makes it difficult to assess whether the routing is learned reliably or could be a post-hoc selection effect.
minor comments (2)
- [Figure 2] Figure 2 (architecture diagram): the flow from relative encoder through the latent classifier to the metric heads is not labeled with tensor dimensions or explicit routing logic, making the inference path harder to follow.
- [Table 1] Table 1 (NYU results): the baseline comparisons should explicitly state whether the competing methods were also pre-trained on the same 12 relative-depth datasets or only on standard supervised splits.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the significance of ZoeDepth. We address the two major comments point by point below. Both points identify areas where the manuscript can be strengthened with additional details and experiments, which we will incorporate in the revised version.
read point-by-point responses
-
Referee: Experiments section (zero-shot evaluation tables): no confusion matrix, per-dataset routing accuracy, or forced-wrong-head ablation is reported for the latent classifier on the eight unseen datasets. Because the central claim of domain-agnostic metric performance rests on correct routing to the NYU vs. KITTI metric bins module, the absence of these diagnostics leaves open the possibility that misrouting inflates the reported REL/RMSE numbers on ambiguous inputs.
Authors: We agree that these diagnostics are important for validating the routing mechanism. In the revised manuscript we will add: (1) a confusion matrix of routing decisions across the eight unseen datasets, (2) per-dataset routing accuracy numbers, and (3) a forced-wrong-head ablation that reports the degradation in REL/RMSE when the model is deliberately routed to the incorrect metric bins module. These additions will directly address the concern that misrouting could be inflating the zero-shot numbers and will make the evidence for correct domain-agnostic routing explicit. revision: yes
-
Referee: §3.3 (metric bins module and latent classifier): the training objective and architecture details for the latent classifier are not fully specified (e.g., loss, number of classes, how it is trained jointly or separately). This makes it difficult to assess whether the routing is learned reliably or could be a post-hoc selection effect.
Authors: We apologize for the incomplete specification in §3.3. The latent classifier is a lightweight two-layer MLP with two output classes (NYU vs. KITTI domain). It is trained jointly with the metric heads using cross-entropy loss on the ground-truth domain labels during the fine-tuning stage; it is not trained separately or applied post-hoc. We will expand §3.3 with the exact architecture, loss function, number of classes, and joint training procedure so that readers can fully reproduce and assess the reliability of the learned routing. revision: yes
Circularity Check
No circularity: empirical training and routing on standard benchmarks
full rationale
The paper describes a standard supervised pipeline: pre-train a relative-depth backbone on 12 datasets, attach domain-specific metric bins modules, fine-tune on NYU and KITTI, and train a latent classifier to route inputs at inference. All performance numbers (REL, RMSE, zero-shot transfer) are obtained by direct evaluation on held-out test sets; no equation or claimed prediction is shown to equal a fitted parameter or self-citation by construction. The latent classifier is an ordinary learned component whose accuracy is measured on the same benchmarks, not presupposed by the reported metrics. The work is therefore self-contained against external data and does not reduce any central claim to its own inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- number of bins in metric bins module
axioms (1)
- domain assumption Standard supervised learning on depth labels produces generalizable features when pre-trained on diverse relative-depth datasets.
Forward citations
Cited by 25 Pith papers
-
LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World
LAMP tracks 3D human motion from moving multi-camera headsets by converting 2D detections to a unified metric 3D world frame via device localization and fitting with an end-to-end spatio-temporal transformer.
-
DP-SfM: Dual-Pixel Structure-from-Motion without Scale Ambiguity
Dual-pixel defocus blur enables absolute scale estimation in SfM without reference objects or calibration.
-
Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.
-
LiftFormer: Lifting and Frame Theory Based Monocular Depth Estimation Using Depth and Edge Oriented Subspace Representation
LiftFormer transforms monocular depth prediction into depth-oriented geometric and edge-aware subspace representations via lifting and frame theory, achieving state-of-the-art results on standard datasets.
-
EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction
EndoVGGT uses a dynamic DeGAT graph attention module to improve depth estimation and non-rigid 3D reconstruction in surgery, reporting 24.6% PSNR and 9.1% SSIM gains on SCARED with zero-shot generalization to new domains.
-
VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation
VGGT-360 delivers geometry-consistent zero-shot panoramic depth by converting panoramas into multi-view 3D reconstructions via VGGT models and three plug-and-play correction modules, then reprojecting the result.
-
3D-VLA: A 3D Vision-Language-Action Generative World Model
3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
-
Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners
LILA learns temporally consistent semantic and geometric pixel features from uncurated videos via linear in-context learning on off-the-shelf depth and motion cues, yielding empirical gains on video object segmentatio...
-
Last-Layer-Centric Feature Recombination: Unleashing 3D Geometric Knowledge in DINOv3 for Monocular Depth Estimation
Layer analysis of DINOv3 shows non-uniform 3D geometric knowledge concentrated in deeper layers, enabling a last-layer-centric recombination module that improves monocular depth estimation accuracy to state-of-the-art levels.
-
SS3D: End2End Self-Supervised 3D from Web Videos
SS3D pretrains an end-to-end 3D estimator on filtered YouTube-8M videos via SfM self-supervision, achieving improved zero-shot transfer and fine-tuning over prior baselines.
-
SS3D: End2End Self-Supervised 3D from Web Videos
SS3D pretrains an end-to-end feed-forward 3D estimator on filtered YouTube-8M videos via SfM self-supervision, MVS filtering, and expert distillation, delivering stronger zero-shot transfer and fine-tuning than prior ...
-
In Depth We Trust: Reliable Monocular Depth Supervision for Gaussian Splatting
A selective regularization framework lets scale-ambiguous monocular depth priors improve Gaussian Splatting geometry and rendering by isolating and supervising only ill-posed regions.
-
Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas
Stepper uses stepwise panoramic expansion with a multi-view 360-degree diffusion model and geometry reconstruction to produce high-fidelity, structurally consistent immersive 3D scenes from text.
-
Depth Anything V2
Depth Anything V2 delivers finer, more robust monocular depth predictions by replacing real labeled images with synthetic data, scaling the teacher model, and using large-scale pseudo-labeled real images for student training.
-
Pose-Aware Diffusion for 3D Generation
PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.
-
Learning from the Unseen: Generative Data Augmentation for Geometric-Semantic Accident Anticipation
A generative video synthesis pipeline paired with a semantic graph neural network yields gains in accident anticipation accuracy and lead time on driving datasets, accompanied by a new benchmark release.
-
SS3D: End2End Self-Supervised 3D from Web Videos
SS3D scales SfM-based self-supervision to ~100M frames from YouTube-8M using a multi-view signal proxy for filtering and a two-stage training schedule, achieving strong zero-shot transfer and better fine-tuning than p...
-
Enhancing Hazy Wildlife Imagery: AnimalHaze3k and IncepDehazeGan
A new wildlife-specific hazy image dataset and IncepDehazeGan model that reports state-of-the-art dehazing metrics and more than doubles downstream animal detection performance.
-
Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction
A multilevel perceptual CRF model using Swin Transformer, HPF fusion, HA adapters, and dynamic scaling attention achieves state-of-the-art monocular depth estimation on NYU Depth v2, KITTI, and MatterPort3D with reduc...
-
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
MoGe-2 recovers metric-scale 3D point maps with fine details from single images via data refinement and extension of affine-invariant predictions.
-
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
SpatialVLA adds 3D-aware position encoding and adaptive discretized action grids to visual-language-action models, enabling strong zero-shot performance and fine-tuning on new robot setups after pre-training on 1.1 mi...
-
AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation
AtteConDA adds attention-based conflict suppression to multi-condition diffusion models so that generated driving-scene images retain richer structural cues from the original annotations.
-
ELoG-GS: Dual-Branch Gaussian Splatting with Luminance-Guided Enhancement for Extreme Low-light 3D Reconstruction
ELoG-GS integrates geometry-aware initialization and luminance-guided photometric adaptation into Gaussian Splatting, achieving PSNR 18.66 and SSIM 0.69 on the NTIRE 2026 Track 1 low-light 3D reconstruction benchmark.
-
Step1X-Edit: A Practical Framework for General Image Editing
Step1X-Edit integrates a multimodal LLM with a diffusion decoder, trained on a custom high-quality dataset, to deliver image editing performance that surpasses open-source baselines and approaches proprietary models o...
-
ELoG-GS: Dual-Branch Gaussian Splatting with Luminance-Guided Enhancement for Extreme Low-light 3D Reconstruction
ELoG-GS combines learning-based initialization and luminance-guided enhancement inside Gaussian Splatting to raise PSNR to 18.66 and SSIM to 0.69 on the NTIRE 2026 low-light 3D challenge.
Reference graph
Works this paper leans on
-
[1]
Miton Abramowitz. Stegun., ia (1972). handbook of mathe- matical functions. Formulas, Graphs and Mathematical Ta- bles, 2002. 4
work page 1972
-
[2]
Attention attention everywhere: Monocular depth prediction with skip attention
Ashutosh Agarwal and Chetan Arora. Attention attention everywhere: Monocular depth prediction with skip attention. arXiv preprint arXiv:2210.09071, 2022. 3, 4
-
[3]
BEiT: BERT Pre-Training of Image Transformers
Hangbo Bao, Li Dong, and Furu Wei. Beit: BERT pre- training of image transformers. CoRR, abs/2106.08254,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Unimodal prob- ability distributions for deep ordinal classification
Christopher Beckham and Christopher Pal. Unimodal prob- ability distributions for deep ordinal classification. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning , volume 70 of Proceedings of Machine Learning Research , pages 411–
-
[5]
PMLR, 06–11 Aug 2017. 4, 9
work page 2017
-
[6]
Adabins: Depth estimation using adaptive bins
Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adaptive bins. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4009–4018, 2021. 1, 2, 3, 4, 6, 7, 8, 13, 14, 15, 16, 20
work page 2021
-
[7]
Localbins: Improving depth estimation by learning local dis- tributions
Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Localbins: Improving depth estimation by learning local dis- tributions. In European Conference on Computer Vision , pages 480–496. Springer, 2022. 1, 2, 3, 4, 6, 7, 8, 9, 13, 14, 15, 16, 20
work page 2022
-
[8]
Yohann Cabon, Naila Murray, and Martin Humenberger. Vir- tual kitti 2, 2020. 5, 12, 13, 15, 17
work page 2020
-
[9]
Structure- aware residual pyramid network for monocular depth esti- mation
Xiaotian Chen, Xuejin Chen, and Zheng-Jun Zha. Structure- aware residual pyramid network for monocular depth esti- mation. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19 , pages 694–700. International Joint Conferences on Artificial Intel- ligence Organization, 7 2019. 6, 20
work page 2019
-
[10]
Depth map prediction from a single image using a multi-scale deep net- work
David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep net- work. In NIPS, 2014. 6, 20
work page 2014
-
[11]
A review of sparse expert models in deep learning
William Fedus, Jeff Dean, and Barret Zoph. A review of sparse expert models in deep learning. arXiv preprint arXiv:2209.01667, 2022. 5
-
[12]
Deep ordinal regression net- work for monocular depth estimation
Huan Fu, Mingming Gong, Chaohui Wang, Nematollah Bat- manghelich, and Dacheng Tao. Deep ordinal regression net- work for monocular depth estimation. 2018 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 2002–2011, 2018. 3, 6, 7, 20
work page 2018
-
[13]
3d packing for self-supervised monocular depth estimation
Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raven- tos, and Adrien Gaidon. 3d packing for self-supervised monocular depth estimation. In IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2020. 5, 12, 13, 15, 18
work page 2020
-
[14]
Detail pre- serving depth estimation from a single image using attention guided networks
Zhixiang Hao, Yu Li, Shaodi You, and Feng Lu. Detail pre- serving depth estimation from a single image using attention guided networks. 2018 International Conference on 3D Vi- sion (3DV), pages 304–313, 2018. 6, 20
work page 2018
-
[15]
Junjie Hu, Mete Ozay, Yan Zhang, and Takayuki Okatani. Revisiting single image depth estimation: Toward higher resolution maps with accurate object boundaries. 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1043–1051, 2018. 6, 20
work page 2019
-
[16]
The apolloscape open dataset for autonomous driving and its application
Xinyu Huang, Peng Wang, Xinjing Cheng, Dingfu Zhou, Qichuan Geng, and Ruigang Yang. The apolloscape open dataset for autonomous driving and its application. IEEE transactions on pattern analysis and machine intelligence , 42(10):2702–2719, 2019. 5, 13
work page 2019
-
[17]
Depth map decomposition for monocular depth estimation
Jinyoung Jun, Jae-Han Lee, Chul Lee, and Chang-Su Kim. Depth map decomposition for monocular depth estimation. arXiv preprint arXiv:2208.10762, 2022. 2, 3, 6
-
[18]
Deep monocular depth estimation via in- tegration of global and local predictions
Youngjung Kim, Hyungjoo Jung, Dongbo Min, and Kwanghoon Sohn. Deep monocular depth estimation via in- tegration of global and local predictions. IEEE transactions on Image Processing, 27(8):4131–4144, 2018. 5, 12, 13, 16, 18
work page 2018
-
[19]
Evaluation of cnn-based single-image depth estimation methods
Tobias Koch, Lukas Liebel, Friedrich Fraundorfer, and Marco K¨orner. Evaluation of cnn-based single-image depth estimation methods. In Proceedings ECCV 2018 Workshops,
work page 2018
-
[20]
Deeper depth prediction with fully convolutional residual networks
Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Fed- erico Tombari, and Nassir Navab. Deeper depth prediction with fully convolutional residual networks. 2016 Fourth In- ternational Conference on 3D Vision (3DV), pages 239–248,
work page 2016
-
[21]
Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326, 2019. 6, 8, 13, 14, 15, 16, 20
-
[22]
Monocular depth es- timation using relative depth maps
Jae-Han Lee and Chang-Su Kim. Monocular depth es- timation using relative depth maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2019. 2
work page 2019
-
[23]
Depth- assisted real-time 3d object detection for augmented reality
Wonwoo Lee, Nohyoung Park, and Woontack Woo. Depth- assisted real-time 3d object detection for augmented reality. ICAT’11, 2:126–132, 2011. 6, 20
work page 2011
-
[24]
Bo Li, Yuchao Dai, and Mingyi He. Monocular depth es- timation with hierarchical fusion of dilated cnns and soft- weighted-sum inference. Pattern Recognition, 83:328–339,
-
[25]
Deep attention-based classification network for robust depth prediction
Ruibo Li, Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, and Lingxiao Hang. Deep attention-based classification network for robust depth prediction. In C.V . Jawahar, Hongdong Li, Greg Mori, and Konrad Schindler, editors, Computer Vision – ACCV 2018, pages 663–678, Cham, 2019. Springer Inter- national Publishing. 3
work page 2018
-
[26]
Megadepth: Learning single- view depth prediction from internet photos
Zhengqi Li and Noah Snavely. Megadepth: Learning single- view depth prediction from internet photos. In Computer Vision and Pattern Recognition (CVPR), 2018. 5, 13
work page 2018
-
[27]
Binsformer: Revisiting adaptive bins for monocular depth estimation
Zhenyu Li, Xuyang Wang, Xianming Liu, and Junjun Jiang. Binsformer: Revisiting adaptive bins for monocular depth estimation. arXiv preprint arXiv:2204.00987, 2022. 1, 2, 3, 4
-
[28]
Swin transformer v2: Scaling up capacity and resolution
Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 12009–12019, 2022. 8, 12 10
work page 2022
-
[29]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021. 8
work page 2021
-
[30]
Object scene flow for au- tonomous vehicles
Moritz Menze and Andreas Geiger. Object scene flow for au- tonomous vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June
-
[31]
Single image depth estimation: An overview
Alican Mertan, Damien Jade Duff, and Gozde Unal. Single image depth estimation: An overview. Digital Signal Pro- cessing, 123:103441, 2022. 1, 2
work page 2022
-
[32]
Sharpnet: Fast and accurate recovery of occluding contours in monocular depth estimation
Michael Ramamonjisoa and Vincent Lepetit. Sharpnet: Fast and accurate recovery of occluding contours in monocular depth estimation. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV) Workshops , Oct 2019. 6, 20
work page 2019
-
[33]
Vi- sion transformers for dense prediction
Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12179–12188, October 2021. 2, 3, 5
work page 2021
-
[34]
Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer
Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence (TPAMI), 2020. 1, 2, 3, 4, 5, 12, 13
work page 2020
-
[35]
Deep robust single image depth estimation neural network using scene understanding
Haoyu Ren, Mostafa El-Khamy, and Jungwon Lee. Deep robust single image depth estimation neural network using scene understanding. In CVPR Workshops, 2019. 3
work page 2019
-
[36]
Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M. Susskind. Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding. In International Conference on Computer Vision (ICCV) 2021,
work page 2021
-
[37]
Progress and proposals: A case study of monocular depth estimation
Khalil Sarwari, Forrest Laine, and Claire Tomlin. Progress and proposals: A case study of monocular depth estimation. Master’s thesis, EECS Department, University of California, Berkeley, May 2021. 3, 7
work page 2021
-
[38]
Indoor segmentation and support inference from rgbd images
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In Computer Vision – ECCV 2012, pages 746– 760, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. 2, 5, 6, 12, 13
work page 2012
-
[39]
S. Song, S. P. Lichtenberg, and J. Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. In 2015 IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) , pages 567–576, 2015. 5, 12, 13, 14
work page 2015
-
[40]
Mingxing Tan and Quoc V . Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In Ka- malika Chaudhuri and Ruslan Salakhutdinov, editors, Pro- ceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, Cali- fornia, USA, volume 97 ofProceedings of Machine Learning Research, pages 6105–...
work page 2019
-
[41]
Igor Vasiljevic, Nick Kolkin, Shanyi Zhang, Ruotian Luo, Haochen Wang, Falcon Z. Dai, Andrea F. Daniele, Moham- madreza Mostajabi, Steven Basart, Matthew R. Walter, and Gregory Shakhnarovich. DIODE: A Dense Indoor and Out- door DEpth Dataset. CoRR, abs/1908.00463, 2019. 5, 12, 13, 14, 16, 19
-
[42]
Web stereo video supervision for depth prediction from dynamic scenes
Chaoyang Wang, Simon Lucey, Federico Perazzi, and Oliver Wang. Web stereo video supervision for depth prediction from dynamic scenes. In 2019 International Conference on 3D Vision (3DV), pages 348–357. IEEE, 2019. 5, 13
work page 2019
-
[43]
Qiang Wang, Shizhen Zheng, Qingsong Yan, Fei Deng, Kaiyong Zhao, and Xiaowen Chu. Irs: A large naturalis- tic indoor robotics stereo dataset to train deep models for disparity and surface normal estimation. arXiv preprint arXiv:1912.09678, 2019. 5, 13
-
[44]
Tartanair: A dataset to push the limits of visual slam
Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se- bastian Scherer. Tartanair: A dataset to push the limits of visual slam. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4909–4916. IEEE, 2020. 5, 13
work page 2020
-
[45]
Ross Wightman. Pytorch image models. https : / / github . com / rwightman / pytorch - image - models, 2019. 5, 12
work page 2019
-
[46]
Monocular relative depth percep- tion with web stereo data supervision
Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, Yang Xiao, Ruibo Li, and Zhenbo Luo. Monocular relative depth percep- tion with web stereo data supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 311–320, 2018. 5, 13
work page 2018
-
[47]
Structure-guided ranking loss for single im- age depth prediction
Ke Xian, Jianming Zhang, Oliver Wang, Long Mai, Zhe Lin, and Zhiguo Cao. Structure-guided ranking loss for single im- age depth prediction. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 611–620, 2020. 5, 13
work page 2020
-
[48]
Blendedmvs: A large- scale dataset for generalized multi-view stereo networks
Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large- scale dataset for generalized multi-view stereo networks. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition , pages 1790–1799, 2020. 5, 13
work page 2020
-
[49]
En- forcing geometric constraints of virtual normal for depth pre- diction
Wei Yin, Yifan Liu, Chunhua Shen, and Youliang Yan. En- forcing geometric constraints of virtual normal for depth pre- diction. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), October 2019. 6, 20
work page 2019
-
[50]
Learning to recover 3d scene shape from a single image
Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Long Mai, Simon Chen, and Chunhua Shen. Learning to recover 3d scene shape from a single image. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 204–213, 2021. 3
work page 2021
-
[51]
New crfs: Neural window fully-connected crfs for monocular depth estimation
Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. New crfs: Neural window fully-connected crfs for monocular depth estimation. arXiv preprint arXiv:2203.01502, 2022. 1, 2, 6, 7, 8, 13, 14, 15, 16, 17, 18, 19, 20 11 A. Appendix A.1. Datasets Overview We begin by providing a detailed overview of the prop- erties of the datasets used for metric ...
-
[52]
based on shifted windows. When using the base and tiny variants Swin2-B and Swin2-T, the number of parame- ters of ZoeDepth drops to 102M and 42M, respectively. We report the results of all the aforementioned models evalu- ated on NYU Depth V2 in Table 18. 12 Seen in # Train # Eval Eval Depth [m] Crop Dataset Domain Type Training? Samples Samples Min Max ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.