RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes
Pith reviewed 2026-05-16 02:47 UTC · model grok-4.3
The pith
Retrieval-augmented networks improve monocular depth estimation for underrepresented classes by transferring geometry from similar retrieved scenes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RAD approximates the benefits of multi-view stereo in a monocular setting by retrieving semantically similar RGB-D context samples for low-confidence regions and fusing them through a matched cross-attention module that transfers geometric information only at reliable point correspondences.
What carries the argument
The matched cross-attention module that selectively transfers geometric information from retrieved RGB-D neighbors to the input image's uncertain regions.
If this is right
- Depth estimates become more reliable for infrequent objects such as specific furniture or uncommon vehicles without requiring extra camera views.
- The framework preserves accuracy on common classes while delivering targeted gains on rare ones.
- It reduces reliance on multi-view capture setups by simulating their geometric cues through database lookup.
- Uncertainty-aware retrieval becomes a practical lever for improving monocular systems in robotics and driving applications.
Where Pith is reading between the lines
- The same retrieval-plus-matched-fusion pattern could help other single-view tasks like semantic segmentation on rare categories.
- Success depends on access to a sufficiently diverse and well-matched RGB-D database at inference time.
- End-to-end learning of the retrieval step itself might further reduce dependence on hand-crafted uncertainty maps.
- Negative transfer could occur if the cross-attention matching fails silently on edge cases not covered by the current evaluations.
Load-bearing premise
Retrieved RGB-D neighbors will reliably supply accurate structural geometry for the input's low-confidence regions without harmful mismatches.
What would settle it
A test on a dataset containing rare classes but with no semantically similar RGB-D retrieval candidates available, showing no error reduction or outright degradation on those classes.
Figures
read the original abstract
Monocular Metric Depth Estimation (MMDE) is essential for physically intelligent systems, yet accurate depth estimation for underrepresented classes in complex scenes remains a persistent challenge. To address this, we propose RAD, a retrieval-augmented framework that approximates the benefits of multi-view stereo by utilizing retrieved neighbors as structural geometric proxies. Our method first employs an uncertainty-aware retrieval mechanism to identify low-confidence regions in the input and retrieve RGB-D context samples containing semantically similar content. We then process both the input and retrieved context via a dual-stream network and fuse them using a matched cross-attention module, which transfers geometric information only at reliable point correspondences. Evaluations on NYU Depth v2, KITTI, and Cityscapes demonstrate that RAD significantly outperforms state-of-the-art baselines on underrepresented classes, reducing relative absolute error by 29.2% on NYU Depth v2, 13.3% on KITTI, and 7.2% on Cityscapes, while maintaining competitive performance on standard in-domain benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes RAD, a retrieval-augmented framework for monocular metric depth estimation (MMDE) that addresses challenges with underrepresented classes by retrieving semantically similar RGB-D neighbors as structural proxies, using an uncertainty-aware retrieval step followed by a dual-stream network and matched cross-attention fusion to transfer geometric information only at reliable correspondences. It claims significant improvements over state-of-the-art baselines on underrepresented classes, with relative absolute error reductions of 29.2% on NYU Depth v2, 13.3% on KITTI, and 7.2% on Cityscapes, while remaining competitive on standard in-domain benchmarks.
Significance. If the reported gains prove robust and the retrieval-attention mechanism is shown to transfer valid correspondences without introducing artifacts, the work could meaningfully advance practical MMDE for long-tail scenes in robotics and autonomous driving by approximating multi-view benefits without requiring additional views at inference time.
major comments (3)
- [Abstract] Abstract: the central performance claims (29.2%/13.3%/7.2% relative absolute error reductions on underrepresented classes) rest on the unverified assumption that retrieved RGB-D neighbors provide accurate metric depth proxies and that matched cross-attention gates out invalid correspondences; no quantitative evidence (correspondence error rates, attention visualizations restricted to low-confidence regions, or ablation isolating the matching step) is supplied to support this.
- [Method] Method description: the uncertainty-aware retrieval mechanism and exact fusion mechanics (including how low-confidence regions are identified and how cross-attention is restricted to reliable points) are not detailed sufficiently to allow reproduction or verification that gains arise from geometric transfer rather than increased model capacity or retrieval artifacts.
- [Experiments] Experiments: no statistical significance testing, ablation on the retrieval database construction, or breakdown of error by class frequency is reported, which is required to substantiate that the improvements are specific to underrepresented classes rather than uniform across the test sets.
minor comments (2)
- [Method] Clarify the size and construction of the retrieval database (e.g., whether it is built from training splits or external sources) to avoid potential data leakage concerns.
- [Experiments] Add attention-map visualizations or correspondence accuracy metrics focused on underrepresented classes to illustrate the matched cross-attention behavior.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below and agree that the suggested additions will improve clarity and substantiation of the claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central performance claims (29.2%/13.3%/7.2% relative absolute error reductions on underrepresented classes) rest on the unverified assumption that retrieved RGB-D neighbors provide accurate metric depth proxies and that matched cross-attention gates out invalid correspondences; no quantitative evidence (correspondence error rates, attention visualizations restricted to low-confidence regions, or ablation isolating the matching step) is supplied to support this.
Authors: We agree that direct quantitative support for the mechanism strengthens the central claims. In the revised manuscript we will add (i) correspondence error rates measured on a held-out validation set of retrieved RGB-D pairs, (ii) attention visualizations masked to low-confidence regions, and (iii) an ablation that isolates the matched cross-attention module (comparing it against unmasked and random-attention variants). These results will appear in the Experiments section and supplementary material. revision: yes
-
Referee: [Method] Method description: the uncertainty-aware retrieval mechanism and exact fusion mechanics (including how low-confidence regions are identified and how cross-attention is restricted to reliable points) are not detailed sufficiently to allow reproduction or verification that gains arise from geometric transfer rather than increased model capacity or retrieval artifacts.
Authors: We acknowledge that the current description is insufficient for full reproducibility. In the revision we will expand Section 3 with: (a) the precise definition of low-confidence regions (pixel-wise depth variance from a 3-model ensemble), (b) the retrieval pipeline (semantic embedding via CLIP followed by cosine-similarity ranking in the RGB-D database), and (c) the fusion mechanics (cross-attention is masked to point pairs whose matching score exceeds a validation-tuned threshold). We will also include a parameter-matched baseline that uses the same dual-stream architecture without retrieval to isolate the contribution of geometric transfer. revision: yes
-
Referee: [Experiments] Experiments: no statistical significance testing, ablation on the retrieval database construction, or breakdown of error by class frequency is reported, which is required to substantiate that the improvements are specific to underrepresented classes rather than uniform across the test sets.
Authors: We agree these analyses are necessary. In the revised version we will add: (i) statistical significance testing (Wilcoxon signed-rank test on per-image relative absolute errors for the underrepresented-class subset), (ii) an ablation varying retrieval-database size and construction strategy (random sampling vs. semantic selection), and (iii) a per-class-frequency error breakdown (binning classes by training-set frequency and reporting errors separately). These will be presented as new tables and figures in the Experiments section. revision: yes
Circularity Check
No circularity: empirical engineering contribution evaluated on held-out benchmarks
full rationale
The paper introduces RAD as a retrieval-augmented network architecture for monocular depth estimation, relying on uncertainty-aware retrieval, dual-stream processing, and matched cross-attention. All reported gains (e.g., relative absolute error reductions on NYU Depth v2, KITTI, Cityscapes) are measured on standard held-out test sets and do not reduce to any fitted parameter, self-defined quantity, or self-citation chain within the paper. No first-principles derivation or uniqueness theorem is invoked; the work is presented as an empirical framework whose validity rests on external benchmark comparisons rather than internal equivalence by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Amir Atapour-Abarghouei and Toby P Breckon. Real-time monocular depth estimation using synthetic data with do- main adaptation via image style transfer. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 2800–2810, 2018. 2
work page 2018
-
[2]
Adabins: Depth estimation using adaptive bins
Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adaptive bins. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4009–4018, 2021. 1, 2, 5, 6
work page 2021
-
[3]
Localbins: Improving depth estimation by learning local dis- tributions
Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Localbins: Improving depth estimation by learning local dis- tributions. InEuropean Conference on Computer Vision, pages 480–496. Springer, 2022. 2
work page 2022
-
[4]
ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth
Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M ¨uller. Zoedepth: Zero-shot trans- fer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023. 1, 2, 5, 6, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Aleksei Bochkovskii, Ama ¨el Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R. Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. InInternational Conference on Learning Representations, 2025. 1, 2, 5, 6
work page 2025
-
[6]
Jun Cen, Peng Yun, Junhao Cai, Michael Yu Wang, and Ming Liu. Open-set 3d object detection. In2021 International conference on 3D vision (3DV), pages 869–878. IEEE, 2021. 2
work page 2021
-
[7]
The cityscapes dataset for semantic urban scene understanding
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. 2, 5
work page 2016
-
[8]
Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazar´e, Maria Lomeli, Lucas Hosseini, and Herv´e J´egou. The faiss library. IEEE Transactions on Big Data, 2025. 4
work page 2025
-
[9]
Depthlab: Real-time 3d in- teraction with depth maps for mobile augmented reality
Ruofei Du, Eric Turner, Maksym Dzitsiuk, Luca Prasso, Ivo Duarte, Jason Dourgarian, Joao Afonso, Jose Pascoal, Josh Gladstone, Nuno Cruces, et al. Depthlab: Real-time 3d in- teraction with depth maps for mobile augmented reality. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, pages 829–843, 2020. 1
work page 2020
-
[10]
Omnidata: A scalable pipeline for making multi- task mid-level vision datasets from 3d scans
Ainaz Eftekhar, Alexander Sax, Jitendra Malik, and Amir Zamir. Omnidata: A scalable pipeline for making multi- task mid-level vision datasets from 3d scans. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 10786–10796, 2021. 2
work page 2021
-
[11]
David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep net- work.Advances in neural information processing systems, 27, 2014. 2, 5, 3
work page 2014
-
[12]
Sean Ryan Fanello, Cem Keskin, Shahram Izadi, Push- meet Kohli, David Kim, David Sweeney, Antonio Criminisi, Jamie Shotton, Sing Bing Kang, and Tim Paek. Learning to be a depth camera for close-range human capture and inter- action.ACM Transactions on Graphics (TOG), 33(4):1–11,
-
[13]
Deep ordinal regression net- work for monocular depth estimation
Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat- manghelich, and Dacheng Tao. Deep ordinal regression net- work for monocular depth estimation. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2002–2011, 2018. 1, 2
work page 2002
-
[14]
Jos ´e M. F ´acil, Alejo Concha, L. Montesano, and Javier Civera. Single-view and multiview depth fusion, 2016. 2
work page 2016
-
[15]
Are we ready for autonomous driving? the kitti vision benchmark suite
Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In2012 IEEE conference on computer vision and pat- tern recognition, pages 3354–3361. IEEE, 2012. 2, 5
work page 2012
-
[16]
Digging into self-supervised monocular depth estimation
Cl ´ement Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. Digging into self-supervised monocular depth estimation. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 3828–3838,
-
[17]
3d packing for self-supervised monocular depth estimation
Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raven- tos, and Adrien Gaidon. 3d packing for self-supervised monocular depth estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2485–2494, 2020. 2
work page 2020
-
[18]
Towards zero-shot scale-aware monocu- lar depth estimation
Vitor Guizilini, Igor Vasiljevic, Dian Chen, Rares , Ambrus,, and Adrien Gaidon. Towards zero-shot scale-aware monocu- lar depth estimation. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 9233–9243,
-
[19]
Towards open-set camera 3d object de- tection.arXiv preprint arXiv:2406.17297, 2024
Zhuolin He, Xinrun Li, Heng Gao, Jiachen Tang, Shoumeng Qiu, Wenfu Wang, Lvjian Lu, Xuchong Qiu, Xiangyang Xue, and Jian Pu. Towards open-set camera 3d object de- tection.arXiv preprint arXiv:2406.17297, 2024. 2
-
[20]
Casual 3d photography.ACM Transactions on Graphics (TOG), 36(6):1–15, 2017
Peter Hedman, Suhib Alsisan, Richard Szeliski, and Jo- hannes Kopf. Casual 3d photography.ACM Transactions on Graphics (TOG), 36(6):1–15, 2017. 1
work page 2017
-
[21]
Out-of-distribution detection for monocular depth esti- mation
Julia Hornauer, Adrian Holzbock, and Vasileios Belagian- nis. Out-of-distribution detection for monocular depth esti- mation. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 1911–1921, 2023. 2
work page 1911
-
[22]
Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 2, 5, 7
work page 2024
-
[23]
Learning to adapt clip for few-shot monocular depth estimation
Xueting Hu, Ce Zhang, Yi Zhang, Bowen Hai, Ke Yu, and Zhihai He. Learning to adapt clip for few-shot monocular depth estimation. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pages 5594–5603, 2024. 2
work page 2024
-
[24]
Out-of-distribution detection for lidar-based 3d object detection
Chengjie Huang, Vahdat Abdelzad, Christopher Gus Mannes, Luke Rowe, Benjamin Therien, Rick Salay, Krzysztof Czarnecki, et al. Out-of-distribution detection for lidar-based 3d object detection. In2022 IEEE 25th Inter- national Conference on Intelligent Transportation Systems (ITSC), pages 4265–4271. IEEE, 2022. 2
work page 2022
-
[25]
Neural kernel surface re- construction
Jiahui Huang, Zan Gojcic, Matan Atzmon, Or Litany, Sanja Fidler, and Francis Williams. Neural kernel surface re- construction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4369– 4379, 2023. 1
work page 2023
-
[26]
Depth extrac- tion from video using non-parametric sampling
Kevin Karsch, Ce Liu, and Sing Bing Kang. Depth extrac- tion from video using non-parametric sampling. InEuropean conference on computer vision, pages 775–788. Springer,
-
[27]
Repurpos- ing diffusion-based image generators for monocular depth estimation
Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Met- zger, Rodrigo Caye Daudt, and Konrad Schindler. Repurpos- ing diffusion-based image generators for monocular depth estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9492–9502,
-
[28]
Revisiting out-of-distribution detection in lidar-based 3d object detection
Michael K ¨osel, Marcel Schreiber, Michael Ulrich, Claudius Gl¨aser, and Klaus Dietmayer. Revisiting out-of-distribution detection in lidar-based 3d object detection. In2024 IEEE Intelligent Vehicles Symposium (IV), pages 2806–2813. IEEE, 2024. 2
work page 2024
-
[29]
Deeper depth prediction with fully convolutional residual networks
Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Fed- erico Tombari, and Nassir Navab. Deeper depth prediction with fully convolutional residual networks. In2016 Fourth international conference on 3D vision (3DV), pages 239–
-
[30]
From big to small: Multi- scale local planar guidance for monocular depth estimation,
Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. From big to small: Multi-scale local planar guidance for monocular depth estimation.arXiv preprint arXiv:1907.10326, 2019. 5, 6
-
[31]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020. 2
work page 2020
-
[32]
Chen, Yu Zhu, Kaix- uan Wang, Xiaozhi Chen, Jinqiu Sun, and Yanning Zhang
Rui Li, Dong Gong, Wei Yin, H. Chen, Yu Zhu, Kaix- uan Wang, Xiaozhi Chen, Jinqiu Sun, and Yanning Zhang. Learning to fuse monocular and multi-view cues for multi- frame depth estimation in dynamic scenes, 2023. 2
work page 2023
-
[33]
Lightglue: Local feature matching at light speed
Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle- feys. Lightglue: Local feature matching at light speed. In Proceedings of the IEEE/CVF international conference on computer vision, pages 17627–17638, 2023. 4, 2, 5
work page 2023
-
[34]
Adrian Lopez-Rodriguez and Krystian Mikolajczyk. Desc: Domain adaptation for depth estimation via semantic con- sistency.International Journal of Computer Vision, 131(3): 752–771, 2023. 2
work page 2023
-
[35]
Mircea Paul Muresan, Marchis Raul, S. Nedevschi, and R. Danescu. Stereo and mono depth estimation fusion for an improved and fault tolerant 3d reconstruction, 2021. 2
work page 2021
-
[36]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 4, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
idisc: In- ternal discretization for monocular depth estimation
Luigi Piccinelli, Christos Sakaridis, and Fisher Yu. idisc: In- ternal discretization for monocular depth estimation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21477–21487, 2023. 5, 6
work page 2023
-
[38]
Unidepth: Universal monocular metric depth estimation
Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10106–10116, 2024. 2
work page 2024
-
[39]
UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler
Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mat- tia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. UniDepthV2: Universal monocular metric depth estimation made simpler.arXiv preprint arXiv:2502.20110, 2025. 1, 2, 5, 6, 7
work page internal anchor Pith review arXiv 2025
-
[40]
Su- perdepth: Self-supervised, super-resolved monocular depth estimation
Sudeep Pillai, Rares ¸ Ambrus ¸, and Adrien Gaidon. Su- perdepth: Self-supervised, super-resolved monocular depth estimation. In2019 International Conference on Robotics and Automation (ICRA), pages 9250–9256. IEEE, 2019. 2
work page 2019
-
[41]
On the uncertainty of self-supervised monocular depth estimation
Matteo Poggi, Filippo Aleotti, Fabio Tosi, and Stefano Mat- toccia. On the uncertainty of self-supervised monocular depth estimation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 3227–3237, 2020. 3
work page 2020
-
[42]
Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020. 2
work page 2020
-
[43]
Vi- sion transformers for dense prediction
Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 2, 4, 1
work page 2021
-
[44]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Shuwei Shao, Zhongcai Pei, Xingming Wu, Zhong Liu, Wei- hai Chen, and Zhengguo Li. Iebins: Iterative elastic bins for monocular depth estimation.Advances in Neural Informa- tion Processing Systems, 36:53025–53037, 2023. 1, 2, 5, 6
work page 2023
-
[46]
Indoor segmentation and support inference from rgbd images
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InEuropean conference on computer vision, pages 746–760. Springer, 2012. 2, 5
work page 2012
-
[47]
Louis Soum-Fontez, Jean-Emmanuel Deschaud, and Franc ¸ois Goulette. Hd-ood3d: Supervised and unsupervised out-of-distribution object detection in lidar data.arXiv preprint arXiv:2410.23767, 2024. 2
-
[48]
Lokesh Veeramacheneni and Matias Valdenegro-Toro. A benchmark for out of distribution detection in point cloud 3d semantic segmentation.arXiv preprint arXiv:2211.06241,
-
[49]
Planedepth: Self-supervised depth estimation via orthogonal planes
Ruoyu Wang, Zehao Yu, and Shenghua Gao. Planedepth: Self-supervised depth estimation via orthogonal planes. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 21425–21434, 2023. 2
work page 2023
-
[50]
Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hari- haran, Mark Campbell, and Kilian Q Weinberger. Pseudo- lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8445–8453, 2019. 1
work page 2019
-
[51]
Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2.https://github. com/facebookresearch/detectron2, 2019. 5
work page 2019
-
[52]
Video depth estimation by fusing flow-to- depth proposals, 2019
Jiaxin Xie, Chenyang Lei, Zhuwen Li, Li Erran Li, and Qifeng Chen. Video depth estimation by fusing flow-to- depth proposals, 2019. 2
work page 2019
-
[53]
Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024
Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024. 1, 2, 3, 4, 5, 6, 7, 8
work page 2024
-
[54]
Yung-Hsu Yang, Luigi Piccinelli, Mattia Segu, Siyuan Li, Rui Huang, Yuqian Fu, Marc Pollefeys, Hermann Blum, and Zuria Bauer. 3d-mood: Lifting 2d to 3d for monocular open- set object detection.arXiv preprint arXiv:2507.23567, 2025. 2
-
[55]
Learning to recover 3d scene shape from a single image
Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Long Mai, Simon Chen, and Chunhua Shen. Learning to recover 3d scene shape from a single image. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 204–213, 2021. 2
work page 2021
-
[56]
Metric3d: Towards zero-shot metric 3d prediction from a single image
Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. InProceedings of the IEEE/CVF international conference on computer vision, pages 9043–9053, 2023. 2, 6
work page 2023
-
[57]
Neural window fully-connected crfs for monocu- lar depth estimation
Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. Neural window fully-connected crfs for monocu- lar depth estimation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 3916–3925, 2022. 1, 2, 5, 6
work page 2022
-
[58]
Out-of-distribution semantic occupancy prediction
Yuheng Zhang, Mengfei Duan, Kunyu Peng, Yuhang Wang, Ruiping Liu, Fei Teng, Kai Luo, Zhiyong Li, and Kailun Yang. Out-of-distribution semantic occupancy prediction. arXiv preprint arXiv:2506.21185, 2025. 2
-
[59]
Geometry-aware symmetric domain adaptation for monocular depth estimation
Shanshan Zhao, Huan Fu, Mingming Gong, and Dacheng Tao. Geometry-aware symmetric domain adaptation for monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9788–9798, 2019. 2
work page 2019
-
[60]
Unsupervised learning of depth and ego-motion from video
Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1851–1858, 2017. 2 RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes Supplementary Material ViT Block Context (RG...
work page 2017
-
[61]
Start with pre-trained DepthAnything v2
-
[62]
Fine-tune the DepthAnything v2 network on the train- ing dataset to produce metric depth rather than relative depth. All weights are trained, with a significantly lower learning rate applied to the encoder, as recommended by the original authors
-
[63]
Modify the projection operation in the context ViT encoder to accept the depth channel (Sec
Construct the context stream encoder by duplicating the fine-tuned encoder from Step 2. Modify the projection operation in the context ViT encoder to accept the depth channel (Sec. 3.1.3 in the main paper)
-
[64]
Freeze the decoder and fine-tune the dual-stream en- coder. Optimization is performed on the training dataset using the complete retrieval-augmented pipeline (Sec. 3.1.1 in the main paper). In Step 4 we optimized the positional encoding in both streams so that the network learns to differentiate between the two types of inputs. As objective, we used the s...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.