Recognition: 2 theorem links
· Lean TheoremUniDAC: Universal Metric Depth Estimation for Any Camera
Pith reviewed 2026-05-14 22:25 UTC · model grok-4.3
The pith
UniDAC decouples metric depth estimation into relative depth prediction and spatially varying scale estimation to generalize across any camera with a single model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UniDAC achieves universal robustness in metric depth estimation across domains by decoupling the task into relative depth prediction followed by spatially varying scale estimation, with a Depth-Guided Scale Estimation module that uses the relative depth map to guide high-resolution scale upsampling and RoPE-φ as a latitude-aware positional embedding that respects ERP spatial warping, allowing one model to outperform prior approaches on all tested datasets.
What carries the argument
The Depth-Guided Scale Estimation module, which upsamples a coarse scale map to high resolution by using the relative depth map as guidance to capture local scale variations, together with the RoPE-φ distortion-aware positional embedding.
If this is right
- A single trained model can be deployed directly in environments that mix standard, fisheye, and 360-degree cameras.
- No separate training or fine-tuning is required when new camera types are introduced.
- Real-time applications gain from the lightweight scale estimation module while retaining metric accuracy.
- Distortion handling improves for equirectangular projections without explicit camera calibration at inference.
- Cross-domain robustness extends to downstream tasks that rely on metric depth such as 3D reconstruction and navigation.
Where Pith is reading between the lines
- The same relative-to-scale decoupling could be tested on other scale-ambiguous tasks such as surface normal estimation or optical flow across camera types.
- Combining the scale module with temporal consistency constraints might improve video depth stability when the camera switches between lens types.
- Evaluating the framework on catadioptric or non-central projection cameras would reveal whether the latitude-aware embedding generalizes beyond ERP.
- If relative depth accuracy is the bottleneck, pre-training the relative branch on synthetic multi-camera data could further lift performance.
Load-bearing premise
Relative depth predictions remain sufficiently accurate across camera domains to reliably guide the scale estimation without any domain-specific fine-tuning.
What would settle it
Measure metric depth error on a held-out wide-FoV camera after training only on narrow-FoV data; if the error stays high even when the relative depth map is accurate, the decoupling claim fails.
Figures
read the original abstract
Monocular metric depth estimation (MMDE) is a core challenge in computer vision, playing a pivotal role in real-world applications that demand accurate spatial understanding. Although prior works have shown promising zero-shot performance in MMDE, they often struggle with generalization across diverse camera types, such as fisheye and $360^\circ$ cameras. Recent advances have addressed this through unified camera representations or canonical representation spaces, but they require either including large-FoV camera data during training or separately trained models for different domains. We propose UniDAC, an MMDE framework that presents universal robustness in all domains and generalizes across diverse cameras using a single model. We achieve this by decoupling metric depth estimation into relative depth prediction and spatially varying scale estimation, enabling robust performance across different domains. We propose a lightweight Depth-Guided Scale Estimation module that upsamples a coarse scale map to high resolution using the relative depth map as guidance to account for local scale variations. Furthermore, we introduce RoPE-$\phi$, a distortion-aware positional embedding that respects the spatial warping in Equi-Rectangular Projections (ERP) via latitude-aware weighting. UniDAC achieves state of the art (SoTA) in cross-camera generalization by consistently outperforming prior methods across all datasets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes UniDAC, a single-model framework for monocular metric depth estimation that decouples the task into relative depth prediction followed by a lightweight Depth-Guided Scale Estimation module to recover spatially varying scales. It introduces RoPE-φ, a latitude-aware positional embedding for handling distortions in equirectangular projections (ERP), and claims universal robustness and state-of-the-art cross-camera generalization across pinhole, fisheye, and 360° cameras without requiring large-FoV training data or per-domain fine-tuning.
Significance. If the central claims hold, this would be a meaningful contribution by enabling practical metric depth estimation on arbitrary camera geometries with one model, which could simplify deployment in robotics, AR/VR, and surveillance. The decoupling strategy and distortion-aware embedding address a real gap in prior unified-camera approaches, but the significance is tempered by the unverified assumption that relative-depth accuracy transfers reliably to unseen distorted geometries to guide scale estimation.
major comments (3)
- [Abstract] Abstract: The claim that UniDAC 'achieves state of the art (SoTA) in cross-camera generalization by consistently outperforming prior methods across all datasets' is presented without any quantitative metrics, tables, error bars, ablation results, or dataset splits, which directly undermines verification of the generalization performance that is the paper's central contribution.
- [§3] §3 (Method, Depth-Guided Scale Estimation module): The module relies on the relative depth map to upsample and guide a coarse scale map, but the manuscript provides no analysis or experiments quantifying how spatially correlated errors in relative depth (typical for pinhole-trained estimators on fisheye/ERP) propagate into the final metric depth; this is load-bearing for the 'universal robustness' claim.
- [§3.3] §3.3 (RoPE-φ): The distortion-aware positional embedding is defined only for ERP latitude weighting; no equivalent intrinsic modeling is described for fisheye cameras, so it is unclear how the relative-depth stage can reliably guide scale estimation on fisheye geometries without inheriting distortion-induced errors.
minor comments (2)
- [Abstract] The abstract introduces MMDE without spelling it out on first use; expand to 'monocular metric depth estimation (MMDE)' for clarity.
- [§3.3] Notation for RoPE-φ should be typeset consistently (e.g., RoPE-ϕ vs. RoPE-φ) across equations and text.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below. Where revisions are needed to strengthen the manuscript, we indicate them explicitly and will incorporate the changes in the next version.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that UniDAC 'achieves state of the art (SoTA) in cross-camera generalization by consistently outperforming prior methods across all datasets' is presented without any quantitative metrics, tables, error bars, ablation results, or dataset splits, which directly undermines verification of the generalization performance that is the paper's central contribution.
Authors: We agree that the abstract would benefit from greater specificity to support its claims at first reading. The detailed quantitative results—including per-dataset metrics, error bars from repeated runs, ablation studies, and explicit train/test splits—are reported in Section 4 and the supplementary material. In the revised manuscript we will update the abstract to include concise references to the key cross-camera improvements (e.g., average relative error reductions) while preserving its brevity. revision: yes
-
Referee: [§3] §3 (Method, Depth-Guided Scale Estimation module): The module relies on the relative depth map to upsample and guide a coarse scale map, but the manuscript provides no analysis or experiments quantifying how spatially correlated errors in relative depth (typical for pinhole-trained estimators on fisheye/ERP) propagate into the final metric depth; this is load-bearing for the 'universal robustness' claim.
Authors: This observation is correct: the current text does not contain a dedicated propagation analysis. We will add a new subsection (or appendix) that quantifies the effect of spatially correlated relative-depth errors on the final metric output. The added experiments will inject controlled noise patterns matching those observed on distorted geometries and measure the resulting metric-depth degradation, thereby directly supporting the robustness claim. revision: yes
-
Referee: [§3.3] §3.3 (RoPE-φ): The distortion-aware positional embedding is defined only for ERP latitude weighting; no equivalent intrinsic modeling is described for fisheye cameras, so it is unclear how the relative-depth stage can reliably guide scale estimation on fisheye geometries without inheriting distortion-induced errors.
Authors: We thank the referee for highlighting this gap in exposition. RoPE-φ is specialized for ERP latitude warping; for fisheye inputs the relative-depth backbone employs standard rotary embeddings together with the mixed-camera training regime, which allows the network to learn fisheye distortion patterns implicitly. The subsequent Depth-Guided Scale Estimation module then operates on the resulting relative map irrespective of source geometry. In the revision we will expand §3.3 with an explicit paragraph describing the fisheye pathway and why the decoupling limits error propagation. revision: yes
Circularity Check
No circularity: architectural decoupling and new modules are independent of target predictions
full rationale
The paper's core claim rests on an explicit decoupling of metric depth into a relative-depth stage followed by a lightweight Depth-Guided Scale Estimation module that uses the relative depth map only as guidance for upsampling a coarse scale map. RoPE-φ is introduced as a new latitude-aware positional embedding for ERP. Neither step defines any quantity in terms of the final metric output, nor does any equation or module reduce by construction to a fitted parameter taken from the evaluation data. No self-citations are invoked as uniqueness theorems or load-bearing premises in the provided text. The generalization performance is therefore an empirical outcome of the proposed architecture rather than a tautology.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Relative depth prediction can be learned independently of absolute scale and remains domain-robust
invented entities (2)
-
Depth-Guided Scale Estimation module
no independent evidence
-
RoPE-φ
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Hao Ai, Zidong Cao, Yan-Pei Cao, Ying Shan, and Lin Wang. Hrdfuse: Monocular 360° depth estimation by collab- oratively learning holistic-with-regional depth distributions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17- 24, 2023, 2023. 3
work page 2023
-
[2]
Pano3d: A holistic benchmark and a solid baseline for 360deg depth estimation
Georgios Albanis, Nikolaos Zioulis, Petros Drakoulis, Vasileios Gkitsas, Vladimiros Sterzentsenko, Federico Al- varez, Dimitrios Zarpalas, and Petros Daras. Pano3d: A holistic benchmark and a solid baseline for 360deg depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3727– 3737, 2021. 6, 8, 1, 3, 5
work page 2021
-
[3]
Adabins: Depth estimation using adaptive bins
Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adaptive bins. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4009–4018, 2021. 1
work page 2021
-
[4]
nuscenes: A multi- modal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 2
work page 2020
-
[5]
Matterport3D: Learning from RGB-D Data in Indoor Environments
Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158, 2017. 6, 1
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[6]
Xingshuai Dong, Matthew A Garratt, Sreenatha G Ana- vatti, and Hussein A Abbass. Towards real-time monocular depth estimation for robotics: A survey.IEEE Transactions on Intelligent Transportation Systems, 23(10):16940–16961,
-
[7]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representa- tions (ICLR). OpenR...
work page 2021
-
[8]
Depthlab: Real-time 3d in- teraction with depth maps for mobile augmented reality
Ruofei Du, Eric Turner, Maksym Dzitsiuk, Luca Prasso, Ivo Duarte, Jason Dourgarian, Joao Afonso, Jose Pascoal, Josh Gladstone, Nuno Cruces, et al. Depthlab: Real-time 3d in- teraction with depth maps for mobile augmented reality. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, pages 829–843, 2020. 1
work page 2020
-
[9]
Omnidata: A scalable pipeline for making multi- task mid-level vision datasets from 3d scans
Ainaz Eftekhar, Alexander Sax, Jitendra Malik, and Amir Zamir. Omnidata: A scalable pipeline for making multi- task mid-level vision datasets from 3d scans. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 10786–10796, 2021. 6
work page 2021
-
[10]
David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep net- work.Advances in neural information processing systems, 27, 2014. 1
work page 2014
-
[11]
Hao Feng, Wendi Wang, Jiajun Deng, Wengang Zhou, Li Li, and Houqiang Li. Simfir: A simple framework for fisheye image rectification with self-supervised representation learn- ing. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, 2023. 3
work page 2023
-
[12]
Geowiz- ard: Unleashing the diffusion priors for 3d geometry esti- mation from a single image
Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowiz- ard: Unleashing the diffusion priors for 3d geometry esti- mation from a single image. InEuropean Conference on Computer Vision, pages 241–258. Springer, 2024. 1
work page 2024
-
[13]
Extending foun- dational monocular depth estimators to fisheye cameras with calibration tokens
Suchisrit Gangopadhyay, Jung-Hee Kim, Xien Chen, Patrick Rim, Hyoungseob Park, and Alex Wong. Extending foun- dational monocular depth estimators to fisheye cameras with calibration tokens. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 5198–5209,
-
[14]
Are we ready for autonomous driving? the kitti vision benchmark suite
Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In2012 IEEE conference on computer vision and pat- tern recognition, pages 3354–3361. IEEE, 2012. 2
work page 2012
-
[15]
A unifying theory for central panoramic systems and practical implications
Christopher Geyer and Kostas Daniilidis. A unifying theory for central panoramic systems and practical implications. In European conference on computer vision, pages 445–461. Springer, 2000. 1, 2
work page 2000
-
[16]
A2d2: Audi autonomous driving dataset
Jakob Geyer, Yohannes Kassahun, Mentar Mahmudi, Xavier Ricou, Rupesh Durgesh, Andrew S Chung, Lorenz Hauswald, Viet Hoang Pham, Maximilian M ¨uhlegg, Sebas- tian Dorn, et al. A2d2: Audi autonomous driving dataset. arXiv preprint arXiv:2004.06320, 2020. 6, 1
-
[17]
3d packing for self-supervised monocular depth estimation
Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raven- tos, and Adrien Gaidon. 3d packing for self-supervised monocular depth estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2485–2494, 2020. 6, 1
work page 2020
-
[18]
Towards zero-shot scale-aware monocu- lar depth estimation
Vitor Guizilini, Igor Vasiljevic, Dian Chen, Rares , Ambrus,, and Adrien Gaidon. Towards zero-shot scale-aware monocu- lar depth estimation. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 9233–9243,
-
[19]
Depth any camera: Zero-shot metric depth estimation from any camera
Yuliang Guo, Sparsh Garg, S Mahdi H Miangoleh, Xinyu Huang, and Liu Ren. Depth any camera: Zero-shot metric depth estimation from any camera. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26996–27006, 2025. 2, 3, 6, 7, 8, 1, 4, 5
work page 2025
-
[20]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 7
work page 2016
-
[21]
Rotary position embedding for vision transformer
Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. In European Conference on Computer Vision, pages 289–305. Springer, 2024. 3, 2
work page 2024
-
[22]
One thousand and one hours: Self-driving motion prediction dataset
John Houston, Guido Zuidhof, Luca Bergamini, Yawei Ye, Long Chen, Ashesh Jain, Sammy Omari, Vladimir Iglovikov, and Peter Ondruska. One thousand and one hours: Self-driving motion prediction dataset. InConference on Robot Learning, pages 409–418. PMLR, 2021. 6, 1
work page 2021
-
[23]
Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10579–10596, 2024. 1, 2, 3, 6, 7
work page 2024
-
[24]
Unifuse: Unidirectional fusion for 360° panorama depth estimation.IEEE Robotics Autom
Hualie Jiang, Zhe Sheng, Siyu Zhu, Zilong Dong, and Rui Huang. Unifuse: Unidirectional fusion for 360° panorama depth estimation.IEEE Robotics Autom. Lett., 2021. 3
work page 2021
-
[25]
Juho Kannala and Sami S Brandt. A generic camera model and calibration method for conventional, wide-angle, and fish-eye lenses.IEEE transactions on pattern analysis and machine intelligence, 28(8):1335–1340, 2006. 1
work page 2006
-
[26]
Repurpos- ing diffusion-based image generators for monocular depth estimation
Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Met- zger, Rodrigo Caye Daudt, and Konrad Schindler. Repurpos- ing diffusion-based image generators for monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 1, 2
work page 2024
-
[27]
Bingxin Ke, Kevin Qu, Tianfu Wang, Nando Metzger, Shengyu Huang, Bo Li, Anton Obukhov, and Konrad Schindler. Marigold: Affordable adaptation of diffusion- based image generators for image analysis.arXiv preprint arXiv:2505.09358, 2025. 1
-
[28]
Evaluation of cnn-based single-image depth estimation methods
Tobias Koch, Lukas Liebel, Friedrich Fraundorfer, and Marco Korner. Evaluation of cnn-based single-image depth estimation methods. InProceedings of the European Con- ference on Computer Vision (ECCV) Workshops, pages 0–0,
-
[29]
Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. From big to small: Multi-scale local planar guidance for monocular depth estimation.arXiv preprint arXiv:1907.10326, 2019. 1
-
[30]
Omnifusion: 360 monocular depth estima- tion via geometry-aware fusion
Yuyan Li, Yuliang Guo, Zhixin Yan, Xinyu Huang, Ye Duan, and Liu Ren. Omnifusion: 360 monocular depth estima- tion via geometry-aware fusion. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, 2022. 3
work page 2022
-
[31]
Yiyi Liao, Jun Xie, and Andreas Geiger. Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3292–3310, 2022. 6, 8, 1, 2, 3
work page 2022
-
[32]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 7
work page 2021
-
[33]
SGDR: Stochastic Gradient Descent with Warm Restarts
Ilya Loshchilov and Frank Hutter. Sgdr: Stochas- tic gradient descent with warm restarts.arXiv preprint arXiv:1608.03983, 2016. 6
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[34]
aimotive dataset: A multimodal dataset for robust autonomous driving with long-range perception
Tamas Matuszka, Ivan Barton, ´Ad´am Butykai, P ´eter Hajas, D´avid Kiss, Domonkos Kov´acs, S´andor Kuns´agi-M´at´e, P´eter Lengyel, G ´abor N´emeth, Levente Pet ˝o, Dezs ˝o Ribli, D ´avid Szeghy, Szabolcs Vajna, and Balint Viktor Varga. aimotive dataset: A multimodal dataset for robust autonomous driving with long-range perception. InInternational Confere...
work page 2023
-
[35]
Single view point om- nidirectional camera calibration from planar grids
Christopher Mei and Patrick Rives. Single view point om- nidirectional camera calibration from planar grids. InPro- ceedings 2007 IEEE International Conference on Robotics and Automation, pages 3945–3950. IEEE, 2007. 1, 2
work page 2007
-
[36]
John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel programming with cuda: Is cuda the parallel programming model that application developers have been waiting for?Queue, 6(2):40–53, 2008. 6
work page 2008
-
[37]
Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li, and Adrien Gaidon. Is pseudo-lidar needed for monocular 3d object detection? InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 3142–3152,
-
[38]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019. 6
work page 2019
-
[39]
idisc: In- ternal discretization for monocular depth estimation
Luigi Piccinelli, Christos Sakaridis, and Fisher Yu. idisc: In- ternal discretization for monocular depth estimation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21477–21487, 2023. 1, 2
work page 2023
-
[40]
Unidepth: Universal monocular metric depth estimation
Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10106–10116, 2024. 1, 2, 3, 6, 7
work page 2024
-
[41]
Unik3d: Universal camera monocular 3d estimation
Luigi Piccinelli, Christos Sakaridis, Mattia Segu, Yung- Hsu Yang, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unik3d: Universal camera monocular 3d estimation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 1028–1039, 2025. 2, 3, 6, 7, 8, 1, 4, 5
work page 2025
-
[42]
Unidepthv2: Universal monocular metric depth estimation made simpler, 2025
Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mat- tia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler, 2025. 1, 2, 3
work page 2025
-
[43]
Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI
Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Un- dersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021. 6, 1
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[44]
Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020. 1, 2
work page 2020
-
[45]
Vi- sion transformers for dense prediction
Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 6
work page 2021
-
[46]
360monodepth: High-resolution 360 monocular depth esti- mation
Manuel Rey, Mingze Yuan Area, and Christian Richardt. 360monodepth: High-resolution 360 monocular depth esti- mation. in 2022 ieee. InCVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2022. 3
work page 2022
-
[47]
Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M. Susskind. Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding. In International Conference on Computer Vision (ICCV) 2021,
work page 2021
-
[48]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2
work page 2022
-
[49]
Single image based depth estimation for robotic applications
Anupa Sabnis and Leena Vachhani. Single image based depth estimation for robotic applications. In2011 IEEE Re- cent Advances in Intelligent Computational Systems, pages 102–106. IEEE, 2011. 1
work page 2011
-
[50]
Irawati Nurmala Sari, Weiwei Du, et al. Depth map esti- mation of single-view image using smartphone camera for a 3-dimension image generation in augmented reality. In2023 Sixth International Symposium on Computer, Consumer and Control (IS3C), pages 167–170. IEEE, 2023. 1
work page 2023
-
[51]
Panoformer: Panorama transformer for indoor 360$ˆ{\circ}$ depth estimation
Zhijie Shen, Chunyu Lin, Kang Liao, Lang Nie, Zishuo Zheng, and Yao Zhao. Panoformer: Panorama transformer for indoor 360$ˆ{\circ}$ depth estimation. InComputer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part I, 2022. 3
work page 2022
-
[52]
Indoor segmentation and support inference from rgbd images
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InComputer Vision–ECCV 2012: 12th Eu- ropean Conference on Computer Vision, Florence, Italy, Oc- tober 7-13, 2012, Proceedings, Part V 12, pages 746–760. Springer, 2012. 2
work page 2012
-
[53]
Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,
-
[55]
Learning spherical con- volution for fast features from 360° imagery
Yu-Chuan Su and Kristen Grauman. Learning spherical con- volution for fast features from 360° imagery. InAdvances in Neural Information Processing Systems 30: Annual Confer- ence on Neural Information Processing Systems 2017, De- cember 4-9, 2017, Long Beach, CA, USA, 2017. 3
work page 2017
-
[56]
Anycalib: On- manifold learning for model-agnostic single-view camera calibration
Javier Tirado-Gar ´ın and Javier Civera. Anycalib: On- manifold learning for model-agnostic single-view camera calibration. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8044–8055, 2025. 1
work page 2025
-
[57]
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details.arXiv preprint arXiv:2507.02546,
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hari- haran, Mark Campbell, and Kilian Q Weinberger. Pseudo- lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8445–8453, 2019. 1
work page 2019
-
[59]
Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting
Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting.arXiv preprint arXiv:2301.00493, 2023. 6, 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
Efficient de- formable convnets: Rethinking dynamic and sparse operator for vision applications
Yuwen Xiong, Zhiqi Li, Yuntao Chen, Feng Wang, Xizhou Zhu, Jiapeng Luo, Wenhai Wang, Tong Lu, Hongsheng Li, Yu Qiao, Lewei Lu, Jie Zhou, and Jifeng Dai. Efficient de- formable convnets: Rethinking dynamic and sparse operator for vision applications. 2024. 3
work page 2024
-
[61]
Depth anything: Unleashing the power of large-scale unlabeled data
Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10371–10381, 2024. 1, 2
work page 2024
-
[62]
Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024
Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024. 1, 2
work page 2024
-
[63]
Scannet++: A high-fidelity dataset of 3d in- door scenes
Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d in- door scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023. 6, 8, 1, 3, 4
work page 2023
-
[64]
Learning to recover 3d scene shape from a single image
Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Long Mai, Simon Chen, and Chunhua Shen. Learning to recover 3d scene shape from a single image. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 204–213, 2021. 1, 2
work page 2021
-
[65]
Metric3d: Towards zero-shot metric 3d prediction from a single image
Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9043–9053, 2023. 1, 2, 3
work page 2023
-
[66]
Neural window fully-connected crfs for monocu- lar depth estimation
Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. Neural window fully-connected crfs for monocu- lar depth estimation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 3916–3925, 2022. 1
work page 2022
-
[67]
Egformer: Equirectangular geometry- biased transformer for 360 depth estimation
Ilwi Yun, Chanyong Shin, Hyunku Lee, Hyuk-Jae Lee, and Chae-Eun Rhee. Egformer: Equirectangular geometry- biased transformer for 360 depth estimation. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, 2023. 3
work page 2023
-
[68]
Taskonomy: Disentangling task transfer learning
Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 3712–3722, 2018. 6, 1
work page 2018
-
[69]
Replay: Remove projective lidar depthmap artifacts via exploiting epipolar geometry
Shengjie Zhu, Girish Chandar Ganesan, Abhinav Kumar, and Xiaoming Liu. Replay: Remove projective lidar depthmap artifacts via exploiting epipolar geometry. InEu- ropean Conference on Computer Vision, pages 393–411. Springer, 2024. 2
work page 2024
-
[70]
De- formable convnets V2: more deformable, better results
Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De- formable convnets V2: more deformable, better results
-
[71]
3 UniDAC: Universal Metric Depth Estimation for Any Camera Supplementary Material
-
[72]
Data 7.1. Training Data Tab. 5 provides an overview of the training datasets. In ad- dition to the training datasets utilized in DAC [19], we add Argoverse2 and A2D2 to balance the indoor and outdoor distribution in the training set. We observe that out of seven cameras in Argoverse2, the front camera’s aspect ratio is different than the rest of the six c...
-
[73]
Comparison with UniK3D As mentioned in Sec. 5.2, the comparison with UniK3D [41] is not fair to UniDAC, since [41] is trained on large-FoV images. However, we note that the compar- ison is also unfair towards [41] since UniDAC requires ground-truth camera parameters while [41] doesn’t. For a fairer comparison, we employ AnyCalib [56], an off-the-shelf cam...
-
[74]
‘+A2D2’ denotes adding A2D2 [16] in the training data as detailed in Sec
and UniDAC using predicted and ground-truth intrin- sics. ‘+A2D2’ denotes adding A2D2 [16] in the training data as detailed in Sec. 7.1. We observe that even under this fairer comparison, we still outperform [41] on Scan- Net++ [63]. We attribute the decrease in the performance Table 8.Zero-shot evaluation on perspective datasets.We evaluate all unified m...
-
[75]
Evaluation on Perspective Datasets We compare UniDAC against our baselines on four per- spective datasets, KITTI [14], NYU-v2 [52], IBims-1 [28], and nuScenes [4]. While [14, 28, 52] provide artifact- free depthmaps in their official dataset, we utilize [69] to estimate artifact-free depthmaps for [4]. We observe from Tab. 8 that UniDAC outperforms UniK3D...
-
[76]
Ablation on Encoder Weights Tab. 9 evaluates the effect of initializing encodersEwith different pre-trained weights on the model performance. We train DACU and UniDAC using DINOv2 and DINOv3 en- coders on HM3D and DDAD datasets. While DAC’s pro- posed framework is compatible with any depth estimation model, they use iDisc [39] for its simplicity and effec...
-
[77]
4.2, we estimate a scale mapSin- stead of a 1-D scalarsto adjust for irregularities
Ablation on Shift Estimation As mentioned in Sec. 4.2, we estimate a scale mapSin- stead of a 1-D scalarsto adjust for irregularities. How- ever, we still estimated shifttas a 1-D scalar. Tab. 10 provides an ablation on estimating a shift scalar and a shift map while keeping scale estimation in the form of a scale map. Formally, we modify the architecture...
-
[78]
Additional Qualitative results We provide additional qualitative results on Scan- Net++ [63], Pano3D-GV2 [2], and KITTI-360 [31] for vi- sual comparison in Fig. 7, Fig. 8 and Fig. 9 respectively. RGB & GT DACU [19] UniK3D [41] UniDAC Figure 7.Qualitative Results on ScanNet++ [63].Every pair of consecutive rows corresponds to a single sample. Odd rows disp...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.