pith. machine review for the scientific record. sign in

arxiv: 2603.27105 · v2 · submitted 2026-03-28 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

UniDAC: Universal Metric Depth Estimation for Any Camera

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:25 UTC · model grok-4.3

classification 💻 cs.CV
keywords monocular metric depth estimationcross-camera generalizationfisheye camera360-degree camerascale estimationrelative depthdistortion-aware embeddingequirectangular projection
0
0 comments X

The pith

UniDAC decouples metric depth estimation into relative depth prediction and spatially varying scale estimation to generalize across any camera with a single model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces UniDAC as a monocular metric depth estimation framework designed to work robustly on diverse cameras including standard, fisheye, and 360-degree types without requiring large-FoV training data or per-domain models. It separates the problem into first predicting relative depth and then estimating a spatially varying scale map that is upsampled with guidance from the relative depth to handle local variations. A lightweight Depth-Guided Scale Estimation module performs the upsampling step while RoPE-φ provides distortion-aware positional embeddings that account for latitude-based warping in equirectangular projections. This structure yields consistent outperformance over prior methods on cross-camera generalization benchmarks. If the decoupling holds, it removes the need for camera-specific retraining in applications that encounter mixed sensor types.

Core claim

UniDAC achieves universal robustness in metric depth estimation across domains by decoupling the task into relative depth prediction followed by spatially varying scale estimation, with a Depth-Guided Scale Estimation module that uses the relative depth map to guide high-resolution scale upsampling and RoPE-φ as a latitude-aware positional embedding that respects ERP spatial warping, allowing one model to outperform prior approaches on all tested datasets.

What carries the argument

The Depth-Guided Scale Estimation module, which upsamples a coarse scale map to high resolution by using the relative depth map as guidance to capture local scale variations, together with the RoPE-φ distortion-aware positional embedding.

If this is right

  • A single trained model can be deployed directly in environments that mix standard, fisheye, and 360-degree cameras.
  • No separate training or fine-tuning is required when new camera types are introduced.
  • Real-time applications gain from the lightweight scale estimation module while retaining metric accuracy.
  • Distortion handling improves for equirectangular projections without explicit camera calibration at inference.
  • Cross-domain robustness extends to downstream tasks that rely on metric depth such as 3D reconstruction and navigation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same relative-to-scale decoupling could be tested on other scale-ambiguous tasks such as surface normal estimation or optical flow across camera types.
  • Combining the scale module with temporal consistency constraints might improve video depth stability when the camera switches between lens types.
  • Evaluating the framework on catadioptric or non-central projection cameras would reveal whether the latitude-aware embedding generalizes beyond ERP.
  • If relative depth accuracy is the bottleneck, pre-training the relative branch on synthetic multi-camera data could further lift performance.

Load-bearing premise

Relative depth predictions remain sufficiently accurate across camera domains to reliably guide the scale estimation without any domain-specific fine-tuning.

What would settle it

Measure metric depth error on a held-out wide-FoV camera after training only on narrow-FoV data; if the error stays high even when the relative depth map is accurate, the decoupling claim fails.

Figures

Figures reproduced from arXiv: 2603.27105 by Girish Chandar Ganesan, Liu Ren, Xiaoming Liu, Yuliang Guo.

Figure 1
Figure 1. Figure 1: We propose UniDAC, a universal, domain-agnostic met￾ric depth estimation framework that generalizes to any camera. Unlike prior methods that either rely on large-FoV data during training or require separate models for indoor and outdoor do￾mains, UniDAC is trained solely on perspective images yet gener￾alizes effectively to large-FoV inputs, leveraging a universal model to robustly handle both indoor and o… view at source ↗
Figure 2
Figure 2. Figure 2: We show the Abs.Rel error between the predicted relative depth and the ground truth by performing (a) no scaling, (b) median [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of proposed method. UniDAC decouples metric depth estimation into relative depth and scale estimation. Relative depth relies on local scene information, while scene scale is domain-specific and depends on global scene information. Therefore, given an ERP image I, we split the features from the encoder into local Fl and the global features Fg. We predict the relative depth Drel using the local feat… view at source ↗
Figure 5
Figure 5. Figure 5: Motivation for RoPE-ϕ. We show the difference between (a) the pixel distance in ERP and (b) the correspond￾ing geodesic distance on the curvature of the sphere. Although |p11 −p12| = |p21 −p22| in the ERP, we see that G(p11, p12) < G(p21, p22) on the sphere. Geodesic distance respects the actual separation in the 3D space. Thus, we modify 2D-RoPE to reflect the geodesic distance to get RoPE-ϕ. The final me… view at source ↗
Figure 4
Figure 4. Figure 4: Depth-Guided Upsampling. We leverage the pre￾dicted relative depth Drel as a guide to upsample the predicted low-resolution scale map Sr ∈ R H r × W r to get S ∈ R H×W . We compare Drel and its downsampled version Dr to get the local in￾formation in the form of weights W ∈ R H×W×9 . We compare the spatial mapping between S and Sr and combine it with W to obtain S. The Depth-Guided Upsampling is non-paramet… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative Results. Every pair of consecutive rows corresponds to a single sample. Odd rows display the input RGB image, and A.Rel error between predicted and GT depth maps. Even rows display the GT depth map and predicted depth maps. UniDAC reduces error in the distorted region as seen around the table in ScanNet++ compared to DACU and on the walls in Pano3D-GV2 compared to UniK3D. comparing it with 2D R… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative Results on ScanNet++ [63]. Every pair of consecutive rows corresponds to a single sample. Odd rows display the input RGB image, and A.Rel error between predicted and GT depth maps. Even rows display the GT depth map and predicted depth maps [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative Results on Pano3D-GV2 [2]. Every pair of consecutive rows corresponds to a single sample. Odd rows display the input RGB image, and A.Rel error between predicted and GT depth maps. Even rows display the GT depth map and predicted depth maps [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative Results on KITTI-360 [31]. Every pair of consecutive rows corresponds to a single sample. Odd rows display the input RGB image, and A.Rel error between predicted and GT depth maps. Even rows display the GT depth map and predicted depth maps [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
read the original abstract

Monocular metric depth estimation (MMDE) is a core challenge in computer vision, playing a pivotal role in real-world applications that demand accurate spatial understanding. Although prior works have shown promising zero-shot performance in MMDE, they often struggle with generalization across diverse camera types, such as fisheye and $360^\circ$ cameras. Recent advances have addressed this through unified camera representations or canonical representation spaces, but they require either including large-FoV camera data during training or separately trained models for different domains. We propose UniDAC, an MMDE framework that presents universal robustness in all domains and generalizes across diverse cameras using a single model. We achieve this by decoupling metric depth estimation into relative depth prediction and spatially varying scale estimation, enabling robust performance across different domains. We propose a lightweight Depth-Guided Scale Estimation module that upsamples a coarse scale map to high resolution using the relative depth map as guidance to account for local scale variations. Furthermore, we introduce RoPE-$\phi$, a distortion-aware positional embedding that respects the spatial warping in Equi-Rectangular Projections (ERP) via latitude-aware weighting. UniDAC achieves state of the art (SoTA) in cross-camera generalization by consistently outperforming prior methods across all datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes UniDAC, a single-model framework for monocular metric depth estimation that decouples the task into relative depth prediction followed by a lightweight Depth-Guided Scale Estimation module to recover spatially varying scales. It introduces RoPE-φ, a latitude-aware positional embedding for handling distortions in equirectangular projections (ERP), and claims universal robustness and state-of-the-art cross-camera generalization across pinhole, fisheye, and 360° cameras without requiring large-FoV training data or per-domain fine-tuning.

Significance. If the central claims hold, this would be a meaningful contribution by enabling practical metric depth estimation on arbitrary camera geometries with one model, which could simplify deployment in robotics, AR/VR, and surveillance. The decoupling strategy and distortion-aware embedding address a real gap in prior unified-camera approaches, but the significance is tempered by the unverified assumption that relative-depth accuracy transfers reliably to unseen distorted geometries to guide scale estimation.

major comments (3)
  1. [Abstract] Abstract: The claim that UniDAC 'achieves state of the art (SoTA) in cross-camera generalization by consistently outperforming prior methods across all datasets' is presented without any quantitative metrics, tables, error bars, ablation results, or dataset splits, which directly undermines verification of the generalization performance that is the paper's central contribution.
  2. [§3] §3 (Method, Depth-Guided Scale Estimation module): The module relies on the relative depth map to upsample and guide a coarse scale map, but the manuscript provides no analysis or experiments quantifying how spatially correlated errors in relative depth (typical for pinhole-trained estimators on fisheye/ERP) propagate into the final metric depth; this is load-bearing for the 'universal robustness' claim.
  3. [§3.3] §3.3 (RoPE-φ): The distortion-aware positional embedding is defined only for ERP latitude weighting; no equivalent intrinsic modeling is described for fisheye cameras, so it is unclear how the relative-depth stage can reliably guide scale estimation on fisheye geometries without inheriting distortion-induced errors.
minor comments (2)
  1. [Abstract] The abstract introduces MMDE without spelling it out on first use; expand to 'monocular metric depth estimation (MMDE)' for clarity.
  2. [§3.3] Notation for RoPE-φ should be typeset consistently (e.g., RoPE-ϕ vs. RoPE-φ) across equations and text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below. Where revisions are needed to strengthen the manuscript, we indicate them explicitly and will incorporate the changes in the next version.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that UniDAC 'achieves state of the art (SoTA) in cross-camera generalization by consistently outperforming prior methods across all datasets' is presented without any quantitative metrics, tables, error bars, ablation results, or dataset splits, which directly undermines verification of the generalization performance that is the paper's central contribution.

    Authors: We agree that the abstract would benefit from greater specificity to support its claims at first reading. The detailed quantitative results—including per-dataset metrics, error bars from repeated runs, ablation studies, and explicit train/test splits—are reported in Section 4 and the supplementary material. In the revised manuscript we will update the abstract to include concise references to the key cross-camera improvements (e.g., average relative error reductions) while preserving its brevity. revision: yes

  2. Referee: [§3] §3 (Method, Depth-Guided Scale Estimation module): The module relies on the relative depth map to upsample and guide a coarse scale map, but the manuscript provides no analysis or experiments quantifying how spatially correlated errors in relative depth (typical for pinhole-trained estimators on fisheye/ERP) propagate into the final metric depth; this is load-bearing for the 'universal robustness' claim.

    Authors: This observation is correct: the current text does not contain a dedicated propagation analysis. We will add a new subsection (or appendix) that quantifies the effect of spatially correlated relative-depth errors on the final metric output. The added experiments will inject controlled noise patterns matching those observed on distorted geometries and measure the resulting metric-depth degradation, thereby directly supporting the robustness claim. revision: yes

  3. Referee: [§3.3] §3.3 (RoPE-φ): The distortion-aware positional embedding is defined only for ERP latitude weighting; no equivalent intrinsic modeling is described for fisheye cameras, so it is unclear how the relative-depth stage can reliably guide scale estimation on fisheye geometries without inheriting distortion-induced errors.

    Authors: We thank the referee for highlighting this gap in exposition. RoPE-φ is specialized for ERP latitude warping; for fisheye inputs the relative-depth backbone employs standard rotary embeddings together with the mixed-camera training regime, which allows the network to learn fisheye distortion patterns implicitly. The subsequent Depth-Guided Scale Estimation module then operates on the resulting relative map irrespective of source geometry. In the revision we will expand §3.3 with an explicit paragraph describing the fisheye pathway and why the decoupling limits error propagation. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural decoupling and new modules are independent of target predictions

full rationale

The paper's core claim rests on an explicit decoupling of metric depth into a relative-depth stage followed by a lightweight Depth-Guided Scale Estimation module that uses the relative depth map only as guidance for upsampling a coarse scale map. RoPE-φ is introduced as a new latitude-aware positional embedding for ERP. Neither step defines any quantity in terms of the final metric output, nor does any equation or module reduce by construction to a fitted parameter taken from the evaluation data. No self-citations are invoked as uniqueness theorems or load-bearing premises in the provided text. The generalization performance is therefore an empirical outcome of the proposed architecture rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract introduces two new modules without listing explicit free parameters; relies on standard computer-vision assumptions about depth cues and camera models.

axioms (1)
  • domain assumption Relative depth prediction can be learned independently of absolute scale and remains domain-robust
    Central to the decoupling strategy described in the abstract
invented entities (2)
  • Depth-Guided Scale Estimation module no independent evidence
    purpose: Upsample coarse scale map to high resolution using relative depth guidance
    New lightweight module proposed to handle local scale variations
  • RoPE-φ no independent evidence
    purpose: Distortion-aware positional embedding for equirectangular projections
    Latitude-aware weighting to respect spatial warping in 360° images

pith-pipeline@v0.9.0 · 5527 in / 1231 out tokens · 49166 ms · 2026-05-14T22:25:22.547707+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 6 internal anchors

  1. [1]

    Hrdfuse: Monocular 360° depth estimation by collab- oratively learning holistic-with-regional depth distributions

    Hao Ai, Zidong Cao, Yan-Pei Cao, Ying Shan, and Lin Wang. Hrdfuse: Monocular 360° depth estimation by collab- oratively learning holistic-with-regional depth distributions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17- 24, 2023, 2023. 3

  2. [2]

    Pano3d: A holistic benchmark and a solid baseline for 360deg depth estimation

    Georgios Albanis, Nikolaos Zioulis, Petros Drakoulis, Vasileios Gkitsas, Vladimiros Sterzentsenko, Federico Al- varez, Dimitrios Zarpalas, and Petros Daras. Pano3d: A holistic benchmark and a solid baseline for 360deg depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3727– 3737, 2021. 6, 8, 1, 3, 5

  3. [3]

    Adabins: Depth estimation using adaptive bins

    Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adaptive bins. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4009–4018, 2021. 1

  4. [4]

    nuscenes: A multi- modal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 2

  5. [5]

    Matterport3D: Learning from RGB-D Data in Indoor Environments

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158, 2017. 6, 1

  6. [6]

    Towards real-time monocular depth estimation for robotics: A survey.IEEE Transactions on Intelligent Transportation Systems, 23(10):16940–16961,

    Xingshuai Dong, Matthew A Garratt, Sreenatha G Ana- vatti, and Hussein A Abbass. Towards real-time monocular depth estimation for robotics: A survey.IEEE Transactions on Intelligent Transportation Systems, 23(10):16940–16961,

  7. [7]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representa- tions (ICLR). OpenR...

  8. [8]

    Depthlab: Real-time 3d in- teraction with depth maps for mobile augmented reality

    Ruofei Du, Eric Turner, Maksym Dzitsiuk, Luca Prasso, Ivo Duarte, Jason Dourgarian, Joao Afonso, Jose Pascoal, Josh Gladstone, Nuno Cruces, et al. Depthlab: Real-time 3d in- teraction with depth maps for mobile augmented reality. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, pages 829–843, 2020. 1

  9. [9]

    Omnidata: A scalable pipeline for making multi- task mid-level vision datasets from 3d scans

    Ainaz Eftekhar, Alexander Sax, Jitendra Malik, and Amir Zamir. Omnidata: A scalable pipeline for making multi- task mid-level vision datasets from 3d scans. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 10786–10796, 2021. 6

  10. [10]

    Depth map prediction from a single image using a multi-scale deep net- work.Advances in neural information processing systems, 27, 2014

    David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep net- work.Advances in neural information processing systems, 27, 2014. 1

  11. [11]

    Simfir: A simple framework for fisheye image rectification with self-supervised representation learn- ing

    Hao Feng, Wendi Wang, Jiajun Deng, Wengang Zhou, Li Li, and Houqiang Li. Simfir: A simple framework for fisheye image rectification with self-supervised representation learn- ing. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, 2023. 3

  12. [12]

    Geowiz- ard: Unleashing the diffusion priors for 3d geometry esti- mation from a single image

    Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowiz- ard: Unleashing the diffusion priors for 3d geometry esti- mation from a single image. InEuropean Conference on Computer Vision, pages 241–258. Springer, 2024. 1

  13. [13]

    Extending foun- dational monocular depth estimators to fisheye cameras with calibration tokens

    Suchisrit Gangopadhyay, Jung-Hee Kim, Xien Chen, Patrick Rim, Hyoungseob Park, and Alex Wong. Extending foun- dational monocular depth estimators to fisheye cameras with calibration tokens. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 5198–5209,

  14. [14]

    Are we ready for autonomous driving? the kitti vision benchmark suite

    Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In2012 IEEE conference on computer vision and pat- tern recognition, pages 3354–3361. IEEE, 2012. 2

  15. [15]

    A unifying theory for central panoramic systems and practical implications

    Christopher Geyer and Kostas Daniilidis. A unifying theory for central panoramic systems and practical implications. In European conference on computer vision, pages 445–461. Springer, 2000. 1, 2

  16. [16]

    A2d2: Audi autonomous driving dataset

    Jakob Geyer, Yohannes Kassahun, Mentar Mahmudi, Xavier Ricou, Rupesh Durgesh, Andrew S Chung, Lorenz Hauswald, Viet Hoang Pham, Maximilian M ¨uhlegg, Sebas- tian Dorn, et al. A2d2: Audi autonomous driving dataset. arXiv preprint arXiv:2004.06320, 2020. 6, 1

  17. [17]

    3d packing for self-supervised monocular depth estimation

    Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raven- tos, and Adrien Gaidon. 3d packing for self-supervised monocular depth estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2485–2494, 2020. 6, 1

  18. [18]

    Towards zero-shot scale-aware monocu- lar depth estimation

    Vitor Guizilini, Igor Vasiljevic, Dian Chen, Rares , Ambrus,, and Adrien Gaidon. Towards zero-shot scale-aware monocu- lar depth estimation. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 9233–9243,

  19. [19]

    Depth any camera: Zero-shot metric depth estimation from any camera

    Yuliang Guo, Sparsh Garg, S Mahdi H Miangoleh, Xinyu Huang, and Liu Ren. Depth any camera: Zero-shot metric depth estimation from any camera. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26996–27006, 2025. 2, 3, 6, 7, 8, 1, 4, 5

  20. [20]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 7

  21. [21]

    Rotary position embedding for vision transformer

    Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. In European Conference on Computer Vision, pages 289–305. Springer, 2024. 3, 2

  22. [22]

    One thousand and one hours: Self-driving motion prediction dataset

    John Houston, Guido Zuidhof, Luca Bergamini, Yawei Ye, Long Chen, Ashesh Jain, Sammy Omari, Vladimir Iglovikov, and Peter Ondruska. One thousand and one hours: Self-driving motion prediction dataset. InConference on Robot Learning, pages 409–418. PMLR, 2021. 6, 1

  23. [23]

    Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10579–10596, 2024. 1, 2, 3, 6, 7

  24. [24]

    Unifuse: Unidirectional fusion for 360° panorama depth estimation.IEEE Robotics Autom

    Hualie Jiang, Zhe Sheng, Siyu Zhu, Zilong Dong, and Rui Huang. Unifuse: Unidirectional fusion for 360° panorama depth estimation.IEEE Robotics Autom. Lett., 2021. 3

  25. [25]

    Juho Kannala and Sami S Brandt. A generic camera model and calibration method for conventional, wide-angle, and fish-eye lenses.IEEE transactions on pattern analysis and machine intelligence, 28(8):1335–1340, 2006. 1

  26. [26]

    Repurpos- ing diffusion-based image generators for monocular depth estimation

    Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Met- zger, Rodrigo Caye Daudt, and Konrad Schindler. Repurpos- ing diffusion-based image generators for monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 1, 2

  27. [27]

    Marigold: Affordable adaptation of diffusion- based image generators for image analysis.arXiv preprint arXiv:2505.09358, 2025

    Bingxin Ke, Kevin Qu, Tianfu Wang, Nando Metzger, Shengyu Huang, Bo Li, Anton Obukhov, and Konrad Schindler. Marigold: Affordable adaptation of diffusion- based image generators for image analysis.arXiv preprint arXiv:2505.09358, 2025. 1

  28. [28]

    Evaluation of cnn-based single-image depth estimation methods

    Tobias Koch, Lukas Liebel, Friedrich Fraundorfer, and Marco Korner. Evaluation of cnn-based single-image depth estimation methods. InProceedings of the European Con- ference on Computer Vision (ECCV) Workshops, pages 0–0,

  29. [29]

    From big to small: Multi-scale local planar guidance for monocular depth estimation.arXiv preprint arXiv:1907.10326, 2019

    Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. From big to small: Multi-scale local planar guidance for monocular depth estimation.arXiv preprint arXiv:1907.10326, 2019. 1

  30. [30]

    Omnifusion: 360 monocular depth estima- tion via geometry-aware fusion

    Yuyan Li, Yuliang Guo, Zhixin Yan, Xinyu Huang, Ye Duan, and Liu Ren. Omnifusion: 360 monocular depth estima- tion via geometry-aware fusion. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, 2022. 3

  31. [31]

    Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3292–3310, 2022

    Yiyi Liao, Jun Xie, and Andreas Geiger. Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3292–3310, 2022. 6, 8, 1, 2, 3

  32. [32]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 7

  33. [33]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Ilya Loshchilov and Frank Hutter. Sgdr: Stochas- tic gradient descent with warm restarts.arXiv preprint arXiv:1608.03983, 2016. 6

  34. [34]

    aimotive dataset: A multimodal dataset for robust autonomous driving with long-range perception

    Tamas Matuszka, Ivan Barton, ´Ad´am Butykai, P ´eter Hajas, D´avid Kiss, Domonkos Kov´acs, S´andor Kuns´agi-M´at´e, P´eter Lengyel, G ´abor N´emeth, Levente Pet ˝o, Dezs ˝o Ribli, D ´avid Szeghy, Szabolcs Vajna, and Balint Viktor Varga. aimotive dataset: A multimodal dataset for robust autonomous driving with long-range perception. InInternational Confere...

  35. [35]

    Single view point om- nidirectional camera calibration from planar grids

    Christopher Mei and Patrick Rives. Single view point om- nidirectional camera calibration from planar grids. InPro- ceedings 2007 IEEE International Conference on Robotics and Automation, pages 3945–3950. IEEE, 2007. 1, 2

  36. [36]

    Scalable parallel programming with cuda: Is cuda the parallel programming model that application developers have been waiting for?Queue, 6(2):40–53, 2008

    John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel programming with cuda: Is cuda the parallel programming model that application developers have been waiting for?Queue, 6(2):40–53, 2008. 6

  37. [37]

    Is pseudo-lidar needed for monocular 3d object detection? InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 3142–3152,

    Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li, and Adrien Gaidon. Is pseudo-lidar needed for monocular 3d object detection? InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 3142–3152,

  38. [38]

    Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019. 6

  39. [39]

    idisc: In- ternal discretization for monocular depth estimation

    Luigi Piccinelli, Christos Sakaridis, and Fisher Yu. idisc: In- ternal discretization for monocular depth estimation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21477–21487, 2023. 1, 2

  40. [40]

    Unidepth: Universal monocular metric depth estimation

    Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10106–10116, 2024. 1, 2, 3, 6, 7

  41. [41]

    Unik3d: Universal camera monocular 3d estimation

    Luigi Piccinelli, Christos Sakaridis, Mattia Segu, Yung- Hsu Yang, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unik3d: Universal camera monocular 3d estimation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 1028–1039, 2025. 2, 3, 6, 7, 8, 1, 4, 5

  42. [42]

    Unidepthv2: Universal monocular metric depth estimation made simpler, 2025

    Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mat- tia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler, 2025. 1, 2, 3

  43. [43]

    Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

    Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Un- dersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021. 6, 1

  44. [44]

    Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020. 1, 2

  45. [45]

    Vi- sion transformers for dense prediction

    Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 6

  46. [46]

    360monodepth: High-resolution 360 monocular depth esti- mation

    Manuel Rey, Mingze Yuan Area, and Christian Richardt. 360monodepth: High-resolution 360 monocular depth esti- mation. in 2022 ieee. InCVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2022. 3

  47. [47]

    Susskind

    Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M. Susskind. Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding. In International Conference on Computer Vision (ICCV) 2021,

  48. [48]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2

  49. [49]

    Single image based depth estimation for robotic applications

    Anupa Sabnis and Leena Vachhani. Single image based depth estimation for robotic applications. In2011 IEEE Re- cent Advances in Intelligent Computational Systems, pages 102–106. IEEE, 2011. 1

  50. [50]

    Depth map esti- mation of single-view image using smartphone camera for a 3-dimension image generation in augmented reality

    Irawati Nurmala Sari, Weiwei Du, et al. Depth map esti- mation of single-view image using smartphone camera for a 3-dimension image generation in augmented reality. In2023 Sixth International Symposium on Computer, Consumer and Control (IS3C), pages 167–170. IEEE, 2023. 1

  51. [51]

    Panoformer: Panorama transformer for indoor 360$ˆ{\circ}$ depth estimation

    Zhijie Shen, Chunyu Lin, Kang Liao, Lang Nie, Zishuo Zheng, and Yao Zhao. Panoformer: Panorama transformer for indoor 360$ˆ{\circ}$ depth estimation. InComputer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part I, 2022. 3

  52. [52]

    Indoor segmentation and support inference from rgbd images

    Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InComputer Vision–ECCV 2012: 12th Eu- ropean Conference on Computer Vision, Florence, Italy, Oc- tober 7-13, 2012, Proceedings, Part V 12, pages 746–760. Springer, 2012. 2

  53. [53]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 6

  54. [54]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

  55. [55]

    Learning spherical con- volution for fast features from 360° imagery

    Yu-Chuan Su and Kristen Grauman. Learning spherical con- volution for fast features from 360° imagery. InAdvances in Neural Information Processing Systems 30: Annual Confer- ence on Neural Information Processing Systems 2017, De- cember 4-9, 2017, Long Beach, CA, USA, 2017. 3

  56. [56]

    Anycalib: On- manifold learning for model-agnostic single-view camera calibration

    Javier Tirado-Gar ´ın and Javier Civera. Anycalib: On- manifold learning for model-agnostic single-view camera calibration. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8044–8055, 2025. 1

  57. [57]

    MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

    Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details.arXiv preprint arXiv:2507.02546,

  58. [58]

    Pseudo- lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving

    Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hari- haran, Mark Campbell, and Kilian Q Weinberger. Pseudo- lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8445–8453, 2019. 1

  59. [59]

    Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting

    Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting.arXiv preprint arXiv:2301.00493, 2023. 6, 1

  60. [60]

    Efficient de- formable convnets: Rethinking dynamic and sparse operator for vision applications

    Yuwen Xiong, Zhiqi Li, Yuntao Chen, Feng Wang, Xizhou Zhu, Jiapeng Luo, Wenhai Wang, Tong Lu, Hongsheng Li, Yu Qiao, Lewei Lu, Jie Zhou, and Jifeng Dai. Efficient de- formable convnets: Rethinking dynamic and sparse operator for vision applications. 2024. 3

  61. [61]

    Depth anything: Unleashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10371–10381, 2024. 1, 2

  62. [62]

    Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024. 1, 2

  63. [63]

    Scannet++: A high-fidelity dataset of 3d in- door scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d in- door scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023. 6, 8, 1, 3, 4

  64. [64]

    Learning to recover 3d scene shape from a single image

    Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Long Mai, Simon Chen, and Chunhua Shen. Learning to recover 3d scene shape from a single image. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 204–213, 2021. 1, 2

  65. [65]

    Metric3d: Towards zero-shot metric 3d prediction from a single image

    Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9043–9053, 2023. 1, 2, 3

  66. [66]

    Neural window fully-connected crfs for monocu- lar depth estimation

    Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. Neural window fully-connected crfs for monocu- lar depth estimation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 3916–3925, 2022. 1

  67. [67]

    Egformer: Equirectangular geometry- biased transformer for 360 depth estimation

    Ilwi Yun, Chanyong Shin, Hyunku Lee, Hyuk-Jae Lee, and Chae-Eun Rhee. Egformer: Equirectangular geometry- biased transformer for 360 depth estimation. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, 2023. 3

  68. [68]

    Taskonomy: Disentangling task transfer learning

    Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 3712–3722, 2018. 6, 1

  69. [69]

    Replay: Remove projective lidar depthmap artifacts via exploiting epipolar geometry

    Shengjie Zhu, Girish Chandar Ganesan, Abhinav Kumar, and Xiaoming Liu. Replay: Remove projective lidar depthmap artifacts via exploiting epipolar geometry. InEu- ropean Conference on Computer Vision, pages 393–411. Springer, 2024. 2

  70. [70]

    De- formable convnets V2: more deformable, better results

    Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De- formable convnets V2: more deformable, better results

  71. [71]

    3 UniDAC: Universal Metric Depth Estimation for Any Camera Supplementary Material

  72. [72]

    Training Data Tab

    Data 7.1. Training Data Tab. 5 provides an overview of the training datasets. In ad- dition to the training datasets utilized in DAC [19], we add Argoverse2 and A2D2 to balance the indoor and outdoor distribution in the training set. We observe that out of seven cameras in Argoverse2, the front camera’s aspect ratio is different than the rest of the six c...

  73. [73]

    5.2, the comparison with UniK3D [41] is not fair to UniDAC, since [41] is trained on large-FoV images

    Comparison with UniK3D As mentioned in Sec. 5.2, the comparison with UniK3D [41] is not fair to UniDAC, since [41] is trained on large-FoV images. However, we note that the compar- ison is also unfair towards [41] since UniDAC requires ground-truth camera parameters while [41] doesn’t. For a fairer comparison, we employ AnyCalib [56], an off-the-shelf cam...

  74. [74]

    ‘+A2D2’ denotes adding A2D2 [16] in the training data as detailed in Sec

    and UniDAC using predicted and ground-truth intrin- sics. ‘+A2D2’ denotes adding A2D2 [16] in the training data as detailed in Sec. 7.1. We observe that even under this fairer comparison, we still outperform [41] on Scan- Net++ [63]. We attribute the decrease in the performance Table 8.Zero-shot evaluation on perspective datasets.We evaluate all unified m...

  75. [75]

    While [14, 28, 52] provide artifact- free depthmaps in their official dataset, we utilize [69] to estimate artifact-free depthmaps for [4]

    Evaluation on Perspective Datasets We compare UniDAC against our baselines on four per- spective datasets, KITTI [14], NYU-v2 [52], IBims-1 [28], and nuScenes [4]. While [14, 28, 52] provide artifact- free depthmaps in their official dataset, we utilize [69] to estimate artifact-free depthmaps for [4]. We observe from Tab. 8 that UniDAC outperforms UniK3D...

  76. [76]

    9 evaluates the effect of initializing encodersEwith different pre-trained weights on the model performance

    Ablation on Encoder Weights Tab. 9 evaluates the effect of initializing encodersEwith different pre-trained weights on the model performance. We train DACU and UniDAC using DINOv2 and DINOv3 en- coders on HM3D and DDAD datasets. While DAC’s pro- posed framework is compatible with any depth estimation model, they use iDisc [39] for its simplicity and effec...

  77. [77]

    4.2, we estimate a scale mapSin- stead of a 1-D scalarsto adjust for irregularities

    Ablation on Shift Estimation As mentioned in Sec. 4.2, we estimate a scale mapSin- stead of a 1-D scalarsto adjust for irregularities. How- ever, we still estimated shifttas a 1-D scalar. Tab. 10 provides an ablation on estimating a shift scalar and a shift map while keeping scale estimation in the form of a scale map. Formally, we modify the architecture...

  78. [78]

    Additional Qualitative results We provide additional qualitative results on Scan- Net++ [63], Pano3D-GV2 [2], and KITTI-360 [31] for vi- sual comparison in Fig. 7, Fig. 8 and Fig. 9 respectively. RGB & GT DACU [19] UniK3D [41] UniDAC Figure 7.Qualitative Results on ScanNet++ [63].Every pair of consecutive rows corresponds to a single sample. Odd rows disp...