arxiv: 2603.27105 · v2 · submitted 2026-03-28 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

UniDAC: Universal Metric Depth Estimation for Any Camera

Girish Chandar Ganesan , Yuliang Guo , Liu Ren , Xiaoming Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords monocular metric depth estimationcross-camera generalizationfisheye camera360-degree camerascale estimationrelative depthdistortion-aware embeddingequirectangular projection

0 comments

The pith

UniDAC decouples metric depth estimation into relative depth prediction and spatially varying scale estimation to generalize across any camera with a single model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces UniDAC as a monocular metric depth estimation framework designed to work robustly on diverse cameras including standard, fisheye, and 360-degree types without requiring large-FoV training data or per-domain models. It separates the problem into first predicting relative depth and then estimating a spatially varying scale map that is upsampled with guidance from the relative depth to handle local variations. A lightweight Depth-Guided Scale Estimation module performs the upsampling step while RoPE-φ provides distortion-aware positional embeddings that account for latitude-based warping in equirectangular projections. This structure yields consistent outperformance over prior methods on cross-camera generalization benchmarks. If the decoupling holds, it removes the need for camera-specific retraining in applications that encounter mixed sensor types.

Core claim

UniDAC achieves universal robustness in metric depth estimation across domains by decoupling the task into relative depth prediction followed by spatially varying scale estimation, with a Depth-Guided Scale Estimation module that uses the relative depth map to guide high-resolution scale upsampling and RoPE-φ as a latitude-aware positional embedding that respects ERP spatial warping, allowing one model to outperform prior approaches on all tested datasets.

What carries the argument

The Depth-Guided Scale Estimation module, which upsamples a coarse scale map to high resolution by using the relative depth map as guidance to capture local scale variations, together with the RoPE-φ distortion-aware positional embedding.

If this is right

A single trained model can be deployed directly in environments that mix standard, fisheye, and 360-degree cameras.
No separate training or fine-tuning is required when new camera types are introduced.
Real-time applications gain from the lightweight scale estimation module while retaining metric accuracy.
Distortion handling improves for equirectangular projections without explicit camera calibration at inference.
Cross-domain robustness extends to downstream tasks that rely on metric depth such as 3D reconstruction and navigation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same relative-to-scale decoupling could be tested on other scale-ambiguous tasks such as surface normal estimation or optical flow across camera types.
Combining the scale module with temporal consistency constraints might improve video depth stability when the camera switches between lens types.
Evaluating the framework on catadioptric or non-central projection cameras would reveal whether the latitude-aware embedding generalizes beyond ERP.
If relative depth accuracy is the bottleneck, pre-training the relative branch on synthetic multi-camera data could further lift performance.

Load-bearing premise

Relative depth predictions remain sufficiently accurate across camera domains to reliably guide the scale estimation without any domain-specific fine-tuning.

What would settle it

Measure metric depth error on a held-out wide-FoV camera after training only on narrow-FoV data; if the error stays high even when the relative depth map is accurate, the decoupling claim fails.

Figures

Figures reproduced from arXiv: 2603.27105 by Girish Chandar Ganesan, Liu Ren, Xiaoming Liu, Yuliang Guo.

**Figure 1.** Figure 1: We propose UniDAC, a universal, domain-agnostic metric depth estimation framework that generalizes to any camera. Unlike prior methods that either rely on large-FoV data during training or require separate models for indoor and outdoor domains, UniDAC is trained solely on perspective images yet generalizes effectively to large-FoV inputs, leveraging a universal model to robustly handle both indoor and o… view at source ↗

**Figure 2.** Figure 2: We show the Abs.Rel error between the predicted relative depth and the ground truth by performing (a) no scaling, (b) median [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of proposed method. UniDAC decouples metric depth estimation into relative depth and scale estimation. Relative depth relies on local scene information, while scene scale is domain-specific and depends on global scene information. Therefore, given an ERP image I, we split the features from the encoder into local Fl and the global features Fg. We predict the relative depth Drel using the local feat… view at source ↗

**Figure 5.** Figure 5: Motivation for RoPE-ϕ. We show the difference between (a) the pixel distance in ERP and (b) the corresponding geodesic distance on the curvature of the sphere. Although |p11 −p12| = |p21 −p22| in the ERP, we see that G(p11, p12) < G(p21, p22) on the sphere. Geodesic distance respects the actual separation in the 3D space. Thus, we modify 2D-RoPE to reflect the geodesic distance to get RoPE-ϕ. The final me… view at source ↗

**Figure 4.** Figure 4: Depth-Guided Upsampling. We leverage the predicted relative depth Drel as a guide to upsample the predicted low-resolution scale map Sr ∈ R H r × W r to get S ∈ R H×W . We compare Drel and its downsampled version Dr to get the local information in the form of weights W ∈ R H×W×9 . We compare the spatial mapping between S and Sr and combine it with W to obtain S. The Depth-Guided Upsampling is non-paramet… view at source ↗

**Figure 6.** Figure 6: Qualitative Results. Every pair of consecutive rows corresponds to a single sample. Odd rows display the input RGB image, and A.Rel error between predicted and GT depth maps. Even rows display the GT depth map and predicted depth maps. UniDAC reduces error in the distorted region as seen around the table in ScanNet++ compared to DACU and on the walls in Pano3D-GV2 compared to UniK3D. comparing it with 2D R… view at source ↗

**Figure 7.** Figure 7: Qualitative Results on ScanNet++ [63]. Every pair of consecutive rows corresponds to a single sample. Odd rows display the input RGB image, and A.Rel error between predicted and GT depth maps. Even rows display the GT depth map and predicted depth maps [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative Results on Pano3D-GV2 [2]. Every pair of consecutive rows corresponds to a single sample. Odd rows display the input RGB image, and A.Rel error between predicted and GT depth maps. Even rows display the GT depth map and predicted depth maps [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative Results on KITTI-360 [31]. Every pair of consecutive rows corresponds to a single sample. Odd rows display the input RGB image, and A.Rel error between predicted and GT depth maps. Even rows display the GT depth map and predicted depth maps [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

read the original abstract

Monocular metric depth estimation (MMDE) is a core challenge in computer vision, playing a pivotal role in real-world applications that demand accurate spatial understanding. Although prior works have shown promising zero-shot performance in MMDE, they often struggle with generalization across diverse camera types, such as fisheye and $360^\circ$ cameras. Recent advances have addressed this through unified camera representations or canonical representation spaces, but they require either including large-FoV camera data during training or separately trained models for different domains. We propose UniDAC, an MMDE framework that presents universal robustness in all domains and generalizes across diverse cameras using a single model. We achieve this by decoupling metric depth estimation into relative depth prediction and spatially varying scale estimation, enabling robust performance across different domains. We propose a lightweight Depth-Guided Scale Estimation module that upsamples a coarse scale map to high resolution using the relative depth map as guidance to account for local scale variations. Furthermore, we introduce RoPE-$\phi$, a distortion-aware positional embedding that respects the spatial warping in Equi-Rectangular Projections (ERP) via latitude-aware weighting. UniDAC achieves state of the art (SoTA) in cross-camera generalization by consistently outperforming prior methods across all datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UniDAC splits relative depth from a guided scale step plus RoPE-φ for ERP, which is a sensible engineering move for mixed cameras, but the abstract supplies no numbers so the cross-domain claim stays unverified.

read the letter

The core move is decoupling metric depth into a relative-depth network followed by a lightweight Depth-Guided Scale Estimation module that uses the relative map to refine a coarse scale field. They add RoPE-φ to handle latitude-dependent warping in equirectangular images. That combination is new relative to the priors cited and directly targets the practical headache of running one model on pinhole, fisheye, and 360° cameras without domain-specific retraining or large-FoV data in the training mix. The problem itself matters for real deployments, and the split keeps the relative stage from having to absorb all the scale and distortion variation at once. The abstract claims consistent outperformance across datasets, yet it contains no quantitative results, ablations, dataset splits, or error bars. That absence makes it impossible to judge whether the relative-depth stage actually stays accurate on unseen fisheye geometries or whether the scale module is simply compensating for correlated errors. The stress-test point about relative-depth mistakes propagating into the scale map is therefore still live; nothing shown rules it out. This is the kind of incremental but useful systems paper that people building mixed-camera pipelines would read. If the full experiments include proper cross-camera hold-outs and show the relative stage generalizing without hidden fitting, it is worth a serious referee. Otherwise the central robustness claim rests on unshown evidence.

Referee Report

3 major / 2 minor

Summary. The paper proposes UniDAC, a single-model framework for monocular metric depth estimation that decouples the task into relative depth prediction followed by a lightweight Depth-Guided Scale Estimation module to recover spatially varying scales. It introduces RoPE-φ, a latitude-aware positional embedding for handling distortions in equirectangular projections (ERP), and claims universal robustness and state-of-the-art cross-camera generalization across pinhole, fisheye, and 360° cameras without requiring large-FoV training data or per-domain fine-tuning.

Significance. If the central claims hold, this would be a meaningful contribution by enabling practical metric depth estimation on arbitrary camera geometries with one model, which could simplify deployment in robotics, AR/VR, and surveillance. The decoupling strategy and distortion-aware embedding address a real gap in prior unified-camera approaches, but the significance is tempered by the unverified assumption that relative-depth accuracy transfers reliably to unseen distorted geometries to guide scale estimation.

major comments (3)

[Abstract] Abstract: The claim that UniDAC 'achieves state of the art (SoTA) in cross-camera generalization by consistently outperforming prior methods across all datasets' is presented without any quantitative metrics, tables, error bars, ablation results, or dataset splits, which directly undermines verification of the generalization performance that is the paper's central contribution.
[§3] §3 (Method, Depth-Guided Scale Estimation module): The module relies on the relative depth map to upsample and guide a coarse scale map, but the manuscript provides no analysis or experiments quantifying how spatially correlated errors in relative depth (typical for pinhole-trained estimators on fisheye/ERP) propagate into the final metric depth; this is load-bearing for the 'universal robustness' claim.
[§3.3] §3.3 (RoPE-φ): The distortion-aware positional embedding is defined only for ERP latitude weighting; no equivalent intrinsic modeling is described for fisheye cameras, so it is unclear how the relative-depth stage can reliably guide scale estimation on fisheye geometries without inheriting distortion-induced errors.

minor comments (2)

[Abstract] The abstract introduces MMDE without spelling it out on first use; expand to 'monocular metric depth estimation (MMDE)' for clarity.
[§3.3] Notation for RoPE-φ should be typeset consistently (e.g., RoPE-ϕ vs. RoPE-φ) across equations and text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below. Where revisions are needed to strengthen the manuscript, we indicate them explicitly and will incorporate the changes in the next version.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that UniDAC 'achieves state of the art (SoTA) in cross-camera generalization by consistently outperforming prior methods across all datasets' is presented without any quantitative metrics, tables, error bars, ablation results, or dataset splits, which directly undermines verification of the generalization performance that is the paper's central contribution.

Authors: We agree that the abstract would benefit from greater specificity to support its claims at first reading. The detailed quantitative results—including per-dataset metrics, error bars from repeated runs, ablation studies, and explicit train/test splits—are reported in Section 4 and the supplementary material. In the revised manuscript we will update the abstract to include concise references to the key cross-camera improvements (e.g., average relative error reductions) while preserving its brevity. revision: yes
Referee: [§3] §3 (Method, Depth-Guided Scale Estimation module): The module relies on the relative depth map to upsample and guide a coarse scale map, but the manuscript provides no analysis or experiments quantifying how spatially correlated errors in relative depth (typical for pinhole-trained estimators on fisheye/ERP) propagate into the final metric depth; this is load-bearing for the 'universal robustness' claim.

Authors: This observation is correct: the current text does not contain a dedicated propagation analysis. We will add a new subsection (or appendix) that quantifies the effect of spatially correlated relative-depth errors on the final metric output. The added experiments will inject controlled noise patterns matching those observed on distorted geometries and measure the resulting metric-depth degradation, thereby directly supporting the robustness claim. revision: yes
Referee: [§3.3] §3.3 (RoPE-φ): The distortion-aware positional embedding is defined only for ERP latitude weighting; no equivalent intrinsic modeling is described for fisheye cameras, so it is unclear how the relative-depth stage can reliably guide scale estimation on fisheye geometries without inheriting distortion-induced errors.

Authors: We thank the referee for highlighting this gap in exposition. RoPE-φ is specialized for ERP latitude warping; for fisheye inputs the relative-depth backbone employs standard rotary embeddings together with the mixed-camera training regime, which allows the network to learn fisheye distortion patterns implicitly. The subsequent Depth-Guided Scale Estimation module then operates on the resulting relative map irrespective of source geometry. In the revision we will expand §3.3 with an explicit paragraph describing the fisheye pathway and why the decoupling limits error propagation. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural decoupling and new modules are independent of target predictions

full rationale

The paper's core claim rests on an explicit decoupling of metric depth into a relative-depth stage followed by a lightweight Depth-Guided Scale Estimation module that uses the relative depth map only as guidance for upsampling a coarse scale map. RoPE-φ is introduced as a new latitude-aware positional embedding for ERP. Neither step defines any quantity in terms of the final metric output, nor does any equation or module reduce by construction to a fitted parameter taken from the evaluation data. No self-citations are invoked as uniqueness theorems or load-bearing premises in the provided text. The generalization performance is therefore an empirical outcome of the proposed architecture rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract introduces two new modules without listing explicit free parameters; relies on standard computer-vision assumptions about depth cues and camera models.

axioms (1)

domain assumption Relative depth prediction can be learned independently of absolute scale and remains domain-robust
Central to the decoupling strategy described in the abstract

invented entities (2)

Depth-Guided Scale Estimation module no independent evidence
purpose: Upsample coarse scale map to high resolution using relative depth guidance
New lightweight module proposed to handle local scale variations
RoPE-φ no independent evidence
purpose: Distortion-aware positional embedding for equirectangular projections
Latitude-aware weighting to respect spatial warping in 360° images

pith-pipeline@v0.9.0 · 5527 in / 1231 out tokens · 49166 ms · 2026-05-14T22:25:22.547707+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 6 internal anchors

[1]

Hrdfuse: Monocular 360° depth estimation by collab- oratively learning holistic-with-regional depth distributions

Hao Ai, Zidong Cao, Yan-Pei Cao, Ying Shan, and Lin Wang. Hrdfuse: Monocular 360° depth estimation by collab- oratively learning holistic-with-regional depth distributions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17- 24, 2023, 2023. 3

work page 2023
[2]

Pano3d: A holistic benchmark and a solid baseline for 360deg depth estimation

Georgios Albanis, Nikolaos Zioulis, Petros Drakoulis, Vasileios Gkitsas, Vladimiros Sterzentsenko, Federico Al- varez, Dimitrios Zarpalas, and Petros Daras. Pano3d: A holistic benchmark and a solid baseline for 360deg depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3727– 3737, 2021. 6, 8, 1, 3, 5

work page 2021
[3]

Adabins: Depth estimation using adaptive bins

Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adaptive bins. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4009–4018, 2021. 1

work page 2021
[4]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 2

work page 2020
[5]

Matterport3D: Learning from RGB-D Data in Indoor Environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158, 2017. 6, 1

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

Towards real-time monocular depth estimation for robotics: A survey.IEEE Transactions on Intelligent Transportation Systems, 23(10):16940–16961,

Xingshuai Dong, Matthew A Garratt, Sreenatha G Ana- vatti, and Hussein A Abbass. Towards real-time monocular depth estimation for robotics: A survey.IEEE Transactions on Intelligent Transportation Systems, 23(10):16940–16961,

work page
[7]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representa- tions (ICLR). OpenR...

work page 2021
[8]

Depthlab: Real-time 3d in- teraction with depth maps for mobile augmented reality

Ruofei Du, Eric Turner, Maksym Dzitsiuk, Luca Prasso, Ivo Duarte, Jason Dourgarian, Joao Afonso, Jose Pascoal, Josh Gladstone, Nuno Cruces, et al. Depthlab: Real-time 3d in- teraction with depth maps for mobile augmented reality. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, pages 829–843, 2020. 1

work page 2020
[9]

Omnidata: A scalable pipeline for making multi- task mid-level vision datasets from 3d scans

Ainaz Eftekhar, Alexander Sax, Jitendra Malik, and Amir Zamir. Omnidata: A scalable pipeline for making multi- task mid-level vision datasets from 3d scans. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 10786–10796, 2021. 6

work page 2021
[10]

Depth map prediction from a single image using a multi-scale deep net- work.Advances in neural information processing systems, 27, 2014

David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep net- work.Advances in neural information processing systems, 27, 2014. 1

work page 2014
[11]

Simfir: A simple framework for fisheye image rectification with self-supervised representation learn- ing

Hao Feng, Wendi Wang, Jiajun Deng, Wengang Zhou, Li Li, and Houqiang Li. Simfir: A simple framework for fisheye image rectification with self-supervised representation learn- ing. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, 2023. 3

work page 2023
[12]

Geowiz- ard: Unleashing the diffusion priors for 3d geometry esti- mation from a single image

Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowiz- ard: Unleashing the diffusion priors for 3d geometry esti- mation from a single image. InEuropean Conference on Computer Vision, pages 241–258. Springer, 2024. 1

work page 2024
[13]

Extending foun- dational monocular depth estimators to fisheye cameras with calibration tokens

Suchisrit Gangopadhyay, Jung-Hee Kim, Xien Chen, Patrick Rim, Hyoungseob Park, and Alex Wong. Extending foun- dational monocular depth estimators to fisheye cameras with calibration tokens. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 5198–5209,

work page
[14]

Are we ready for autonomous driving? the kitti vision benchmark suite

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In2012 IEEE conference on computer vision and pat- tern recognition, pages 3354–3361. IEEE, 2012. 2

work page 2012
[15]

A unifying theory for central panoramic systems and practical implications

Christopher Geyer and Kostas Daniilidis. A unifying theory for central panoramic systems and practical implications. In European conference on computer vision, pages 445–461. Springer, 2000. 1, 2

work page 2000
[16]

A2d2: Audi autonomous driving dataset

Jakob Geyer, Yohannes Kassahun, Mentar Mahmudi, Xavier Ricou, Rupesh Durgesh, Andrew S Chung, Lorenz Hauswald, Viet Hoang Pham, Maximilian M ¨uhlegg, Sebas- tian Dorn, et al. A2d2: Audi autonomous driving dataset. arXiv preprint arXiv:2004.06320, 2020. 6, 1

work page arXiv 2004
[17]

3d packing for self-supervised monocular depth estimation

Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raven- tos, and Adrien Gaidon. 3d packing for self-supervised monocular depth estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2485–2494, 2020. 6, 1

work page 2020
[18]

Towards zero-shot scale-aware monocu- lar depth estimation

Vitor Guizilini, Igor Vasiljevic, Dian Chen, Rares , Ambrus,, and Adrien Gaidon. Towards zero-shot scale-aware monocu- lar depth estimation. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 9233–9243,

work page
[19]

Depth any camera: Zero-shot metric depth estimation from any camera

Yuliang Guo, Sparsh Garg, S Mahdi H Miangoleh, Xinyu Huang, and Liu Ren. Depth any camera: Zero-shot metric depth estimation from any camera. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26996–27006, 2025. 2, 3, 6, 7, 8, 1, 4, 5

work page 2025
[20]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 7

work page 2016
[21]

Rotary position embedding for vision transformer

Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. In European Conference on Computer Vision, pages 289–305. Springer, 2024. 3, 2

work page 2024
[22]

One thousand and one hours: Self-driving motion prediction dataset

John Houston, Guido Zuidhof, Luca Bergamini, Yawei Ye, Long Chen, Ashesh Jain, Sammy Omari, Vladimir Iglovikov, and Peter Ondruska. One thousand and one hours: Self-driving motion prediction dataset. InConference on Robot Learning, pages 409–418. PMLR, 2021. 6, 1

work page 2021
[23]

Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10579–10596, 2024. 1, 2, 3, 6, 7

work page 2024
[24]

Unifuse: Unidirectional fusion for 360° panorama depth estimation.IEEE Robotics Autom

Hualie Jiang, Zhe Sheng, Siyu Zhu, Zilong Dong, and Rui Huang. Unifuse: Unidirectional fusion for 360° panorama depth estimation.IEEE Robotics Autom. Lett., 2021. 3

work page 2021
[25]

Juho Kannala and Sami S Brandt. A generic camera model and calibration method for conventional, wide-angle, and fish-eye lenses.IEEE transactions on pattern analysis and machine intelligence, 28(8):1335–1340, 2006. 1

work page 2006
[26]

Repurpos- ing diffusion-based image generators for monocular depth estimation

Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Met- zger, Rodrigo Caye Daudt, and Konrad Schindler. Repurpos- ing diffusion-based image generators for monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 1, 2

work page 2024
[27]

Marigold: Affordable adaptation of diffusion- based image generators for image analysis.arXiv preprint arXiv:2505.09358, 2025

Bingxin Ke, Kevin Qu, Tianfu Wang, Nando Metzger, Shengyu Huang, Bo Li, Anton Obukhov, and Konrad Schindler. Marigold: Affordable adaptation of diffusion- based image generators for image analysis.arXiv preprint arXiv:2505.09358, 2025. 1

work page arXiv 2025
[28]

Evaluation of cnn-based single-image depth estimation methods

Tobias Koch, Lukas Liebel, Friedrich Fraundorfer, and Marco Korner. Evaluation of cnn-based single-image depth estimation methods. InProceedings of the European Con- ference on Computer Vision (ECCV) Workshops, pages 0–0,

work page
[29]

From big to small: Multi-scale local planar guidance for monocular depth estimation.arXiv preprint arXiv:1907.10326, 2019

Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. From big to small: Multi-scale local planar guidance for monocular depth estimation.arXiv preprint arXiv:1907.10326, 2019. 1

work page arXiv 1907
[30]

Omnifusion: 360 monocular depth estima- tion via geometry-aware fusion

Yuyan Li, Yuliang Guo, Zhixin Yan, Xinyu Huang, Ye Duan, and Liu Ren. Omnifusion: 360 monocular depth estima- tion via geometry-aware fusion. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, 2022. 3

work page 2022
[31]

Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3292–3310, 2022

Yiyi Liao, Jun Xie, and Andreas Geiger. Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3292–3310, 2022. 6, 8, 1, 2, 3

work page 2022
[32]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 7

work page 2021
[33]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochas- tic gradient descent with warm restarts.arXiv preprint arXiv:1608.03983, 2016. 6

work page internal anchor Pith review Pith/arXiv arXiv 2016
[34]

aimotive dataset: A multimodal dataset for robust autonomous driving with long-range perception

Tamas Matuszka, Ivan Barton, ´Ad´am Butykai, P ´eter Hajas, D´avid Kiss, Domonkos Kov´acs, S´andor Kuns´agi-M´at´e, P´eter Lengyel, G ´abor N´emeth, Levente Pet ˝o, Dezs ˝o Ribli, D ´avid Szeghy, Szabolcs Vajna, and Balint Viktor Varga. aimotive dataset: A multimodal dataset for robust autonomous driving with long-range perception. InInternational Confere...

work page 2023
[35]

Single view point om- nidirectional camera calibration from planar grids

Christopher Mei and Patrick Rives. Single view point om- nidirectional camera calibration from planar grids. InPro- ceedings 2007 IEEE International Conference on Robotics and Automation, pages 3945–3950. IEEE, 2007. 1, 2

work page 2007
[36]

Scalable parallel programming with cuda: Is cuda the parallel programming model that application developers have been waiting for?Queue, 6(2):40–53, 2008

John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel programming with cuda: Is cuda the parallel programming model that application developers have been waiting for?Queue, 6(2):40–53, 2008. 6

work page 2008
[37]

Is pseudo-lidar needed for monocular 3d object detection? InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 3142–3152,

Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li, and Adrien Gaidon. Is pseudo-lidar needed for monocular 3d object detection? InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 3142–3152,

work page
[38]

Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019. 6

work page 2019
[39]

idisc: In- ternal discretization for monocular depth estimation

Luigi Piccinelli, Christos Sakaridis, and Fisher Yu. idisc: In- ternal discretization for monocular depth estimation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21477–21487, 2023. 1, 2

work page 2023
[40]

Unidepth: Universal monocular metric depth estimation

Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10106–10116, 2024. 1, 2, 3, 6, 7

work page 2024
[41]

Unik3d: Universal camera monocular 3d estimation

Luigi Piccinelli, Christos Sakaridis, Mattia Segu, Yung- Hsu Yang, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unik3d: Universal camera monocular 3d estimation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 1028–1039, 2025. 2, 3, 6, 7, 8, 1, 4, 5

work page 2025
[42]

Unidepthv2: Universal monocular metric depth estimation made simpler, 2025

Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mat- tia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler, 2025. 1, 2, 3

work page 2025
[43]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Un- dersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021. 6, 1

work page internal anchor Pith review Pith/arXiv arXiv 2021
[44]

Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020. 1, 2

work page 2020
[45]

Vi- sion transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 6

work page 2021
[46]

360monodepth: High-resolution 360 monocular depth esti- mation

Manuel Rey, Mingze Yuan Area, and Christian Richardt. 360monodepth: High-resolution 360 monocular depth esti- mation. in 2022 ieee. InCVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2022. 3

work page 2022
[47]

Susskind

Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M. Susskind. Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding. In International Conference on Computer Vision (ICCV) 2021,

work page 2021
[48]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2

work page 2022
[49]

Single image based depth estimation for robotic applications

Anupa Sabnis and Leena Vachhani. Single image based depth estimation for robotic applications. In2011 IEEE Re- cent Advances in Intelligent Computational Systems, pages 102–106. IEEE, 2011. 1

work page 2011
[50]

Depth map esti- mation of single-view image using smartphone camera for a 3-dimension image generation in augmented reality

Irawati Nurmala Sari, Weiwei Du, et al. Depth map esti- mation of single-view image using smartphone camera for a 3-dimension image generation in augmented reality. In2023 Sixth International Symposium on Computer, Consumer and Control (IS3C), pages 167–170. IEEE, 2023. 1

work page 2023
[51]

Panoformer: Panorama transformer for indoor 360$ˆ{\circ}$ depth estimation

Zhijie Shen, Chunyu Lin, Kang Liao, Lang Nie, Zishuo Zheng, and Yao Zhao. Panoformer: Panorama transformer for indoor 360$ˆ{\circ}$ depth estimation. InComputer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part I, 2022. 3

work page 2022
[52]

Indoor segmentation and support inference from rgbd images

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InComputer Vision–ECCV 2012: 12th Eu- ropean Conference on Computer Vision, Florence, Italy, Oc- tober 7-13, 2012, Proceedings, Part V 12, pages 746–760. Springer, 2012. 2

work page 2012
[53]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

work page
[55]

Learning spherical con- volution for fast features from 360° imagery

Yu-Chuan Su and Kristen Grauman. Learning spherical con- volution for fast features from 360° imagery. InAdvances in Neural Information Processing Systems 30: Annual Confer- ence on Neural Information Processing Systems 2017, De- cember 4-9, 2017, Long Beach, CA, USA, 2017. 3

work page 2017
[56]

Anycalib: On- manifold learning for model-agnostic single-view camera calibration

Javier Tirado-Gar ´ın and Javier Civera. Anycalib: On- manifold learning for model-agnostic single-view camera calibration. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8044–8055, 2025. 1

work page 2025
[57]

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details.arXiv preprint arXiv:2507.02546,

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Pseudo- lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving

Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hari- haran, Mark Campbell, and Kilian Q Weinberger. Pseudo- lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8445–8453, 2019. 1

work page 2019
[59]

Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting

Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting.arXiv preprint arXiv:2301.00493, 2023. 6, 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Efficient de- formable convnets: Rethinking dynamic and sparse operator for vision applications

Yuwen Xiong, Zhiqi Li, Yuntao Chen, Feng Wang, Xizhou Zhu, Jiapeng Luo, Wenhai Wang, Tong Lu, Hongsheng Li, Yu Qiao, Lewei Lu, Jie Zhou, and Jifeng Dai. Efficient de- formable convnets: Rethinking dynamic and sparse operator for vision applications. 2024. 3

work page 2024
[61]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10371–10381, 2024. 1, 2

work page 2024
[62]

Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024. 1, 2

work page 2024
[63]

Scannet++: A high-fidelity dataset of 3d in- door scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d in- door scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023. 6, 8, 1, 3, 4

work page 2023
[64]

Learning to recover 3d scene shape from a single image

Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Long Mai, Simon Chen, and Chunhua Shen. Learning to recover 3d scene shape from a single image. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 204–213, 2021. 1, 2

work page 2021
[65]

Metric3d: Towards zero-shot metric 3d prediction from a single image

Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9043–9053, 2023. 1, 2, 3

work page 2023
[66]

Neural window fully-connected crfs for monocu- lar depth estimation

Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. Neural window fully-connected crfs for monocu- lar depth estimation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 3916–3925, 2022. 1

work page 2022
[67]

Egformer: Equirectangular geometry- biased transformer for 360 depth estimation

Ilwi Yun, Chanyong Shin, Hyunku Lee, Hyuk-Jae Lee, and Chae-Eun Rhee. Egformer: Equirectangular geometry- biased transformer for 360 depth estimation. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, 2023. 3

work page 2023
[68]

Taskonomy: Disentangling task transfer learning

Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 3712–3722, 2018. 6, 1

work page 2018
[69]

Replay: Remove projective lidar depthmap artifacts via exploiting epipolar geometry

Shengjie Zhu, Girish Chandar Ganesan, Abhinav Kumar, and Xiaoming Liu. Replay: Remove projective lidar depthmap artifacts via exploiting epipolar geometry. InEu- ropean Conference on Computer Vision, pages 393–411. Springer, 2024. 2

work page 2024
[70]

De- formable convnets V2: more deformable, better results

Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De- formable convnets V2: more deformable, better results

work page
[71]

3 UniDAC: Universal Metric Depth Estimation for Any Camera Supplementary Material

work page
[72]

Training Data Tab

Data 7.1. Training Data Tab. 5 provides an overview of the training datasets. In ad- dition to the training datasets utilized in DAC [19], we add Argoverse2 and A2D2 to balance the indoor and outdoor distribution in the training set. We observe that out of seven cameras in Argoverse2, the front camera’s aspect ratio is different than the rest of the six c...

work page
[73]

5.2, the comparison with UniK3D [41] is not fair to UniDAC, since [41] is trained on large-FoV images

Comparison with UniK3D As mentioned in Sec. 5.2, the comparison with UniK3D [41] is not fair to UniDAC, since [41] is trained on large-FoV images. However, we note that the compar- ison is also unfair towards [41] since UniDAC requires ground-truth camera parameters while [41] doesn’t. For a fairer comparison, we employ AnyCalib [56], an off-the-shelf cam...

work page
[74]

‘+A2D2’ denotes adding A2D2 [16] in the training data as detailed in Sec

and UniDAC using predicted and ground-truth intrin- sics. ‘+A2D2’ denotes adding A2D2 [16] in the training data as detailed in Sec. 7.1. We observe that even under this fairer comparison, we still outperform [41] on Scan- Net++ [63]. We attribute the decrease in the performance Table 8.Zero-shot evaluation on perspective datasets.We evaluate all unified m...

work page
[75]

While [14, 28, 52] provide artifact- free depthmaps in their official dataset, we utilize [69] to estimate artifact-free depthmaps for [4]

Evaluation on Perspective Datasets We compare UniDAC against our baselines on four per- spective datasets, KITTI [14], NYU-v2 [52], IBims-1 [28], and nuScenes [4]. While [14, 28, 52] provide artifact- free depthmaps in their official dataset, we utilize [69] to estimate artifact-free depthmaps for [4]. We observe from Tab. 8 that UniDAC outperforms UniK3D...

work page
[76]

9 evaluates the effect of initializing encodersEwith different pre-trained weights on the model performance

Ablation on Encoder Weights Tab. 9 evaluates the effect of initializing encodersEwith different pre-trained weights on the model performance. We train DACU and UniDAC using DINOv2 and DINOv3 en- coders on HM3D and DDAD datasets. While DAC’s pro- posed framework is compatible with any depth estimation model, they use iDisc [39] for its simplicity and effec...

work page
[77]

4.2, we estimate a scale mapSin- stead of a 1-D scalarsto adjust for irregularities

Ablation on Shift Estimation As mentioned in Sec. 4.2, we estimate a scale mapSin- stead of a 1-D scalarsto adjust for irregularities. How- ever, we still estimated shifttas a 1-D scalar. Tab. 10 provides an ablation on estimating a shift scalar and a shift map while keeping scale estimation in the form of a scale map. Formally, we modify the architecture...

work page
[78]

Additional Qualitative results We provide additional qualitative results on Scan- Net++ [63], Pano3D-GV2 [2], and KITTI-360 [31] for vi- sual comparison in Fig. 7, Fig. 8 and Fig. 9 respectively. RGB & GT DACU [19] UniK3D [41] UniDAC Figure 7.Qualitative Results on ScanNet++ [63].Every pair of consecutive rows corresponds to a single sample. Odd rows disp...

work page