pith. sign in

arxiv: 2606.12368 · v2 · pith:ZKY5TPZVnew · submitted 2026-06-10 · 💻 cs.CV

DepthMaster: Unified Monocular Depth Estimation for Perspective and Panoramic Images

Pith reviewed 2026-06-27 09:45 UTC · model grok-4.3

classification 💻 cs.CV
keywords monocular depth estimationpanoramic imagesperspective imagesunified frameworkzero-shot generalizationpatch decompositionconsistency loss
0
0 comments X

The pith

Decomposing panoramas into overlapping perspective patches with a consistency loss unifies metric depth estimation for both camera types.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Monocular depth estimation has long required separate models for narrow perspective images and full 360 panoramas because of geometric mismatches and limited panoramic training data. DepthMaster instead splits each panoramic image into overlapping perspective patches, then applies a Correspondence Consistency Loss together with virtual projection cameras to stitch the patches without custom operators or boundary fixes. This converts every input into a standard perspective view, so the same Transformer backbone can draw on large existing perspective datasets that carry metric ground truth. The result is a single model trained on mostly perspective data plus one panoramic set that delivers state-of-the-art zero-shot performance on thirteen benchmarks, beating both general-purpose and specialist networks in each domain. A reader would care because the method removes the need to maintain two separate pipelines or collect new panoramic metric labels.

Core claim

The paper claims that panoramic depth estimation can be reformulated as perspective patch processing, where the Correspondence Consistency Loss and virtual projection cameras ensure seamless stitching and geometric consistency, enabling a standard Transformer backbone to produce accurate metric depth for both perspective and panoramic images while using only one panoramic dataset in training.

What carries the argument

Correspondence Consistency Loss together with virtual projection cameras that serve as geometric priors for stitching overlapping perspective patches taken from panoramas.

If this is right

  • Trained on mixed data containing only one panoramic set, the model reaches state-of-the-art zero-shot results on thirteen diverse datasets.
  • It outperforms both universal depth methods and leading specialist models in the perspective and panoramic domains.
  • All inputs are reduced to a canonical perspective representation that removes the original geometric discrepancy.
  • Abundant perspective metric data become directly usable for panoramic estimation without new annotations.
  • The backbone stays compatible with standard Transformer designs and requires no specialized operators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar patch decomposition plus consistency losses could be tested on other wide-field tasks such as semantic segmentation or optical flow on 360 video.
  • The reduced need for panoramic labels might speed up adaptation of depth models in robotics and virtual-reality pipelines that rely on 360 cameras.
  • Running the same model on video sequences would test whether temporal consistency appears automatically from the spatial consistency term.

Load-bearing premise

The assumption that patch decomposition plus the consistency loss and virtual cameras will remove all boundary artifacts and geometric mismatches so that metric depth remains accurate after stitching.

What would settle it

If panoramic outputs show visible seams or large metric errors at patch boundaries on standard 360 benchmarks with ground-truth depth, or if the model falls below leading specialist panoramic networks on those benchmarks.

read the original abstract

While monocular depth estimation has achieved significant progress, achieving generalized metric depth estimation for both narrow field-of-view (FoV) perspectives and $360^\circ$ panoramas remains an unsolved challenge. Existing methods are often tailored to specific camera types and struggle to produce accurate metric depth that generalizes across diverse settings. This limitation stems from two key challenges: the inherent geometric discrepancy between perspective and panoramic cameras, and the scarcity of panoramic training data with metric annotations. In this work, we introduce DepthMaster, a unified metric depth estimation framework. Rather than employing specialized networks to learn spherical distortions, we reformulate the problem by decomposing panoramic images into overlapping perspective patches. Crucially, distinct from prior projection-based methods that rely on ad-hoc architectural modifications to handle boundaries, we introduce a novel Correspondence Consistency Loss (CCL) and inject virtual projection cameras as geometric priors, allowing us to seamlessly stitch the patches while avoiding specialized operators and keeping the backbone largely compatible with standard Transformer designs. This strategy also resolves the geometric differences by unifying all inputs into a canonical perspective representation, and effectively circumvents data scarcity by directly unlocking powerful metric priors from vast perspective datasets. Trained on a mixed dataset that contains only one panorama dataset, DepthMaster achieves state-of-the-art zero-shot performance on 13 diverse datasets, outperforming not only universal methods but also leading specialist models in both perspective and panoramic domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces DepthMaster, a unified monocular metric depth estimation framework for both perspective and 360° panoramic images. It decomposes panoramas into overlapping perspective patches, introduces a Correspondence Consistency Loss (CCL) together with virtual projection cameras as geometric priors to enable seamless stitching without specialized operators or architectural changes to the Transformer backbone, and reports state-of-the-art zero-shot performance on 13 diverse datasets after training on a mixed dataset containing only a single panoramic dataset.

Significance. If the empirical claims are substantiated, the work would be significant for the field because it offers a practical route to leverage abundant perspective datasets for panoramic depth estimation, potentially reducing the need for domain-specific networks and large-scale panoramic metric annotations while maintaining compatibility with standard vision transformers.

major comments (3)
  1. [Abstract] Abstract: The central claim of SOTA zero-shot performance on 13 datasets is asserted without any quantitative tables, error bars, ablation results, or implementation details on CCL formulation or virtual-camera injection, rendering the performance claims impossible to evaluate from the supplied text.
  2. [Method] Method section: The Correspondence Consistency Loss (CCL) and virtual projection cameras are presented as the mechanisms that resolve geometric discrepancies and eliminate boundary artifacts, yet no explicit equations, loss formulation, or pseudocode are supplied to show how correspondence is enforced across patch overlaps at the precision required for metric (rather than relative) depth.
  3. [Experiments] Experiments section: No ablation isolating CCL or virtual cameras, and no boundary-specific continuity metrics (e.g., seam discontinuity error or overlap-region RMSE), are reported; this directly undermines the load-bearing assumption that the proposed components produce seamless metric depth without residual artifacts that would otherwise degrade panoramic accuracy.
minor comments (1)
  1. [Abstract] Abstract: The phrase "keeping the backbone largely compatible with standard Transformer designs" is vague; clarify the exact degree of modification in the method section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies opportunities to improve the clarity and completeness of our presentation. We address each major comment point-by-point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of SOTA zero-shot performance on 13 datasets is asserted without any quantitative tables, error bars, ablation results, or implementation details on CCL formulation or virtual-camera injection, rendering the performance claims impossible to evaluate from the supplied text.

    Authors: We agree that the abstract, being a concise summary, does not include quantitative details or implementation specifics. The full manuscript reports the zero-shot results with comparisons in Table 1 (including standard deviations across runs) and provides implementation details in Sections 3 and 4. To make the central claims more immediately evaluable, we will revise the abstract to include key quantitative highlights, such as the average relative error reduction on the 13 datasets. This change will be incorporated in the revised version. revision: yes

  2. Referee: [Method] Method section: The Correspondence Consistency Loss (CCL) and virtual projection cameras are presented as the mechanisms that resolve geometric discrepancies and eliminate boundary artifacts, yet no explicit equations, loss formulation, or pseudocode are supplied to show how correspondence is enforced across patch overlaps at the precision required for metric (rather than relative) depth.

    Authors: The manuscript presents the CCL formulation in Equation (4) and describes the virtual camera projection in Section 3.2, including how correspondences are established via the projection matrices to enforce metric consistency. However, we acknowledge that additional explicit pseudocode and a step-by-step derivation of the metric-depth enforcement would improve clarity. We will add pseudocode for the CCL computation and expand the derivation in the revised method section to explicitly show the overlap correspondence mechanism. revision: yes

  3. Referee: [Experiments] Experiments section: No ablation isolating CCL or virtual cameras, and no boundary-specific continuity metrics (e.g., seam discontinuity error or overlap-region RMSE), are reported; this directly undermines the load-bearing assumption that the proposed components produce seamless metric depth without residual artifacts that would otherwise degrade panoramic accuracy.

    Authors: We agree that dedicated ablations and boundary-specific metrics would provide stronger evidence for the contribution of CCL and virtual cameras. In the revised manuscript, we will add an ablation study in Section 4.3 that isolates the impact of each component on both perspective and panoramic accuracy. We will also introduce boundary continuity metrics, specifically seam discontinuity error and overlap-region RMSE, to quantify the stitching quality before and after applying the proposed components. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation relies on external data and new loss without self-reduction

full rationale

The paper's core approach—decomposing panoramas into perspective patches, introducing CCL and virtual projection cameras, then training on a mixed dataset with only one panorama source to claim zero-shot SOTA—does not reduce any reported performance metric or geometric unification to a fitted parameter, self-citation chain, or definitional tautology. No equations are presented that equate outputs to inputs by construction, and the abstract contains no load-bearing self-citations or ansatzes smuggled from prior author work. The method is presented as leveraging independent perspective priors and a novel consistency loss, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the unstated premise that standard perspective depth networks plus the new loss and virtual cameras suffice to handle spherical geometry without residual distortion or metric bias; no free parameters are named in the abstract, but the virtual cameras and CCL are introduced entities whose independent validation is not provided.

axioms (1)
  • domain assumption Perspective projection models remain valid when applied to patches extracted from spherical panoramas
    Implicit in the decomposition strategy described in the abstract.
invented entities (2)
  • Correspondence Consistency Loss (CCL) no independent evidence
    purpose: Enforce depth consistency across overlapping patches to avoid boundary artifacts
    Newly introduced to replace ad-hoc architectural modifications
  • virtual projection cameras no independent evidence
    purpose: Provide geometric priors for stitching patches
    Injected as priors to unify inputs into canonical perspective representation

pith-pipeline@v0.9.1-grok · 5790 in / 1420 out tokens · 22450 ms · 2026-06-27T09:45:48.024890+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

90 extracted references · 3 canonical work pages

  1. [1]

    In: ACM SIGGRAPH 2024 Conference Pa- pers

    Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accurate radiance fields. InSIGGRAPH 2024 Conference Papers. Association for Computing Machinery, 2024. doi: 10.1145/3641519.3657428

  2. [2]

    Wonder3d: Single image to 3d using cross-domain diffusion

    Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Single image to 3d using cross-domain diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9970–9980, 2024

  3. [3]

    Shape of motion: 4d reconstruction from a single video.arXiv preprint arXiv:2407.13764, 2024

    Qianqian Wang, Vickie Ye, Hang Gao, Jake Austin, Zhengqi Li, and Angjoo Kanazawa. Shape of motion: 4d reconstruction from a single video.arXiv preprint arXiv:2407.13764, 2024. Visual Computing Lab·The Hong Kong Polytechnic University 9 / 25

  4. [4]

    Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds.arXiv preprint arXiv:2405.17421, 2024

    Jiahui Lei, Yijia Weng, Adam Harley, Leonidas Guibas, and Kostas Daniilidis. Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds.arXiv preprint arXiv:2405.17421, 2024

  5. [5]

    Spatialtracker: Tracking any 2d pixels in 3d space

    Yuxi Xiao, Qianqian Wang, Shangzhan Zhang, Nan Xue, Sida Peng, Yujun Shen, and Xiaowei Zhou. Spatialtracker: Tracking any 2d pixels in 3d space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  6. [6]

    Asurveyofautonomousdriving: Common practices and emerging technologies.IEEE access, 8:58443–58469, 2020

    EkimYurtsever,JacobLambert,AlexanderCarballo,andKazuyaTakeda. Asurveyofautonomousdriving: Common practices and emerging technologies.IEEE access, 8:58443–58469, 2020

  7. [7]

    Planning-oriented autonomous driving

    YihanHu, JiazhiYang, LiChen, KeyuLi, ChonghaoSima, XizhouZhu, SiqiChai, SenyaoDu, TianweiLin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17853–17862, 2023

  8. [8]

    Open vocabulary 3d scene understanding via geometry guided self-distillation

    Pengfei Wang, Yuxi Wang, Shuai Li, Zhaoxiang Zhang, Zhen Lei, and Lei Zhang. Open vocabulary 3d scene understanding via geometry guided self-distillation. InEuropean Conference on Computer Vision, pages 442–460. Springer, 2024

  9. [9]

    Fast multi-view consistent 3d editing with video priors

    Liyi Chen, Ruihuang Li, Guowen Zhang, Pengfei Wang, and Lei Zhang. Fast multi-view consistent 3d editing with video priors. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 2948–2956, 2026

  10. [10]

    One2scene: Geometric consistent explorable 3d scene generation from a single image.arXiv preprint arXiv:2602.19766, 2026

    Pengfei Wang, Liyi Chen, Zhiyuan Ma, Yanjun Guo, Guowen Zhang, and Lei Zhang. One2scene: Geometric consistent explorable 3d scene generation from a single image.arXiv preprint arXiv:2602.19766, 2026

  11. [11]

    Omni-3dedit: Generalized versatile 3d editing in one-pass

    Liyi Chen, Pengfei Wang, Guowen Zhang, Zhiyuan Ma, and Lei Zhang. Omni-3dedit: Generalized versatile 3d editing in one-pass. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12640–12650, 2026

  12. [12]

    Metric3d: Towards zero-shot metric 3d prediction from a single image

    WeiYin,ChiZhang,HaoChen,ZhipengCai,GangYu,KaixuanWang,XiaozhiChen,andChunhuaShen. Metric3d: Towards zero-shot metric 3d prediction from a single image. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9043–9053, 2023

  13. [13]

    Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation.arXiv preprint arXiv:2404.15506, 2024

    Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation.arXiv preprint arXiv:2404.15506, 2024

  14. [15]

    Depth anything v2.arXiv preprint arXiv:2406.09414, 2024

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.arXiv preprint arXiv:2406.09414, 2024

  15. [16]

    René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020

  16. [17]

    Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision

    RuichengWang,SichengXu,CassieDai,JianfengXiang,YuDeng,XinTong,andJiaolongYang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5261–5271, 2025

  17. [18]

    Learning to recover 3d scene shape from a single image.CoRR, abs/2012.09365, 2020

    Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Long Mai, Simon Chen, and Chunhua Shen. Learning to recover 3d scene shape from a single image.CoRR, abs/2012.09365, 2020. URL https://arxiv.org/abs/2012.09365

  18. [19]

    Unidepth: Universal monocular metric depth estimation

    Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10106–10116, 2024

  19. [21]

    Moge-2: Accurate monocular geometry with metric scale and sharp details.arXiv preprint arXiv:2507.02546, 2025

    Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details.arXiv preprint arXiv:2507.02546, 2025. Visual Computing Lab·The Hong Kong Polytechnic University 10 / 25

  20. [22]

    Depth any panoramas: A foundation model for panoramic depth estimation.arXiv, 2025

    Xin Lin, Meixi Song, Dizhe Zhang, Wenxuan Lu, Haodong Li, Bo Du, Ming-Hsuan Yang, Truong Nguyen, and Lu Qi. Depth any panoramas: A foundation model for panoramic depth estimation.arXiv, 2025

  21. [23]

    Unik3d: Universalcameramonocular3destimation

    Luigi Piccinelli, Christos Sakaridis, Mattia Segu, Yung-Hsu Yang, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unik3d: Universalcameramonocular3destimation. InProceedingsoftheComputerVisionandPatternRecognition Conference, pages 1028–1039, 2025

  22. [24]

    Depth any camera: Zero-shot metric depth estimation from any camera

    Yuliang Guo, Sparsh Garg, S Mahdi H Miangoleh, Xinyu Huang, and Liu Ren. Depth any camera: Zero-shot metric depth estimation from any camera. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26996–27006, 2025

  23. [25]

    Ning-Hsu Albert Wang and Yu-Lun Liu. Depth anywhere: Enhancing 360 monocular depth estimation via perspective distillation and unlabeled data augmentation.Advances in Neural Information Processing Systems, 37: 127739–127764, 2024

  24. [26]

    Bifuse: Monocular360depthestimation via bi-projection fusion

    Fu-EnWang,Yu-HsuanYeh,MinSun,Wei-ChenChiu,andYi-HsuanTsai. Bifuse: Monocular360depthestimation via bi-projection fusion. InCVPR, pages 459–468. Computer Vision Foundation / IEEE, 2020

  25. [27]

    FoVA-Depth: Field-of-view agnostic depth estimation for cross-dataset generalization

    Daniel Lichy, Hang Su, Abhishek Badki, Jan Kautz, and Orazio Gallo. FoVA-Depth: Field-of-view agnostic depth estimation for cross-dataset generalization. InInternational Conference on 3D Vision (3DV), 2024

  26. [28]

    Hush: Holistic panoramic 3d scene understanding using spherical harmonics

    Jongsung Lee, Harin Park, Byeong-Uk Lee, and Kyungdon Joo. Hush: Holistic panoramic 3d scene understanding using spherical harmonics. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16599–16608, 2025

  27. [29]

    Depth map prediction from a single image using a multi-scale deep network.Advances in neural information processing systems, 27, 2014

    David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network.Advances in neural information processing systems, 27, 2014

  28. [30]

    Deep ordinal regression network for monocular depth estimation

    Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2002–2011, 2018

  29. [31]

    Neural window fully-connected crfs for monoculardepthestimation

    Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. Neural window fully-connected crfs for monoculardepthestimation. InProceedingsoftheIEEE/CVFconferenceoncomputervisionandpatternrecognition, pages 3916–3925, 2022

  30. [32]

    Vision transformers for dense prediction

    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021

  31. [33]

    Virtual normal: Enforcing geometric constraints for accurate and robust depth prediction.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):7282–7295, 2021

    Wei Yin, Yifan Liu, and Chunhua Shen. Virtual normal: Enforcing geometric constraints for accurate and robust depth prediction.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):7282–7295, 2021

  32. [34]

    Depth anything: Unleashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10371–10381, 2024

  33. [35]

    Diffusion models trained with large data are transferable visual models.arXiv preprint arXiv:2403.06090, 2024

    Guangkai Xu, Yongtao Ge, Mingyu Liu, Chengxiang Fan, Kangyang Xie, Zhiyue Zhao, Hao Chen, and Chunhua Shen. Diffusion models trained with large data are transferable visual models.arXiv preprint arXiv:2403.06090, 2024

  34. [36]

    Repurposing diffusion-based image generators for monocular depth estimation

    Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9492–9502, 2024

  35. [37]

    Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image.arXiv preprint arXiv:2403.12013, 2024

    Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image.arXiv preprint arXiv:2403.12013, 2024

  36. [38]

    Hohonet: 360 indoor holistic understanding with latent horizontal features

    Cheng Sun, Min Sun, and Hwann-Tzong Chen. Hohonet: 360 indoor holistic understanding with latent horizontal features. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2573–2582, 2021

  37. [39]

    Elite360d: Towards efficient 360 depth estimation via semantic-and distance-aware bi-projection fusion

    Hao Ai and Lin Wang. Elite360d: Towards efficient 360 depth estimation via semantic-and distance-aware bi-projection fusion. InCVPR, 2024

  38. [40]

    Spherefusion: Efficient panorama depth estimation via gated fusion

    Qingsong Yan, Qiang Wang, Kaiyong Zhao, Jie Chen, Bo Li, Xiaowen Chu, and Fei Deng. Spherefusion: Efficient panorama depth estimation via gated fusion. InInternational Conference on 3D Vision 2025, 2025. Visual Computing Lab·The Hong Kong Polytechnic University 11 / 25

  39. [41]

    Learning spherical convolution for fast features from 360°imagery

    Yu-Chuan Su and Kristen Grauman. Learning spherical convolution for fast features from 360°imagery. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Decemb...

  40. [42]

    Efficient deformable convnets: Rethinking dynamic and sparse operator for vision applications

    Yuwen Xiong, Zhiqi Li, Yuntao Chen, Feng Wang, Xizhou Zhu, Jiapeng Luo, Wenhai Wang, Tong Lu, Hongsheng Li, Yu Qiao, Lewei Lu, Jie Zhou, and Jifeng Dai. Efficient deformable convnets: Rethinking dynamic and sparse operator for vision applications. 2024

  41. [43]

    Omnifusion: 360 monocular depth estimation via geometry-aware fusion

    Yuyan Li, Yuliang Guo, Zhixin Yan, Xinyu Huang, Ye Duan, and Liu Ren. Omnifusion: 360 monocular depth estimation via geometry-aware fusion. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, 2022

  42. [44]

    Panoformer: Panorama transformer for indoor 360$ˆ{\circ }$ depth estimation

    Zhijie Shen, Chunyu Lin, Kang Liao, Lang Nie, Zishuo Zheng, and Yao Zhao. Panoformer: Panorama transformer for indoor 360$ˆ{\circ }$ depth estimation. In Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors,Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceed...

  43. [46]

    Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  44. [47]

    Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

  45. [48]

    Segformer: Simple and efficient design for semantic segmentation with transformers.Advances in neural information processing systems, 34:12077–12090, 2021

    Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers.Advances in neural information processing systems, 34:12077–12090, 2021

  46. [49]

    Structured3d: A large photo-realistic dataset for structured 3d modeling

    Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. Structured3d: A large photo-realistic dataset for structured 3d modeling. InProceedings of The European Conference on Computer Vision (ECCV), 2020

  47. [50]

    Da ˆ{2}: Depth anything in any direction.arXiv preprint arXiv:2509.26618, 2025

    Haodong Li, Wangguangdong Zheng, Jing He, Yuhao Liu, Xin Lin, Xin Yang, Ying-Cong Chen, and Chunchao Guo. Da ˆ{2}: Depth anything in any direction.arXiv preprint arXiv:2509.26618, 2025

  48. [51]

    Joint2d-3d-semanticdataforindoorsceneunderstanding

    IroArmeni,SashaSax,AmirRZamir,andSilvioSavarese. Joint2d-3d-semanticdataforindoorsceneunderstanding. arXiv preprint arXiv:1702.01105, 2017

  49. [52]

    Matterport3d: Learningfromrgb-ddatainindoorenvironments

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niebner, Manolis Savva, Shuran Song, AndyZeng,andYindaZhang. Matterport3d: Learningfromrgb-ddatainindoorenvironments. In2017International Conference on 3D Vision (3DV), pages 667–676. IEEE Computer Society, 2017

  50. [53]

    Self-supervised learning of depth and camera motion from 360 videos

    Fu-En Wang, Hou-Ning Hu, Hsien-Tzu Cheng, Juan-Ting Lin, Shang-Ta Yang, Meng-Li Shih, Hung-Kuo Chu, and Min Sun. Self-supervised learning of depth and camera motion from 360 videos. InAsian Conference on Computer Vision, pages 53–68. Springer, 2018

  51. [54]

    Depth anything in 360◦: Towards scale invariance in the wild.arXiv preprint arXiv:2512.22819, 2025

    Hualie Jiang, Ziyang Song, Zhiqiang Lou, Rui Xu, and Minglang Tan. Depth anything in 360◦: Towards scale invariance in the wild.arXiv preprint arXiv:2512.22819, 2025

  52. [55]

    Indoor segmentation and support inference from rgbd images

    Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. InECCV, 2012

  53. [56]

    Sparsity invariant cnns

    Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke, Thomas Brox, and Andreas Geiger. Sparsity invariant cnns. InInternational Conference on 3D Vision (3DV), 2017

  54. [57]

    BAD SLAM: Bundle adjusted direct RGB-D SLAM

    Thomas Schöps, Torsten Sattler, and Marc Pollefeys. BAD SLAM: Bundle adjusted direct RGB-D SLAM. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  55. [58]

    Computer Vision and Image Understanding (CVIU)191, 102877 (2020).https: //doi.org/10.1016/j.cviu.2019.102877

    Tobias Koch, Lukas Liebel, Marco Körner, and Friedrich Fraundorfer. Comparison of monocular depth estimation methods using geometrically relevant metrics on the ibims-1 dataset.Computer Vision and Image Understanding (CVIU), 191:102877, 2020. doi: 10.1016/j.cviu.2019.102877. Visual Computing Lab·The Hong Kong Polytechnic University 12 / 25

  56. [59]

    Evaluation of cnn-based single-image depth estimation methods

    Tobias Koch, Lukas Liebel, Friedrich Fraundorfer, and Marco Körner. Evaluation of cnn-based single-image depth estimation methods. In Laura Leal-Taixé and Stefan Roth, editors,Proceedings of the European Conference on Computer Vision Workshops (ECCV-WS), pages 331–348. Springer International Publishing, 2019. doi: 10.1007/978-3-030-11015-4_25. URL http://...

  57. [60]

    McHugh, and Vincent Vanhoucke

    Laura Downs, Anthony Francis, Nate Koenig, Brandon Kinman, Ryan Hickman, Krista Reymann, Thomas B. McHugh, and Vincent Vanhoucke. Google scanned objects: A high-quality dataset of 3d scanned household items,

  58. [61]

    URL https://arxiv.org/abs/2204.11918

  59. [62]

    D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In A. Fitzgibbon et al. (Eds.), editor,European Conf. on Computer Vision (ECCV), Part IV, LNCS 7577, pages 611–625. Springer-Verlag, October 2012

  60. [63]

    3d packing for self-supervised monocular depth estimation

    Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, and Adrien Gaidon. 3d packing for self-supervised monocular depth estimation. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  61. [64]

    Dai, Andrea F

    Igor Vasiljevic, Nick Kolkin, Shanyi Zhang, Ruotian Luo, Haochen Wang, Falcon Z. Dai, Andrea F. Daniele, Mohammadreza Mostajabi, Steven Basart, Matthew R. Walter, and Gregory Shakhnarovich. DIODE: A Dense Indoor and Outdoor DEpth Dataset.CoRR, abs/1908.00463, 2019. URL http://arxiv.org/abs/1908.00463

  62. [65]

    Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo

    Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava Nalivayko, and Andrés Bruhn. Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  63. [66]

    On the importance of accurate geometry data for dense 3d vision tasks

    HyunJun Jung, Patrick Ruhkamp, Guangyao Zhai, Nikolas Brasch, Yitong Li, Yannick Verdie, Jifei Song, Yiren Zhou, Anil Armagan, Slobodan Ilic, et al. On the importance of accurate geometry data for dense 3d vision tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 780–791, 2023

  64. [67]

    Omnidepth: Dense depth estimation for indoors spherical panoramas

    Nikolaos Zioulis, Antonis Karakottas, Dimitrios Zarpalas, and Petros Daras. Omnidepth: Dense depth estimation for indoors spherical panoramas. InProceedings of the European Conference on Computer Vision (ECCV), pages 448–465, 2018

  65. [68]

    Bifuse: Monocular360depthestimation via bi-projection fusion

    Fu-EnWang,Yu-HsuanYeh,MinSun,Wei-ChenChiu,andYi-HsuanTsai. Bifuse: Monocular360depthestimation via bi-projection fusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 462–471, 2020

  66. [69]

    Bifuse++: Self-supervised and efficient bi-projection fusion for 360 depth estimation.IEEE transactions on pattern analysis and machine intelligence, 45 (5):5448–5460, 2022

    Fu-En Wang, Yu-Hsuan Yeh, Yi-Hsuan Tsai, Wei-Chen Chiu, and Min Sun. Bifuse++: Self-supervised and efficient bi-projection fusion for 360 depth estimation.IEEE transactions on pattern analysis and machine intelligence, 45 (5):5448–5460, 2022

  67. [70]

    Unifuse: Unidirectional fusion for 360 panorama depth estimation.IEEE Robotics and Automation Letters, 6(2):1519–1526, 2021

    Hualie Jiang, Zhe Sheng, Siyu Zhu, Zilong Dong, and Rui Huang. Unifuse: Unidirectional fusion for 360 panorama depth estimation.IEEE Robotics and Automation Letters, 6(2):1519–1526, 2021

  68. [71]

    Panoformer: panorama transformer for indoor 360◦ depth estimation

    Zhijie Shen, Chunyu Lin, Kang Liao, Lang Nie, Zishuo Zheng, and Yao Zhao. Panoformer: panorama transformer for indoor 360◦ depth estimation. InEuropean Conference on Computer Vision, pages 195–211. Springer, 2022

  69. [72]

    Hrdfuse: Monocular 360deg depth estimation by collaboratively learning holistic-with-regional depth distributions

    Hao Ai, Zidong Cao, Yan-Pei Cao, Ying Shan, and Lin Wang. Hrdfuse: Monocular 360deg depth estimation by collaboratively learning holistic-with-regional depth distributions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13273–13282, 2023

  70. [73]

    Panda: Towards panoramic depth anything with unlabeled panoramas and mobius spatial augmentation

    Zidong Cao, Jinjing Zhu, Weiming Zhang, Hao Ai, Haotian Bai, Hengshuang Zhao, and Lin Wang. Panda: Towards panoramic depth anything with unlabeled panoramas and mobius spatial augmentation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 982–992, 2025

  71. [74]

    Zoedepth: Zero-shot transfer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023

    Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Müller. Zoedepth: Zero-shot transfer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023

  72. [75]

    Grounding image matching in 3d with mast3r, 2024

    Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3d with mast3r, 2024

  73. [76]

    Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint arXiv:2410.02073, 2024

    Aleksei Bochkovskii, AmaãG, l Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint arXiv:2410.02073, 2024

  74. [77]

    Unidepthv2: Universal monocular metric depth estimation made simpler.arXiv preprint arXiv:2502.20110, 2025

    Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mattia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler.arXiv preprint arXiv:2502.20110, 2025. Visual Computing Lab·The Hong Kong Polytechnic University 13 / 25

  75. [78]

    Susskind

    MikeRoberts,JasonRamapuram,AnuragRanjan,AtulitKumar,MiguelAngelBautista,NathanPaczan,RussWebb, and Joshua M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In International Conference on Computer Vision (ICCV) 2021, 2021

  76. [79]

    Jakob Geyer, Yohannes Kassahun, Mentar Mahmudi, Xavier Ricou, Rupesh Durgesh, Andrew S. Chung, Lorenz Hauswald, Viet Hoang Pham, Maximilian Mühlegg, Sebastian Dorn, Tiffany Fernandez, Martin Jänicke, Sudesh Mirashi, Chiragkumar Savani, Martin Sturm, Oleksandr Vorobiov, Martin Oelker, Sebastian Garreis, and Peter Schuberth. A2D2: Audi Autonomous Driving Da...

  77. [80]

    Deepmvs: Learning multi-view stereopsis

    Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi-view stereopsis. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  78. [81]

    Flow-motion and depth network for monocular stereo and beyond.CoRR, abs/1909.05452, 2019

    Kaixuan Wang and Shaojie Shen. Flow-motion and depth network for monocular stereo and beyond.CoRR, abs/1909.05452, 2019. URL http://arxiv.org/abs/1909.05452

  79. [82]

    JoseL.Gómez,ManuelSilva,AntonioSeoane,AgnèsBorrás,MarioNoriega,GermánRos,JoseA.Iglesias-Guitian, and Antonio M. López. All for one, and one for all: Urbansyn dataset, the third musketeer of synthetic driving scenes, 2023

  80. [83]

    Tartanair: A dataset to push the limits of visual slam

    Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual slam. 2020

Showing first 80 references.