pith. sign in

arxiv: 2605.26368 · v2 · pith:UYIA2VGHnew · submitted 2026-05-25 · 💻 cs.CV · cs.AI

Unified Panoramic Geometry Estimation via Multi-View Foundation Models

Pith reviewed 2026-06-29 22:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords panoramic geometrydepth estimationsurface normalsfoundation modelsomnidirectional imageszero-shot performance3D reconstruction
0
0 comments X

The pith

PaGeR adapts pre-trained 3D foundation models to predict scale-invariant depth, metric depth, surface normals and sky masks from both perspective images and panoramas in one forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PaGeR as a way to extend existing 3D reconstruction models from ordinary photos to full 360-degree panoramas without major redesign. It starts from a transformer already trained on perspective images, applies only small architectural tweaks, and trains on a mixture of both image types. The result is a single model that handles either input and outputs multiple geometry quantities at once while keeping the original 3D knowledge intact. This matters because it opens the door to recovering complete scene structure from a single panoramic shot, something that current perspective-only models cannot do directly.

Core claim

PaGeR lifts powerful 3D foundation models designed for perspective imagery to the panorama domain. The strategy keeps architectural changes to a minimum and mixes perspective and panoramic images during training so the model retains its rich 3D prior while also learning to estimate geometrically consistent 360-degree scenes from single panoramas. The unified model predicts scale-invariant depth, metric depth, surface normals, and sky masks from both perspective and omnidirectional images in a single forward pass.

What carries the argument

The minimally modified pre-trained transformer trained on mixed perspective and panoramic data that produces unified geometry outputs for both image domains.

If this is right

  • Achieves state-of-the-art performance on indoor and outdoor environments.
  • Delivers strong zero-shot generalization across a wide range of scenes.
  • Produces consistent full-scene geometry from a single panoramic input.
  • Supports both perspective and omnidirectional images without separate models or passes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same minimal-adaptation pattern could be tested on other geometry tasks such as optical flow or semantic segmentation.
  • If the mixed-training approach generalizes, it may reduce the need for fully separate panoramic training datasets in future work.
  • Applications that require quick 360 reconstruction, such as virtual walkthroughs, would benefit directly from a single-pass model.

Load-bearing premise

Mixing perspective and panoramic images during training together with only minimal architectural changes is enough to preserve the original 3D prior and avoid domain-specific inconsistencies in the 360-degree outputs.

What would settle it

A test showing that performance on perspective images drops or that panoramic outputs contain geometric inconsistencies after the mixed training would indicate the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.26368 by Denis Rozumny, Dominik Narnhofer, Isidora Slavkovic, Konrad Schindler, Nando Metzger, Nikolai Kalischek, Vukasin Bozic.

Figure 1
Figure 1. Figure 1: Different geometric modalities predicted by PaGeR. Given only a single monocular panoramic image, our framework simultaneously reconstructs highly detailed scale-invariant depth, absolute metric depth, surface normals, and sky segmentation masks across both indoor and outdoor environments. Spatial measurements are presented in meters. Abstract Geometry estimation from perspective images has greatly advance… view at source ↗
Figure 2
Figure 2. Figure 2: PaGeR Architecture. An input RGB panorama is processed by a shared geometry transformer backbone to predict a sky mask, scale-invariant (SI) depth, surface normals, and coarse metric depth. In the metric branch, the final absolute depth is obtained by aligning the SI depth with coarse metric predictions and masking the sky. In the normal branch, predicted orientations are masked by the sky segmentation to … view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of panoramic depth estimation. Visual results from PaGeR, DAP [27], and DA2 [23] (the strongest metric and scale-invariant baselines) alongside the RGB input and ground-truth depth on Matterport3D360, Stanford2D3DS, and ZüriPano. Our framework recovers sharper boundaries and more accurate global structures than competitors. Additional examples are in the appendix. Best viewed zoomed … view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of panoramic surface normals estimation. Results from PaGeR and MTL (best available baseline method), shown alongside the RGB input and ground-truth depth on panoramas from the Structured3D dataset. (Best viewed zoomed in.) 8 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examples of RGB, depth, and surface normal panoramas from our PanoInfinigen dataset. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Samples of our ZüriPano dataset. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison of panoramic depth estimation.. Visual results from PaGeR, compared to the subset of the best evaluated baselines, shown alongside the RGB input and ground￾truth depth on Matterport3D360, Stanford2D3DS, and ZüriPano panoramas. (Best viewed zoomed in.) Input DA3 Ours [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison to vanilla DA3. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative point-cloud comparison. Indoor scenes (top) and an outdoor scene (bottom) are rendered as point clouds alongside the corresponding panoramic input images for competitors and our method. For the indoor examples, we show our point cloud reconstruction with zoomed-in novel-view rendering comparison to the main competitors, highlighted by red boxes [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Examples of measured distances in our predicted point cloud. The measures are given in [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative comparison of panoramic surface normals estimation. Visual results from PaGeR and MTL (best available baseline method), shown alongside the RGB input and ground-truth depth on panoramas from the Structured3D dataset. (Best viewed zoomed in.) 22 [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
read the original abstract

Geometry estimation from perspective images has greatly advanced, maturing to the point where off-the-shelf foundation models are able to reconstruct 3D scene structure not only from multi-view imagery, but even from a single view. A natural extension is 3D reconstruction from panoramas, with the exciting prospect of recovering a full 360-degree scene from a single panoramic image. In this work, we introduce PaGeR (Panoramic Geometry Reconstruction), a framework to lift powerful 3D foundation models designed for perspective imagery to the panorama domain. Our strategy is to start from a pre-trained transformer for 3D reconstruction and turn it into a unified high-performance model that predicts scale-invariant depth, metric depth, surface normals, and sky masks from both perspective and omnidirectional images, in a single forward pass. By keeping architectural changes to a minimum and mixing perspective and panoramic images during training, PaGeR retains the rich 3D prior of the underlying foundation model while learning to also estimate geometrically consistent 360-degree scenes from single panoramas. We extensively test our method in both indoor and outdoor environments and find that it delivers state-of-the-art performance and excellent zero-shot performance across a wide range of scenes. Code, data and models are available $\href{https://github.com/prs-eth/PaGeR}{\text{here}}$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces PaGeR, a framework that adapts pre-trained transformer-based 3D foundation models for perspective images to the panoramic domain. By making minimal architectural changes and training on a mix of perspective and panoramic images, it enables a single model to predict scale-invariant depth, metric depth, surface normals, and sky masks from both perspective and omnidirectional images in one forward pass. The authors claim state-of-the-art performance and strong zero-shot generalization across indoor and outdoor scenes.

Significance. If the experimental results hold, this work would be significant for providing a unified approach to 3D geometry estimation that bridges perspective and 360-degree imagery using foundation models. The release of code, data, and models enhances reproducibility and potential impact in computer vision applications involving panoramic scenes.

major comments (2)
  1. [Training procedure and ablations (likely §3-4)] The central claim that minimal architectural changes plus mixed training on perspective and panoramic data is sufficient to retain the base model's 3D prior (without domain interference from equirectangular distortions or wrap-around topology) is load-bearing but not yet substantiated by the provided abstract. Specific ablations comparing perspective-only performance before and after adding panoramic data are required to address the risk of compromised attention patterns or feature statistics.
  2. [Abstract and Experiments] The abstract asserts SOTA and excellent zero-shot results across scenes, but supplies no quantitative metrics, dataset details, or ablation evidence. Full experimental sections with tables reporting metrics on standard benchmarks (e.g., perspective depth/normal accuracy pre/post-mixing, panoramic consistency measures) would be required to assess whether the data support the claim.
minor comments (1)
  1. [Abstract] The abstract could benefit from including one or two key quantitative results (e.g., relative improvement on a panoramic benchmark) to ground the SOTA claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The comments highlight important aspects of substantiating our central claims regarding the mixed-training strategy and the need for clearer quantitative support in the abstract. We address each point below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Training procedure and ablations (likely §3-4)] The central claim that minimal architectural changes plus mixed training on perspective and panoramic data is sufficient to retain the base model's 3D prior (without domain interference from equirectangular distortions or wrap-around topology) is load-bearing but not yet substantiated by the provided abstract. Specific ablations comparing perspective-only performance before and after adding panoramic data are required to address the risk of compromised attention patterns or feature statistics.

    Authors: We agree that explicit evidence of retained perspective performance after mixed training is essential to support the claim. While Section 4 already includes ablations on mixed vs. panoramic-only training and reports perspective-task metrics, it does not contain the exact before/after comparison on a held-out perspective benchmark. We will add this ablation (training the model on perspective data only, then continuing with mixed data, and evaluating both on standard perspective depth/normal benchmarks) to the revised manuscript. revision: yes

  2. Referee: [Abstract and Experiments] The abstract asserts SOTA and excellent zero-shot results across scenes, but supplies no quantitative metrics, dataset details, or ablation evidence. Full experimental sections with tables reporting metrics on standard benchmarks (e.g., perspective depth/normal accuracy pre/post-mixing, panoramic consistency measures) would be required to assess whether the data support the claim.

    Authors: The full manuscript already contains extensive experimental sections (Sections 4–5) with tables reporting quantitative metrics on standard perspective benchmarks (e.g., NYU, KITTI), panoramic datasets, zero-shot generalization, and consistency measures. However, the abstract is deliberately concise and omits specific numbers. We will revise the abstract to include a small number of key quantitative highlights (e.g., relative improvements on panoramic depth and zero-shot metrics) while keeping it within length limits, and ensure all requested table types are clearly referenced. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation; empirical adaptation of external pre-trained model

full rationale

The paper presents an empirical adaptation strategy: starting from an external pre-trained transformer, applying minimal architectural changes, and training on mixed perspective/panoramic data. No equations, derivations, or self-defined quantities are shown that reduce to fitted inputs by construction. Claims rest on experimental validation against external benchmarks rather than self-citation chains or ansatzes imported from the authors' prior work. This matches the expected non-finding for papers whose central contribution is data mixing and fine-tuning without load-bearing self-referential math.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond the high-level framework description; the pre-trained transformer is treated as an external input from prior literature.

pith-pipeline@v0.9.1-grok · 5791 in / 1202 out tokens · 35232 ms · 2026-06-29T22:06:01.644096+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 10 canonical work pages · 4 internal anchors

  1. [1]

    Elite360D: Towards efficient 360 depth estimation via semantic-and distance-aware bi-projection fusion

    Hao Ai and Lin Wang. Elite360D: Towards efficient 360 depth estimation via semantic-and distance-aware bi-projection fusion. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  2. [2]

    ARKitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Yuri Feigin, Peter Fu, Thomas Gebauer, Daniel Kurz, Tal Dimry, Brandon Joffe, Arik Schwartz, and Elad Shulman. ARKitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2021

  3. [3]

    ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

    Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Müller. Zoedepth: Zero-shot transfer by combining relative and metric depth.preprint arXiv:2302.12288, 2023

  4. [4]

    PanDA: Towards panoramic depth anything with unlabeled panoramas and mobius spatial augmentation

    Zidong Cao, Jinjing Zhu, Weiming Zhang, Hao Ai, Haotian Bai, Hengshuang Zhao, and Lin Wang. PanDA: Towards panoramic depth anything with unlabeled panoramas and mobius spatial augmentation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  5. [5]

    Depth map prediction from a single image using a multi-scale deep network

    David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. InAdvances in Neural Information Processing Systems (NeurIPS), 2014

  6. [6]

    Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image

    Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image. In European Conference on Computer Vision (ECCV), 2024

  7. [7]

    Fine-tuning image-conditional diffusion models is easier than you think

    Gonzalo Martin Garcia, Karim Abou Zeid, Christian Schmidt, Daan De Geus, Alexander Hermans, and Bastian Leibe. Fine-tuning image-conditional diffusion models is easier than you think. InIEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025

  8. [8]

    Clément Godard, Oisin Mac Aodha, and Gabriel J. Brostow. Unsupervised monocular depth estimation with left-right consistency. InCVPR, 2017

  9. [9]

    Depth any camera: Zero-shot metric depth estimation from any camera

    Yuliang Guo, Sparsh Garg, S Mahdi H Miangoleh, Xinyu Huang, and Liu Ren. Depth any camera: Zero-shot metric depth estimation from any camera. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  10. [10]

    PanoVGGT: Feed-forward 3d reconstruction from panoramic imagery.preprint arXiv:2603.17571, 2026

    Yijing Guo, Mengjun Chao, Luo Wang, Tianyang Zhao, Haizhao Dai, Yingliang Zhang, Jingyi Yu, and Yu- jiao Shi. PanoVGGT: Feed-forward 3d reconstruction from panoramic imagery.preprint arXiv:2603.17571, 2026

  11. [11]

    Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero- shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  12. [12]

    PanoNormal: Monocular indoor 360 ◦ surface normal estimation.preprint arXiv:2405.18745, 2024

    Kun Huang, Fanglue Zhang, and Neil A Dodgson. PanoNormal: Monocular indoor 360 ◦ surface normal estimation.preprint arXiv:2405.18745, 2024

  13. [13]

    Multi-task geometric estimation of depth and surface normal from monocular 360◦ images.preprint arXiv:2411.01749, 2024

    Kun Huang, Fanglue Zhang, Fangfang Zhang, Yu-Kun Lai, Paul L Rosin, and Neil A Dodgson. Multi-task geometric estimation of depth and surface normal from monocular 360◦ images.preprint arXiv:2411.01749, 2024

  14. [14]

    DreamCube: 3d panorama generation via multi-plane synchronization.preprint arXiv:2506.17206, 2025

    Yukun Huang, Yanning Zhou, Jianan Wang, Kaiyi Huang, and Xihui Liu. DreamCube: 3d panorama generation via multi-plane synchronization.preprint arXiv:2506.17206, 2025

  15. [15]

    Unifuse: Unidirectional fusion for 360 ◦ panorama depth estimation.IEEE Robotics and Automation Letters, 2021

    Hualie Jiang, Zhe Sheng, Siyu Zhu, Zilong Dong, and Rui Huang. Unifuse: Unidirectional fusion for 360 ◦ panorama depth estimation.IEEE Robotics and Automation Letters, 2021

  16. [16]

    RPG360: Robust 360 depth estimation with perspective foundation models and graph optimization

    Dongki Jung, Jaehoon Choi, Yonghan Lee, and Dinesh Manocha. RPG360: Robust 360 depth estimation with perspective foundation models and graph optimization. InAdvances in Neural Information Processing Systems (NeurIPS), 2026

  17. [17]

    CubeDiff: Repurposing diffusion-based image models for panorama generation

    Nikolai Kalischek, Michael Oechsle, Fabian Manhardt, Philipp Henzler, Konrad Schindler, and Federico Tombari. CubeDiff: Repurposing diffusion-based image models for panorama generation. InInternational Conference on Learning Representations (ICLR), 2025

  18. [18]

    360 ◦ surface regression with a hyper-sphere loss.preprint arXiv:1909.07043, 2019

    Antonis Karakottas, Nikolaos Zioulis, Stamatis Samaras, Dimitrios Ataloglou, Vasileios Gkitsas, Dimitrios Zarpalas, and Petros Daras. 360 ◦ surface regression with a hyper-sphere loss.preprint arXiv:1909.07043, 2019. 11

  19. [19]

    Repurposing diffusion-based image generators for monocular depth estimation

    Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  20. [20]

    HUSH: Holistic panoramic 3d scene understanding using spherical harmonics

    Jongsung Lee, Harin Park, Byeong-Uk Lee, and Kyungdon Joo. HUSH: Holistic panoramic 3d scene understanding using spherical harmonics. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  21. [21]

    Hexagon AB, 2026

    Leica Geosystems.Leica RTC360 3D Reality Capture Solution System Specification. Hexagon AB, 2026. Accessed: 2026-05-02

  22. [22]

    Grounding image matching in 3d with MASt3R

    Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with MASt3R. In European Conference on Computer Vision (ECCV), 2024

  23. [23]

    DA 2: Depth anything in any direction

    Haodong Li, Wangguangdong Zheng, Jing He, Yuhao Liu, Xin Lin, Xin Yang, Ying-Cong Chen, and Chunchao Guo. DA 2: Depth anything in any direction. InInternational Conference on Learning Representations (ICLR), 2026

  24. [24]

    OmniFusion: 360 monocular depth estimation via geometry-aware fusion

    Yuyan Li, Yuliang Guo, Zhixin Yan, Xinyu Huang, Ye Duan, and Liu Ren. OmniFusion: 360 monocular depth estimation via geometry-aware fusion. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  25. [25]

    Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth Anything 3: Recovering the visual space from any views. InInternational Conference on Learning Representations (ICLR), 2026

  26. [26]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2017

  27. [27]

    Depth Any Panoramas: A foundation model for panoramic depth estimation

    Xin Lin, Meixi Song, Dizhe Zhang, Wenxuan Lu, Haodong Li, Bo Du, Ming-Hsuan Yang, Truong Nguyen, and Lu Qi. Depth Any Panoramas: A foundation model for panoramic depth estimation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

  28. [28]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.preprint arXiv:1711.05101, 2019

  29. [29]

    DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research, 1, 2024

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick La...

  30. [30]

    iCity, professional procedural city generation add-on for blender

    Parametra. iCity, professional procedural city generation add-on for blender. https://parametra.net/,

  31. [31]

    Accessed: 2026-05-02

  32. [32]

    UniK3D: Universal camera monocular 3d estimation

    Luigi Piccinelli, Christos Sakaridis, Mattia Segu, Yung-Hsu Yang, Siyuan Li, Wim Abbeloos, and Luc Van Gool. UniK3D: Universal camera monocular 3d estimation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  33. [33]

    UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler

    Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mattia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler.preprint arXiv:2502.20110, 2025

  34. [34]

    Infinite photorealistic worlds using procedural generation

    Alexander Raistrick, Lahav Lipson, Zeyu Ma, Lingjie Mei, Mingzhe Wang, Yiming Zuo, Karhan Kayan, Hongyu Wen, Beining Han, Yihan Wang, Alejandro Newell, Hei Law, Ankit Goyal, Kaiyu Yang, and Jia Deng. Infinite photorealistic worlds using procedural generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  35. [35]

    Infinigen indoors: Photorealistic indoor scenes using procedural generation

    Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, Zeyu Ma, and Jia Deng. Infinigen indoors: Photorealistic indoor scenes using procedural generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  36. [36]

    René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1623–1637, 2020. 12

  37. [37]

    Vision transformers for dense prediction

    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In IEEE/CVF International Conference on Computer Vision (ICCV), 2021

  38. [38]

    Matterport3D 360 ◦ RGBD dataset

    Manuel Rey-Area, Mingze Yuan, and Christian Richardt. Matterport3D 360 ◦ RGBD dataset. https: //researchdata.bath.ac.uk/1126/, 2022

  39. [39]

    PanoFormer: panorama transformer for indoor 360◦ depth estimation

    Zhijie Shen, Chunyu Lin, Kang Liao, Lang Nie, Zishuo Zheng, and Yao Zhao. PanoFormer: panorama transformer for indoor 360◦ depth estimation. InEuropean Conference on Computer Vision (ECCV), 2022

  40. [40]

    Very deep convolutional networks for large-scale image recogni- tion

    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni- tion. InInternational Conference on Learning Representations (ICLR), 2015

  41. [41]

    Stanford 2D-3D-Semantics dataset (2D-3D-S)

    Stanford Doerr School of Sustainability Data Repository. Stanford 2D-3D-Semantics dataset (2D-3D-S). https://sdss.redivis.com/datasets/f304-a3vhsvcaf?v=1.0, 2024

  42. [42]

    Sudre, Wenqi Li, Tom Vercauteren, Sebastien Ourselin, and M

    Carole H. Sudre, Wenqi Li, Tom Vercauteren, Sebastien Ourselin, and M. Jorge Cardoso.Generalised Dice Overlap as a Deep Learning Loss Function for Highly Unbalanced Segmentations, page 240–248. Springer International Publishing, 2017

  43. [43]

    Bifuse++: Self-supervised and efficient bi-projection fusion for 360 depth estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(5):5448–5460, 2022

    Fu-En Wang, Yu-Hsuan Yeh, Yi-Hsuan Tsai, Wei-Chen Chiu, and Min Sun. Bifuse++: Self-supervised and efficient bi-projection fusion for 360 depth estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(5):5448–5460, 2022

  44. [44]

    VGGT: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  45. [45]

    Depth anywhere: Enhancing 360 monocular depth estimation via perspective distillation and unlabeled data augmentation

    Ning-Hsu Albert Wang and Yu-Lun Liu. Depth anywhere: Enhancing 360 monocular depth estimation via perspective distillation and unlabeled data augmentation. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  46. [46]

    MoGe: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision

    Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. MoGe: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  47. [47]

    DUSt3R: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3d vision made easy. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  48. [48]

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He.π 3: Permutation-equivariant visual geometry learning, 2026

  49. [49]

    FS-Depth: Focal-and-scale depth estimation from a single image in unseen indoor scene.IEEE Transactions on Circuits and Systems for Video Technology, 34(11), 2024

    Chengrui Wei, Meng Yang, Lei He, and Nanning Zheng. FS-Depth: Focal-and-scale depth estimation from a single image in unseen indoor scene.IEEE Transactions on Circuits and Systems for Video Technology, 34(11), 2024

  50. [50]

    Metric-solver: Sliding anchored metric depth estimation from a single image, 2025

    Tao Wen, Jiepeng Wang, Yabo Chen, Shugong Xu, Chi Zhang, and Xuelong Li. Metric-solver: Sliding anchored metric depth estimation from a single image, 2025

  51. [51]

    Depth anything: Unleashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  52. [52]

    ScanNet++: A high-fidelity dataset of 3d indoor scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. ScanNet++: A high-fidelity dataset of 3d indoor scenes. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  53. [53]

    Metric3d: Towards zero-shot metric 3d prediction from a single image

    Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. InIEEE/CVF International Conference on Computer Vision (ICCV), 2023

  54. [54]

    VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation

    Jiayi Yuan, Haobo Jiang, De Wen Soh, and Na Zhao. VGGT-360: Geometry-consistent zero-shot panoramic depth estimation.preprint arXiv:2603.18943, 2026

  55. [55]

    EGformer: Equirectangu- lar geometry-biased transformer for 360 depth estimation

    Ilwi Yun, Chanyong Shin, Hyunku Lee, Hyuk-Jae Lee, and Chae Eun Rhee. EGformer: Equirectangu- lar geometry-biased transformer for 360 depth estimation. InIEEE/CVF International Conference on Computer Vision (ICCV), 2023. 13

  56. [56]

    MonoViT: Self-supervised monocular depth estimation with a vision transformer

    Chaoqiang Zhao, Youmin Zhang, Matteo Poggi, Fabio Tosi, Xianda Guo, Zheng Zhu, Guan Huang, Yang Tang, and Stefano Mattoccia. MonoViT: Self-supervised monocular depth estimation with a vision transformer. In2022 International Conference on 3D Vision (3DV), 2022

  57. [57]

    Structured3D: A large photo-realistic dataset for structured 3d modeling

    Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. Structured3D: A large photo-realistic dataset for structured 3d modeling. InEuropean Conference on Computer Vision (ECCV), 2020

  58. [58]

    corrupted

    Ruijie Zhu, Chuxin Wang, Ziyang Song, Li Liu, Jianfeng He, Jiacheng Deng, Tianzhu Zhang, and Yongdong Zhang. ScaleDepth: Decomposing metric depth estimation into semantic-aware scale prediction and adaptive relative depth estimation.IEEE Transactions on Circuits and Systems for Video Technology, page 1–1, 2026. 14 A PanoInfinigen High-quality datasets are...