pith. machine review for the scientific record. sign in

arxiv: 2603.18943 · v2 · submitted 2026-03-19 · 💻 cs.CV

Recognition: no theorem link

VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords panoramic depth estimationzero-shot learningtraining-free method3D reconstructiongeometry consistencyreprojectionfoundation modelscomputer vision
0
0 comments X

The pith

VGGT-360 turns panoramic depth estimation into consistent 3D reprojection by leveraging VGGT foundation models without any training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VGGT-360, a training-free framework that reformulates zero-shot panoramic depth estimation as reprojection from multi-view 3D models. It uses the intrinsic 3D consistency of VGGT-like models to convert fragmented per-view depth estimates into a single coherent panorama. Three plug-and-play modules handle the conversion: adaptive slicing based on uncertainty, structure-aware attention during reconstruction, and correlation-weighted correction of the 3D points. A sympathetic reader would care because panoramic depth supports downstream tasks such as VR navigation, robotics mapping, and 3D scene understanding without requiring task-specific training data or fine-tuning.

Core claim

VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding through an uncertainty-guided adaptive projection module, a structure-saliency enhanced attention module, and a correlation-weighted 3D model correction module.

What carries the argument

The panorama-to-3D-to-depth reprojection pipeline that feeds uncertainty-selected perspective views into VGGT for 3D reconstruction and then projects the resulting point cloud back to panoramic depth.

If this is right

  • The approach yields geometry-consistent depth maps that outperform both trained and training-free state-of-the-art methods across resolutions and indoor/outdoor datasets.
  • The three modules can be added to any VGGT-like model to convert it into a panoramic depth estimator without retraining.
  • Reprojection from corrected 3D models removes the domain gap between perspective-trained foundation models and panoramic inputs.
  • The resulting depth supports direct use in downstream panoramic applications such as view synthesis and navigation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reprojection strategy could be tested on other foundation models that exhibit 3D consistency, potentially creating a family of training-free panoramic estimators.
  • If future VGGT-style models improve their cross-view coherence, the plug-and-play modules would immediately deliver higher-quality panoramic depth without modification.
  • The uncertainty-guided slicing suggests a general principle for allocating compute to difficult geometry regions in any multi-view 3D task.

Load-bearing premise

That VGGT-like foundation models already contain enough intrinsic 3D consistency to produce coherent panoramic depth when their per-view outputs are fused through reprojection.

What would settle it

A dataset where depth maps from overlapping panoramic regions remain visibly inconsistent after applying the three modules, or where VGGT-360 fails to exceed existing training-free baselines on indoor and outdoor scenes at multiple resolutions.

Figures

Figures reproduced from arXiv: 2603.18943 by De Wen Soh, Haobo Jiang, Jiayi Yuan, Na Zhao.

Figure 1
Figure 1. Figure 1: Comparison between the conventional training-free panoramic depth estimation framework and our VGGT-360. Unlike view-independent inference methods (e.g., 360MD [29]), VGGT-360 reconstructs a globally coherent 3D representation via VGGT-like 3D foundation models and reprojects it to the panorama, unifying fragmented per-view predictions into consis￾tent, cross-view correlated depth with superior performance… view at source ↗
Figure 2
Figure 2. Figure 2: Framework Overview of VGGT-360. Given a panoramic image, we first perform uncertainty-guided adaptive projection to produce geometry-informative views for VGGT. With structure-saliency enhanced attention, VGGT reconstructs a structure-faithful 3D model, which is then refined by correlation-weighted 3D model correction and reprojected into a globally consistent panoramic depth map. optimization-based fusion… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of results before and after applying our structure-saliency enhanced attention mechanism. Guided by our well-designed structure-aware confidence map, our VGGT￾360 effectively removes artifacts and preserves geometric struc￾tures in weakly structured regions, which are easily affected by illumination cues and noise. Adaptive Neighbor Augmentation. As shown in [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of reconstructed 3D models without (a) and with (b) our correlation-weighted 3D model correction. Our correction module significantly enhances surface continuity and removes artifacts in overlapping regions. Specifically, consider an ERP pixel r that is ob￾served by a set of NK perspective views, denoted as V k per={v1, . . . , vNK }. For each view vk, we project the ERP pixel r onto its corresp… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparisons with the state-of-the-art supervised method Depth-Anywhere (DA) [ [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison with SOTA methods: supervised [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ablation studies on Stanford2D3D [3]: effectiveness of the three proposed modules across various VGGT-like baselines. from over-smoothing and inaccurate geometric predictions. In contrast, our globally consistent pipeline captures scene structure faithfully without training. The accompanying er￾ror maps further confirm the superiority of our predictions, showing fewer high-error regions, especially in geom… view at source ↗
Figure 10
Figure 10. Figure 10: , uniform projection with NB=6 views offers lim￾ited performance, while simply increasing the view count brings marginal gains at a high computational cost. In con￾trast, our adaptive strategy (i.e., NB=8 base views with top-K=2 neighbor augmentation) achieves a better accu￾racy–efficiency trade-off, showing that dynamically focus￾ing on uncertain regions outperforms fixed sampling. Effect of Structure-Sa… view at source ↗
read the original abstract

This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT's perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (ii) Structure-saliency enhanced attention strengthens VGGT's robustness during 3D reconstruction by injecting structure-aware confidence into its attention layers, guiding focus toward geometrically reliable regions and enhancing cross-view coherence. (iii) Correlation-weighted 3D model correction refines the reconstructed 3D model by reweighting overlapping points using attention-inferred correlation scores, providing a consistent geometric basis for accurate panoramic reprojection. Extensive experiments show that VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces VGGT-360, a training-free zero-shot framework for panoramic depth estimation. It reformulates the task as reprojection from multi-view 3D reconstructions using VGGT foundation models. Three modules are proposed: uncertainty-guided adaptive projection to handle domain gap, structure-saliency enhanced attention for robustness, and correlation-weighted 3D model correction for consistency. Experiments claim superior performance over trained and training-free SOTA on diverse datasets.

Significance. If the results hold, this work demonstrates effective transfer of perspective 3D consistency to panoramic domains via geometric operations, potentially enabling broader use of foundation models in 360-degree vision tasks without additional training. The plug-and-play nature is a strength, but verification of the consistency claims is key to its impact.

major comments (2)
  1. [§3.1] §3.1 (Uncertainty-guided adaptive projection): The method claims to bridge the panoramic-to-perspective domain gap via gradient-based uncertainty allocation, but provides no quantitative verification (e.g., cross-view depth variance or overlap alignment error) that residual distortion in geometry-poor regions is actually reduced rather than masked by the downstream modules.
  2. [§4] §4 (Experiments): The central SOTA outperformance claim on indoor/outdoor datasets requires explicit ablations isolating each module's contribution to 3D consistency (e.g., point-cloud alignment metrics before/after correction); without these, the transfer of VGGT's intrinsic consistency across the spherical gap remains untested.
minor comments (1)
  1. [Abstract] Abstract: The repeated use of 'VGGT-like' is imprecise; clarify whether the framework is tied specifically to the VGGT model or generalizes to other foundation models with similar 3D priors.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We have carefully considered the major comments and provide point-by-point responses below. We believe these revisions will significantly improve the clarity and rigor of our work.

read point-by-point responses
  1. Referee: [§3.1] §3.1 (Uncertainty-guided adaptive projection): The method claims to bridge the panoramic-to-perspective domain gap via gradient-based uncertainty allocation, but provides no quantitative verification (e.g., cross-view depth variance or overlap alignment error) that residual distortion in geometry-poor regions is actually reduced rather than masked by the downstream modules.

    Authors: We agree that quantitative verification is important to substantiate the effectiveness of the uncertainty-guided adaptive projection module. In the revised version, we will include additional experiments reporting cross-view depth variance and overlap alignment error metrics. These will demonstrate that the module reduces residual distortion in geometry-poor regions, rather than relying solely on downstream corrections. revision: yes

  2. Referee: [§4] §4 (Experiments): The central SOTA outperformance claim on indoor/outdoor datasets requires explicit ablations isolating each module's contribution to 3D consistency (e.g., point-cloud alignment metrics before/after correction); without these, the transfer of VGGT's intrinsic consistency across the spherical gap remains untested.

    Authors: We concur that isolating the contribution of each module to 3D consistency is crucial for validating the core claim. We will add comprehensive ablations in the revised manuscript, including point-cloud alignment metrics (such as Chamfer distance or registration errors) computed before and after each module, particularly the correlation-weighted 3D model correction. This will explicitly show how VGGT's consistency is transferred across the panoramic domain. revision: yes

Circularity Check

0 steps flagged

No circularity: pipeline applies external VGGT model via standard geometric modules

full rationale

The paper's chain reformulates panoramic depth as reprojection over 3D models reconstructed by an external VGGT-like foundation model, then applies three plug-and-play modules (uncertainty-guided slicing, structure-saliency attention, correlation-weighted correction). No equations, fitted parameters, or self-citations are shown that reduce the final depth map to a quantity defined by the method itself. Claims rest on empirical outperformance rather than tautological re-derivation of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that VGGT foundation models already encode transferable 3D consistency; no free parameters or new entities are introduced in the abstract description.

axioms (1)
  • domain assumption VGGT-like foundation models possess intrinsic 3D consistency that survives domain shift from perspective to panoramic inputs when guided by the three modules.
    Invoked in the reformulation of panoramic depth as multi-view 3D reprojection.

pith-pipeline@v0.9.0 · 5551 in / 1241 out tokens · 30586 ms · 2026-05-15T08:34:42.567413+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 5 internal anchors

  1. [1]

    Elite360d: Towards efficient 360 depth estimation via semantic-and distance-aware bi- projection fusion

    Hao Ai and Lin Wang. Elite360d: Towards efficient 360 depth estimation via semantic-and distance-aware bi- projection fusion. InCVPR, pages 9926–9935, 2024. 6, 7

  2. [2]

    Hrdfuse: Monocular 360deg depth estimation by collaboratively learning holistic-with-regional depth distri- butions

    Hao Ai, Zidong Cao, Yan-Pei Cao, Ying Shan, and Lin Wang. Hrdfuse: Monocular 360deg depth estimation by collaboratively learning holistic-with-regional depth distri- butions. InCVPR, pages 13273–13282, 2023. 2, 6, 7

  3. [3]

    Joint 2D-3D-Semantic Data for Indoor Scene Understanding

    Iro Armeni, Sasha Sax, Amir R Zamir, and Silvio Savarese. Joint 2d-3d-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105, 2017. 2, 6, 7, 8

  4. [4]

    Matryodshka: Real-time 6dof video view synthesis using multi-sphere images

    Benjamin Attal, Selena Ling, Aaron Gokaslan, Christian Richardt, and James Tompkin. Matryodshka: Real-time 6dof video view synthesis using multi-sphere images. InECCV, pages 441–459. Springer, 2020. 1

  5. [5]

    Omniphotos: casual 360 vr photography.ACM Transactions on Graphics (TOG), 39(6):1–12, 2020

    Tobias Bertel, Mingze Yuan, Reuben Lindroos, and Christian Richardt. Omniphotos: casual 360 vr photography.ACM Transactions on Graphics (TOG), 39(6):1–12, 2020. 6, 7, 8

  6. [6]

    ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

    Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M ¨uller. Zoedepth: Zero-shot trans- fer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023. 3

  7. [7]

    Crf360d: Monocular 360 depth estimation via spherical fully-connected crfs.arXiv preprint arXiv:2405.11564, 2024

    Zidong Cao and Lin Wang. Crf360d: Monocular 360 depth estimation via spherical fully-connected crfs.arXiv preprint arXiv:2405.11564, 2024. 1, 2

  8. [8]

    Panda: To- wards panoramic depth anything with unlabeled panoramas and mobius spatial augmentation

    Zidong Cao, Jinjing Zhu, Weiming Zhang, Hao Ai, Hao- tian Bai, Hengshuang Zhao, and Lin Wang. Panda: To- wards panoramic depth anything with unlabeled panoramas and mobius spatial augmentation. InCVPR, pages 982–992,

  9. [9]

    Matterport3d: Learning from rgb-d data in indoor environments.3DV, 2017

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Hal- ber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.3DV, 2017. 6, 7

  10. [10]

    Omnistereo: Real-time omnidireac- tional depth estimation with multiview fisheye cameras

    Jiaxi Deng, Yushen Wang, Haitao Meng, Zuoxun Hou, Yi Chang, and Gang Chen. Omnistereo: Real-time omnidireac- tional depth estimation with multiview fisheye cameras. In CVPR, pages 1003–1012, 2025. 1

  11. [11]

    Convolutions on spher- ical images

    Marc Eder and Jan-Michael Frahm. Convolutions on spher- ical images. InCVPR, pages 1–5, 2019. 1, 2

  12. [12]

    Omnidata: A scalable pipeline for making multi- task mid-level vision datasets from 3d scans

    Ainaz Eftekhar, Alexander Sax, Jitendra Malik, and Amir Zamir. Omnidata: A scalable pipeline for making multi- task mid-level vision datasets from 3d scans. InICCV, pages 10786–10796, 2021. 2, 3, 6

  13. [13]

    Vggt-dp: Generalizable robot control via vision foundation models.arXiv preprint arXiv:2509.18778, 2025

    Shijia Ge, Yinxin Zhang, Shuzhao Xie, Weixiang Zhang, Mingcai Zhou, and Zhi Wang. Vggt-dp: Generalizable robot control via vision foundation models.arXiv preprint arXiv:2509.18778, 2025. 3

  14. [14]

    Depth any camera: Zero-shot metric depth estimation from any camera

    Yuliang Guo, Sparsh Garg, S Mahdi H Miangoleh, Xinyu Huang, and Liu Ren. Depth any camera: Zero-shot metric depth estimation from any camera. InCVPR, pages 26996– 27006, 2025. 6, 7

  15. [15]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, pages 770–778, 2016. 6

  16. [16]

    Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 6

  17. [17]

    6-dof vr videos with a single 360-camera

    Jingwei Huang, Zhili Chen, Duygu Ceylan, and Hailin Jin. 6-dof vr videos with a single 360-camera. In2017 IEEE Virtual Reality (VR), pages 37–44. IEEE, 2017. 1

  18. [18]

    Sampling network guided cross-entropy method for unsupervised point cloud registration

    Haobo Jiang, Yaqi Shen, Jin Xie, Jun Li, Jianjun Qian, and Jian Yang. Sampling network guided cross-entropy method for unsupervised point cloud registration. InProceedings of the IEEE/CVF international conference on computer vision,

  19. [19]

    Unifuse: Unidirectional fusion for 360 panorama depth estimation.IEEE Robotics and Automation Letters, 6 (2):1519–1526, 2021

    Hualie Jiang, Zhe Sheng, Siyu Zhu, Zilong Dong, and Rui Huang. Unifuse: Unidirectional fusion for 360 panorama depth estimation.IEEE Robotics and Automation Letters, 6 (2):1519–1526, 2021. 2, 6, 7

  20. [20]

    Robust outlier rejection for 3d registra- tion with variational bayes

    Haobo Jiang, Zheng Dang, Zhen Wei, Jin Xie, Jian Yang, and Mathieu Salzmann. Robust outlier rejection for 3d registra- tion with variational bayes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,

  21. [21]

    Se (3) diffusion model-based point cloud registra- tion for robust 6d object pose estimation.Advances in Neural Information Processing Systems, 2023

    Haobo Jiang, Mathieu Salzmann, Zheng Dang, Jin Xie, and Jian Yang. Se (3) diffusion model-based point cloud registra- tion for robust 6d object pose estimation.Advances in Neural Information Processing Systems, 2023. 1

  22. [22]

    Rpg360: Robust 360 depth estimation with per- spective foundation models and graph optimization.arXiv preprint arXiv:2509.23991, 2025

    Dongki Jung, Jaehoon Choi, Yonghan Lee, and Dinesh Manocha. Rpg360: Robust 360 depth estimation with per- spective foundation models and graph optimization.arXiv preprint arXiv:2509.23991, 2025. 2, 3, 6, 7

  23. [23]

    Omnifusion: 360 monocular depth estima- tion via geometry-aware fusion

    Yuyan Li, Yuliang Guo, Zhixin Yan, Xinyu Huang, Ye Duan, and Liu Ren. Omnifusion: 360 monocular depth estima- tion via geometry-aware fusion. InCVPR, pages 2801–2810,

  24. [24]

    Vggt-x: When vggt meets dense novel view synthesis.arXiv preprint arXiv:2509.25191, 2025

    Yang Liu, Chuanchen Luo, Zimo Tang, Junran Peng, and Zhaoxiang Zhang. Vggt-x: When vggt meets dense novel view synthesis.arXiv preprint arXiv:2509.25191, 2025. 3

  25. [25]

    Ms360: A multi-scale feature fusion framework for 360 monocular depth estimation

    Payal Mohadikar, Chuanmao Fan, and Ye Duan. Ms360: A multi-scale feature fusion framework for 360 monocular depth estimation. InProceedings of the 50th Graphics Inter- face Conference, pages 1–11, 2024. 2

  26. [26]

    High-resolution depth estimation for 360deg panoramas through perspective and panoramic depth images registration

    Chi-Han Peng and Jiayao Zhang. High-resolution depth estimation for 360deg panoramas through perspective and panoramic depth images registration. InWACV, pages 3116– 3125, 2023. 2, 6, 7

  27. [27]

    Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

    Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Un- dersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021. 6, 7

  28. [28]

    Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020. 2, 6

  29. [29]

    360monodepth: High-resolution 360deg monocular depth estimation

    Manuel Rey-Area, Mingze Yuan, and Christian Richardt. 360monodepth: High-resolution 360deg monocular depth estimation. InCVPR, pages 3762–3772, 2022. 1, 2, 3, 6, 7, 8

  30. [30]

    Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding

    Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding. In ICCV, pages 10912–10922, 2021. 6, 7

  31. [31]

    Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025

    You Shen, Zhipeng Zhang, Yansong Qu, and Liujuan Cao. Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025. 2, 3, 6

  32. [32]

    The Replica Dataset: A Digital Replica of Indoor Spaces

    Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, et al. The replica dataset: A digital replica of indoor spaces.arXiv preprint arXiv:1906.05797,

  33. [33]

    Hohonet: 360 indoor holistic understanding with latent horizontal fea- tures

    Cheng Sun, Min Sun, and Hwann-Tzong Chen. Hohonet: 360 indoor holistic understanding with latent horizontal fea- tures. InCVPR, pages 2573–2582, 2021. 2, 6, 7

  34. [34]

    Efficientnet: Rethinking model scaling for convolutional neural networks

    Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. InICML, pages 6105–6114, 2019. 6

  35. [35]

    Distortion-aware convolutional filters for dense prediction in panoramic images

    Keisuke Tateno, Nassir Navab, and Federico Tombari. Distortion-aware convolutional filters for dense prediction in panoramic images. InECCV, pages 707–722, 2018. 2

  36. [36]

    Improving robotic manipulation with efficient geometry-aware vision encoder.arXiv preprint arXiv:2509.15880, 2025

    An Dinh Vuong, Minh Nhat Vu, and Ian Reid. Improving robotic manipulation with efficient geometry-aware vision encoder.arXiv preprint arXiv:2509.15880, 2025. 3

  37. [37]

    Self-supervised learning of depth and camera mo- tion from 360 videos

    Fu-En Wang, Hou-Ning Hu, Hsien-Tzu Cheng, Juan-Ting Lin, Shang-Ta Yang, Meng-Li Shih, Hung-Kuo Chu, and Min Sun. Self-supervised learning of depth and camera mo- tion from 360 videos. InACCV, pages 53–68. Springer,

  38. [38]

    Bifuse: Monocular 360 depth estimation via bi-projection fusion

    Fu-En Wang, Yu-Hsuan Yeh, Min Sun, Wei-Chen Chiu, and Yi-Hsuan Tsai. Bifuse: Monocular 360 depth estimation via bi-projection fusion. InCVPR, pages 462–471, 2020. 3, 6, 7

  39. [39]

    Bifuse++: Self-supervised and efficient bi-projection fusion for 360 depth estimation.IEEE trans- actions on pattern analysis and machine intelligence, 45(5): 5448–5460, 2022

    Fu-En Wang, Yu-Hsuan Yeh, Yi-Hsuan Tsai, Wei-Chen Chiu, and Min Sun. Bifuse++: Self-supervised and efficient bi-projection fusion for 360 depth estimation.IEEE trans- actions on pattern analysis and machine intelligence, 45(5): 5448–5460, 2022. 2, 6, 7

  40. [40]

    Vggt: Vi- sual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InCVPR, pages 5294– 5306, 2025. 2, 3, 6, 8

  41. [41]

    Depth anywhere: Enhancing 360 monocular depth estimation via perspective distillation and unlabeled data augmentation.NeurIPS, 37: 127739–127764, 2024

    Ning-Hsu Albert Wang and Yu-Lun Liu. Depth anywhere: Enhancing 360 monocular depth estimation via perspective distillation and unlabeled data augmentation.NeurIPS, 37: 127739–127764, 2024. 1, 2, 3, 6, 7

  42. [42]

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He.π 3: Scalable permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347,

  43. [43]

    Distortion and uncertainty aware loss for panoramic depth completion

    Zhiqiang Yan, Xiang Li, Kun Wang, Shuo Chen, Jun Li, and Jian Yang. Distortion and uncertainty aware loss for panoramic depth completion. InICML, pages 39099–39109. PMLR, 2023. 1

  44. [44]

    Depth anything: Unleashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InCVPR, pages 10371–10381, 2024. 2, 3, 6

  45. [45]

    Metric3d: Towards zero-shot metric 3d prediction from a single image

    Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. InICCV, pages 9043–9053, 2023. 2

  46. [46]

    Egformer: Equirectangular geometry- biased transformer for 360 depth estimation

    Ilwi Yun, Chanyong Shin, Hyunku Lee, Hyuk-Jae Lee, and Chae Eun Rhee. Egformer: Equirectangular geometry- biased transformer for 360 depth estimation. InICCV, pages 6101–6112, 2023. 1, 2, 6, 7

  47. [47]

    Taskonomy: Disentangling task transfer learning

    Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. InCVPR, pages 3712– 3722, 2018. 6, 7

  48. [48]

    Sgformer: Spherical geometry transformer for 360 depth estimation.IEEE Trans- actions on Circuits and Systems for Video Technology, 2025

    Junsong Zhang, Zisong Chen, Chunyu Lin, Zhijie Shen, Lang Nie, Kang Liao, and Yao Zhao. Sgformer: Spherical geometry transformer for 360 depth estimation.IEEE Trans- actions on Circuits and Systems for Video Technology, 2025. 1, 2

  49. [49]

    Panoramic visual slam technology for spherical images.Sensors, 21(3):705, 2021

    Yi Zhang and Fei Huang. Panoramic visual slam technology for spherical images.Sensors, 21(3):705, 2021. 1

  50. [50]

    Structured3d: A large photo-realistic dataset for structured 3d modeling

    Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. Structured3d: A large photo-realistic dataset for structured 3d modeling. InECCV, pages 519–

  51. [51]

    Springer, 2020. 6, 7

  52. [52]

    Acdnet: Adaptively combined dilated con- volution for monocular panorama depth estimation

    Chuanqing Zhuang, Zhengda Lu, Yiqun Wang, Jun Xiao, and Ying Wang. Acdnet: Adaptively combined dilated con- volution for monocular panorama depth estimation. InAAAI, pages 3653–3661, 2022. 2