arxiv: 2603.18943 · v2 · submitted 2026-03-19 · 💻 cs.CV

Recognition: no theorem link

VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation

Jiayi Yuan , Haobo Jiang , De Wen Soh , Na Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords panoramic depth estimationzero-shot learningtraining-free method3D reconstructiongeometry consistencyreprojectionfoundation modelscomputer vision

0 comments

The pith

VGGT-360 turns panoramic depth estimation into consistent 3D reprojection by leveraging VGGT foundation models without any training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VGGT-360, a training-free framework that reformulates zero-shot panoramic depth estimation as reprojection from multi-view 3D models. It uses the intrinsic 3D consistency of VGGT-like models to convert fragmented per-view depth estimates into a single coherent panorama. Three plug-and-play modules handle the conversion: adaptive slicing based on uncertainty, structure-aware attention during reconstruction, and correlation-weighted correction of the 3D points. A sympathetic reader would care because panoramic depth supports downstream tasks such as VR navigation, robotics mapping, and 3D scene understanding without requiring task-specific training data or fine-tuning.

Core claim

VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding through an uncertainty-guided adaptive projection module, a structure-saliency enhanced attention module, and a correlation-weighted 3D model correction module.

What carries the argument

The panorama-to-3D-to-depth reprojection pipeline that feeds uncertainty-selected perspective views into VGGT for 3D reconstruction and then projects the resulting point cloud back to panoramic depth.

If this is right

The approach yields geometry-consistent depth maps that outperform both trained and training-free state-of-the-art methods across resolutions and indoor/outdoor datasets.
The three modules can be added to any VGGT-like model to convert it into a panoramic depth estimator without retraining.
Reprojection from corrected 3D models removes the domain gap between perspective-trained foundation models and panoramic inputs.
The resulting depth supports direct use in downstream panoramic applications such as view synthesis and navigation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reprojection strategy could be tested on other foundation models that exhibit 3D consistency, potentially creating a family of training-free panoramic estimators.
If future VGGT-style models improve their cross-view coherence, the plug-and-play modules would immediately deliver higher-quality panoramic depth without modification.
The uncertainty-guided slicing suggests a general principle for allocating compute to difficult geometry regions in any multi-view 3D task.

Load-bearing premise

That VGGT-like foundation models already contain enough intrinsic 3D consistency to produce coherent panoramic depth when their per-view outputs are fused through reprojection.

What would settle it

A dataset where depth maps from overlapping panoramic regions remain visibly inconsistent after applying the three modules, or where VGGT-360 fails to exceed existing training-free baselines on indoor and outdoor scenes at multiple resolutions.

Figures

Figures reproduced from arXiv: 2603.18943 by De Wen Soh, Haobo Jiang, Jiayi Yuan, Na Zhao.

**Figure 1.** Figure 1: Comparison between the conventional training-free panoramic depth estimation framework and our VGGT-360. Unlike view-independent inference methods (e.g., 360MD [29]), VGGT-360 reconstructs a globally coherent 3D representation via VGGT-like 3D foundation models and reprojects it to the panorama, unifying fragmented per-view predictions into consistent, cross-view correlated depth with superior performance… view at source ↗

**Figure 2.** Figure 2: Framework Overview of VGGT-360. Given a panoramic image, we first perform uncertainty-guided adaptive projection to produce geometry-informative views for VGGT. With structure-saliency enhanced attention, VGGT reconstructs a structure-faithful 3D model, which is then refined by correlation-weighted 3D model correction and reprojected into a globally consistent panoramic depth map. optimization-based fusion… view at source ↗

**Figure 4.** Figure 4: Comparison of results before and after applying our structure-saliency enhanced attention mechanism. Guided by our well-designed structure-aware confidence map, our VGGT360 effectively removes artifacts and preserves geometric structures in weakly structured regions, which are easily affected by illumination cues and noise. Adaptive Neighbor Augmentation. As shown in [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

**Figure 6.** Figure 6: Comparison of reconstructed 3D models without (a) and with (b) our correlation-weighted 3D model correction. Our correction module significantly enhances surface continuity and removes artifacts in overlapping regions. Specifically, consider an ERP pixel r that is observed by a set of NK perspective views, denoted as V k per={v1, . . . , vNK }. For each view vk, we project the ERP pixel r onto its corresp… view at source ↗

**Figure 7.** Figure 7: Qualitative comparisons with the state-of-the-art supervised method Depth-Anywhere (DA) [ [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparison with SOTA methods: supervised [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Ablation studies on Stanford2D3D [3]: effectiveness of the three proposed modules across various VGGT-like baselines. from over-smoothing and inaccurate geometric predictions. In contrast, our globally consistent pipeline captures scene structure faithfully without training. The accompanying error maps further confirm the superiority of our predictions, showing fewer high-error regions, especially in geom… view at source ↗

**Figure 10.** Figure 10: , uniform projection with NB=6 views offers limited performance, while simply increasing the view count brings marginal gains at a high computational cost. In contrast, our adaptive strategy (i.e., NB=8 base views with top-K=2 neighbor augmentation) achieves a better accuracy–efficiency trade-off, showing that dynamically focusing on uncertain regions outperforms fixed sampling. Effect of Structure-Sa… view at source ↗

read the original abstract

This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT's perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (ii) Structure-saliency enhanced attention strengthens VGGT's robustness during 3D reconstruction by injecting structure-aware confidence into its attention layers, guiding focus toward geometrically reliable regions and enhancing cross-view coherence. (iii) Correlation-weighted 3D model correction refines the reconstructed 3D model by reweighting overlapping points using attention-inferred correlation scores, providing a consistent geometric basis for accurate panoramic reprojection. Extensive experiments show that VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VGGT-360 gives a concrete training-free way to feed panoramas into a perspective 3D foundation model through adaptive slicing, saliency attention, and overlap correction, but the abstract supplies no numbers to back the SOTA claim.

read the letter

VGGT-360 treats panoramic depth as a 3D reconstruction task solved by VGGT. It slices the input panorama into perspective views using gradient-based uncertainty to put more slices on geometry-poor areas, runs the foundation model, injects structure-saliency into its attention layers, corrects overlapping 3D points with correlation weights, and reprojects to a final depth map. The three modules are presented as plug-and-play additions that close the spherical-to-perspective gap without any retraining on panoramic data.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces VGGT-360, a training-free zero-shot framework for panoramic depth estimation. It reformulates the task as reprojection from multi-view 3D reconstructions using VGGT foundation models. Three modules are proposed: uncertainty-guided adaptive projection to handle domain gap, structure-saliency enhanced attention for robustness, and correlation-weighted 3D model correction for consistency. Experiments claim superior performance over trained and training-free SOTA on diverse datasets.

Significance. If the results hold, this work demonstrates effective transfer of perspective 3D consistency to panoramic domains via geometric operations, potentially enabling broader use of foundation models in 360-degree vision tasks without additional training. The plug-and-play nature is a strength, but verification of the consistency claims is key to its impact.

major comments (2)

[§3.1] §3.1 (Uncertainty-guided adaptive projection): The method claims to bridge the panoramic-to-perspective domain gap via gradient-based uncertainty allocation, but provides no quantitative verification (e.g., cross-view depth variance or overlap alignment error) that residual distortion in geometry-poor regions is actually reduced rather than masked by the downstream modules.
[§4] §4 (Experiments): The central SOTA outperformance claim on indoor/outdoor datasets requires explicit ablations isolating each module's contribution to 3D consistency (e.g., point-cloud alignment metrics before/after correction); without these, the transfer of VGGT's intrinsic consistency across the spherical gap remains untested.

minor comments (1)

[Abstract] Abstract: The repeated use of 'VGGT-like' is imprecise; clarify whether the framework is tied specifically to the VGGT model or generalizes to other foundation models with similar 3D priors.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We have carefully considered the major comments and provide point-by-point responses below. We believe these revisions will significantly improve the clarity and rigor of our work.

read point-by-point responses

Referee: [§3.1] §3.1 (Uncertainty-guided adaptive projection): The method claims to bridge the panoramic-to-perspective domain gap via gradient-based uncertainty allocation, but provides no quantitative verification (e.g., cross-view depth variance or overlap alignment error) that residual distortion in geometry-poor regions is actually reduced rather than masked by the downstream modules.

Authors: We agree that quantitative verification is important to substantiate the effectiveness of the uncertainty-guided adaptive projection module. In the revised version, we will include additional experiments reporting cross-view depth variance and overlap alignment error metrics. These will demonstrate that the module reduces residual distortion in geometry-poor regions, rather than relying solely on downstream corrections. revision: yes
Referee: [§4] §4 (Experiments): The central SOTA outperformance claim on indoor/outdoor datasets requires explicit ablations isolating each module's contribution to 3D consistency (e.g., point-cloud alignment metrics before/after correction); without these, the transfer of VGGT's intrinsic consistency across the spherical gap remains untested.

Authors: We concur that isolating the contribution of each module to 3D consistency is crucial for validating the core claim. We will add comprehensive ablations in the revised manuscript, including point-cloud alignment metrics (such as Chamfer distance or registration errors) computed before and after each module, particularly the correlation-weighted 3D model correction. This will explicitly show how VGGT's consistency is transferred across the panoramic domain. revision: yes

Circularity Check

0 steps flagged

No circularity: pipeline applies external VGGT model via standard geometric modules

full rationale

The paper's chain reformulates panoramic depth as reprojection over 3D models reconstructed by an external VGGT-like foundation model, then applies three plug-and-play modules (uncertainty-guided slicing, structure-saliency attention, correlation-weighted correction). No equations, fitted parameters, or self-citations are shown that reduce the final depth map to a quantity defined by the method itself. Claims rest on empirical outperformance rather than tautological re-derivation of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that VGGT foundation models already encode transferable 3D consistency; no free parameters or new entities are introduced in the abstract description.

axioms (1)

domain assumption VGGT-like foundation models possess intrinsic 3D consistency that survives domain shift from perspective to panoramic inputs when guided by the three modules.
Invoked in the reformulation of panoramic depth as multi-view 3D reprojection.

pith-pipeline@v0.9.0 · 5551 in / 1241 out tokens · 30586 ms · 2026-05-15T08:34:42.567413+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 5 internal anchors

[1]

Elite360d: Towards efficient 360 depth estimation via semantic-and distance-aware bi- projection fusion

Hao Ai and Lin Wang. Elite360d: Towards efficient 360 depth estimation via semantic-and distance-aware bi- projection fusion. InCVPR, pages 9926–9935, 2024. 6, 7

work page 2024
[2]

Hrdfuse: Monocular 360deg depth estimation by collaboratively learning holistic-with-regional depth distri- butions

Hao Ai, Zidong Cao, Yan-Pei Cao, Ying Shan, and Lin Wang. Hrdfuse: Monocular 360deg depth estimation by collaboratively learning holistic-with-regional depth distri- butions. InCVPR, pages 13273–13282, 2023. 2, 6, 7

work page 2023
[3]

Joint 2D-3D-Semantic Data for Indoor Scene Understanding

Iro Armeni, Sasha Sax, Amir R Zamir, and Silvio Savarese. Joint 2d-3d-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105, 2017. 2, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2017
[4]

Matryodshka: Real-time 6dof video view synthesis using multi-sphere images

Benjamin Attal, Selena Ling, Aaron Gokaslan, Christian Richardt, and James Tompkin. Matryodshka: Real-time 6dof video view synthesis using multi-sphere images. InECCV, pages 441–459. Springer, 2020. 1

work page 2020
[5]

Omniphotos: casual 360 vr photography.ACM Transactions on Graphics (TOG), 39(6):1–12, 2020

Tobias Bertel, Mingze Yuan, Reuben Lindroos, and Christian Richardt. Omniphotos: casual 360 vr photography.ACM Transactions on Graphics (TOG), 39(6):1–12, 2020. 6, 7, 8

work page 2020
[6]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M ¨uller. Zoedepth: Zero-shot trans- fer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Crf360d: Monocular 360 depth estimation via spherical fully-connected crfs.arXiv preprint arXiv:2405.11564, 2024

Zidong Cao and Lin Wang. Crf360d: Monocular 360 depth estimation via spherical fully-connected crfs.arXiv preprint arXiv:2405.11564, 2024. 1, 2

work page arXiv 2024
[8]

Panda: To- wards panoramic depth anything with unlabeled panoramas and mobius spatial augmentation

Zidong Cao, Jinjing Zhu, Weiming Zhang, Hao Ai, Hao- tian Bai, Hengshuang Zhao, and Lin Wang. Panda: To- wards panoramic depth anything with unlabeled panoramas and mobius spatial augmentation. InCVPR, pages 982–992,

work page
[9]

Matterport3d: Learning from rgb-d data in indoor environments.3DV, 2017

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Hal- ber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.3DV, 2017. 6, 7

work page 2017
[10]

Omnistereo: Real-time omnidireac- tional depth estimation with multiview fisheye cameras

Jiaxi Deng, Yushen Wang, Haitao Meng, Zuoxun Hou, Yi Chang, and Gang Chen. Omnistereo: Real-time omnidireac- tional depth estimation with multiview fisheye cameras. In CVPR, pages 1003–1012, 2025. 1

work page 2025
[11]

Convolutions on spher- ical images

Marc Eder and Jan-Michael Frahm. Convolutions on spher- ical images. InCVPR, pages 1–5, 2019. 1, 2

work page 2019
[12]

Omnidata: A scalable pipeline for making multi- task mid-level vision datasets from 3d scans

Ainaz Eftekhar, Alexander Sax, Jitendra Malik, and Amir Zamir. Omnidata: A scalable pipeline for making multi- task mid-level vision datasets from 3d scans. InICCV, pages 10786–10796, 2021. 2, 3, 6

work page 2021
[13]

Vggt-dp: Generalizable robot control via vision foundation models.arXiv preprint arXiv:2509.18778, 2025

Shijia Ge, Yinxin Zhang, Shuzhao Xie, Weixiang Zhang, Mingcai Zhou, and Zhi Wang. Vggt-dp: Generalizable robot control via vision foundation models.arXiv preprint arXiv:2509.18778, 2025. 3

work page arXiv 2025
[14]

Depth any camera: Zero-shot metric depth estimation from any camera

Yuliang Guo, Sparsh Garg, S Mahdi H Miangoleh, Xinyu Huang, and Liu Ren. Depth any camera: Zero-shot metric depth estimation from any camera. InCVPR, pages 26996– 27006, 2025. 6, 7

work page 2025
[15]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, pages 770–778, 2016. 6

work page 2016
[16]

Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 6

work page 2024
[17]

6-dof vr videos with a single 360-camera

Jingwei Huang, Zhili Chen, Duygu Ceylan, and Hailin Jin. 6-dof vr videos with a single 360-camera. In2017 IEEE Virtual Reality (VR), pages 37–44. IEEE, 2017. 1

work page 2017
[18]

Sampling network guided cross-entropy method for unsupervised point cloud registration

Haobo Jiang, Yaqi Shen, Jin Xie, Jun Li, Jianjun Qian, and Jian Yang. Sampling network guided cross-entropy method for unsupervised point cloud registration. InProceedings of the IEEE/CVF international conference on computer vision,

work page
[19]

Unifuse: Unidirectional fusion for 360 panorama depth estimation.IEEE Robotics and Automation Letters, 6 (2):1519–1526, 2021

Hualie Jiang, Zhe Sheng, Siyu Zhu, Zilong Dong, and Rui Huang. Unifuse: Unidirectional fusion for 360 panorama depth estimation.IEEE Robotics and Automation Letters, 6 (2):1519–1526, 2021. 2, 6, 7

work page 2021
[20]

Robust outlier rejection for 3d registra- tion with variational bayes

Haobo Jiang, Zheng Dang, Zhen Wei, Jin Xie, Jian Yang, and Mathieu Salzmann. Robust outlier rejection for 3d registra- tion with variational bayes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,

work page
[21]

Se (3) diffusion model-based point cloud registra- tion for robust 6d object pose estimation.Advances in Neural Information Processing Systems, 2023

Haobo Jiang, Mathieu Salzmann, Zheng Dang, Jin Xie, and Jian Yang. Se (3) diffusion model-based point cloud registra- tion for robust 6d object pose estimation.Advances in Neural Information Processing Systems, 2023. 1

work page 2023
[22]

Rpg360: Robust 360 depth estimation with per- spective foundation models and graph optimization.arXiv preprint arXiv:2509.23991, 2025

Dongki Jung, Jaehoon Choi, Yonghan Lee, and Dinesh Manocha. Rpg360: Robust 360 depth estimation with per- spective foundation models and graph optimization.arXiv preprint arXiv:2509.23991, 2025. 2, 3, 6, 7

work page arXiv 2025
[23]

Omnifusion: 360 monocular depth estima- tion via geometry-aware fusion

Yuyan Li, Yuliang Guo, Zhixin Yan, Xinyu Huang, Ye Duan, and Liu Ren. Omnifusion: 360 monocular depth estima- tion via geometry-aware fusion. InCVPR, pages 2801–2810,

work page
[24]

Vggt-x: When vggt meets dense novel view synthesis.arXiv preprint arXiv:2509.25191, 2025

Yang Liu, Chuanchen Luo, Zimo Tang, Junran Peng, and Zhaoxiang Zhang. Vggt-x: When vggt meets dense novel view synthesis.arXiv preprint arXiv:2509.25191, 2025. 3

work page arXiv 2025
[25]

Ms360: A multi-scale feature fusion framework for 360 monocular depth estimation

Payal Mohadikar, Chuanmao Fan, and Ye Duan. Ms360: A multi-scale feature fusion framework for 360 monocular depth estimation. InProceedings of the 50th Graphics Inter- face Conference, pages 1–11, 2024. 2

work page 2024
[26]

High-resolution depth estimation for 360deg panoramas through perspective and panoramic depth images registration

Chi-Han Peng and Jiayao Zhang. High-resolution depth estimation for 360deg panoramas through perspective and panoramic depth images registration. InWACV, pages 3116– 3125, 2023. 2, 6, 7

work page 2023
[27]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Un- dersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020. 2, 6

work page 2020
[29]

360monodepth: High-resolution 360deg monocular depth estimation

Manuel Rey-Area, Mingze Yuan, and Christian Richardt. 360monodepth: High-resolution 360deg monocular depth estimation. InCVPR, pages 3762–3772, 2022. 1, 2, 3, 6, 7, 8

work page 2022
[30]

Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding

Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding. In ICCV, pages 10912–10922, 2021. 6, 7

work page 2021
[31]

Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025

You Shen, Zhipeng Zhang, Yansong Qu, and Liujuan Cao. Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025. 2, 3, 6

work page arXiv 2025
[32]

The Replica Dataset: A Digital Replica of Indoor Spaces

Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, et al. The replica dataset: A digital replica of indoor spaces.arXiv preprint arXiv:1906.05797,

work page internal anchor Pith review Pith/arXiv arXiv 1906
[33]

Hohonet: 360 indoor holistic understanding with latent horizontal fea- tures

Cheng Sun, Min Sun, and Hwann-Tzong Chen. Hohonet: 360 indoor holistic understanding with latent horizontal fea- tures. InCVPR, pages 2573–2582, 2021. 2, 6, 7

work page 2021
[34]

Efficientnet: Rethinking model scaling for convolutional neural networks

Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. InICML, pages 6105–6114, 2019. 6

work page 2019
[35]

Distortion-aware convolutional filters for dense prediction in panoramic images

Keisuke Tateno, Nassir Navab, and Federico Tombari. Distortion-aware convolutional filters for dense prediction in panoramic images. InECCV, pages 707–722, 2018. 2

work page 2018
[36]

Improving robotic manipulation with efficient geometry-aware vision encoder.arXiv preprint arXiv:2509.15880, 2025

An Dinh Vuong, Minh Nhat Vu, and Ian Reid. Improving robotic manipulation with efficient geometry-aware vision encoder.arXiv preprint arXiv:2509.15880, 2025. 3

work page arXiv 2025
[37]

Self-supervised learning of depth and camera mo- tion from 360 videos

Fu-En Wang, Hou-Ning Hu, Hsien-Tzu Cheng, Juan-Ting Lin, Shang-Ta Yang, Meng-Li Shih, Hung-Kuo Chu, and Min Sun. Self-supervised learning of depth and camera mo- tion from 360 videos. InACCV, pages 53–68. Springer,

work page
[38]

Bifuse: Monocular 360 depth estimation via bi-projection fusion

Fu-En Wang, Yu-Hsuan Yeh, Min Sun, Wei-Chen Chiu, and Yi-Hsuan Tsai. Bifuse: Monocular 360 depth estimation via bi-projection fusion. InCVPR, pages 462–471, 2020. 3, 6, 7

work page 2020
[39]

Bifuse++: Self-supervised and efficient bi-projection fusion for 360 depth estimation.IEEE trans- actions on pattern analysis and machine intelligence, 45(5): 5448–5460, 2022

Fu-En Wang, Yu-Hsuan Yeh, Yi-Hsuan Tsai, Wei-Chen Chiu, and Min Sun. Bifuse++: Self-supervised and efficient bi-projection fusion for 360 depth estimation.IEEE trans- actions on pattern analysis and machine intelligence, 45(5): 5448–5460, 2022. 2, 6, 7

work page 2022
[40]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InCVPR, pages 5294– 5306, 2025. 2, 3, 6, 8

work page 2025
[41]

Depth anywhere: Enhancing 360 monocular depth estimation via perspective distillation and unlabeled data augmentation.NeurIPS, 37: 127739–127764, 2024

Ning-Hsu Albert Wang and Yu-Lun Liu. Depth anywhere: Enhancing 360 monocular depth estimation via perspective distillation and unlabeled data augmentation.NeurIPS, 37: 127739–127764, 2024. 1, 2, 3, 6, 7

work page 2024
[42]

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He.π 3: Scalable permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347,

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Distortion and uncertainty aware loss for panoramic depth completion

Zhiqiang Yan, Xiang Li, Kun Wang, Shuo Chen, Jun Li, and Jian Yang. Distortion and uncertainty aware loss for panoramic depth completion. InICML, pages 39099–39109. PMLR, 2023. 1

work page 2023
[44]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InCVPR, pages 10371–10381, 2024. 2, 3, 6

work page 2024
[45]

Metric3d: Towards zero-shot metric 3d prediction from a single image

Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. InICCV, pages 9043–9053, 2023. 2

work page 2023
[46]

Egformer: Equirectangular geometry- biased transformer for 360 depth estimation

Ilwi Yun, Chanyong Shin, Hyunku Lee, Hyuk-Jae Lee, and Chae Eun Rhee. Egformer: Equirectangular geometry- biased transformer for 360 depth estimation. InICCV, pages 6101–6112, 2023. 1, 2, 6, 7

work page 2023
[47]

Taskonomy: Disentangling task transfer learning

Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. InCVPR, pages 3712– 3722, 2018. 6, 7

work page 2018
[48]

Sgformer: Spherical geometry transformer for 360 depth estimation.IEEE Trans- actions on Circuits and Systems for Video Technology, 2025

Junsong Zhang, Zisong Chen, Chunyu Lin, Zhijie Shen, Lang Nie, Kang Liao, and Yao Zhao. Sgformer: Spherical geometry transformer for 360 depth estimation.IEEE Trans- actions on Circuits and Systems for Video Technology, 2025. 1, 2

work page 2025
[49]

Panoramic visual slam technology for spherical images.Sensors, 21(3):705, 2021

Yi Zhang and Fei Huang. Panoramic visual slam technology for spherical images.Sensors, 21(3):705, 2021. 1

work page 2021
[50]

Structured3d: A large photo-realistic dataset for structured 3d modeling

Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. Structured3d: A large photo-realistic dataset for structured 3d modeling. InECCV, pages 519–

work page
[51]

Springer, 2020. 6, 7

work page 2020
[52]

Acdnet: Adaptively combined dilated con- volution for monocular panorama depth estimation

Chuanqing Zhuang, Zhengda Lu, Yiqun Wang, Jun Xiao, and Ying Wang. Acdnet: Adaptively combined dilated con- volution for monocular panorama depth estimation. InAAAI, pages 3653–3661, 2022. 2

work page 2022