Recognition: no theorem link
VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation
Pith reviewed 2026-05-15 08:34 UTC · model grok-4.3
The pith
VGGT-360 turns panoramic depth estimation into consistent 3D reprojection by leveraging VGGT foundation models without any training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding through an uncertainty-guided adaptive projection module, a structure-saliency enhanced attention module, and a correlation-weighted 3D model correction module.
What carries the argument
The panorama-to-3D-to-depth reprojection pipeline that feeds uncertainty-selected perspective views into VGGT for 3D reconstruction and then projects the resulting point cloud back to panoramic depth.
If this is right
- The approach yields geometry-consistent depth maps that outperform both trained and training-free state-of-the-art methods across resolutions and indoor/outdoor datasets.
- The three modules can be added to any VGGT-like model to convert it into a panoramic depth estimator without retraining.
- Reprojection from corrected 3D models removes the domain gap between perspective-trained foundation models and panoramic inputs.
- The resulting depth supports direct use in downstream panoramic applications such as view synthesis and navigation.
Where Pith is reading between the lines
- The same reprojection strategy could be tested on other foundation models that exhibit 3D consistency, potentially creating a family of training-free panoramic estimators.
- If future VGGT-style models improve their cross-view coherence, the plug-and-play modules would immediately deliver higher-quality panoramic depth without modification.
- The uncertainty-guided slicing suggests a general principle for allocating compute to difficult geometry regions in any multi-view 3D task.
Load-bearing premise
That VGGT-like foundation models already contain enough intrinsic 3D consistency to produce coherent panoramic depth when their per-view outputs are fused through reprojection.
What would settle it
A dataset where depth maps from overlapping panoramic regions remain visibly inconsistent after applying the three modules, or where VGGT-360 fails to exceed existing training-free baselines on indoor and outdoor scenes at multiple resolutions.
Figures
read the original abstract
This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT's perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (ii) Structure-saliency enhanced attention strengthens VGGT's robustness during 3D reconstruction by injecting structure-aware confidence into its attention layers, guiding focus toward geometrically reliable regions and enhancing cross-view coherence. (iii) Correlation-weighted 3D model correction refines the reconstructed 3D model by reweighting overlapping points using attention-inferred correlation scores, providing a consistent geometric basis for accurate panoramic reprojection. Extensive experiments show that VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces VGGT-360, a training-free zero-shot framework for panoramic depth estimation. It reformulates the task as reprojection from multi-view 3D reconstructions using VGGT foundation models. Three modules are proposed: uncertainty-guided adaptive projection to handle domain gap, structure-saliency enhanced attention for robustness, and correlation-weighted 3D model correction for consistency. Experiments claim superior performance over trained and training-free SOTA on diverse datasets.
Significance. If the results hold, this work demonstrates effective transfer of perspective 3D consistency to panoramic domains via geometric operations, potentially enabling broader use of foundation models in 360-degree vision tasks without additional training. The plug-and-play nature is a strength, but verification of the consistency claims is key to its impact.
major comments (2)
- [§3.1] §3.1 (Uncertainty-guided adaptive projection): The method claims to bridge the panoramic-to-perspective domain gap via gradient-based uncertainty allocation, but provides no quantitative verification (e.g., cross-view depth variance or overlap alignment error) that residual distortion in geometry-poor regions is actually reduced rather than masked by the downstream modules.
- [§4] §4 (Experiments): The central SOTA outperformance claim on indoor/outdoor datasets requires explicit ablations isolating each module's contribution to 3D consistency (e.g., point-cloud alignment metrics before/after correction); without these, the transfer of VGGT's intrinsic consistency across the spherical gap remains untested.
minor comments (1)
- [Abstract] Abstract: The repeated use of 'VGGT-like' is imprecise; clarify whether the framework is tied specifically to the VGGT model or generalizes to other foundation models with similar 3D priors.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We have carefully considered the major comments and provide point-by-point responses below. We believe these revisions will significantly improve the clarity and rigor of our work.
read point-by-point responses
-
Referee: [§3.1] §3.1 (Uncertainty-guided adaptive projection): The method claims to bridge the panoramic-to-perspective domain gap via gradient-based uncertainty allocation, but provides no quantitative verification (e.g., cross-view depth variance or overlap alignment error) that residual distortion in geometry-poor regions is actually reduced rather than masked by the downstream modules.
Authors: We agree that quantitative verification is important to substantiate the effectiveness of the uncertainty-guided adaptive projection module. In the revised version, we will include additional experiments reporting cross-view depth variance and overlap alignment error metrics. These will demonstrate that the module reduces residual distortion in geometry-poor regions, rather than relying solely on downstream corrections. revision: yes
-
Referee: [§4] §4 (Experiments): The central SOTA outperformance claim on indoor/outdoor datasets requires explicit ablations isolating each module's contribution to 3D consistency (e.g., point-cloud alignment metrics before/after correction); without these, the transfer of VGGT's intrinsic consistency across the spherical gap remains untested.
Authors: We concur that isolating the contribution of each module to 3D consistency is crucial for validating the core claim. We will add comprehensive ablations in the revised manuscript, including point-cloud alignment metrics (such as Chamfer distance or registration errors) computed before and after each module, particularly the correlation-weighted 3D model correction. This will explicitly show how VGGT's consistency is transferred across the panoramic domain. revision: yes
Circularity Check
No circularity: pipeline applies external VGGT model via standard geometric modules
full rationale
The paper's chain reformulates panoramic depth as reprojection over 3D models reconstructed by an external VGGT-like foundation model, then applies three plug-and-play modules (uncertainty-guided slicing, structure-saliency attention, correlation-weighted correction). No equations, fitted parameters, or self-citations are shown that reduce the final depth map to a quantity defined by the method itself. Claims rest on empirical outperformance rather than tautological re-derivation of inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption VGGT-like foundation models possess intrinsic 3D consistency that survives domain shift from perspective to panoramic inputs when guided by the three modules.
Reference graph
Works this paper leans on
-
[1]
Hao Ai and Lin Wang. Elite360d: Towards efficient 360 depth estimation via semantic-and distance-aware bi- projection fusion. InCVPR, pages 9926–9935, 2024. 6, 7
work page 2024
-
[2]
Hao Ai, Zidong Cao, Yan-Pei Cao, Ying Shan, and Lin Wang. Hrdfuse: Monocular 360deg depth estimation by collaboratively learning holistic-with-regional depth distri- butions. InCVPR, pages 13273–13282, 2023. 2, 6, 7
work page 2023
-
[3]
Joint 2D-3D-Semantic Data for Indoor Scene Understanding
Iro Armeni, Sasha Sax, Amir R Zamir, and Silvio Savarese. Joint 2d-3d-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105, 2017. 2, 6, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[4]
Matryodshka: Real-time 6dof video view synthesis using multi-sphere images
Benjamin Attal, Selena Ling, Aaron Gokaslan, Christian Richardt, and James Tompkin. Matryodshka: Real-time 6dof video view synthesis using multi-sphere images. InECCV, pages 441–459. Springer, 2020. 1
work page 2020
-
[5]
Omniphotos: casual 360 vr photography.ACM Transactions on Graphics (TOG), 39(6):1–12, 2020
Tobias Bertel, Mingze Yuan, Reuben Lindroos, and Christian Richardt. Omniphotos: casual 360 vr photography.ACM Transactions on Graphics (TOG), 39(6):1–12, 2020. 6, 7, 8
work page 2020
-
[6]
ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth
Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M ¨uller. Zoedepth: Zero-shot trans- fer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Zidong Cao and Lin Wang. Crf360d: Monocular 360 depth estimation via spherical fully-connected crfs.arXiv preprint arXiv:2405.11564, 2024. 1, 2
-
[8]
Panda: To- wards panoramic depth anything with unlabeled panoramas and mobius spatial augmentation
Zidong Cao, Jinjing Zhu, Weiming Zhang, Hao Ai, Hao- tian Bai, Hengshuang Zhao, and Lin Wang. Panda: To- wards panoramic depth anything with unlabeled panoramas and mobius spatial augmentation. InCVPR, pages 982–992,
-
[9]
Matterport3d: Learning from rgb-d data in indoor environments.3DV, 2017
Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Hal- ber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.3DV, 2017. 6, 7
work page 2017
-
[10]
Omnistereo: Real-time omnidireac- tional depth estimation with multiview fisheye cameras
Jiaxi Deng, Yushen Wang, Haitao Meng, Zuoxun Hou, Yi Chang, and Gang Chen. Omnistereo: Real-time omnidireac- tional depth estimation with multiview fisheye cameras. In CVPR, pages 1003–1012, 2025. 1
work page 2025
-
[11]
Convolutions on spher- ical images
Marc Eder and Jan-Michael Frahm. Convolutions on spher- ical images. InCVPR, pages 1–5, 2019. 1, 2
work page 2019
-
[12]
Omnidata: A scalable pipeline for making multi- task mid-level vision datasets from 3d scans
Ainaz Eftekhar, Alexander Sax, Jitendra Malik, and Amir Zamir. Omnidata: A scalable pipeline for making multi- task mid-level vision datasets from 3d scans. InICCV, pages 10786–10796, 2021. 2, 3, 6
work page 2021
-
[13]
Shijia Ge, Yinxin Zhang, Shuzhao Xie, Weixiang Zhang, Mingcai Zhou, and Zhi Wang. Vggt-dp: Generalizable robot control via vision foundation models.arXiv preprint arXiv:2509.18778, 2025. 3
-
[14]
Depth any camera: Zero-shot metric depth estimation from any camera
Yuliang Guo, Sparsh Garg, S Mahdi H Miangoleh, Xinyu Huang, and Liu Ren. Depth any camera: Zero-shot metric depth estimation from any camera. InCVPR, pages 26996– 27006, 2025. 6, 7
work page 2025
-
[15]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, pages 770–778, 2016. 6
work page 2016
-
[16]
Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 6
work page 2024
-
[17]
6-dof vr videos with a single 360-camera
Jingwei Huang, Zhili Chen, Duygu Ceylan, and Hailin Jin. 6-dof vr videos with a single 360-camera. In2017 IEEE Virtual Reality (VR), pages 37–44. IEEE, 2017. 1
work page 2017
-
[18]
Sampling network guided cross-entropy method for unsupervised point cloud registration
Haobo Jiang, Yaqi Shen, Jin Xie, Jun Li, Jianjun Qian, and Jian Yang. Sampling network guided cross-entropy method for unsupervised point cloud registration. InProceedings of the IEEE/CVF international conference on computer vision,
-
[19]
Hualie Jiang, Zhe Sheng, Siyu Zhu, Zilong Dong, and Rui Huang. Unifuse: Unidirectional fusion for 360 panorama depth estimation.IEEE Robotics and Automation Letters, 6 (2):1519–1526, 2021. 2, 6, 7
work page 2021
-
[20]
Robust outlier rejection for 3d registra- tion with variational bayes
Haobo Jiang, Zheng Dang, Zhen Wei, Jin Xie, Jian Yang, and Mathieu Salzmann. Robust outlier rejection for 3d registra- tion with variational bayes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,
-
[21]
Haobo Jiang, Mathieu Salzmann, Zheng Dang, Jin Xie, and Jian Yang. Se (3) diffusion model-based point cloud registra- tion for robust 6d object pose estimation.Advances in Neural Information Processing Systems, 2023. 1
work page 2023
-
[22]
Dongki Jung, Jaehoon Choi, Yonghan Lee, and Dinesh Manocha. Rpg360: Robust 360 depth estimation with per- spective foundation models and graph optimization.arXiv preprint arXiv:2509.23991, 2025. 2, 3, 6, 7
-
[23]
Omnifusion: 360 monocular depth estima- tion via geometry-aware fusion
Yuyan Li, Yuliang Guo, Zhixin Yan, Xinyu Huang, Ye Duan, and Liu Ren. Omnifusion: 360 monocular depth estima- tion via geometry-aware fusion. InCVPR, pages 2801–2810,
-
[24]
Vggt-x: When vggt meets dense novel view synthesis.arXiv preprint arXiv:2509.25191, 2025
Yang Liu, Chuanchen Luo, Zimo Tang, Junran Peng, and Zhaoxiang Zhang. Vggt-x: When vggt meets dense novel view synthesis.arXiv preprint arXiv:2509.25191, 2025. 3
-
[25]
Ms360: A multi-scale feature fusion framework for 360 monocular depth estimation
Payal Mohadikar, Chuanmao Fan, and Ye Duan. Ms360: A multi-scale feature fusion framework for 360 monocular depth estimation. InProceedings of the 50th Graphics Inter- face Conference, pages 1–11, 2024. 2
work page 2024
-
[26]
Chi-Han Peng and Jiayao Zhang. High-resolution depth estimation for 360deg panoramas through perspective and panoramic depth images registration. InWACV, pages 3116– 3125, 2023. 2, 6, 7
work page 2023
-
[27]
Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI
Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Un- dersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[28]
Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020. 2, 6
work page 2020
-
[29]
360monodepth: High-resolution 360deg monocular depth estimation
Manuel Rey-Area, Mingze Yuan, and Christian Richardt. 360monodepth: High-resolution 360deg monocular depth estimation. InCVPR, pages 3762–3772, 2022. 1, 2, 3, 6, 7, 8
work page 2022
-
[30]
Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding
Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding. In ICCV, pages 10912–10922, 2021. 6, 7
work page 2021
-
[31]
You Shen, Zhipeng Zhang, Yansong Qu, and Liujuan Cao. Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025. 2, 3, 6
-
[32]
The Replica Dataset: A Digital Replica of Indoor Spaces
Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, et al. The replica dataset: A digital replica of indoor spaces.arXiv preprint arXiv:1906.05797,
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[33]
Hohonet: 360 indoor holistic understanding with latent horizontal fea- tures
Cheng Sun, Min Sun, and Hwann-Tzong Chen. Hohonet: 360 indoor holistic understanding with latent horizontal fea- tures. InCVPR, pages 2573–2582, 2021. 2, 6, 7
work page 2021
-
[34]
Efficientnet: Rethinking model scaling for convolutional neural networks
Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. InICML, pages 6105–6114, 2019. 6
work page 2019
-
[35]
Distortion-aware convolutional filters for dense prediction in panoramic images
Keisuke Tateno, Nassir Navab, and Federico Tombari. Distortion-aware convolutional filters for dense prediction in panoramic images. InECCV, pages 707–722, 2018. 2
work page 2018
-
[36]
An Dinh Vuong, Minh Nhat Vu, and Ian Reid. Improving robotic manipulation with efficient geometry-aware vision encoder.arXiv preprint arXiv:2509.15880, 2025. 3
-
[37]
Self-supervised learning of depth and camera mo- tion from 360 videos
Fu-En Wang, Hou-Ning Hu, Hsien-Tzu Cheng, Juan-Ting Lin, Shang-Ta Yang, Meng-Li Shih, Hung-Kuo Chu, and Min Sun. Self-supervised learning of depth and camera mo- tion from 360 videos. InACCV, pages 53–68. Springer,
-
[38]
Bifuse: Monocular 360 depth estimation via bi-projection fusion
Fu-En Wang, Yu-Hsuan Yeh, Min Sun, Wei-Chen Chiu, and Yi-Hsuan Tsai. Bifuse: Monocular 360 depth estimation via bi-projection fusion. InCVPR, pages 462–471, 2020. 3, 6, 7
work page 2020
-
[39]
Fu-En Wang, Yu-Hsuan Yeh, Yi-Hsuan Tsai, Wei-Chen Chiu, and Min Sun. Bifuse++: Self-supervised and efficient bi-projection fusion for 360 depth estimation.IEEE trans- actions on pattern analysis and machine intelligence, 45(5): 5448–5460, 2022. 2, 6, 7
work page 2022
-
[40]
Vggt: Vi- sual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InCVPR, pages 5294– 5306, 2025. 2, 3, 6, 8
work page 2025
-
[41]
Ning-Hsu Albert Wang and Yu-Lun Liu. Depth anywhere: Enhancing 360 monocular depth estimation via perspective distillation and unlabeled data augmentation.NeurIPS, 37: 127739–127764, 2024. 1, 2, 3, 6, 7
work page 2024
-
[42]
Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He.π 3: Scalable permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Distortion and uncertainty aware loss for panoramic depth completion
Zhiqiang Yan, Xiang Li, Kun Wang, Shuo Chen, Jun Li, and Jian Yang. Distortion and uncertainty aware loss for panoramic depth completion. InICML, pages 39099–39109. PMLR, 2023. 1
work page 2023
-
[44]
Depth anything: Unleashing the power of large-scale unlabeled data
Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InCVPR, pages 10371–10381, 2024. 2, 3, 6
work page 2024
-
[45]
Metric3d: Towards zero-shot metric 3d prediction from a single image
Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. InICCV, pages 9043–9053, 2023. 2
work page 2023
-
[46]
Egformer: Equirectangular geometry- biased transformer for 360 depth estimation
Ilwi Yun, Chanyong Shin, Hyunku Lee, Hyuk-Jae Lee, and Chae Eun Rhee. Egformer: Equirectangular geometry- biased transformer for 360 depth estimation. InICCV, pages 6101–6112, 2023. 1, 2, 6, 7
work page 2023
-
[47]
Taskonomy: Disentangling task transfer learning
Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. InCVPR, pages 3712– 3722, 2018. 6, 7
work page 2018
-
[48]
Junsong Zhang, Zisong Chen, Chunyu Lin, Zhijie Shen, Lang Nie, Kang Liao, and Yao Zhao. Sgformer: Spherical geometry transformer for 360 depth estimation.IEEE Trans- actions on Circuits and Systems for Video Technology, 2025. 1, 2
work page 2025
-
[49]
Panoramic visual slam technology for spherical images.Sensors, 21(3):705, 2021
Yi Zhang and Fei Huang. Panoramic visual slam technology for spherical images.Sensors, 21(3):705, 2021. 1
work page 2021
-
[50]
Structured3d: A large photo-realistic dataset for structured 3d modeling
Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. Structured3d: A large photo-realistic dataset for structured 3d modeling. InECCV, pages 519–
-
[51]
Springer, 2020. 6, 7
work page 2020
-
[52]
Acdnet: Adaptively combined dilated con- volution for monocular panorama depth estimation
Chuanqing Zhuang, Zhengda Lu, Yiqun Wang, Jun Xiao, and Ying Wang. Acdnet: Adaptively combined dilated con- volution for monocular panorama depth estimation. InAAAI, pages 3653–3661, 2022. 2
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.