Recognition: unknown
Sphere-Depth: A Benchmark for Depth Estimation Methods with Varying Spherical Camera Orientations
Pith reviewed 2026-05-08 08:19 UTC · model grok-4.3
The pith
Depth estimation from spherical images degrades when camera pose deviates from the canonical orientation
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Sphere-Depth benchmark demonstrates that monocular depth estimation models for spherical images, even those explicitly designed to handle equirectangular projections, exhibit substantial performance degradation when camera pose varies from the canonical orientation, as quantified by a depth calibration-based error protocol that applies supervised learned scaling factors to enable fair comparison across relative-depth predictors.
What carries the argument
Sphere-Depth benchmark, which applies simulated pose perturbations to equirectangular images and uses supervised scaling-factor calibration to measure depth errors
Load-bearing premise
The simulated camera pose perturbations accurately represent unintentional real-world variations on robotic platforms, and the supervised learned scaling factors provide a fair, unbiased way to compare relative-depth models.
What would settle it
Real-world tests on robotic platforms with measured unintentional pose variations that show no significant increase in depth error for spherical-aware models.
Figures
read the original abstract
Reliable depth estimation from spherical images is crucial for 360{\deg} vision in robotic navigation and immersive scene understanding. However, the onboard spherical camera can experience unintentional pose variations in real-world robotic platforms that, along with the geometric distortions inherent in equirectangular projections, significantly impact the effectiveness of depth estimation. To study this issue, a novel public benchmark, called Sphere-Depth, is introduced to systematically evaluate the robustness of monocular depth estimation models from equirectangular images in a reproducible way. Camera pose perturbations are simulated and used to assess the performance of a popular perspective-based model, Depth Anything, and of spherical-aware models such as Depth Anywhere, ACDNet, Bifuse++, and SliceNet. Furthermore, to ensure meaningful evaluation across models, a depth calibration-based error protocol is proposed to convert predicted relative depth values into metric depth values using supervised learned scaling factors for each model. Experiments show that even models explicitly designed to process spherical images exhibit substantial performance degradation when variations in the camera pose are observed with respect to the canonical pose. The full benchmark, evaluation protocol, and dataset splits are made publicly available at: https://github.com/sgazzeh/Sphere_depth
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Sphere-Depth, a public benchmark for evaluating monocular depth estimation from equirectangular images under simulated camera pose perturbations. It assesses a perspective model (Depth Anything) alongside spherical-aware models (Depth Anywhere, ACDNet, Bifuse++, SliceNet), proposes a calibration protocol that converts relative depth predictions to metric depths via per-model supervised scaling factors, and reports that even spherical-designed models suffer substantial performance degradation when camera orientation deviates from the canonical pose. The benchmark, protocol, and splits are released publicly.
Significance. If the simulated perturbations prove representative of real robotic variations and the calibration protocol yields unbiased comparisons, the work would usefully highlight a robustness gap in spherical depth estimation for navigation and immersive applications. The public release of the full benchmark, evaluation protocol, and dataset splits is a clear strength that supports reproducibility and follow-on research.
major comments (2)
- The central claim that spherical-aware models degrade under pose variation rests on the fidelity of the simulated perturbations. The manuscript provides no direct empirical validation (e.g., comparison of simulated roll/pitch/yaw distributions against IMU recordings from actual robotic platforms on uneven terrain), so it is unclear whether the observed degradation reflects a general property or an artifact of the simulation protocol.
- The depth calibration protocol applies supervised learned scaling factors per model to convert relative to metric depth. Because these factors are fitted with ground-truth supervision on the same data used for evaluation, they risk introducing model-specific bias that could inflate or deflate relative performance; the paper does not report an ablation or cross-validation of this choice.
minor comments (1)
- The abstract states that the benchmark is 'reproducible' yet does not specify the exact ranges, sampling strategy, or correlation structure of the simulated pose perturbations; these details belong in the main text or supplementary material for independent reproduction.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our paper. We address each of the major comments point by point below, indicating where revisions will be made to improve the manuscript.
read point-by-point responses
-
Referee: The central claim that spherical-aware models degrade under pose variation rests on the fidelity of the simulated perturbations. The manuscript provides no direct empirical validation (e.g., comparison of simulated roll/pitch/yaw distributions against IMU recordings from actual robotic platforms on uneven terrain), so it is unclear whether the observed degradation reflects a general property or an artifact of the simulation protocol.
Authors: We agree that direct empirical validation against real IMU recordings would provide stronger support for the representativeness of our simulations. Our benchmark is designed for controlled, reproducible evaluation of pose-induced degradation using parameterized perturbations, which enables systematic analysis across models and magnitudes. The roll, pitch, and yaw ranges were chosen to reflect plausible robotic variations based on standard practices in the literature. In the revised manuscript, we will expand the simulation protocol section with additional justification, parameter details, and references to typical pose variation statistics from robotic navigation studies. This constitutes a partial revision that clarifies the scope of the claims without altering the benchmark's simulation-based nature. revision: partial
-
Referee: The depth calibration protocol applies supervised learned scaling factors per model to convert relative to metric depth. Because these factors are fitted with ground-truth supervision on the same data used for evaluation, they risk introducing model-specific bias that could inflate or deflate relative performance; the paper does not report an ablation or cross-validation of this choice.
Authors: We appreciate the referee noting this potential limitation in the calibration protocol. The per-model scaling factors enable fair metric comparisons of relative depth outputs without requiring model retraining, following common practice in depth estimation benchmarks. To address the risk of bias from fitting on evaluation data, we will incorporate an ablation study in the revision that uses cross-validation (learning scaling factors on training folds and evaluating on held-out splits) and report the resulting performance metrics and rankings. This will demonstrate the robustness of our findings. revision: yes
Circularity Check
No circularity: purely empirical benchmark paper
full rationale
The paper introduces Sphere-Depth as a public benchmark to evaluate existing monocular depth models (Depth Anything, Depth Anywhere, ACDNet, Bifuse++, SliceNet) on equirectangular images under simulated pose perturbations. It proposes a calibration protocol using supervised scaling factors for metric conversion and reports experimental degradation results. No mathematical derivations, equations, predictions, or self-citations are present that reduce any claim to fitted inputs or prior author work by construction. The central findings rest on direct empirical comparisons and data release, remaining self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- scaling factors =
learned per model
axioms (1)
- domain assumption Simulated pose perturbations represent real-world unintentional camera variations on robotic platforms
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Ai, H., Cao, Z., Cao, Y.P., Shan, Y., Wang, L.: Hrdfuse: Monocular 360deg depth estimation by collaboratively learning holistic-with-regional depth distributions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13273–13282 (2023)
2023
-
[2]
In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition
Ai, H., Wang, L.: Elite360d: Towards efficient 360 depth estimation via semantic- and distance-aware bi-projection fusion. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. pp. 9926–9935 (2024)
2024
-
[3]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Albanis, G., Zioulis, N., Drakoulis, P., Gkitsas, V., Sterzentsenko, V., Alvarez, F., Zarpalas, D., Daras, P.: Pano3d: A holistic benchmark and a solid baseline for 360deg depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3727–3737 (2021)
2021
-
[4]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Athwale, A., Afrasiyabi, A., Lagüe, J., Shili, I., Ahmad, O., Lalonde, J.F.: Dar- swin: distortion aware radial swin transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5929–5938 (2023)
2023
-
[5]
Cheng,H.T.,Chao,C.H.,Dong,J.D.,Wen,H.K.,Liu,T.L.,Sun,M.:Cubepadding forweakly-supervisedsaliencypredictionin360videos.In:ProceedingsoftheIEEE conference on computer vision and pattern recognition. pp. 1420–1429 (2018)
2018
-
[6]
IEEE Transactions on Industrial Electronics (2024)
Cho, E., Kim, H., Kim, P., Lee, H.: Obstacle avoidance of a uav using fast monoc- ular depth estimation for a wide stereo camera. IEEE Transactions on Industrial Electronics (2024)
2024
-
[7]
Cohen, T.S., Geiger, M., Köhler, J., Welling, M.: Spherical cnns. arXiv preprint arXiv:1801.10130 (2018)
work page Pith review arXiv 2018
-
[8]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Eder,M.,Shvets,M.,Lim,J.,Frahm,J.M.:Tangentimagesformitigatingspherical distortion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12426–12434 (2020)
2020
-
[9]
Advances in neural information processing systems27 (2014)
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems27 (2014)
2014
-
[10]
In: 2024 IEEE International Conference on Robotics and Automation (ICRA)
Jaisawal, P.K., Papakonstantinou, S., Gollnick, V.: Airfisheye dataset: A multi- model fisheye dataset for uav applications. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 11818–11824. IEEE (2024) Sphere-Depth 11
2024
-
[11]
IEEE Robotics and Automation Letters6(2), 1519–1526 (2021)
Jiang, H., Sheng, Z., Zhu, S., Dong, Z., Huang, R.: Unifuse: Unidirectional fusion for 360 panorama depth estimation. IEEE Robotics and Automation Letters6(2), 1519–1526 (2021)
2021
-
[12]
In: 2016 Fourth international conference on 3D vision (3DV)
Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 2016 Fourth international conference on 3D vision (3DV). pp. 239–248. IEEE (2016)
2016
-
[13]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Li, Y., Guo, Y., Yan, Z., Huang, X., Duan, Y., Ren, L.: Omnifusion: 360 monocular depth estimation via geometry-aware fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2801–2810 (2022)
2022
-
[14]
In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
Mohadikar, P., Duan, Y.: Omnidiffusion: Reformulating 360 monocular depth estimation using semantic and surface normal conditioned diffusion. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 8068–8078. IEEE (2025)
2025
-
[15]
IEEE Transactions on Consumer Electronics (2024)
Park, C., Kim, H., Jang, J., Paik, J.: Odd-m3d: Object-wise dense depth estimation for monocular 3d object detection. IEEE Transactions on Consumer Electronics (2024)
2024
-
[16]
In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition
Patni, S., Agarwal, A., Arora, C.: Ecodepth: Effective conditioning of diffusion models for monocular depth estimation. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. pp. 28285–28295 (2024)
2024
-
[17]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Pintore, G., Agus, M., Almansa, E., Schneider, J., Gobbetti, E.: Slicenet: deep dense depth estimation from a single indoor panorama using a slice-based repre- sentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11536–11545 (2021)
2021
-
[18]
In: Asian Conference on Computer Vision
Su, Y.C., Jayaraman, D., Grauman, K.: Pano2vid: Automatic cinematography for watching 360 videos. In: Asian Conference on Computer Vision. pp. 154–171. Springer (2016)
2016
-
[19]
IEEE transactions on pattern analysis and machine intelligence45(5), 5448–5460 (2022)
Wang, F.E., Yeh, Y.H., Tsai, Y.H., Chiu, W.C., Sun, M.: Bifuse++: Self-supervised and efficient bi-projection fusion for 360 depth estimation. IEEE transactions on pattern analysis and machine intelligence45(5), 5448–5460 (2022)
2022
-
[20]
Wang, H., Zhang, X., Chen, Z., Jun, L., Liu, H.: Pddepth: Pose decoupled monocu- lardepthestimationforroadsideperceptionsystem.IEEETransactionsonCircuits and Systems for Video Technology (2025)
2025
-
[21]
Advances in Neural Information Processing Systems37, 127739–127764 (2024)
Wang, N.H.A., Liu, Y.L.: Depth anywhere: Enhancing 360 monocular depth esti- mation via perspective distillation and unlabeled data augmentation. Advances in Neural Information Processing Systems37, 127739–127764 (2024)
2024
-
[22]
Advances in Neural Information Processing Systems37, 21875–21911 (2024)
Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything v2. Advances in Neural Information Processing Systems37, 21875–21911 (2024)
2024
-
[23]
In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition
Yoon, Y., Chung, I., Wang, L., Yoon, K.J.: Spheresr: 360deg image super-resolution with arbitrary projection via continuous spherical image representation. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5677–5686 (2022)
2022
-
[24]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Zheng,J.,Lin,C.,Sun,J.,Zhao,Z.,Li,Q.,Shen,C.:Physical3dadversarialattacks against monocular depth estimation in autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 24452– 24461 (2024)
2024
-
[25]
In: Proceedings of the AAAI conference on artificial intelligence
Zhuang, C., Lu, Z., Wang, Y., Xiao, J., Wang, Y.: Acdnet: Adaptively combined dilated convolution for monocular panorama depth estimation. In: Proceedings of the AAAI conference on artificial intelligence. vol. 36, pp. 3653–3661 (2022)
2022
-
[26]
In: Proceedings of the European Confer- ence on Computer Vision (ECCV)
Zioulis, N., Karakottas, A., Zarpalas, D., Daras, P.: Omnidepth: Dense depth esti- mation for indoors spherical panoramas. In: Proceedings of the European Confer- ence on Computer Vision (ECCV). pp. 448–465 (2018)
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.