arxiv: 2604.23432 · v1 · submitted 2026-04-25 · 💻 cs.CV · cs.AI

Recognition: unknown

Sphere-Depth: A Benchmark for Depth Estimation Methods with Varying Spherical Camera Orientations

Soulayma Gazzeh , Giuseppe Mazzola , Liliana Lo Presti , Marco La Cascia

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:19 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords depth estimationspherical imagesequirectangular projectioncamera posebenchmark360 visionrobotic navigation

0 comments

The pith

Depth estimation from spherical images degrades when camera pose deviates from the canonical orientation

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Sphere-Depth, a public benchmark that simulates camera pose perturbations on equirectangular images to test the robustness of monocular depth estimation models. It evaluates a perspective-based model and several spherical-aware models using a proposed calibration protocol that converts relative depth predictions to metric depths via supervised learned scaling factors per model. Experiments reveal substantial performance drops for all tested models, including those built for spherical inputs, when orientation varies from the standard pose.

Core claim

The Sphere-Depth benchmark demonstrates that monocular depth estimation models for spherical images, even those explicitly designed to handle equirectangular projections, exhibit substantial performance degradation when camera pose varies from the canonical orientation, as quantified by a depth calibration-based error protocol that applies supervised learned scaling factors to enable fair comparison across relative-depth predictors.

What carries the argument

Sphere-Depth benchmark, which applies simulated pose perturbations to equirectangular images and uses supervised scaling-factor calibration to measure depth errors

Load-bearing premise

The simulated camera pose perturbations accurately represent unintentional real-world variations on robotic platforms, and the supervised learned scaling factors provide a fair, unbiased way to compare relative-depth models.

What would settle it

Real-world tests on robotic platforms with measured unintentional pose variations that show no significant increase in depth error for spherical-aware models.

Figures

Figures reproduced from arXiv: 2604.23432 by Giuseppe Mazzola, Liliana Lo Presti, Marco La Cascia, Soulayma Gazzeh.

**Figure 1.** Figure 1: (a) planar image obtained by cubic projection; (b) gravity-aligned image view at source ↗

**Figure 2.** Figure 2: Equirectangular depth map of the scene obtained by reprojecting six view at source ↗

**Figure 3.** Figure 3: Error heatmaps (on the left) and error response curves (on the right) for view at source ↗

**Figure 4.** Figure 4: Error heatmaps (on the left) and error response curves (on the right) for view at source ↗

**Figure 5.** Figure 5: Qualitative results for ACDNet under conditions of no camera pose vari view at source ↗

**Figure 6.** Figure 6: LOWESS curves describing error trend across equirectangular models. view at source ↗

read the original abstract

Reliable depth estimation from spherical images is crucial for 360{\deg} vision in robotic navigation and immersive scene understanding. However, the onboard spherical camera can experience unintentional pose variations in real-world robotic platforms that, along with the geometric distortions inherent in equirectangular projections, significantly impact the effectiveness of depth estimation. To study this issue, a novel public benchmark, called Sphere-Depth, is introduced to systematically evaluate the robustness of monocular depth estimation models from equirectangular images in a reproducible way. Camera pose perturbations are simulated and used to assess the performance of a popular perspective-based model, Depth Anything, and of spherical-aware models such as Depth Anywhere, ACDNet, Bifuse++, and SliceNet. Furthermore, to ensure meaningful evaluation across models, a depth calibration-based error protocol is proposed to convert predicted relative depth values into metric depth values using supervised learned scaling factors for each model. Experiments show that even models explicitly designed to process spherical images exhibit substantial performance degradation when variations in the camera pose are observed with respect to the canonical pose. The full benchmark, evaluation protocol, and dataset splits are made publicly available at: https://github.com/sgazzeh/Sphere_depth

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Sphere-Depth, a public benchmark for evaluating monocular depth estimation from equirectangular images under simulated camera pose perturbations. It assesses a perspective model (Depth Anything) alongside spherical-aware models (Depth Anywhere, ACDNet, Bifuse++, SliceNet), proposes a calibration protocol that converts relative depth predictions to metric depths via per-model supervised scaling factors, and reports that even spherical-designed models suffer substantial performance degradation when camera orientation deviates from the canonical pose. The benchmark, protocol, and splits are released publicly.

Significance. If the simulated perturbations prove representative of real robotic variations and the calibration protocol yields unbiased comparisons, the work would usefully highlight a robustness gap in spherical depth estimation for navigation and immersive applications. The public release of the full benchmark, evaluation protocol, and dataset splits is a clear strength that supports reproducibility and follow-on research.

major comments (2)

The central claim that spherical-aware models degrade under pose variation rests on the fidelity of the simulated perturbations. The manuscript provides no direct empirical validation (e.g., comparison of simulated roll/pitch/yaw distributions against IMU recordings from actual robotic platforms on uneven terrain), so it is unclear whether the observed degradation reflects a general property or an artifact of the simulation protocol.
The depth calibration protocol applies supervised learned scaling factors per model to convert relative to metric depth. Because these factors are fitted with ground-truth supervision on the same data used for evaluation, they risk introducing model-specific bias that could inflate or deflate relative performance; the paper does not report an ablation or cross-validation of this choice.

minor comments (1)

The abstract states that the benchmark is 'reproducible' yet does not specify the exact ranges, sampling strategy, or correlation structure of the simulated pose perturbations; these details belong in the main text or supplementary material for independent reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our paper. We address each of the major comments point by point below, indicating where revisions will be made to improve the manuscript.

read point-by-point responses

Referee: The central claim that spherical-aware models degrade under pose variation rests on the fidelity of the simulated perturbations. The manuscript provides no direct empirical validation (e.g., comparison of simulated roll/pitch/yaw distributions against IMU recordings from actual robotic platforms on uneven terrain), so it is unclear whether the observed degradation reflects a general property or an artifact of the simulation protocol.

Authors: We agree that direct empirical validation against real IMU recordings would provide stronger support for the representativeness of our simulations. Our benchmark is designed for controlled, reproducible evaluation of pose-induced degradation using parameterized perturbations, which enables systematic analysis across models and magnitudes. The roll, pitch, and yaw ranges were chosen to reflect plausible robotic variations based on standard practices in the literature. In the revised manuscript, we will expand the simulation protocol section with additional justification, parameter details, and references to typical pose variation statistics from robotic navigation studies. This constitutes a partial revision that clarifies the scope of the claims without altering the benchmark's simulation-based nature. revision: partial
Referee: The depth calibration protocol applies supervised learned scaling factors per model to convert relative to metric depth. Because these factors are fitted with ground-truth supervision on the same data used for evaluation, they risk introducing model-specific bias that could inflate or deflate relative performance; the paper does not report an ablation or cross-validation of this choice.

Authors: We appreciate the referee noting this potential limitation in the calibration protocol. The per-model scaling factors enable fair metric comparisons of relative depth outputs without requiring model retraining, following common practice in depth estimation benchmarks. To address the risk of bias from fitting on evaluation data, we will incorporate an ablation study in the revision that uses cross-validation (learning scaling factors on training folds and evaluating on held-out splits) and report the resulting performance metrics and rankings. This will demonstrate the robustness of our findings. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark paper

full rationale

The paper introduces Sphere-Depth as a public benchmark to evaluate existing monocular depth models (Depth Anything, Depth Anywhere, ACDNet, Bifuse++, SliceNet) on equirectangular images under simulated pose perturbations. It proposes a calibration protocol using supervised scaling factors for metric conversion and reports experimental degradation results. No mathematical derivations, equations, predictions, or self-citations are present that reduce any claim to fitted inputs or prior author work by construction. The central findings rest on direct empirical comparisons and data release, remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the realism of simulated perturbations and the appropriateness of the per-model scaling calibration; these are not independently validated in the abstract.

free parameters (1)

scaling factors = learned per model
Supervised learned scaling factors per model to convert relative depth predictions to metric depth for error computation.

axioms (1)

domain assumption Simulated pose perturbations represent real-world unintentional camera variations on robotic platforms
Used to generate test cases for robustness evaluation.

pith-pipeline@v0.9.0 · 5522 in / 1152 out tokens · 82079 ms · 2026-05-08T08:19:01.815046+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 1 canonical work pages

[1]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Ai, H., Cao, Z., Cao, Y.P., Shan, Y., Wang, L.: Hrdfuse: Monocular 360deg depth estimation by collaboratively learning holistic-with-regional depth distributions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13273–13282 (2023)

2023
[2]

In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition

Ai, H., Wang, L.: Elite360d: Towards efficient 360 depth estimation via semantic- and distance-aware bi-projection fusion. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. pp. 9926–9935 (2024)

2024
[3]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Albanis, G., Zioulis, N., Drakoulis, P., Gkitsas, V., Sterzentsenko, V., Alvarez, F., Zarpalas, D., Daras, P.: Pano3d: A holistic benchmark and a solid baseline for 360deg depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3727–3737 (2021)

2021
[4]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Athwale, A., Afrasiyabi, A., Lagüe, J., Shili, I., Ahmad, O., Lalonde, J.F.: Dar- swin: distortion aware radial swin transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5929–5938 (2023)

2023
[5]

Cheng,H.T.,Chao,C.H.,Dong,J.D.,Wen,H.K.,Liu,T.L.,Sun,M.:Cubepadding forweakly-supervisedsaliencypredictionin360videos.In:ProceedingsoftheIEEE conference on computer vision and pattern recognition. pp. 1420–1429 (2018)

2018
[6]

IEEE Transactions on Industrial Electronics (2024)

Cho, E., Kim, H., Kim, P., Lee, H.: Obstacle avoidance of a uav using fast monoc- ular depth estimation for a wide stereo camera. IEEE Transactions on Industrial Electronics (2024)

2024
[7]

Spherical CNNs

Cohen, T.S., Geiger, M., Köhler, J., Welling, M.: Spherical cnns. arXiv preprint arXiv:1801.10130 (2018)

work page Pith review arXiv 2018
[8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Eder,M.,Shvets,M.,Lim,J.,Frahm,J.M.:Tangentimagesformitigatingspherical distortion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12426–12434 (2020)

2020
[9]

Advances in neural information processing systems27 (2014)

Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems27 (2014)

2014
[10]

In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

Jaisawal, P.K., Papakonstantinou, S., Gollnick, V.: Airfisheye dataset: A multi- model fisheye dataset for uav applications. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 11818–11824. IEEE (2024) Sphere-Depth 11

2024
[11]

IEEE Robotics and Automation Letters6(2), 1519–1526 (2021)

Jiang, H., Sheng, Z., Zhu, S., Dong, Z., Huang, R.: Unifuse: Unidirectional fusion for 360 panorama depth estimation. IEEE Robotics and Automation Letters6(2), 1519–1526 (2021)

2021
[12]

In: 2016 Fourth international conference on 3D vision (3DV)

Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 2016 Fourth international conference on 3D vision (3DV). pp. 239–248. IEEE (2016)

2016
[13]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, Y., Guo, Y., Yan, Z., Huang, X., Duan, Y., Ren, L.: Omnifusion: 360 monocular depth estimation via geometry-aware fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2801–2810 (2022)

2022
[14]

In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Mohadikar, P., Duan, Y.: Omnidiffusion: Reformulating 360 monocular depth estimation using semantic and surface normal conditioned diffusion. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 8068–8078. IEEE (2025)

2025
[15]

IEEE Transactions on Consumer Electronics (2024)

Park, C., Kim, H., Jang, J., Paik, J.: Odd-m3d: Object-wise dense depth estimation for monocular 3d object detection. IEEE Transactions on Consumer Electronics (2024)

2024
[16]

In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition

Patni, S., Agarwal, A., Arora, C.: Ecodepth: Effective conditioning of diffusion models for monocular depth estimation. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. pp. 28285–28295 (2024)

2024
[17]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Pintore, G., Agus, M., Almansa, E., Schneider, J., Gobbetti, E.: Slicenet: deep dense depth estimation from a single indoor panorama using a slice-based repre- sentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11536–11545 (2021)

2021
[18]

In: Asian Conference on Computer Vision

Su, Y.C., Jayaraman, D., Grauman, K.: Pano2vid: Automatic cinematography for watching 360 videos. In: Asian Conference on Computer Vision. pp. 154–171. Springer (2016)

2016
[19]

IEEE transactions on pattern analysis and machine intelligence45(5), 5448–5460 (2022)

Wang, F.E., Yeh, Y.H., Tsai, Y.H., Chiu, W.C., Sun, M.: Bifuse++: Self-supervised and efficient bi-projection fusion for 360 depth estimation. IEEE transactions on pattern analysis and machine intelligence45(5), 5448–5460 (2022)

2022
[20]

Wang, H., Zhang, X., Chen, Z., Jun, L., Liu, H.: Pddepth: Pose decoupled monocu- lardepthestimationforroadsideperceptionsystem.IEEETransactionsonCircuits and Systems for Video Technology (2025)

2025
[21]

Advances in Neural Information Processing Systems37, 127739–127764 (2024)

Wang, N.H.A., Liu, Y.L.: Depth anywhere: Enhancing 360 monocular depth esti- mation via perspective distillation and unlabeled data augmentation. Advances in Neural Information Processing Systems37, 127739–127764 (2024)

2024
[22]

Advances in Neural Information Processing Systems37, 21875–21911 (2024)

Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything v2. Advances in Neural Information Processing Systems37, 21875–21911 (2024)

2024
[23]

In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yoon, Y., Chung, I., Wang, L., Yoon, K.J.: Spheresr: 360deg image super-resolution with arbitrary projection via continuous spherical image representation. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5677–5686 (2022)

2022
[24]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zheng,J.,Lin,C.,Sun,J.,Zhao,Z.,Li,Q.,Shen,C.:Physical3dadversarialattacks against monocular depth estimation in autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 24452– 24461 (2024)

2024
[25]

In: Proceedings of the AAAI conference on artificial intelligence

Zhuang, C., Lu, Z., Wang, Y., Xiao, J., Wang, Y.: Acdnet: Adaptively combined dilated convolution for monocular panorama depth estimation. In: Proceedings of the AAAI conference on artificial intelligence. vol. 36, pp. 3653–3661 (2022)

2022
[26]

In: Proceedings of the European Confer- ence on Computer Vision (ECCV)

Zioulis, N., Karakottas, A., Zarpalas, D., Daras, P.: Omnidepth: Dense depth esti- mation for indoors spherical panoramas. In: Proceedings of the European Confer- ence on Computer Vision (ECCV). pp. 448–465 (2018)

2018