pith. sign in

arxiv: 2605.14963 · v2 · pith:FWQVODMSnew · submitted 2026-05-14 · 💻 cs.CV

H-OmniStereo: Zero-Shot Omnidirectional Stereo Matching with Heading-Aligned Normal Priors

Pith reviewed 2026-05-20 20:45 UTC · model grok-4.3

classification 💻 cs.CV
keywords stereomatchingpriorsdatasetsequirectangularh-omnistereomonocularomnidirectional
0
0 comments X

The pith

Heading-aligned normal priors enable accurate zero-shot omnidirectional stereo matching

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces H-OmniStereo to perform stereo matching on top-bottom equirectangular omnidirectional images without dedicated training on real omnidirectional stereo data. It creates a large synthetic dataset of over 2.8 million pairs and develops a monocular normal estimator that operates in a heading-aligned coordinate system to generate geometric priors robust to spherical distortions. These priors help establish reliable correspondences by providing cross-view consistent information, allowing the use of perspective stereo methods on distorted images. The result is a single model that outperforms existing methods on unseen datasets and works on real consumer cameras for full-surround perception.

Core claim

By training in a heading-aligned coordinate system, the equirectangular monocular normal estimator produces distortion-robust and cross-view-consistent geometric priors that boost the performance of stereo matching on omnidirectional images, enabling zero-shot generalization after training solely on synthetic data.

What carries the argument

The heading-aligned monocular normal estimator that supplies geometric priors for stereo correspondence in equirectangular projections.

Load-bearing premise

The monocular normal estimator trained in heading-aligned coordinates will yield priors that are robust to distortions and consistent across views, reliably aiding stereo matching on real data without adaptation.

What would settle it

Observing that the predicted normals from left and right views of the same scene are inconsistent or that adding the normal priors does not increase matching accuracy over a baseline stereo matcher on real omnidirectional test images.

Figures

Figures reproduced from arXiv: 2605.14963 by Chenxing Jiang, Chuan Fang, Peize Liu, Ping Tan, Pusen Gao, Shaojie Shen, Yang Xu, Zhe Tong.

Figure 1
Figure 1. Figure 1: Left: Spherical disparity under top-bottom camera pairs. Right: Dense 3D reconstruction and trajectory estimation (red line) achieved by integrating our proposed single-pair omnidirectional stereo matching model into a visual odometry pipeline. By processing sequential top-bottom equirectangular stereo pairs, our method enables complete environment recovery. The ceiling of the point cloud is cropped for be… view at source ↗
Figure 2
Figure 2. Figure 2: The sphere illustrates the coordinate frames defining pixel normals. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The pipeline of H-OmniStereo. panoramic stereo pairs with corresponding ground-truth depth and surface normal maps. For each sample, we randomize the rig’s baseline (0.05m to 0.5m) and pose (roll and pitch: −45◦ to 45◦ ), utilizing a capsule-shaped collision proxy to ensure collision-free placement among surrounding objects. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Samples from our synthetic omnidirectional stereo dataset, featuring fixed layouts from GRUtopia [34] and HM3D [35], as well as randomized layouts [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative disparity comparison against our two most competitive baselines, 360SD-Net [2] and MODE [9], using their official checkpoints. For [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative point cloud comparison on real-world images [2]. Points beyond 10m are removed, causing missing regions in some of the baseline results. [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results of point clouds derived from our predicted disparity. [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative surface normal comparisons with PanoNormal [32] on the [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative results of our omnidirectional stereo odometry: estimated [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
read the original abstract

Stereo matching on top-bottom equirectangular images provides an effective framework for full-surround perception, as vertically aligned epipolar lines enable the use of advanced perspective stereo architectures that are largely driven by large-scale datasets and monocular priors. However, the performance of such adaptations is severely limited by the scarcity of omnidirectional stereo datasets and the degradation of perspective monocular priors under spherical distortions. To address these challenges, we propose H-OmniStereo, a zero-shot omnidirectional stereo matching framework. First, we construct high-quality synthetic dataset comprising over 2.8 million top-bottom equirectangular stereo pairs to scale up training. Second, we introduce an equirectangular monocular normal estimator, specifically operating in a heading-aligned coordinate system. Beyond providing distortion-robust and cross-view-consistent geometric priors for establishing reliable correspondences in stereo matching, this design boosts training efficiency and accommodates train-test FoV mismatches. Extensive experiments show that our approach achieves higher accuracy than existing methods on out-of-domain datasets and successfully generalizes to real-world consumer camera setups using a single model. The model and dataset will be released at https://github.com/JIANG-CX/H-OmniStereo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents H-OmniStereo, a zero-shot omnidirectional stereo matching framework for top-bottom equirectangular images. It constructs a synthetic dataset of over 2.8 million pairs and introduces an equirectangular monocular normal estimator in a heading-aligned coordinate system to supply distortion-robust, cross-view-consistent geometric priors for stereo correspondence. The work claims superior accuracy over existing methods on out-of-domain datasets and successful generalization to real-world consumer camera setups with a single model, with the model and dataset to be released.

Significance. If the central claims are substantiated, the framework could advance omnidirectional perception by allowing perspective stereo networks to operate effectively on equirectangular data without domain-specific real-world training. The scale of the synthetic dataset and the heading-aligned normal prior design address key bottlenecks in dataset scarcity and distortion handling.

major comments (1)
  1. Abstract and Method: The claim that the heading-aligned monocular normal estimator yields cross-view-consistent priors on real top-bottom pairs without adaptation is load-bearing for the zero-shot generalization result. No quantitative consistency metric (such as mean angular error between predicted normals on overlapping 3D points after coordinate transformation) or ablation isolating the prior's contribution from the 2.8 M synthetic pairs is reported, leaving open the possibility that performance gains derive primarily from dataset scale rather than the proposed geometric design.
minor comments (1)
  1. Abstract: The statement that 'extensive experiments show higher accuracy' would be strengthened by including at least one representative quantitative metric (e.g., EPE or D1) even in the abstract.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and will strengthen the paper accordingly.

read point-by-point responses
  1. Referee: Abstract and Method: The claim that the heading-aligned monocular normal estimator yields cross-view-consistent priors on real top-bottom pairs without adaptation is load-bearing for the zero-shot generalization result. No quantitative consistency metric (such as mean angular error between predicted normals on overlapping 3D points after coordinate transformation) or ablation isolating the prior's contribution from the 2.8 M synthetic pairs is reported, leaving open the possibility that performance gains derive primarily from dataset scale rather than the proposed geometric design.

    Authors: We agree that an explicit quantitative consistency metric and an ablation isolating the heading-aligned normal prior would strengthen the evidence for its role in zero-shot generalization. In the revised manuscript we will add both: (1) a consistency evaluation reporting mean angular error between normals predicted on corresponding 3D points from the two views after coordinate transformation, computed on held-out synthetic pairs and on real top-bottom captures; and (2) an ablation that trains the stereo matcher on the identical 2.8 M synthetic dataset with and without the normal prior. These additions will directly address whether the reported gains derive from the geometric design rather than dataset scale alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper constructs a synthetic dataset of over 2.8 million top-bottom equirectangular pairs and trains an equirectangular monocular normal estimator in a heading-aligned coordinate system to supply geometric priors for stereo matching. No equations or steps in the abstract reduce any claimed prediction or result to a fitted parameter or self-citation by construction; the central claims rest on externally generated data and standard monocular estimation techniques applied to out-of-domain and real-world setups. The derivation chain remains self-contained against external benchmarks with no load-bearing self-referential reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the representativeness of the synthetic dataset and the utility of the heading-aligned normal prior for real-world generalization.

axioms (1)
  • domain assumption Synthetic top-bottom equirectangular stereo pairs can be generated at sufficient scale and realism to support zero-shot transfer to real consumer camera data.
    Invoked when claiming generalization without real-data fine-tuning.

pith-pipeline@v0.9.0 · 5761 in / 1219 out tokens · 40979 ms · 2026-05-20T20:45:26.269094+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 3 internal anchors

  1. [1]

    Helvipad: A real-world dataset for omnidirectional stereo depth estimation,

    M. Zayene, J. Endres, A. Havolli, C. Corbi `ere, S. Cherkaoui, A. Kon- touli, and A. Alahi, “Helvipad: A real-world dataset for omnidirectional stereo depth estimation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 26 975–26 984

  2. [2]

    360sd- net: 360 stereo depth estimation with learnable cost volume,

    N.-H. Wang, B. Solarte, Y .-H. Tsai, W.-C. Chiu, and M. Sun, “360sd- net: 360 stereo depth estimation with learnable cost volume,” in2020 IEEE International Conference on Robotics and Automation. IEEE, 2020, pp. 582–588

  3. [3]

    Boosting omnidirectional stereo matching with a pre-trained depth foundation model,

    J. Endres, O. Hahn, C. Corbi `ere, S. Schaub-Meyer, S. Roth, and A. Alahi, “Boosting omnidirectional stereo matching with a pre-trained depth foundation model,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2025, pp. 15 111–15 118

  4. [4]

    Panorama: The rise of omnidirectional vision in the embodied ai era,

    X. Zheng, C. Liao, Z. Weng, K. Lei, Z. Dongfang, H. He, Y . Lyu, L. Jiang, L. Qi, L. Chen, D. P. Paudel, K. Yang, L. Zhang, L. V . Gool, and X. Hu, “Panorama: The rise of omnidirectional vision in the embodied ai era,” Sept. 2025. [Online]. Available: http://arxiv.org/abs/2509.12989

  5. [5]

    Foundationstereo: Zero-shot stereo matching,

    B. Wen, M. Trepte, J. Aribido, J. Kautz, O. Gallo, and S. Birchfield, “Foundationstereo: Zero-shot stereo matching,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2025, pp. 5249–5260

  6. [6]

    Monster: Marry monodepth to stereo unleashes power,

    J. Cheng, L. Liu, G. Xu, X. Wang, Z. Zhang, Y . Deng, J. Zang, Y . Chen, Z. Cai, and X. Yang, “Monster: Marry monodepth to stereo unleashes power,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 6273–6282

  7. [7]

    Fast-FoundationStereo: Real-time zero-shot stereo matching,

    B. Wen, S. Dewan, and S. Birchfield, “Fast-FoundationStereo: Real-time zero-shot stereo matching,”Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2026

  8. [8]

    Depth anything v2,

    L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”Advances in Neural Information Processing Sys- tems, vol. 37, pp. 21 875–21 911, 2024

  9. [9]

    Mode: Multi-view omnidirec- tional depth estimation with 360◦cameras

    M. Li, X. Jin, X. Hu, J. Dai, and S. Du, “Mode: Multi-view omnidirec- tional depth estimation with 360◦cameras.”

  10. [10]

    Panda: Towards panoramic depth anything with unlabeled panoramas and mobius spatial augmentation,

    Z. Cao, J. Zhu, W. Zhang, H. Ai, H. Bai, H. Zhao, and L. Wang, “Panda: Towards panoramic depth anything with unlabeled panoramas and mobius spatial augmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 982– 992

  11. [11]

    Da 2: Depth anything in any direction,

    H. Li, W. Zheng, J. He, Y . Liu, and X. Lin, “Da$ˆ2$: Depth anything in any direction,” Sept. 2025. [Online]. Available: http: //arxiv.org/abs/2509.26618

  12. [12]

    Depth any panoramas: A foundation model for panoramic depth estimation,

    X. Lin, M. Song, D. Zhang, W. Lu, H. Li, B. Du, M.-H. Yang, T. Nguyen, and L. Qi, “Depth any panoramas: A foundation model for panoramic depth estimation,”arXiv preprint arXiv:2512.16913, 2025

  13. [13]

    Depth any camera: Zero-shot metric depth estimation from any camera,

    Y . Guo, S. Garg, S. M. H. Miangoleh, X. Huang, and L. Ren, “Depth any camera: Zero-shot metric depth estimation from any camera,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 26 996–27 006

  14. [14]

    Spherical view synthesis for self-supervised360 o depth estimation,

    N. Zioulis, A. Karakottas, D. Zarpalas, F. Alvarez, and P. Daras, “Spherical view synthesis for self-supervised360 o depth estimation,” inInternational Conference on 3D Vision, September 2019

  15. [15]

    Structured3d: A large photo-realistic dataset for structured 3d modeling,

    J. Zheng, J. Zhang, J. Li, R. Tang, S. Gao, and Z. Zhou, “Structured3d: A large photo-realistic dataset for structured 3d modeling,” inEuropean Conference on Computer Vision. Springer, 2020, pp. 519–535

  16. [16]

    Tartanair: A dataset to push the limits of visual slam,

    W. Wang, D. Zhu, X. Wang, Y . Hu, Y . Qiu, C. Wang, Y . Hu, A. Kapoor, and S. Scherer, “Tartanair: A dataset to push the limits of visual slam,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2020, pp. 4909–4916

  17. [17]

    Spatialgen: Layout-guided 3d indoor scene generation.arXiv preprint arXiv:2509.14981, 3, 2025

    C. Fang, H. Li, Y . Liang, J. Zheng, Y . Mao, Y . Liu, R. Tang, Z. Zhou, and P. Tan, “Spatialgen: Layout-guided 3d indoor scene generation,” arXiv preprint arXiv:2509.14981, vol. 3, 2025

  18. [18]

    360 surface regression with a hyper-sphere loss,

    A. Karakottas, N. Zioulis, S. Samaras, D. Ataloglou, V . Gkitsas, D. Zarpalas, and P. Daras, “360 surface regression with a hyper-sphere loss,” in2019 International Conference on 3D Vision. IEEE, 2019, pp. 258–268

  19. [19]

    End-to-end learning of geometry and context for deep stereo regression,

    A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry, “End-to-end learning of geometry and context for deep stereo regression,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 66–75

  20. [20]

    Pyramid stereo matching network,

    J.-R. Chang and Y .-S. Chen, “Pyramid stereo matching network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5410–5418

  21. [21]

    Raft-stereo: Multilevel recurrent field transforms for stereo matching,

    L. Lipson, Z. Teed, and J. Deng, “Raft-stereo: Multilevel recurrent field transforms for stereo matching,” in2021 International conference on 3D vision. IEEE, 2021, pp. 218–227

  22. [22]

    Iterative geometry encoding volume for stereo matching,

    G. Xu, X. Wang, X. Ding, and X. Yang, “Iterative geometry encoding volume for stereo matching,” inProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, 2023, pp. 21 919– 21 928

  23. [23]

    S2m2: Scalable stereo matching model for reliable depth estimation,

    J. Min, Y . Jeon, J. Kim, and M. Choi, “S2m2: Scalable stereo matching model for reliable depth estimation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 26 729–26 739

  24. [24]

    Sweepnet: Wide-baseline omnidirectional depth estimation,

    C. Won, J. Ryu, and J. Lim, “Sweepnet: Wide-baseline omnidirectional depth estimation,” in2019 International Conference on Robotics and Automation. IEEE, 2019, pp. 6073–6079

  25. [25]

    Omnimvs: End-to-end learning for omnidirectional stereo match- ing,

    ——, “Omnimvs: End-to-end learning for omnidirectional stereo match- ing,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8987–8996

  26. [26]

    S- omnimvs: Incorporating sphere geometry into omnidirectional stereo matching,

    Z. Chen, C. Lin, L. Nie, Z. Shen, K. Liao, Y . Cao, and Y . Zhao, “S- omnimvs: Incorporating sphere geometry into omnidirectional stereo matching,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 1495–1503

  27. [27]

    Romnistereo: Recurrent om- nidirectional stereo matching,

    H. Jiang, R. Xu, M. Tan, and W. Jiang, “Romnistereo: Recurrent om- nidirectional stereo matching,”IEEE Robotics and Automation Letters, vol. 9, no. 3, pp. 2511–2518, 2024

  28. [28]

    Depth anywhere: Enhancing 360 monocular depth estimation via perspective distillation and unlabeled data augmen- tation,

    N.-H. Wang and Y .-L. Liu, “Depth anywhere: Enhancing 360 monocular depth estimation via perspective distillation and unlabeled data augmen- tation,”Advances in Neural Information Processing Systems, vol. 37, pp. 127 739–127 764, 2024

  29. [29]

    Unik3d: Universal camera monocular 3d estimation,

    L. Piccinelli, C. Sakaridis, M. Segu, Y .-H. Yang, S. Li, W. Abbeloos, and L. Van Gool, “Unik3d: Universal camera monocular 3d estimation,” inProceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 1028–1039

  30. [30]

    Unifuse: Uni- directional fusion for 360 panorama depth estimation,

    H. Jiang, Z. Sheng, S. Zhu, Z. Dong, and R. Huang, “Unifuse: Uni- directional fusion for 360 panorama depth estimation,”IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 1519–1526, 2021

  31. [31]

    Omnifusion: 360 monocular depth estimation via geometry-aware fusion,

    Y . Li, Y . Guo, Z. Yan, X. Huang, Y . Duan, and L. Ren, “Omnifusion: 360 monocular depth estimation via geometry-aware fusion,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2801–2810

  32. [32]

    Panonormal: Monocular indoor 360◦ surface normal estimation,

    K. Huang, F. Zhang, and N. Dodgson, “Panonormal: Monocular indoor 360◦ surface normal estimation,” May 2024. [Online]. Available: http://arxiv.org/abs/2405.18745

  33. [33]

    Panoformer: panorama transformer for indoor 360° depth estimation,

    Z. Shen, C. Lin, K. Liao, L. Nie, Z. Zheng, and Y . Zhao, “Panoformer: panorama transformer for indoor 360° depth estimation,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 195–211

  34. [34]

    Grutopia: Dream general robots in a city at scale,

    H. Wang, J. Chen, W. Huang, Q. Ben, T. Wang, B. Mi, T. Huang, S. Zhao, Y . Chen, S. Yang,et al., “Grutopia: Dream general robots in a city at scale,”arXiv preprint arXiv:2407.10943, 2024

  35. [35]

    Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

    S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang,et al., “Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai,”arXiv preprint arXiv:2109.08238, 2021

  36. [36]

    Isaac Sim

    NVIDIA, “Isaac Sim.” [Online]. Available: https://github.com/isaac-sim/ IsaacSim

  37. [37]

    Texverse: A universe of 3d objects with high-resolution textures.arXiv preprint arXiv:2508.10868, 2025

    Y . Zhang, L. Zhang, R. Ma, and N. Cao, “Texverse: A universe of 3d objects with high-resolution textures,”arXiv preprint arXiv:2508.10868, 2025

  38. [38]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby,et al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023

  39. [39]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

  40. [40]

    Geometry-informed dis- tance candidate selection for adaptive lightweight omnidirectional stereo vision with fisheye images,

    C. Pulling, J. H. Tan, Y . Hu, and S. Scherer, “Geometry-informed dis- tance candidate selection for adaptive lightweight omnidirectional stereo vision with fisheye images,” in2024 IEEE International Conference on Robotics and Automation). IEEE, 2024, pp. 12 255–12 261

  41. [41]

    Mac-vo: Metrics- aware covariance for learning-based stereo visual odometry mac-vo. github. io,

    Y . Qiu, Y . Chen, Z. Zhang, W. Wang, and S. Scherer, “Mac-vo: Metrics- aware covariance for learning-based stereo visual odometry mac-vo. github. io,” in2025 IEEE International Conference on Robotics and Automation. IEEE, 2025, pp. 3803–3814

  42. [42]

    evo: Python package for the evaluation of odometry and slam

    M. Grupp, “evo: Python package for the evaluation of odometry and slam.” https://github.com/MichaelGrupp/evo, 2017