H-OmniStereo: Zero-Shot Omnidirectional Stereo Matching with Heading-Aligned Normal Priors

Chenxing Jiang; Chuan Fang; Peize Liu; Ping Tan; Pusen Gao; Shaojie Shen; Yang Xu; Zhe Tong

arxiv: 2605.14963 · v2 · pith:FWQVODMSnew · submitted 2026-05-14 · 💻 cs.CV

H-OmniStereo: Zero-Shot Omnidirectional Stereo Matching with Heading-Aligned Normal Priors

Chenxing Jiang , Zhe Tong , Pusen Gao , Peize Liu , Yang Xu , Chuan Fang , Ping Tan , Shaojie Shen This is my paper

Pith reviewed 2026-05-20 20:45 UTC · model grok-4.3

classification 💻 cs.CV

keywords stereomatchingpriorsdatasetsequirectangularh-omnistereomonocularomnidirectional

0 comments

The pith

Heading-aligned normal priors enable accurate zero-shot omnidirectional stereo matching

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces H-OmniStereo to perform stereo matching on top-bottom equirectangular omnidirectional images without dedicated training on real omnidirectional stereo data. It creates a large synthetic dataset of over 2.8 million pairs and develops a monocular normal estimator that operates in a heading-aligned coordinate system to generate geometric priors robust to spherical distortions. These priors help establish reliable correspondences by providing cross-view consistent information, allowing the use of perspective stereo methods on distorted images. The result is a single model that outperforms existing methods on unseen datasets and works on real consumer cameras for full-surround perception.

Core claim

By training in a heading-aligned coordinate system, the equirectangular monocular normal estimator produces distortion-robust and cross-view-consistent geometric priors that boost the performance of stereo matching on omnidirectional images, enabling zero-shot generalization after training solely on synthetic data.

What carries the argument

The heading-aligned monocular normal estimator that supplies geometric priors for stereo correspondence in equirectangular projections.

Load-bearing premise

The monocular normal estimator trained in heading-aligned coordinates will yield priors that are robust to distortions and consistent across views, reliably aiding stereo matching on real data without adaptation.

What would settle it

Observing that the predicted normals from left and right views of the same scene are inconsistent or that adding the normal priors does not increase matching accuracy over a baseline stereo matcher on real omnidirectional test images.

Figures

Figures reproduced from arXiv: 2605.14963 by Chenxing Jiang, Chuan Fang, Peize Liu, Ping Tan, Pusen Gao, Shaojie Shen, Yang Xu, Zhe Tong.

**Figure 1.** Figure 1: Left: Spherical disparity under top-bottom camera pairs. Right: Dense 3D reconstruction and trajectory estimation (red line) achieved by integrating our proposed single-pair omnidirectional stereo matching model into a visual odometry pipeline. By processing sequential top-bottom equirectangular stereo pairs, our method enables complete environment recovery. The ceiling of the point cloud is cropped for be… view at source ↗

**Figure 2.** Figure 2: The sphere illustrates the coordinate frames defining pixel normals. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The pipeline of H-OmniStereo. panoramic stereo pairs with corresponding ground-truth depth and surface normal maps. For each sample, we randomize the rig’s baseline (0.05m to 0.5m) and pose (roll and pitch: −45◦ to 45◦ ), utilizing a capsule-shaped collision proxy to ensure collision-free placement among surrounding objects. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Samples from our synthetic omnidirectional stereo dataset, featuring fixed layouts from GRUtopia [34] and HM3D [35], as well as randomized layouts [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative disparity comparison against our two most competitive baselines, 360SD-Net [2] and MODE [9], using their official checkpoints. For [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative point cloud comparison on real-world images [2]. Points beyond 10m are removed, causing missing regions in some of the baseline results. [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results of point clouds derived from our predicted disparity. [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative surface normal comparisons with PanoNormal [32] on the [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative results of our omnidirectional stereo odometry: estimated [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

read the original abstract

Stereo matching on top-bottom equirectangular images provides an effective framework for full-surround perception, as vertically aligned epipolar lines enable the use of advanced perspective stereo architectures that are largely driven by large-scale datasets and monocular priors. However, the performance of such adaptations is severely limited by the scarcity of omnidirectional stereo datasets and the degradation of perspective monocular priors under spherical distortions. To address these challenges, we propose H-OmniStereo, a zero-shot omnidirectional stereo matching framework. First, we construct high-quality synthetic dataset comprising over 2.8 million top-bottom equirectangular stereo pairs to scale up training. Second, we introduce an equirectangular monocular normal estimator, specifically operating in a heading-aligned coordinate system. Beyond providing distortion-robust and cross-view-consistent geometric priors for establishing reliable correspondences in stereo matching, this design boosts training efficiency and accommodates train-test FoV mismatches. Extensive experiments show that our approach achieves higher accuracy than existing methods on out-of-domain datasets and successfully generalizes to real-world consumer camera setups using a single model. The model and dataset will be released at https://github.com/JIANG-CX/H-OmniStereo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Large synthetic equirectangular dataset plus heading-aligned normals gives a practical zero-shot stereo pipeline, but the priors' cross-view consistency on real data is not yet shown.

read the letter

The paper's main move is releasing a synthetic dataset of 2.8 million top-bottom equirectangular stereo pairs and training a monocular normal estimator inside a heading-aligned coordinate frame. This setup feeds distortion-robust priors into a stereo matcher that re-uses perspective architectures on vertically aligned epipolar lines. The scale of the synthetic data and the coordinate choice are the concrete additions that were not already in the cited prior work on omnidirectional stereo.

Referee Report

1 major / 1 minor

Summary. The manuscript presents H-OmniStereo, a zero-shot omnidirectional stereo matching framework for top-bottom equirectangular images. It constructs a synthetic dataset of over 2.8 million pairs and introduces an equirectangular monocular normal estimator in a heading-aligned coordinate system to supply distortion-robust, cross-view-consistent geometric priors for stereo correspondence. The work claims superior accuracy over existing methods on out-of-domain datasets and successful generalization to real-world consumer camera setups with a single model, with the model and dataset to be released.

Significance. If the central claims are substantiated, the framework could advance omnidirectional perception by allowing perspective stereo networks to operate effectively on equirectangular data without domain-specific real-world training. The scale of the synthetic dataset and the heading-aligned normal prior design address key bottlenecks in dataset scarcity and distortion handling.

major comments (1)

Abstract and Method: The claim that the heading-aligned monocular normal estimator yields cross-view-consistent priors on real top-bottom pairs without adaptation is load-bearing for the zero-shot generalization result. No quantitative consistency metric (such as mean angular error between predicted normals on overlapping 3D points after coordinate transformation) or ablation isolating the prior's contribution from the 2.8 M synthetic pairs is reported, leaving open the possibility that performance gains derive primarily from dataset scale rather than the proposed geometric design.

minor comments (1)

Abstract: The statement that 'extensive experiments show higher accuracy' would be strengthened by including at least one representative quantitative metric (e.g., EPE or D1) even in the abstract.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and will strengthen the paper accordingly.

read point-by-point responses

Referee: Abstract and Method: The claim that the heading-aligned monocular normal estimator yields cross-view-consistent priors on real top-bottom pairs without adaptation is load-bearing for the zero-shot generalization result. No quantitative consistency metric (such as mean angular error between predicted normals on overlapping 3D points after coordinate transformation) or ablation isolating the prior's contribution from the 2.8 M synthetic pairs is reported, leaving open the possibility that performance gains derive primarily from dataset scale rather than the proposed geometric design.

Authors: We agree that an explicit quantitative consistency metric and an ablation isolating the heading-aligned normal prior would strengthen the evidence for its role in zero-shot generalization. In the revised manuscript we will add both: (1) a consistency evaluation reporting mean angular error between normals predicted on corresponding 3D points from the two views after coordinate transformation, computed on held-out synthetic pairs and on real top-bottom captures; and (2) an ablation that trains the stereo matcher on the identical 2.8 M synthetic dataset with and without the normal prior. These additions will directly address whether the reported gains derive from the geometric design rather than dataset scale alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper constructs a synthetic dataset of over 2.8 million top-bottom equirectangular pairs and trains an equirectangular monocular normal estimator in a heading-aligned coordinate system to supply geometric priors for stereo matching. No equations or steps in the abstract reduce any claimed prediction or result to a fitted parameter or self-citation by construction; the central claims rest on externally generated data and standard monocular estimation techniques applied to out-of-domain and real-world setups. The derivation chain remains self-contained against external benchmarks with no load-bearing self-referential reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the representativeness of the synthetic dataset and the utility of the heading-aligned normal prior for real-world generalization.

axioms (1)

domain assumption Synthetic top-bottom equirectangular stereo pairs can be generated at sufficient scale and realism to support zero-shot transfer to real consumer camera data.
Invoked when claiming generalization without real-data fine-tuning.

pith-pipeline@v0.9.0 · 5761 in / 1219 out tokens · 40979 ms · 2026-05-20T20:45:26.269094+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Attentive Hybrid Cost Filtering... iterative refinement with ConvGRU

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 3 internal anchors

[1]

Helvipad: A real-world dataset for omnidirectional stereo depth estimation,

M. Zayene, J. Endres, A. Havolli, C. Corbi `ere, S. Cherkaoui, A. Kon- touli, and A. Alahi, “Helvipad: A real-world dataset for omnidirectional stereo depth estimation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 26 975–26 984

work page 2025
[2]

360sd- net: 360 stereo depth estimation with learnable cost volume,

N.-H. Wang, B. Solarte, Y .-H. Tsai, W.-C. Chiu, and M. Sun, “360sd- net: 360 stereo depth estimation with learnable cost volume,” in2020 IEEE International Conference on Robotics and Automation. IEEE, 2020, pp. 582–588

work page 2020
[3]

Boosting omnidirectional stereo matching with a pre-trained depth foundation model,

J. Endres, O. Hahn, C. Corbi `ere, S. Schaub-Meyer, S. Roth, and A. Alahi, “Boosting omnidirectional stereo matching with a pre-trained depth foundation model,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2025, pp. 15 111–15 118

work page 2025
[4]

Panorama: The rise of omnidirectional vision in the embodied ai era,

X. Zheng, C. Liao, Z. Weng, K. Lei, Z. Dongfang, H. He, Y . Lyu, L. Jiang, L. Qi, L. Chen, D. P. Paudel, K. Yang, L. Zhang, L. V . Gool, and X. Hu, “Panorama: The rise of omnidirectional vision in the embodied ai era,” Sept. 2025. [Online]. Available: http://arxiv.org/abs/2509.12989

work page arXiv 2025
[5]

Foundationstereo: Zero-shot stereo matching,

B. Wen, M. Trepte, J. Aribido, J. Kautz, O. Gallo, and S. Birchfield, “Foundationstereo: Zero-shot stereo matching,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2025, pp. 5249–5260

work page 2025
[6]

Monster: Marry monodepth to stereo unleashes power,

J. Cheng, L. Liu, G. Xu, X. Wang, Z. Zhang, Y . Deng, J. Zang, Y . Chen, Z. Cai, and X. Yang, “Monster: Marry monodepth to stereo unleashes power,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 6273–6282

work page 2025
[7]

Fast-FoundationStereo: Real-time zero-shot stereo matching,

B. Wen, S. Dewan, and S. Birchfield, “Fast-FoundationStereo: Real-time zero-shot stereo matching,”Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2026

work page 2026
[8]

Depth anything v2,

L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”Advances in Neural Information Processing Sys- tems, vol. 37, pp. 21 875–21 911, 2024

work page 2024
[9]

Mode: Multi-view omnidirec- tional depth estimation with 360◦cameras

M. Li, X. Jin, X. Hu, J. Dai, and S. Du, “Mode: Multi-view omnidirec- tional depth estimation with 360◦cameras.”

work page
[10]

Panda: Towards panoramic depth anything with unlabeled panoramas and mobius spatial augmentation,

Z. Cao, J. Zhu, W. Zhang, H. Ai, H. Bai, H. Zhao, and L. Wang, “Panda: Towards panoramic depth anything with unlabeled panoramas and mobius spatial augmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 982– 992

work page 2025
[11]

Da 2: Depth anything in any direction,

H. Li, W. Zheng, J. He, Y . Liu, and X. Lin, “Da$ˆ2$: Depth anything in any direction,” Sept. 2025. [Online]. Available: http: //arxiv.org/abs/2509.26618

work page arXiv 2025
[12]

Depth any panoramas: A foundation model for panoramic depth estimation,

X. Lin, M. Song, D. Zhang, W. Lu, H. Li, B. Du, M.-H. Yang, T. Nguyen, and L. Qi, “Depth any panoramas: A foundation model for panoramic depth estimation,”arXiv preprint arXiv:2512.16913, 2025

work page arXiv 2025
[13]

Depth any camera: Zero-shot metric depth estimation from any camera,

Y . Guo, S. Garg, S. M. H. Miangoleh, X. Huang, and L. Ren, “Depth any camera: Zero-shot metric depth estimation from any camera,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 26 996–27 006

work page 2025
[14]

Spherical view synthesis for self-supervised360 o depth estimation,

N. Zioulis, A. Karakottas, D. Zarpalas, F. Alvarez, and P. Daras, “Spherical view synthesis for self-supervised360 o depth estimation,” inInternational Conference on 3D Vision, September 2019

work page 2019
[15]

Structured3d: A large photo-realistic dataset for structured 3d modeling,

J. Zheng, J. Zhang, J. Li, R. Tang, S. Gao, and Z. Zhou, “Structured3d: A large photo-realistic dataset for structured 3d modeling,” inEuropean Conference on Computer Vision. Springer, 2020, pp. 519–535

work page 2020
[16]

Tartanair: A dataset to push the limits of visual slam,

W. Wang, D. Zhu, X. Wang, Y . Hu, Y . Qiu, C. Wang, Y . Hu, A. Kapoor, and S. Scherer, “Tartanair: A dataset to push the limits of visual slam,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2020, pp. 4909–4916

work page 2020
[17]

Spatialgen: Layout-guided 3d indoor scene generation.arXiv preprint arXiv:2509.14981, 3, 2025

C. Fang, H. Li, Y . Liang, J. Zheng, Y . Mao, Y . Liu, R. Tang, Z. Zhou, and P. Tan, “Spatialgen: Layout-guided 3d indoor scene generation,” arXiv preprint arXiv:2509.14981, vol. 3, 2025

work page arXiv 2025
[18]

360 surface regression with a hyper-sphere loss,

A. Karakottas, N. Zioulis, S. Samaras, D. Ataloglou, V . Gkitsas, D. Zarpalas, and P. Daras, “360 surface regression with a hyper-sphere loss,” in2019 International Conference on 3D Vision. IEEE, 2019, pp. 258–268

work page 2019
[19]

End-to-end learning of geometry and context for deep stereo regression,

A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry, “End-to-end learning of geometry and context for deep stereo regression,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 66–75

work page 2017
[20]

Pyramid stereo matching network,

J.-R. Chang and Y .-S. Chen, “Pyramid stereo matching network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5410–5418

work page 2018
[21]

Raft-stereo: Multilevel recurrent field transforms for stereo matching,

L. Lipson, Z. Teed, and J. Deng, “Raft-stereo: Multilevel recurrent field transforms for stereo matching,” in2021 International conference on 3D vision. IEEE, 2021, pp. 218–227

work page 2021
[22]

Iterative geometry encoding volume for stereo matching,

G. Xu, X. Wang, X. Ding, and X. Yang, “Iterative geometry encoding volume for stereo matching,” inProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, 2023, pp. 21 919– 21 928

work page 2023
[23]

S2m2: Scalable stereo matching model for reliable depth estimation,

J. Min, Y . Jeon, J. Kim, and M. Choi, “S2m2: Scalable stereo matching model for reliable depth estimation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 26 729–26 739

work page 2025
[24]

Sweepnet: Wide-baseline omnidirectional depth estimation,

C. Won, J. Ryu, and J. Lim, “Sweepnet: Wide-baseline omnidirectional depth estimation,” in2019 International Conference on Robotics and Automation. IEEE, 2019, pp. 6073–6079

work page 2019
[25]

Omnimvs: End-to-end learning for omnidirectional stereo match- ing,

——, “Omnimvs: End-to-end learning for omnidirectional stereo match- ing,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8987–8996

work page 2019
[26]

S- omnimvs: Incorporating sphere geometry into omnidirectional stereo matching,

Z. Chen, C. Lin, L. Nie, Z. Shen, K. Liao, Y . Cao, and Y . Zhao, “S- omnimvs: Incorporating sphere geometry into omnidirectional stereo matching,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 1495–1503

work page 2023
[27]

Romnistereo: Recurrent om- nidirectional stereo matching,

H. Jiang, R. Xu, M. Tan, and W. Jiang, “Romnistereo: Recurrent om- nidirectional stereo matching,”IEEE Robotics and Automation Letters, vol. 9, no. 3, pp. 2511–2518, 2024

work page 2024
[28]

Depth anywhere: Enhancing 360 monocular depth estimation via perspective distillation and unlabeled data augmen- tation,

N.-H. Wang and Y .-L. Liu, “Depth anywhere: Enhancing 360 monocular depth estimation via perspective distillation and unlabeled data augmen- tation,”Advances in Neural Information Processing Systems, vol. 37, pp. 127 739–127 764, 2024

work page 2024
[29]

Unik3d: Universal camera monocular 3d estimation,

L. Piccinelli, C. Sakaridis, M. Segu, Y .-H. Yang, S. Li, W. Abbeloos, and L. Van Gool, “Unik3d: Universal camera monocular 3d estimation,” inProceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 1028–1039

work page 2025
[30]

Unifuse: Uni- directional fusion for 360 panorama depth estimation,

H. Jiang, Z. Sheng, S. Zhu, Z. Dong, and R. Huang, “Unifuse: Uni- directional fusion for 360 panorama depth estimation,”IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 1519–1526, 2021

work page 2021
[31]

Omnifusion: 360 monocular depth estimation via geometry-aware fusion,

Y . Li, Y . Guo, Z. Yan, X. Huang, Y . Duan, and L. Ren, “Omnifusion: 360 monocular depth estimation via geometry-aware fusion,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2801–2810

work page 2022
[32]

Panonormal: Monocular indoor 360◦ surface normal estimation,

K. Huang, F. Zhang, and N. Dodgson, “Panonormal: Monocular indoor 360◦ surface normal estimation,” May 2024. [Online]. Available: http://arxiv.org/abs/2405.18745

work page arXiv 2024
[33]

Panoformer: panorama transformer for indoor 360° depth estimation,

Z. Shen, C. Lin, K. Liao, L. Nie, Z. Zheng, and Y . Zhao, “Panoformer: panorama transformer for indoor 360° depth estimation,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 195–211

work page 2022
[34]

Grutopia: Dream general robots in a city at scale,

H. Wang, J. Chen, W. Huang, Q. Ben, T. Wang, B. Mi, T. Huang, S. Zhao, Y . Chen, S. Yang,et al., “Grutopia: Dream general robots in a city at scale,”arXiv preprint arXiv:2407.10943, 2024

work page arXiv 2024
[35]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang,et al., “Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai,”arXiv preprint arXiv:2109.08238, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[36]

Isaac Sim

NVIDIA, “Isaac Sim.” [Online]. Available: https://github.com/isaac-sim/ IsaacSim

work page
[37]

Texverse: A universe of 3d objects with high-resolution textures.arXiv preprint arXiv:2508.10868, 2025

Y . Zhang, L. Zhang, R. Ma, and N. Cao, “Texverse: A universe of 3d objects with high-resolution textures,”arXiv preprint arXiv:2508.10868, 2025

work page arXiv 2025
[38]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby,et al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[40]

Geometry-informed dis- tance candidate selection for adaptive lightweight omnidirectional stereo vision with fisheye images,

C. Pulling, J. H. Tan, Y . Hu, and S. Scherer, “Geometry-informed dis- tance candidate selection for adaptive lightweight omnidirectional stereo vision with fisheye images,” in2024 IEEE International Conference on Robotics and Automation). IEEE, 2024, pp. 12 255–12 261

work page 2024
[41]

Mac-vo: Metrics- aware covariance for learning-based stereo visual odometry mac-vo. github. io,

Y . Qiu, Y . Chen, Z. Zhang, W. Wang, and S. Scherer, “Mac-vo: Metrics- aware covariance for learning-based stereo visual odometry mac-vo. github. io,” in2025 IEEE International Conference on Robotics and Automation. IEEE, 2025, pp. 3803–3814

work page 2025
[42]

evo: Python package for the evaluation of odometry and slam

M. Grupp, “evo: Python package for the evaluation of odometry and slam.” https://github.com/MichaelGrupp/evo, 2017

work page 2017

[1] [1]

Helvipad: A real-world dataset for omnidirectional stereo depth estimation,

M. Zayene, J. Endres, A. Havolli, C. Corbi `ere, S. Cherkaoui, A. Kon- touli, and A. Alahi, “Helvipad: A real-world dataset for omnidirectional stereo depth estimation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 26 975–26 984

work page 2025

[2] [2]

360sd- net: 360 stereo depth estimation with learnable cost volume,

N.-H. Wang, B. Solarte, Y .-H. Tsai, W.-C. Chiu, and M. Sun, “360sd- net: 360 stereo depth estimation with learnable cost volume,” in2020 IEEE International Conference on Robotics and Automation. IEEE, 2020, pp. 582–588

work page 2020

[3] [3]

Boosting omnidirectional stereo matching with a pre-trained depth foundation model,

J. Endres, O. Hahn, C. Corbi `ere, S. Schaub-Meyer, S. Roth, and A. Alahi, “Boosting omnidirectional stereo matching with a pre-trained depth foundation model,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2025, pp. 15 111–15 118

work page 2025

[4] [4]

Panorama: The rise of omnidirectional vision in the embodied ai era,

X. Zheng, C. Liao, Z. Weng, K. Lei, Z. Dongfang, H. He, Y . Lyu, L. Jiang, L. Qi, L. Chen, D. P. Paudel, K. Yang, L. Zhang, L. V . Gool, and X. Hu, “Panorama: The rise of omnidirectional vision in the embodied ai era,” Sept. 2025. [Online]. Available: http://arxiv.org/abs/2509.12989

work page arXiv 2025

[5] [5]

Foundationstereo: Zero-shot stereo matching,

B. Wen, M. Trepte, J. Aribido, J. Kautz, O. Gallo, and S. Birchfield, “Foundationstereo: Zero-shot stereo matching,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2025, pp. 5249–5260

work page 2025

[6] [6]

Monster: Marry monodepth to stereo unleashes power,

J. Cheng, L. Liu, G. Xu, X. Wang, Z. Zhang, Y . Deng, J. Zang, Y . Chen, Z. Cai, and X. Yang, “Monster: Marry monodepth to stereo unleashes power,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 6273–6282

work page 2025

[7] [7]

Fast-FoundationStereo: Real-time zero-shot stereo matching,

B. Wen, S. Dewan, and S. Birchfield, “Fast-FoundationStereo: Real-time zero-shot stereo matching,”Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2026

work page 2026

[8] [8]

Depth anything v2,

L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”Advances in Neural Information Processing Sys- tems, vol. 37, pp. 21 875–21 911, 2024

work page 2024

[9] [9]

Mode: Multi-view omnidirec- tional depth estimation with 360◦cameras

M. Li, X. Jin, X. Hu, J. Dai, and S. Du, “Mode: Multi-view omnidirec- tional depth estimation with 360◦cameras.”

work page

[10] [10]

Panda: Towards panoramic depth anything with unlabeled panoramas and mobius spatial augmentation,

Z. Cao, J. Zhu, W. Zhang, H. Ai, H. Bai, H. Zhao, and L. Wang, “Panda: Towards panoramic depth anything with unlabeled panoramas and mobius spatial augmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 982– 992

work page 2025

[11] [11]

Da 2: Depth anything in any direction,

H. Li, W. Zheng, J. He, Y . Liu, and X. Lin, “Da$ˆ2$: Depth anything in any direction,” Sept. 2025. [Online]. Available: http: //arxiv.org/abs/2509.26618

work page arXiv 2025

[12] [12]

Depth any panoramas: A foundation model for panoramic depth estimation,

X. Lin, M. Song, D. Zhang, W. Lu, H. Li, B. Du, M.-H. Yang, T. Nguyen, and L. Qi, “Depth any panoramas: A foundation model for panoramic depth estimation,”arXiv preprint arXiv:2512.16913, 2025

work page arXiv 2025

[13] [13]

Depth any camera: Zero-shot metric depth estimation from any camera,

Y . Guo, S. Garg, S. M. H. Miangoleh, X. Huang, and L. Ren, “Depth any camera: Zero-shot metric depth estimation from any camera,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 26 996–27 006

work page 2025

[14] [14]

Spherical view synthesis for self-supervised360 o depth estimation,

N. Zioulis, A. Karakottas, D. Zarpalas, F. Alvarez, and P. Daras, “Spherical view synthesis for self-supervised360 o depth estimation,” inInternational Conference on 3D Vision, September 2019

work page 2019

[15] [15]

Structured3d: A large photo-realistic dataset for structured 3d modeling,

J. Zheng, J. Zhang, J. Li, R. Tang, S. Gao, and Z. Zhou, “Structured3d: A large photo-realistic dataset for structured 3d modeling,” inEuropean Conference on Computer Vision. Springer, 2020, pp. 519–535

work page 2020

[16] [16]

Tartanair: A dataset to push the limits of visual slam,

W. Wang, D. Zhu, X. Wang, Y . Hu, Y . Qiu, C. Wang, Y . Hu, A. Kapoor, and S. Scherer, “Tartanair: A dataset to push the limits of visual slam,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2020, pp. 4909–4916

work page 2020

[17] [17]

Spatialgen: Layout-guided 3d indoor scene generation.arXiv preprint arXiv:2509.14981, 3, 2025

C. Fang, H. Li, Y . Liang, J. Zheng, Y . Mao, Y . Liu, R. Tang, Z. Zhou, and P. Tan, “Spatialgen: Layout-guided 3d indoor scene generation,” arXiv preprint arXiv:2509.14981, vol. 3, 2025

work page arXiv 2025

[18] [18]

360 surface regression with a hyper-sphere loss,

A. Karakottas, N. Zioulis, S. Samaras, D. Ataloglou, V . Gkitsas, D. Zarpalas, and P. Daras, “360 surface regression with a hyper-sphere loss,” in2019 International Conference on 3D Vision. IEEE, 2019, pp. 258–268

work page 2019

[19] [19]

End-to-end learning of geometry and context for deep stereo regression,

A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry, “End-to-end learning of geometry and context for deep stereo regression,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 66–75

work page 2017

[20] [20]

Pyramid stereo matching network,

J.-R. Chang and Y .-S. Chen, “Pyramid stereo matching network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5410–5418

work page 2018

[21] [21]

Raft-stereo: Multilevel recurrent field transforms for stereo matching,

L. Lipson, Z. Teed, and J. Deng, “Raft-stereo: Multilevel recurrent field transforms for stereo matching,” in2021 International conference on 3D vision. IEEE, 2021, pp. 218–227

work page 2021

[22] [22]

Iterative geometry encoding volume for stereo matching,

G. Xu, X. Wang, X. Ding, and X. Yang, “Iterative geometry encoding volume for stereo matching,” inProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, 2023, pp. 21 919– 21 928

work page 2023

[23] [23]

S2m2: Scalable stereo matching model for reliable depth estimation,

J. Min, Y . Jeon, J. Kim, and M. Choi, “S2m2: Scalable stereo matching model for reliable depth estimation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 26 729–26 739

work page 2025

[24] [24]

Sweepnet: Wide-baseline omnidirectional depth estimation,

C. Won, J. Ryu, and J. Lim, “Sweepnet: Wide-baseline omnidirectional depth estimation,” in2019 International Conference on Robotics and Automation. IEEE, 2019, pp. 6073–6079

work page 2019

[25] [25]

Omnimvs: End-to-end learning for omnidirectional stereo match- ing,

——, “Omnimvs: End-to-end learning for omnidirectional stereo match- ing,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8987–8996

work page 2019

[26] [26]

S- omnimvs: Incorporating sphere geometry into omnidirectional stereo matching,

Z. Chen, C. Lin, L. Nie, Z. Shen, K. Liao, Y . Cao, and Y . Zhao, “S- omnimvs: Incorporating sphere geometry into omnidirectional stereo matching,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 1495–1503

work page 2023

[27] [27]

Romnistereo: Recurrent om- nidirectional stereo matching,

H. Jiang, R. Xu, M. Tan, and W. Jiang, “Romnistereo: Recurrent om- nidirectional stereo matching,”IEEE Robotics and Automation Letters, vol. 9, no. 3, pp. 2511–2518, 2024

work page 2024

[28] [28]

Depth anywhere: Enhancing 360 monocular depth estimation via perspective distillation and unlabeled data augmen- tation,

N.-H. Wang and Y .-L. Liu, “Depth anywhere: Enhancing 360 monocular depth estimation via perspective distillation and unlabeled data augmen- tation,”Advances in Neural Information Processing Systems, vol. 37, pp. 127 739–127 764, 2024

work page 2024

[29] [29]

Unik3d: Universal camera monocular 3d estimation,

L. Piccinelli, C. Sakaridis, M. Segu, Y .-H. Yang, S. Li, W. Abbeloos, and L. Van Gool, “Unik3d: Universal camera monocular 3d estimation,” inProceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 1028–1039

work page 2025

[30] [30]

Unifuse: Uni- directional fusion for 360 panorama depth estimation,

H. Jiang, Z. Sheng, S. Zhu, Z. Dong, and R. Huang, “Unifuse: Uni- directional fusion for 360 panorama depth estimation,”IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 1519–1526, 2021

work page 2021

[31] [31]

Omnifusion: 360 monocular depth estimation via geometry-aware fusion,

Y . Li, Y . Guo, Z. Yan, X. Huang, Y . Duan, and L. Ren, “Omnifusion: 360 monocular depth estimation via geometry-aware fusion,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2801–2810

work page 2022

[32] [32]

Panonormal: Monocular indoor 360◦ surface normal estimation,

K. Huang, F. Zhang, and N. Dodgson, “Panonormal: Monocular indoor 360◦ surface normal estimation,” May 2024. [Online]. Available: http://arxiv.org/abs/2405.18745

work page arXiv 2024

[33] [33]

Panoformer: panorama transformer for indoor 360° depth estimation,

Z. Shen, C. Lin, K. Liao, L. Nie, Z. Zheng, and Y . Zhao, “Panoformer: panorama transformer for indoor 360° depth estimation,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 195–211

work page 2022

[34] [34]

Grutopia: Dream general robots in a city at scale,

H. Wang, J. Chen, W. Huang, Q. Ben, T. Wang, B. Mi, T. Huang, S. Zhao, Y . Chen, S. Yang,et al., “Grutopia: Dream general robots in a city at scale,”arXiv preprint arXiv:2407.10943, 2024

work page arXiv 2024

[35] [35]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang,et al., “Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai,”arXiv preprint arXiv:2109.08238, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[36] [36]

Isaac Sim

NVIDIA, “Isaac Sim.” [Online]. Available: https://github.com/isaac-sim/ IsaacSim

work page

[37] [37]

Texverse: A universe of 3d objects with high-resolution textures.arXiv preprint arXiv:2508.10868, 2025

Y . Zhang, L. Zhang, R. Ma, and N. Cao, “Texverse: A universe of 3d objects with high-resolution textures,”arXiv preprint arXiv:2508.10868, 2025

work page arXiv 2025

[38] [38]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby,et al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[40] [40]

Geometry-informed dis- tance candidate selection for adaptive lightweight omnidirectional stereo vision with fisheye images,

C. Pulling, J. H. Tan, Y . Hu, and S. Scherer, “Geometry-informed dis- tance candidate selection for adaptive lightweight omnidirectional stereo vision with fisheye images,” in2024 IEEE International Conference on Robotics and Automation). IEEE, 2024, pp. 12 255–12 261

work page 2024

[41] [41]

Mac-vo: Metrics- aware covariance for learning-based stereo visual odometry mac-vo. github. io,

Y . Qiu, Y . Chen, Z. Zhang, W. Wang, and S. Scherer, “Mac-vo: Metrics- aware covariance for learning-based stereo visual odometry mac-vo. github. io,” in2025 IEEE International Conference on Robotics and Automation. IEEE, 2025, pp. 3803–3814

work page 2025

[42] [42]

evo: Python package for the evaluation of odometry and slam

M. Grupp, “evo: Python package for the evaluation of odometry and slam.” https://github.com/MichaelGrupp/evo, 2017

work page 2017