pith. the verified trust layer for science. sign in

arxiv: 2511.16428 · v3 · submitted 2025-11-20 · 💻 cs.CV

CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation

Pith reviewed 2026-05-17 20:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords self-supervised depth estimationmulti-view consistencysurround depthcylindrical attentionautonomous drivingDDADnuScenesmulti-camera rigs
0
0 comments X p. Extension

The pith

Mapping image features onto a shared cylinder and weighting them by cylindrical distance produces consistent depth estimates across overlapping camera views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to fix inconsistent depth values in the overlap zones of surround-camera rigs when depth is learned without direct supervision. It projects each camera's features onto one common cylinder so that nearby points on the cylinder correspond to actual neighbors in 3D space. A simple distance-based weighting then blends information only from those nearby cylinder locations instead of using full learned attention. The blended features are decoded into per-view depth maps. Readers care because reliable 360-degree depth from cheap cameras is needed for navigation and 3D scene understanding.

Core claim

By mapping the feature positions from each camera image onto a shared cylinder, neighborhood relationships are established between different views. An explicit spatial attention mechanism then aggregates features across images using non-learned weights based on their distances on the cylinder. These modulated features are decoded to produce a depth map per view, yielding improved cross-view depth consistency and higher overall depth accuracy on the DDAD and nuScenes datasets.

What carries the argument

Cylindrical spatial attention, which projects features from each view onto a common cylinder and aggregates them with explicit non-learned distance weighting.

If this is right

  • Depth maps from adjacent cameras show lower inconsistency in their overlapping regions.
  • Overall depth accuracy rises on standard surround-view benchmarks relative to prior self-supervised methods.
  • The limited receptive field at image borders is effectively extended by borrowing information from neighboring views.
  • Correspondence problems are eased because attention is restricted to small cylinder neighborhoods rather than the full image set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cylinder projection could be tried for other surround tasks such as semantic segmentation or optical flow.
  • If the non-learned weighting works reliably, similar explicit geometry cues might replace heavier learned attention blocks in other multi-camera networks.
  • Experiments on rigs whose calibration drifts over time would test how sensitive the consistency gains are to exact geometry.

Load-bearing premise

That mapping each image's features onto a shared cylinder correctly identifies true 3D neighborhood relationships without introducing distortions from imperfect calibration or camera geometry.

What would settle it

Measure the variance of predicted depths for the same 3D scene points when they appear in two overlapping camera views; a drop in this variance after the cylindrical attention would support the claim.

Figures

Figures reproduced from arXiv: 2511.16428 by Christian Grannemann, Max Mehltretter, Samer Abualhanud.

Figure 1
Figure 1. Figure 1: Comparison of multi-view consistency between our method and CVCDepth [4]. The star and circle de￾note 3D reconstructions of the same 3D object point from two different images. While prior work struggles to achieve consistency in the reconstruction across images, our method overcomes this limitation. In contrast, self-supervised approaches enforce photometric consistency between images, training on monocula… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed network. The depth network takes the target images It as input. The lowest-scale features FS,It from all target images are projected onto a cylinder, where attention is applied based on cylindrical distances. The pose network takes the source It ′ ,1 and target front It,1 images as input to predict the temporal pose. lar [1, 13, 17, 21, 24, 29, 30, 48] and multi-view [27, 32, 37, 4… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of the cylindrical projection of a pixel p from the 3D position map PS,It,i resulting in cylindrical position map OS,It,i for all pixels in PS,It,i . We then parameterize p ′ in cylindrical coordinates by its azimuth θp′ and height hp′ : θp′ = atan2(y ′ − cy, x′ − cx) ∈ (−π, π], (4) hp′ = z ′ − cz. (5) For each feature map FS,It,i , we obtain an associated posi￾tion map OS,It,i ∈ R HS×WS×2 th… view at source ↗
Figure 4
Figure 4. Figure 4: Panoramic visualization of the cylindrical projection of RGB inputs. Note that in our method, only pixel positions are projected, not RGB values. This figure is provided solely for illustration, to show how objects captured from different views are mapped to nearby locations in cylindrical coordinates. (a) Back image (b) Back-left image [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Attention maps for a query token (indicated by the arrow in the back-left image), as overlays on the respec￾tive RGB images, showing that this token attends to itself, nearby regions, and to the corresponding region in the spa￾tially adjacent image. High attention is shown in red, low attention in yellow to blue. Dataset Method Abs Rel Sq Rel [m] RMSE [m] δ < 1.25 DDAD FSM 0.201 - - - FSM* 0.228 4.409 13.4… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of depth maps predicted by our method and by state-of-the-art methods on DDAD. Our results show better preserved [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Exemplary 3D reconstructions, comparing our method to the state-of-the-art on DDAD. While our method maps overlapping [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Self-supervised surround-view depth estimation enables dense, low-cost 3D perception with a 360{\deg} field of view from multiple minimally overlapping images. Yet, most existing methods suffer from depth estimates that are inconsistent across overlapping images. To address this limitation, we propose a novel geometry-guided method for calibrated, time-synchronized multi-camera rigs that predicts dense metric depth. Our approach targets two main sources of inconsistency: the limited receptive field in border regions of single-image depth estimation, and the difficulty of correspondence matching. We mitigate these two issues by extending the receptive field across views and restricting cross-view attention to a small neighborhood. To this end, we establish the neighborhood relationships between images by mapping the image-specific feature positions onto a shared cylinder. Based on the cylindrical positions, we apply an explicit spatial attention mechanism, with non-learned weighting, that aggregates features across images according to their distances on the cylinder. The modulated features are then decoded into a depth map for each view. Evaluated on the DDAD and nuScenes datasets, our method improves both cross-view depth consistency and overall depth accuracy compared with state-of-the-art approaches. Code is available at https://abualhanud.github.io/CylinderDepthPage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes CylinderDepth, a geometry-guided self-supervised method for surround depth estimation from calibrated multi-camera rigs. It maps per-image features onto a shared cylinder to define cross-view neighborhoods and applies explicit non-learned distance-based spatial attention to aggregate features, targeting limited receptive fields and correspondence issues. The modulated features are decoded to per-view depth maps. The central claim is that this yields improved cross-view depth consistency and overall accuracy on DDAD and nuScenes relative to prior state-of-the-art methods.

Significance. If the quantitative gains hold under realistic conditions, the approach offers a parameter-light way to enforce multi-view consistency via explicit geometry rather than learned attention, which could benefit 360° perception pipelines in autonomous driving. The explicit weighting and code release are strengths for interpretability and reproducibility.

major comments (2)
  1. [§3] §3 (Method), cylinder mapping paragraph: the claim that mapping image-specific feature positions onto a shared cylinder 'correctly establishes neighborhood relationships' is load-bearing for the consistency gains, yet the manuscript provides no sensitivity analysis or ablation on extrinsic calibration errors or rig rigidity violations. Small pose inaccuracies (common in real rigs) would distort cylinder distances and cause the fixed weighting to aggregate mismatched features, directly undermining the non-learned attention's effectiveness.
  2. [Experiments] Experiments section and associated tables: while improvements on DDAD and nuScenes are asserted, the manuscript must include explicit quantitative metrics for cross-view consistency (e.g., disparity or depth variance across overlaps) alongside standard depth errors, plus an ablation isolating the cylinder attention component; without these, the central claim that the method outperforms SOTA on both consistency and accuracy cannot be fully verified.
minor comments (2)
  1. [Abstract] Abstract: the statement 'improves both cross-view depth consistency and overall depth accuracy' would be clearer if it referenced specific table numbers or reported delta values rather than remaining purely qualitative.
  2. [§3] Notation in §3: define the exact cylinder coordinate transform (e.g., the mapping from pixel (u,v) and depth to cylindrical (θ, z)) with an equation to allow readers to reproduce the neighborhood computation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We respond to each major comment below and indicate planned revisions to address the concerns raised.

read point-by-point responses
  1. Referee: [§3] §3 (Method), cylinder mapping paragraph: the claim that mapping image-specific feature positions onto a shared cylinder 'correctly establishes neighborhood relationships' is load-bearing for the consistency gains, yet the manuscript provides no sensitivity analysis or ablation on extrinsic calibration errors or rig rigidity violations. Small pose inaccuracies (common in real rigs) would distort cylinder distances and cause the fixed weighting to aggregate mismatched features, directly undermining the non-learned attention's effectiveness.

    Authors: We agree that the effectiveness of the non-learned cylindrical attention depends on accurate extrinsics and that sensitivity to calibration errors merits explicit examination. The current manuscript follows the standard assumption of calibrated rigs used by prior surround-view methods on DDAD and nuScenes. In the revision we will add a sensitivity study that perturbs the provided extrinsics with small Gaussian noise (e.g., 0.5–2° rotation and 1–5 cm translation) and reports the resulting degradation in both depth accuracy and cross-view consistency metrics. revision: yes

  2. Referee: [Experiments] Experiments section and associated tables: while improvements on DDAD and nuScenes are asserted, the manuscript must include explicit quantitative metrics for cross-view consistency (e.g., disparity or depth variance across overlaps) alongside standard depth errors, plus an ablation isolating the cylinder attention component; without these, the central claim that the method outperforms SOTA on both consistency and accuracy cannot be fully verified.

    Authors: We acknowledge that the manuscript currently supports the consistency claim primarily through qualitative visualizations and overall depth metrics rather than dedicated quantitative consistency measures. We will add, in the revised experiments section, (i) explicit cross-view consistency metrics such as mean depth variance and disparity variance computed over overlapping image regions and (ii) an ablation that removes the cylindrical spatial attention while keeping all other components fixed, reporting both accuracy and consistency numbers for all variants. revision: yes

Circularity Check

0 steps flagged

No circularity: geometry-guided cylinder mapping and explicit attention are independent of target consistency metrics

full rationale

The paper describes a self-contained method that maps per-image features onto a shared cylinder to define neighborhoods and then applies explicit non-learned distance-based attention to aggregate features before decoding depth. No equations, derivations, or load-bearing steps reduce the claimed cross-view consistency gains to a fitted parameter, self-definition, or self-citation chain. The central premise relies on geometric coordinate transforms and fixed weighting rules that are stated directly rather than derived from the evaluation results or prior author work in a circular manner. This matches the reader's assessment that the approach is presented as geometry-guided without self-referential reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the geometric validity of the cylinder mapping for calibrated rigs and the assumption that restricting attention to small neighborhoods on the cylinder suffices to resolve border and correspondence issues.

axioms (1)
  • domain assumption Calibrated, time-synchronized multi-camera rigs allow accurate mapping of image positions to a shared cylindrical surface.
    Stated in the abstract as the setting for the method.

pith-pipeline@v0.9.0 · 5527 in / 1120 out tokens · 44048 ms · 2026-05-17T20:37:14.515868+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 3 internal anchors

  1. [1]

    Attention attention ev- erywhere: Monocular depth prediction with skip attention

    Ashutosh Agarwal and Chetan Arora. Attention attention ev- erywhere: Monocular depth prediction with skip attention. InProceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision, pages 5861–5870, 2023. 2, 3

  2. [2]

    nuscenes: A multi- modal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 1, 6, 9

  3. [3]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 6

  4. [4]

    Towards cross-view-consistent self-supervised sur- round depth estimation

    Laiyan Ding, Hualie Jiang, Jie Li, Yongquan Chen, and Rui Huang. Towards cross-view-consistent self-supervised sur- round depth estimation. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10043–10050. IEEE, 2024. 1, 2, 5, 7

  5. [5]

    Depth map prediction from a single image using a multi-scale deep net- work.Advances in neural information processing systems, 27, 2014

    David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep net- work.Advances in neural information processing systems, 27, 2014. 2, 6

  6. [6]

    Driv3r: Learning dense 4d reconstruction for autonomous driving

    Xin Fei, Wenzhao Zheng, Yueqi Duan, Wei Zhan, Masayoshi Tomizuka, Kurt Keutzer, and Jiwen Lu. Driv3r: Learn- ing dense 4d reconstruction for autonomous driving.arXiv preprint arXiv:2412.06777, 2024. 2

  7. [7]

    Deep ordinal regression net- work for monocular depth estimation

    Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat- manghelich, and Dacheng Tao. Deep ordinal regression net- work for monocular depth estimation. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2002–2011, 2018. 2

  8. [8]

    Unsupervised cnn for single view depth estimation: Geom- etry to the rescue

    Ravi Garg, Vijay Kumar Bg, Gustavo Carneiro, and Ian Reid. Unsupervised cnn for single view depth estimation: Geom- etry to the rescue. InEuropean conference on computer vi- sion, pages 740–756. Springer, 2016. 2

  9. [9]

    Unsupervised monocular depth estimation with left- right consistency

    Cl ´ement Godard, Oisin Mac Aodha, and Gabriel J Bros- tow. Unsupervised monocular depth estimation with left- right consistency. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 270–279,

  10. [10]

    Digging into self-supervised monocular depth estimation

    Cl ´ement Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. Digging into self-supervised monocular depth estimation. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 3828–3838,

  11. [11]

    Cascade cost volume for high-resolution multi-view stereo and stereo matching

    Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and Ping Tan. Cascade cost volume for high-resolution multi-view stereo and stereo matching. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2495–2504, 2020. 2

  12. [12]

    3d packing for self-supervised monocular depth estimation

    Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raven- tos, and Adrien Gaidon. 3d packing for self-supervised monocular depth estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2485–2494, 2020. 1, 2, 6

  13. [13]

    Multi-frame self-supervised depth with transformers

    Vitor Guizilini, Rares, Ambrus, , Dian Chen, Sergey Zakharov, and Adrien Gaidon. Multi-frame self-supervised depth with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 160–170,

  14. [14]

    Full surround mon- odepth from multiple cameras.IEEE Robotics and Automa- tion Letters, 7(2):5397–5404, 2022

    Vitor Guizilini, Igor Vasiljevic, Rares Ambrus, Greg Shakhnarovich, and Adrien Gaidon. Full surround mon- odepth from multiple cameras.IEEE Robotics and Automa- tion Letters, 7(2):5397–5404, 2022. 2, 5, 7

  15. [15]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 6

  16. [16]

    DPSNet: End-to-end Deep Plane Sweep Stereo

    Sunghoon Im, Hae-Gon Jeon, Stephen Lin, and In So Kweon. Dpsnet: End-to-end deep plane sweep stereo.arXiv preprint arXiv:1905.00538, 2019. 2

  17. [17]

    Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume

    Adrian Johnston and Gustavo Carneiro. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. InProceedings of the ieee/cvf con- ference on computer vision and pattern recognition, pages 4756–4765, 2020. 3

  18. [18]

    Learning Unsupervised Multi-View Stereopsis via Robust Photometric Consistency

    Tejas Khot, Shubham Agrawal, Shubham Tulsiani, Christoph Mertz, Simon Lucey, and Martial Hebert. Learning unsupervised multi-view stereopsis via robust photometric consistency.arXiv preprint arXiv:1905.02706,

  19. [19]

    Self-supervised surround-view depth es- timation with volumetric feature fusion.Advances in Neural Information Processing Systems, 35:4032–4045, 2022

    Jung-Hee Kim, Junhwa Hur, Tien Phuoc Nguyen, and Seong-Gyun Jeong. Self-supervised surround-view depth es- timation with volumetric feature fusion.Advances in Neural Information Processing Systems, 35:4032–4045, 2022. 2, 6, 7

  20. [20]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

  21. [21]

    Patch-wise attention network for monocular depth estimation

    Sihaeng Lee, Janghyeon Lee, Byungju Kim, Eojindl Yi, and Junmo Kim. Patch-wise attention network for monocular depth estimation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1873–1881, 2021. 3

  22. [22]

    Ground- ing image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 2

  23. [23]

    M2depth: A novel self- supervised multi-camera depth estimation with multi-level supervision

    Ruihang Li, Shanding Ye, Zhe Yin, Tao Li, ZeHua Zhang, KaiKai Xiao, and Zhijie Pan. M2depth: A novel self- supervised multi-camera depth estimation with multi-level supervision. In2024 IEEE International Conference on Mul- timedia and Expo (ICME), pages 1–6. IEEE, 2024. 2 9

  24. [24]

    Depthformer: Exploiting long-range correlation and local in- formation for accurate monocular depth estimation.Machine Intelligence Research, 20(6):837–854, 2023

    Zhenyu Li, Zehui Chen, Xianming Liu, and Junjun Jiang. Depthformer: Exploiting long-range correlation and local in- formation for accurate monocular depth estimation.Machine Intelligence Research, 20(6):837–854, 2023. 3

  25. [25]

    Learning depth from single monocular images using deep convolutional neural fields.IEEE transactions on pattern analysis and machine intelligence, 38(10):2024–2039, 2015

    Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. Learning depth from single monocular images using deep convolutional neural fields.IEEE transactions on pattern analysis and machine intelligence, 38(10):2024–2039, 2015. 2

  26. [26]

    Mono-vifi: A unified learning framework for self-supervised single and multi-frame monocular depth estimation

    Jinfeng Liu, Lingtong Kong, Bo Li, Zerong Wang, Hong Gu, and Jinwei Chen. Mono-vifi: A unified learning framework for self-supervised single and multi-frame monocular depth estimation. InEuropean Conference on Computer Vision, pages 90–107. Springer, 2024. 2

  27. [27]

    Attention-aware multi-view stereo

    Keyang Luo, Tao Guan, Lili Ju, Yuesong Wang, Zhuo Chen, and Yawei Luo. Attention-aware multi-view stereo. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1590–1599, 2020. 3

  28. [28]

    Un- supervised learning of depth and ego-motion from monocu- lar video using 3d geometric constraints

    Reza Mahjourian, Martin Wicke, and Anelia Angelova. Un- supervised learning of depth and ego-motion from monocu- lar video using 3d geometric constraints. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 5667–5675, 2018. 2

  29. [29]

    Vi- sion transformers for dense prediction

    Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 2, 3

  30. [30]

    Attention meets geometry: Geom- etry guided spatial-temporal attention for consistent self- supervised monocular depth estimation

    Patrick Ruhkamp, Daoyi Gao, Hanzhi Chen, Nassir Navab, and Beniamin Busam. Attention meets geometry: Geom- etry guided spatial-temporal attention for consistent self- supervised monocular depth estimation. In2021 Inter- national Conference on 3D Vision (3DV), pages 837–847. IEEE, 2021. 2, 3

  31. [31]

    R3d3: Dense 3d reconstruction of dynamic scenes from multiple cameras

    Aron Schmied, Tobias Fischer, Martin Danelljan, Marc Pollefeys, and Fisher Yu. R3d3: Dense 3d reconstruction of dynamic scenes from multiple cameras. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 3216–3226, 2023. 2

  32. [32]

    Ega- depth: Efficient guided attention for self-supervised multi- camera depth estimation

    Yunxiao Shi, Hong Cai, Amin Ansari, and Fatih Porikli. Ega- depth: Efficient guided attention for self-supervised multi- camera depth estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 119–129, 2023. 2, 3

  33. [33]

    Neural ray surfaces for self-supervised learning of depth and ego-motion

    Igor Vasiljevic, Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Wolfram Burgard, Greg Shakhnarovich, and Adrien Gaidon. Neural ray surfaces for self-supervised learning of depth and ego-motion. In2020 International Conference on 3D Vision (3DV), pages 1–11. IEEE, 2020. 2

  34. [34]

    Self-supervised learning of depth and camera mo- tion from 360 videos

    Fu-En Wang, Hou-Ning Hu, Hsien-Tzu Cheng, Juan-Ting Lin, Shang-Ta Yang, Meng-Li Shih, Hung-Kuo Chu, and Min Sun. Self-supervised learning of depth and camera mo- tion from 360 videos. InAsian Conference on Computer Vision, pages 53–68. Springer, 2018. 2

  35. [35]

    Vggt: Vi- sual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2

  36. [36]

    Dust3r: Geometric 3d vi- sion made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 2

  37. [37]

    Mvster: Epipolar transformer for efficient multi-view stereo

    Xiaofeng Wang, Zheng Zhu, Guan Huang, Fangbo Qin, Yun Ye, Yijia He, Xu Chi, and Xingang Wang. Mvster: Epipolar transformer for efficient multi-view stereo. InEuropean con- ference on computer vision, pages 573–591. Springer, 2022. 2, 3

  38. [38]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 5

  39. [39]

    Self-supervised monocular depth hints

    Jamie Watson, Michael Firman, Gabriel J Brostow, and Daniyar Turmukhambetov. Self-supervised monocular depth hints. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 2162–2171, 2019. 2

  40. [40]

    The temporal opportunist: Self-supervised multi-frame monocular depth

    Jamie Watson, Oisin Mac Aodha, Victor Prisacariu, Gabriel Brostow, and Michael Firman. The temporal opportunist: Self-supervised multi-frame monocular depth. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1164–1174, 2021. 2

  41. [41]

    Surround- depth: Entangling surrounding views for self-supervised multi-camera depth estimation

    Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Yong- ming Rao, Guan Huang, Jiwen Lu, and Jie Zhou. Surround- depth: Entangling surrounding views for self-supervised multi-camera depth estimation. InConference on robot learning, pages 539–549. PMLR, 2023. 2, 3, 6, 7

  42. [42]

    Behind the scenes: Density fields for single view reconstruction

    Felix Wimbauer, Nan Yang, Christian Rupprecht, and Daniel Cremers. Behind the scenes: Density fields for single view reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9076– 9086, 2023. 2

  43. [43]

    Self-supervised multi-camera collaborative depth prediction with latent diffusion models.IEEE Trans- actions on Intelligent Transportation Systems, 2025

    Jialei Xu, Xianming Liu, Yuanchao Bai, Junjun Jiang, and Xiangyang Ji. Self-supervised multi-camera collaborative depth prediction with latent diffusion models.IEEE Trans- actions on Intelligent Transportation Systems, 2025. 2

  44. [44]

    Towards scale-aware full sur- round monodepth with transformers.arXiv preprint arXiv:2407.10406, 2024

    Yuchen Yang, Xinyi Wang, Dong Li, Lu Tian, Ashish Sirasao, and Xun Yang. Towards scale-aware full sur- round monodepth with transformers.arXiv preprint arXiv:2407.10406, 2024. 2

  45. [45]

    Mvsnet: Depth inference for unstructured multi-view stereo

    Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. InProceedings of the European conference on computer vi- sion (ECCV), pages 767–783, 2018. 2

  46. [46]

    Recurrent mvsnet for high-resolution multi-view stereo depth inference

    Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, and Long Quan. Recurrent mvsnet for high-resolution multi-view stereo depth inference. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5525–5534, 2019. 2

  47. [47]

    Geonet: Unsupervised learn- ing of dense depth, optical flow and camera pose

    Zhichao Yin and Jianping Shi. Geonet: Unsupervised learn- ing of dense depth, optical flow and camera pose. InPro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 1983–1992, 2018. 2

  48. [48]

    Improving 360 monocular depth estimation via non-local dense prediction transformer and joint supervised and self-supervised learn- 10 ing

    Ilwi Yun, Hyuk-Jae Lee, and Chae Eun Rhee. Improving 360 monocular depth estimation via non-local dense prediction transformer and joint supervised and self-supervised learn- 10 ing. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3224–3233, 2022. 3

  49. [49]

    Unsupervised learning of depth and ego-motion from video

    Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1851–1858, 2017. 2

  50. [50]

    M 2 depth: Self-supervised two-frame m ulti-camera m etric depth estimation

    Yingshuang Zou, Yikang Ding, Xi Qiu, Haoqian Wang, and Haotian Zhang. M 2 depth: Self-supervised two-frame m ulti-camera m etric depth estimation. InEuropean Confer- ence on Computer Vision, pages 269–285. Springer, 2024. 2 11