arxiv: 2511.16428 · v3 · submitted 2025-11-20 · 💻 cs.CV

CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation

Samer Abualhanud , Christian Grannemann , Max Mehltretter This is my paper

Pith reviewed 2026-05-17 20:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords self-supervised depth estimationmulti-view consistencysurround depthcylindrical attentionautonomous drivingDDADnuScenesmulti-camera rigs

0 comments p. Extension

The pith

Mapping image features onto a shared cylinder and weighting them by cylindrical distance produces consistent depth estimates across overlapping camera views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to fix inconsistent depth values in the overlap zones of surround-camera rigs when depth is learned without direct supervision. It projects each camera's features onto one common cylinder so that nearby points on the cylinder correspond to actual neighbors in 3D space. A simple distance-based weighting then blends information only from those nearby cylinder locations instead of using full learned attention. The blended features are decoded into per-view depth maps. Readers care because reliable 360-degree depth from cheap cameras is needed for navigation and 3D scene understanding.

Core claim

By mapping the feature positions from each camera image onto a shared cylinder, neighborhood relationships are established between different views. An explicit spatial attention mechanism then aggregates features across images using non-learned weights based on their distances on the cylinder. These modulated features are decoded to produce a depth map per view, yielding improved cross-view depth consistency and higher overall depth accuracy on the DDAD and nuScenes datasets.

What carries the argument

Cylindrical spatial attention, which projects features from each view onto a common cylinder and aggregates them with explicit non-learned distance weighting.

If this is right

Depth maps from adjacent cameras show lower inconsistency in their overlapping regions.
Overall depth accuracy rises on standard surround-view benchmarks relative to prior self-supervised methods.
The limited receptive field at image borders is effectively extended by borrowing information from neighboring views.
Correspondence problems are eased because attention is restricted to small cylinder neighborhoods rather than the full image set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cylinder projection could be tried for other surround tasks such as semantic segmentation or optical flow.
If the non-learned weighting works reliably, similar explicit geometry cues might replace heavier learned attention blocks in other multi-camera networks.
Experiments on rigs whose calibration drifts over time would test how sensitive the consistency gains are to exact geometry.

Load-bearing premise

That mapping each image's features onto a shared cylinder correctly identifies true 3D neighborhood relationships without introducing distortions from imperfect calibration or camera geometry.

What would settle it

Measure the variance of predicted depths for the same 3D scene points when they appear in two overlapping camera views; a drop in this variance after the cylindrical attention would support the claim.

Figures

Figures reproduced from arXiv: 2511.16428 by Christian Grannemann, Max Mehltretter, Samer Abualhanud.

**Figure 1.** Figure 1: Comparison of multi-view consistency between our method and CVCDepth [4]. The star and circle denote 3D reconstructions of the same 3D object point from two different images. While prior work struggles to achieve consistency in the reconstruction across images, our method overcomes this limitation. In contrast, self-supervised approaches enforce photometric consistency between images, training on monocula… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed network. The depth network takes the target images It as input. The lowest-scale features FS,It from all target images are projected onto a cylinder, where attention is applied based on cylindrical distances. The pose network takes the source It ′ ,1 and target front It,1 images as input to predict the temporal pose. lar [1, 13, 17, 21, 24, 29, 30, 48] and multi-view [27, 32, 37, 4… view at source ↗

**Figure 3.** Figure 3: Visualization of the cylindrical projection of a pixel p from the 3D position map PS,It,i resulting in cylindrical position map OS,It,i for all pixels in PS,It,i . We then parameterize p ′ in cylindrical coordinates by its azimuth θp′ and height hp′ : θp′ = atan2(y ′ − cy, x′ − cx) ∈ (−π, π], (4) hp′ = z ′ − cz. (5) For each feature map FS,It,i , we obtain an associated position map OS,It,i ∈ R HS×WS×2 th… view at source ↗

**Figure 4.** Figure 4: Panoramic visualization of the cylindrical projection of RGB inputs. Note that in our method, only pixel positions are projected, not RGB values. This figure is provided solely for illustration, to show how objects captured from different views are mapped to nearby locations in cylindrical coordinates. (a) Back image (b) Back-left image [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Attention maps for a query token (indicated by the arrow in the back-left image), as overlays on the respective RGB images, showing that this token attends to itself, nearby regions, and to the corresponding region in the spatially adjacent image. High attention is shown in red, low attention in yellow to blue. Dataset Method Abs Rel Sq Rel [m] RMSE [m] δ < 1.25 DDAD FSM 0.201 - - - FSM* 0.228 4.409 13.4… view at source ↗

**Figure 6.** Figure 6: Comparison of depth maps predicted by our method and by state-of-the-art methods on DDAD. Our results show better preserved [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Exemplary 3D reconstructions, comparing our method to the state-of-the-art on DDAD. While our method maps overlapping [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Self-supervised surround-view depth estimation enables dense, low-cost 3D perception with a 360{\deg} field of view from multiple minimally overlapping images. Yet, most existing methods suffer from depth estimates that are inconsistent across overlapping images. To address this limitation, we propose a novel geometry-guided method for calibrated, time-synchronized multi-camera rigs that predicts dense metric depth. Our approach targets two main sources of inconsistency: the limited receptive field in border regions of single-image depth estimation, and the difficulty of correspondence matching. We mitigate these two issues by extending the receptive field across views and restricting cross-view attention to a small neighborhood. To this end, we establish the neighborhood relationships between images by mapping the image-specific feature positions onto a shared cylinder. Based on the cylindrical positions, we apply an explicit spatial attention mechanism, with non-learned weighting, that aggregates features across images according to their distances on the cylinder. The modulated features are then decoded into a depth map for each view. Evaluated on the DDAD and nuScenes datasets, our method improves both cross-view depth consistency and overall depth accuracy compared with state-of-the-art approaches. Code is available at https://abualhanud.github.io/CylinderDepthPage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cylinder mapping with fixed distance attention gives a clean geometry-driven fix for cross-view inconsistency in surround depth, but the gains rest on untested calibration assumptions.

read the letter

The new element is the explicit mapping of per-view features onto a shared cylinder, followed by non-learned distance-weighted attention that only mixes nearby points. This replaces the usual learned cross-view attention or simple concatenation and directly uses rig geometry to define neighborhoods. It targets the border and matching problems that produce inconsistent depths across overlapping cameras in self-supervised setups. The choice to keep the weighting fixed rather than learned is sensible because it reduces parameters and ties the aggregation to actual 3D proximity on the cylinder surface. That is a practical step for calibrated, time-synchronized rigs. The evaluation on DDAD and nuScenes is the right place to test it, and the abstract reports gains in both consistency and accuracy over prior methods. Those are the parts that work. The soft spot is exactly the one the stress-test note flags. The cylinder distances only correspond to real adjacency if the extrinsics are accurate and the rig stays rigid. Real deployments have small calibration drift and minor pose errors; once those appear, the fixed weights start blending mismatched features and the consistency benefit can shrink or reverse. The paper gives no ablation with added noise to the poses or any sensitivity numbers, so the central claim is harder to trust outside perfect lab conditions. This is a targeted method for teams already running multi-camera depth on vehicles or robots. Readers who need a drop-in consistency module for nuScenes-style rigs will find the cylinder construction and the attention rule worth examining. It deserves peer review because the geometry idea is reproducible and the datasets are standard, even though the calibration robustness gap will need addressing in revision.

Referee Report

2 major / 2 minor

Summary. The paper proposes CylinderDepth, a geometry-guided self-supervised method for surround depth estimation from calibrated multi-camera rigs. It maps per-image features onto a shared cylinder to define cross-view neighborhoods and applies explicit non-learned distance-based spatial attention to aggregate features, targeting limited receptive fields and correspondence issues. The modulated features are decoded to per-view depth maps. The central claim is that this yields improved cross-view depth consistency and overall accuracy on DDAD and nuScenes relative to prior state-of-the-art methods.

Significance. If the quantitative gains hold under realistic conditions, the approach offers a parameter-light way to enforce multi-view consistency via explicit geometry rather than learned attention, which could benefit 360° perception pipelines in autonomous driving. The explicit weighting and code release are strengths for interpretability and reproducibility.

major comments (2)

[§3] §3 (Method), cylinder mapping paragraph: the claim that mapping image-specific feature positions onto a shared cylinder 'correctly establishes neighborhood relationships' is load-bearing for the consistency gains, yet the manuscript provides no sensitivity analysis or ablation on extrinsic calibration errors or rig rigidity violations. Small pose inaccuracies (common in real rigs) would distort cylinder distances and cause the fixed weighting to aggregate mismatched features, directly undermining the non-learned attention's effectiveness.
[Experiments] Experiments section and associated tables: while improvements on DDAD and nuScenes are asserted, the manuscript must include explicit quantitative metrics for cross-view consistency (e.g., disparity or depth variance across overlaps) alongside standard depth errors, plus an ablation isolating the cylinder attention component; without these, the central claim that the method outperforms SOTA on both consistency and accuracy cannot be fully verified.

minor comments (2)

[Abstract] Abstract: the statement 'improves both cross-view depth consistency and overall depth accuracy' would be clearer if it referenced specific table numbers or reported delta values rather than remaining purely qualitative.
[§3] Notation in §3: define the exact cylinder coordinate transform (e.g., the mapping from pixel (u,v) and depth to cylindrical (θ, z)) with an equation to allow readers to reproduce the neighborhood computation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We respond to each major comment below and indicate planned revisions to address the concerns raised.

read point-by-point responses

Referee: [§3] §3 (Method), cylinder mapping paragraph: the claim that mapping image-specific feature positions onto a shared cylinder 'correctly establishes neighborhood relationships' is load-bearing for the consistency gains, yet the manuscript provides no sensitivity analysis or ablation on extrinsic calibration errors or rig rigidity violations. Small pose inaccuracies (common in real rigs) would distort cylinder distances and cause the fixed weighting to aggregate mismatched features, directly undermining the non-learned attention's effectiveness.

Authors: We agree that the effectiveness of the non-learned cylindrical attention depends on accurate extrinsics and that sensitivity to calibration errors merits explicit examination. The current manuscript follows the standard assumption of calibrated rigs used by prior surround-view methods on DDAD and nuScenes. In the revision we will add a sensitivity study that perturbs the provided extrinsics with small Gaussian noise (e.g., 0.5–2° rotation and 1–5 cm translation) and reports the resulting degradation in both depth accuracy and cross-view consistency metrics. revision: yes
Referee: [Experiments] Experiments section and associated tables: while improvements on DDAD and nuScenes are asserted, the manuscript must include explicit quantitative metrics for cross-view consistency (e.g., disparity or depth variance across overlaps) alongside standard depth errors, plus an ablation isolating the cylinder attention component; without these, the central claim that the method outperforms SOTA on both consistency and accuracy cannot be fully verified.

Authors: We acknowledge that the manuscript currently supports the consistency claim primarily through qualitative visualizations and overall depth metrics rather than dedicated quantitative consistency measures. We will add, in the revised experiments section, (i) explicit cross-view consistency metrics such as mean depth variance and disparity variance computed over overlapping image regions and (ii) an ablation that removes the cylindrical spatial attention while keeping all other components fixed, reporting both accuracy and consistency numbers for all variants. revision: yes

Circularity Check

0 steps flagged

No circularity: geometry-guided cylinder mapping and explicit attention are independent of target consistency metrics

full rationale

The paper describes a self-contained method that maps per-image features onto a shared cylinder to define neighborhoods and then applies explicit non-learned distance-based attention to aggregate features before decoding depth. No equations, derivations, or load-bearing steps reduce the claimed cross-view consistency gains to a fitted parameter, self-definition, or self-citation chain. The central premise relies on geometric coordinate transforms and fixed weighting rules that are stated directly rather than derived from the evaluation results or prior author work in a circular manner. This matches the reader's assessment that the approach is presented as geometry-guided without self-referential reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the geometric validity of the cylinder mapping for calibrated rigs and the assumption that restricting attention to small neighborhoods on the cylinder suffices to resolve border and correspondence issues.

axioms (1)

domain assumption Calibrated, time-synchronized multi-camera rigs allow accurate mapping of image positions to a shared cylindrical surface.
Stated in the abstract as the setting for the method.

pith-pipeline@v0.9.0 · 5527 in / 1120 out tokens · 44048 ms · 2026-05-17T20:37:14.515868+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we first reconstruct the scene in 3D space, using the preliminary depth map... The resulting 3D points are then projected onto a unit-radius cylinder... attention weights based on the geodesic distance between the pixels on the cylinder
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

explicit, non-learned spatial attention that weights pixel interactions based on the geodesic distances... truncated 2D Gaussian kernel

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 3 internal anchors

[1]

Attention attention ev- erywhere: Monocular depth prediction with skip attention

Ashutosh Agarwal and Chetan Arora. Attention attention ev- erywhere: Monocular depth prediction with skip attention. InProceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision, pages 5861–5870, 2023. 2, 3

work page 2023
[2]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 1, 6, 9

work page 2020
[3]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 6

work page 2009
[4]

Towards cross-view-consistent self-supervised sur- round depth estimation

Laiyan Ding, Hualie Jiang, Jie Li, Yongquan Chen, and Rui Huang. Towards cross-view-consistent self-supervised sur- round depth estimation. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10043–10050. IEEE, 2024. 1, 2, 5, 7

work page 2024
[5]

Depth map prediction from a single image using a multi-scale deep net- work.Advances in neural information processing systems, 27, 2014

David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep net- work.Advances in neural information processing systems, 27, 2014. 2, 6

work page 2014
[6]

Driv3r: Learning dense 4d reconstruction for autonomous driving

Xin Fei, Wenzhao Zheng, Yueqi Duan, Wei Zhan, Masayoshi Tomizuka, Kurt Keutzer, and Jiwen Lu. Driv3r: Learn- ing dense 4d reconstruction for autonomous driving.arXiv preprint arXiv:2412.06777, 2024. 2

work page arXiv 2024
[7]

Deep ordinal regression net- work for monocular depth estimation

Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat- manghelich, and Dacheng Tao. Deep ordinal regression net- work for monocular depth estimation. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2002–2011, 2018. 2

work page 2002
[8]

Unsupervised cnn for single view depth estimation: Geom- etry to the rescue

Ravi Garg, Vijay Kumar Bg, Gustavo Carneiro, and Ian Reid. Unsupervised cnn for single view depth estimation: Geom- etry to the rescue. InEuropean conference on computer vi- sion, pages 740–756. Springer, 2016. 2

work page 2016
[9]

Unsupervised monocular depth estimation with left- right consistency

Cl ´ement Godard, Oisin Mac Aodha, and Gabriel J Bros- tow. Unsupervised monocular depth estimation with left- right consistency. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 270–279,

work page
[10]

Digging into self-supervised monocular depth estimation

Cl ´ement Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. Digging into self-supervised monocular depth estimation. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 3828–3838,

work page
[11]

Cascade cost volume for high-resolution multi-view stereo and stereo matching

Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and Ping Tan. Cascade cost volume for high-resolution multi-view stereo and stereo matching. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2495–2504, 2020. 2

work page 2020
[12]

3d packing for self-supervised monocular depth estimation

Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raven- tos, and Adrien Gaidon. 3d packing for self-supervised monocular depth estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2485–2494, 2020. 1, 2, 6

work page 2020
[13]

Multi-frame self-supervised depth with transformers

Vitor Guizilini, Rares, Ambrus, , Dian Chen, Sergey Zakharov, and Adrien Gaidon. Multi-frame self-supervised depth with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 160–170,

work page
[14]

Full surround mon- odepth from multiple cameras.IEEE Robotics and Automa- tion Letters, 7(2):5397–5404, 2022

Vitor Guizilini, Igor Vasiljevic, Rares Ambrus, Greg Shakhnarovich, and Adrien Gaidon. Full surround mon- odepth from multiple cameras.IEEE Robotics and Automa- tion Letters, 7(2):5397–5404, 2022. 2, 5, 7

work page 2022
[15]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 6

work page 2016
[16]

DPSNet: End-to-end Deep Plane Sweep Stereo

Sunghoon Im, Hae-Gon Jeon, Stephen Lin, and In So Kweon. Dpsnet: End-to-end deep plane sweep stereo.arXiv preprint arXiv:1905.00538, 2019. 2

work page internal anchor Pith review Pith/arXiv arXiv 1905
[17]

Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume

Adrian Johnston and Gustavo Carneiro. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. InProceedings of the ieee/cvf con- ference on computer vision and pattern recognition, pages 4756–4765, 2020. 3

work page 2020
[18]

Learning Unsupervised Multi-View Stereopsis via Robust Photometric Consistency

Tejas Khot, Shubham Agrawal, Shubham Tulsiani, Christoph Mertz, Simon Lucey, and Martial Hebert. Learning unsupervised multi-view stereopsis via robust photometric consistency.arXiv preprint arXiv:1905.02706,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[19]

Self-supervised surround-view depth es- timation with volumetric feature fusion.Advances in Neural Information Processing Systems, 35:4032–4045, 2022

Jung-Hee Kim, Junhwa Hur, Tien Phuoc Nguyen, and Seong-Gyun Jeong. Self-supervised surround-view depth es- timation with volumetric feature fusion.Advances in Neural Information Processing Systems, 35:4032–4045, 2022. 2, 6, 7

work page 2022
[20]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Patch-wise attention network for monocular depth estimation

Sihaeng Lee, Janghyeon Lee, Byungju Kim, Eojindl Yi, and Junmo Kim. Patch-wise attention network for monocular depth estimation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1873–1881, 2021. 3

work page 2021
[22]

Ground- ing image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 2

work page 2024
[23]

M2depth: A novel self- supervised multi-camera depth estimation with multi-level supervision

Ruihang Li, Shanding Ye, Zhe Yin, Tao Li, ZeHua Zhang, KaiKai Xiao, and Zhijie Pan. M2depth: A novel self- supervised multi-camera depth estimation with multi-level supervision. In2024 IEEE International Conference on Mul- timedia and Expo (ICME), pages 1–6. IEEE, 2024. 2 9

work page 2024
[24]

Depthformer: Exploiting long-range correlation and local in- formation for accurate monocular depth estimation.Machine Intelligence Research, 20(6):837–854, 2023

Zhenyu Li, Zehui Chen, Xianming Liu, and Junjun Jiang. Depthformer: Exploiting long-range correlation and local in- formation for accurate monocular depth estimation.Machine Intelligence Research, 20(6):837–854, 2023. 3

work page 2023
[25]

Learning depth from single monocular images using deep convolutional neural fields.IEEE transactions on pattern analysis and machine intelligence, 38(10):2024–2039, 2015

Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. Learning depth from single monocular images using deep convolutional neural fields.IEEE transactions on pattern analysis and machine intelligence, 38(10):2024–2039, 2015. 2

work page 2024
[26]

Mono-vifi: A unified learning framework for self-supervised single and multi-frame monocular depth estimation

Jinfeng Liu, Lingtong Kong, Bo Li, Zerong Wang, Hong Gu, and Jinwei Chen. Mono-vifi: A unified learning framework for self-supervised single and multi-frame monocular depth estimation. InEuropean Conference on Computer Vision, pages 90–107. Springer, 2024. 2

work page 2024
[27]

Attention-aware multi-view stereo

Keyang Luo, Tao Guan, Lili Ju, Yuesong Wang, Zhuo Chen, and Yawei Luo. Attention-aware multi-view stereo. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1590–1599, 2020. 3

work page 2020
[28]

Un- supervised learning of depth and ego-motion from monocu- lar video using 3d geometric constraints

Reza Mahjourian, Martin Wicke, and Anelia Angelova. Un- supervised learning of depth and ego-motion from monocu- lar video using 3d geometric constraints. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 5667–5675, 2018. 2

work page 2018
[29]

Vi- sion transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 2, 3

work page 2021
[30]

Attention meets geometry: Geom- etry guided spatial-temporal attention for consistent self- supervised monocular depth estimation

Patrick Ruhkamp, Daoyi Gao, Hanzhi Chen, Nassir Navab, and Beniamin Busam. Attention meets geometry: Geom- etry guided spatial-temporal attention for consistent self- supervised monocular depth estimation. In2021 Inter- national Conference on 3D Vision (3DV), pages 837–847. IEEE, 2021. 2, 3

work page 2021
[31]

R3d3: Dense 3d reconstruction of dynamic scenes from multiple cameras

Aron Schmied, Tobias Fischer, Martin Danelljan, Marc Pollefeys, and Fisher Yu. R3d3: Dense 3d reconstruction of dynamic scenes from multiple cameras. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 3216–3226, 2023. 2

work page 2023
[32]

Ega- depth: Efficient guided attention for self-supervised multi- camera depth estimation

Yunxiao Shi, Hong Cai, Amin Ansari, and Fatih Porikli. Ega- depth: Efficient guided attention for self-supervised multi- camera depth estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 119–129, 2023. 2, 3

work page 2023
[33]

Neural ray surfaces for self-supervised learning of depth and ego-motion

Igor Vasiljevic, Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Wolfram Burgard, Greg Shakhnarovich, and Adrien Gaidon. Neural ray surfaces for self-supervised learning of depth and ego-motion. In2020 International Conference on 3D Vision (3DV), pages 1–11. IEEE, 2020. 2

work page 2020
[34]

Self-supervised learning of depth and camera mo- tion from 360 videos

Fu-En Wang, Hou-Ning Hu, Hsien-Tzu Cheng, Juan-Ting Lin, Shang-Ta Yang, Meng-Li Shih, Hung-Kuo Chu, and Min Sun. Self-supervised learning of depth and camera mo- tion from 360 videos. InAsian Conference on Computer Vision, pages 53–68. Springer, 2018. 2

work page 2018
[35]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2

work page 2025
[36]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 2

work page 2024
[37]

Mvster: Epipolar transformer for efficient multi-view stereo

Xiaofeng Wang, Zheng Zhu, Guan Huang, Fangbo Qin, Yun Ye, Yijia He, Xu Chi, and Xingang Wang. Mvster: Epipolar transformer for efficient multi-view stereo. InEuropean con- ference on computer vision, pages 573–591. Springer, 2022. 2, 3

work page 2022
[38]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 5

work page 2004
[39]

Self-supervised monocular depth hints

Jamie Watson, Michael Firman, Gabriel J Brostow, and Daniyar Turmukhambetov. Self-supervised monocular depth hints. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 2162–2171, 2019. 2

work page 2019
[40]

The temporal opportunist: Self-supervised multi-frame monocular depth

Jamie Watson, Oisin Mac Aodha, Victor Prisacariu, Gabriel Brostow, and Michael Firman. The temporal opportunist: Self-supervised multi-frame monocular depth. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1164–1174, 2021. 2

work page 2021
[41]

Surround- depth: Entangling surrounding views for self-supervised multi-camera depth estimation

Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Yong- ming Rao, Guan Huang, Jiwen Lu, and Jie Zhou. Surround- depth: Entangling surrounding views for self-supervised multi-camera depth estimation. InConference on robot learning, pages 539–549. PMLR, 2023. 2, 3, 6, 7

work page 2023
[42]

Behind the scenes: Density fields for single view reconstruction

Felix Wimbauer, Nan Yang, Christian Rupprecht, and Daniel Cremers. Behind the scenes: Density fields for single view reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9076– 9086, 2023. 2

work page 2023
[43]

Self-supervised multi-camera collaborative depth prediction with latent diffusion models.IEEE Trans- actions on Intelligent Transportation Systems, 2025

Jialei Xu, Xianming Liu, Yuanchao Bai, Junjun Jiang, and Xiangyang Ji. Self-supervised multi-camera collaborative depth prediction with latent diffusion models.IEEE Trans- actions on Intelligent Transportation Systems, 2025. 2

work page 2025
[44]

Towards scale-aware full sur- round monodepth with transformers.arXiv preprint arXiv:2407.10406, 2024

Yuchen Yang, Xinyi Wang, Dong Li, Lu Tian, Ashish Sirasao, and Xun Yang. Towards scale-aware full sur- round monodepth with transformers.arXiv preprint arXiv:2407.10406, 2024. 2

work page arXiv 2024
[45]

Mvsnet: Depth inference for unstructured multi-view stereo

Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. InProceedings of the European conference on computer vi- sion (ECCV), pages 767–783, 2018. 2

work page 2018
[46]

Recurrent mvsnet for high-resolution multi-view stereo depth inference

Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, and Long Quan. Recurrent mvsnet for high-resolution multi-view stereo depth inference. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5525–5534, 2019. 2

work page 2019
[47]

Geonet: Unsupervised learn- ing of dense depth, optical flow and camera pose

Zhichao Yin and Jianping Shi. Geonet: Unsupervised learn- ing of dense depth, optical flow and camera pose. InPro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 1983–1992, 2018. 2

work page 1983
[48]

Improving 360 monocular depth estimation via non-local dense prediction transformer and joint supervised and self-supervised learn- 10 ing

Ilwi Yun, Hyuk-Jae Lee, and Chae Eun Rhee. Improving 360 monocular depth estimation via non-local dense prediction transformer and joint supervised and self-supervised learn- 10 ing. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3224–3233, 2022. 3

work page 2022
[49]

Unsupervised learning of depth and ego-motion from video

Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1851–1858, 2017. 2

work page 2017
[50]

M 2 depth: Self-supervised two-frame m ulti-camera m etric depth estimation

Yingshuang Zou, Yikang Ding, Xi Qiu, Haoqian Wang, and Haotian Zhang. M 2 depth: Self-supervised two-frame m ulti-camera m etric depth estimation. InEuropean Confer- ence on Computer Vision, pages 269–285. Springer, 2024. 2 11

work page 2024