CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation
Pith reviewed 2026-05-17 20:37 UTC · model grok-4.3
The pith
Mapping image features onto a shared cylinder and weighting them by cylindrical distance produces consistent depth estimates across overlapping camera views.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By mapping the feature positions from each camera image onto a shared cylinder, neighborhood relationships are established between different views. An explicit spatial attention mechanism then aggregates features across images using non-learned weights based on their distances on the cylinder. These modulated features are decoded to produce a depth map per view, yielding improved cross-view depth consistency and higher overall depth accuracy on the DDAD and nuScenes datasets.
What carries the argument
Cylindrical spatial attention, which projects features from each view onto a common cylinder and aggregates them with explicit non-learned distance weighting.
If this is right
- Depth maps from adjacent cameras show lower inconsistency in their overlapping regions.
- Overall depth accuracy rises on standard surround-view benchmarks relative to prior self-supervised methods.
- The limited receptive field at image borders is effectively extended by borrowing information from neighboring views.
- Correspondence problems are eased because attention is restricted to small cylinder neighborhoods rather than the full image set.
Where Pith is reading between the lines
- The same cylinder projection could be tried for other surround tasks such as semantic segmentation or optical flow.
- If the non-learned weighting works reliably, similar explicit geometry cues might replace heavier learned attention blocks in other multi-camera networks.
- Experiments on rigs whose calibration drifts over time would test how sensitive the consistency gains are to exact geometry.
Load-bearing premise
That mapping each image's features onto a shared cylinder correctly identifies true 3D neighborhood relationships without introducing distortions from imperfect calibration or camera geometry.
What would settle it
Measure the variance of predicted depths for the same 3D scene points when they appear in two overlapping camera views; a drop in this variance after the cylindrical attention would support the claim.
Figures
read the original abstract
Self-supervised surround-view depth estimation enables dense, low-cost 3D perception with a 360{\deg} field of view from multiple minimally overlapping images. Yet, most existing methods suffer from depth estimates that are inconsistent across overlapping images. To address this limitation, we propose a novel geometry-guided method for calibrated, time-synchronized multi-camera rigs that predicts dense metric depth. Our approach targets two main sources of inconsistency: the limited receptive field in border regions of single-image depth estimation, and the difficulty of correspondence matching. We mitigate these two issues by extending the receptive field across views and restricting cross-view attention to a small neighborhood. To this end, we establish the neighborhood relationships between images by mapping the image-specific feature positions onto a shared cylinder. Based on the cylindrical positions, we apply an explicit spatial attention mechanism, with non-learned weighting, that aggregates features across images according to their distances on the cylinder. The modulated features are then decoded into a depth map for each view. Evaluated on the DDAD and nuScenes datasets, our method improves both cross-view depth consistency and overall depth accuracy compared with state-of-the-art approaches. Code is available at https://abualhanud.github.io/CylinderDepthPage.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CylinderDepth, a geometry-guided self-supervised method for surround depth estimation from calibrated multi-camera rigs. It maps per-image features onto a shared cylinder to define cross-view neighborhoods and applies explicit non-learned distance-based spatial attention to aggregate features, targeting limited receptive fields and correspondence issues. The modulated features are decoded to per-view depth maps. The central claim is that this yields improved cross-view depth consistency and overall accuracy on DDAD and nuScenes relative to prior state-of-the-art methods.
Significance. If the quantitative gains hold under realistic conditions, the approach offers a parameter-light way to enforce multi-view consistency via explicit geometry rather than learned attention, which could benefit 360° perception pipelines in autonomous driving. The explicit weighting and code release are strengths for interpretability and reproducibility.
major comments (2)
- [§3] §3 (Method), cylinder mapping paragraph: the claim that mapping image-specific feature positions onto a shared cylinder 'correctly establishes neighborhood relationships' is load-bearing for the consistency gains, yet the manuscript provides no sensitivity analysis or ablation on extrinsic calibration errors or rig rigidity violations. Small pose inaccuracies (common in real rigs) would distort cylinder distances and cause the fixed weighting to aggregate mismatched features, directly undermining the non-learned attention's effectiveness.
- [Experiments] Experiments section and associated tables: while improvements on DDAD and nuScenes are asserted, the manuscript must include explicit quantitative metrics for cross-view consistency (e.g., disparity or depth variance across overlaps) alongside standard depth errors, plus an ablation isolating the cylinder attention component; without these, the central claim that the method outperforms SOTA on both consistency and accuracy cannot be fully verified.
minor comments (2)
- [Abstract] Abstract: the statement 'improves both cross-view depth consistency and overall depth accuracy' would be clearer if it referenced specific table numbers or reported delta values rather than remaining purely qualitative.
- [§3] Notation in §3: define the exact cylinder coordinate transform (e.g., the mapping from pixel (u,v) and depth to cylindrical (θ, z)) with an equation to allow readers to reproduce the neighborhood computation.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We respond to each major comment below and indicate planned revisions to address the concerns raised.
read point-by-point responses
-
Referee: [§3] §3 (Method), cylinder mapping paragraph: the claim that mapping image-specific feature positions onto a shared cylinder 'correctly establishes neighborhood relationships' is load-bearing for the consistency gains, yet the manuscript provides no sensitivity analysis or ablation on extrinsic calibration errors or rig rigidity violations. Small pose inaccuracies (common in real rigs) would distort cylinder distances and cause the fixed weighting to aggregate mismatched features, directly undermining the non-learned attention's effectiveness.
Authors: We agree that the effectiveness of the non-learned cylindrical attention depends on accurate extrinsics and that sensitivity to calibration errors merits explicit examination. The current manuscript follows the standard assumption of calibrated rigs used by prior surround-view methods on DDAD and nuScenes. In the revision we will add a sensitivity study that perturbs the provided extrinsics with small Gaussian noise (e.g., 0.5–2° rotation and 1–5 cm translation) and reports the resulting degradation in both depth accuracy and cross-view consistency metrics. revision: yes
-
Referee: [Experiments] Experiments section and associated tables: while improvements on DDAD and nuScenes are asserted, the manuscript must include explicit quantitative metrics for cross-view consistency (e.g., disparity or depth variance across overlaps) alongside standard depth errors, plus an ablation isolating the cylinder attention component; without these, the central claim that the method outperforms SOTA on both consistency and accuracy cannot be fully verified.
Authors: We acknowledge that the manuscript currently supports the consistency claim primarily through qualitative visualizations and overall depth metrics rather than dedicated quantitative consistency measures. We will add, in the revised experiments section, (i) explicit cross-view consistency metrics such as mean depth variance and disparity variance computed over overlapping image regions and (ii) an ablation that removes the cylindrical spatial attention while keeping all other components fixed, reporting both accuracy and consistency numbers for all variants. revision: yes
Circularity Check
No circularity: geometry-guided cylinder mapping and explicit attention are independent of target consistency metrics
full rationale
The paper describes a self-contained method that maps per-image features onto a shared cylinder to define neighborhoods and then applies explicit non-learned distance-based attention to aggregate features before decoding depth. No equations, derivations, or load-bearing steps reduce the claimed cross-view consistency gains to a fitted parameter, self-definition, or self-citation chain. The central premise relies on geometric coordinate transforms and fixed weighting rules that are stated directly rather than derived from the evaluation results or prior author work in a circular manner. This matches the reader's assessment that the approach is presented as geometry-guided without self-referential reductions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Calibrated, time-synchronized multi-camera rigs allow accurate mapping of image positions to a shared cylindrical surface.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we first reconstruct the scene in 3D space, using the preliminary depth map... The resulting 3D points are then projected onto a unit-radius cylinder... attention weights based on the geodesic distance between the pixels on the cylinder
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
explicit, non-learned spatial attention that weights pixel interactions based on the geodesic distances... truncated 2D Gaussian kernel
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Attention attention ev- erywhere: Monocular depth prediction with skip attention
Ashutosh Agarwal and Chetan Arora. Attention attention ev- erywhere: Monocular depth prediction with skip attention. InProceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision, pages 5861–5870, 2023. 2, 3
work page 2023
-
[2]
nuscenes: A multi- modal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 1, 6, 9
work page 2020
-
[3]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 6
work page 2009
-
[4]
Towards cross-view-consistent self-supervised sur- round depth estimation
Laiyan Ding, Hualie Jiang, Jie Li, Yongquan Chen, and Rui Huang. Towards cross-view-consistent self-supervised sur- round depth estimation. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10043–10050. IEEE, 2024. 1, 2, 5, 7
work page 2024
-
[5]
David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep net- work.Advances in neural information processing systems, 27, 2014. 2, 6
work page 2014
-
[6]
Driv3r: Learning dense 4d reconstruction for autonomous driving
Xin Fei, Wenzhao Zheng, Yueqi Duan, Wei Zhan, Masayoshi Tomizuka, Kurt Keutzer, and Jiwen Lu. Driv3r: Learn- ing dense 4d reconstruction for autonomous driving.arXiv preprint arXiv:2412.06777, 2024. 2
-
[7]
Deep ordinal regression net- work for monocular depth estimation
Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat- manghelich, and Dacheng Tao. Deep ordinal regression net- work for monocular depth estimation. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2002–2011, 2018. 2
work page 2002
-
[8]
Unsupervised cnn for single view depth estimation: Geom- etry to the rescue
Ravi Garg, Vijay Kumar Bg, Gustavo Carneiro, and Ian Reid. Unsupervised cnn for single view depth estimation: Geom- etry to the rescue. InEuropean conference on computer vi- sion, pages 740–756. Springer, 2016. 2
work page 2016
-
[9]
Unsupervised monocular depth estimation with left- right consistency
Cl ´ement Godard, Oisin Mac Aodha, and Gabriel J Bros- tow. Unsupervised monocular depth estimation with left- right consistency. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 270–279,
-
[10]
Digging into self-supervised monocular depth estimation
Cl ´ement Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. Digging into self-supervised monocular depth estimation. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 3828–3838,
-
[11]
Cascade cost volume for high-resolution multi-view stereo and stereo matching
Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and Ping Tan. Cascade cost volume for high-resolution multi-view stereo and stereo matching. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2495–2504, 2020. 2
work page 2020
-
[12]
3d packing for self-supervised monocular depth estimation
Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raven- tos, and Adrien Gaidon. 3d packing for self-supervised monocular depth estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2485–2494, 2020. 1, 2, 6
work page 2020
-
[13]
Multi-frame self-supervised depth with transformers
Vitor Guizilini, Rares, Ambrus, , Dian Chen, Sergey Zakharov, and Adrien Gaidon. Multi-frame self-supervised depth with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 160–170,
-
[14]
Vitor Guizilini, Igor Vasiljevic, Rares Ambrus, Greg Shakhnarovich, and Adrien Gaidon. Full surround mon- odepth from multiple cameras.IEEE Robotics and Automa- tion Letters, 7(2):5397–5404, 2022. 2, 5, 7
work page 2022
-
[15]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 6
work page 2016
-
[16]
DPSNet: End-to-end Deep Plane Sweep Stereo
Sunghoon Im, Hae-Gon Jeon, Stephen Lin, and In So Kweon. Dpsnet: End-to-end deep plane sweep stereo.arXiv preprint arXiv:1905.00538, 2019. 2
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[17]
Adrian Johnston and Gustavo Carneiro. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. InProceedings of the ieee/cvf con- ference on computer vision and pattern recognition, pages 4756–4765, 2020. 3
work page 2020
-
[18]
Learning Unsupervised Multi-View Stereopsis via Robust Photometric Consistency
Tejas Khot, Shubham Agrawal, Shubham Tulsiani, Christoph Mertz, Simon Lucey, and Martial Hebert. Learning unsupervised multi-view stereopsis via robust photometric consistency.arXiv preprint arXiv:1905.02706,
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[19]
Jung-Hee Kim, Junhwa Hur, Tien Phuoc Nguyen, and Seong-Gyun Jeong. Self-supervised surround-view depth es- timation with volumetric feature fusion.Advances in Neural Information Processing Systems, 35:4032–4045, 2022. 2, 6, 7
work page 2022
-
[20]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Patch-wise attention network for monocular depth estimation
Sihaeng Lee, Janghyeon Lee, Byungju Kim, Eojindl Yi, and Junmo Kim. Patch-wise attention network for monocular depth estimation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1873–1881, 2021. 3
work page 2021
-
[22]
Ground- ing image matching in 3d with mast3r
Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 2
work page 2024
-
[23]
M2depth: A novel self- supervised multi-camera depth estimation with multi-level supervision
Ruihang Li, Shanding Ye, Zhe Yin, Tao Li, ZeHua Zhang, KaiKai Xiao, and Zhijie Pan. M2depth: A novel self- supervised multi-camera depth estimation with multi-level supervision. In2024 IEEE International Conference on Mul- timedia and Expo (ICME), pages 1–6. IEEE, 2024. 2 9
work page 2024
-
[24]
Zhenyu Li, Zehui Chen, Xianming Liu, and Junjun Jiang. Depthformer: Exploiting long-range correlation and local in- formation for accurate monocular depth estimation.Machine Intelligence Research, 20(6):837–854, 2023. 3
work page 2023
-
[25]
Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. Learning depth from single monocular images using deep convolutional neural fields.IEEE transactions on pattern analysis and machine intelligence, 38(10):2024–2039, 2015. 2
work page 2024
-
[26]
Jinfeng Liu, Lingtong Kong, Bo Li, Zerong Wang, Hong Gu, and Jinwei Chen. Mono-vifi: A unified learning framework for self-supervised single and multi-frame monocular depth estimation. InEuropean Conference on Computer Vision, pages 90–107. Springer, 2024. 2
work page 2024
-
[27]
Attention-aware multi-view stereo
Keyang Luo, Tao Guan, Lili Ju, Yuesong Wang, Zhuo Chen, and Yawei Luo. Attention-aware multi-view stereo. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1590–1599, 2020. 3
work page 2020
-
[28]
Reza Mahjourian, Martin Wicke, and Anelia Angelova. Un- supervised learning of depth and ego-motion from monocu- lar video using 3d geometric constraints. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 5667–5675, 2018. 2
work page 2018
-
[29]
Vi- sion transformers for dense prediction
Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 2, 3
work page 2021
-
[30]
Patrick Ruhkamp, Daoyi Gao, Hanzhi Chen, Nassir Navab, and Beniamin Busam. Attention meets geometry: Geom- etry guided spatial-temporal attention for consistent self- supervised monocular depth estimation. In2021 Inter- national Conference on 3D Vision (3DV), pages 837–847. IEEE, 2021. 2, 3
work page 2021
-
[31]
R3d3: Dense 3d reconstruction of dynamic scenes from multiple cameras
Aron Schmied, Tobias Fischer, Martin Danelljan, Marc Pollefeys, and Fisher Yu. R3d3: Dense 3d reconstruction of dynamic scenes from multiple cameras. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 3216–3226, 2023. 2
work page 2023
-
[32]
Ega- depth: Efficient guided attention for self-supervised multi- camera depth estimation
Yunxiao Shi, Hong Cai, Amin Ansari, and Fatih Porikli. Ega- depth: Efficient guided attention for self-supervised multi- camera depth estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 119–129, 2023. 2, 3
work page 2023
-
[33]
Neural ray surfaces for self-supervised learning of depth and ego-motion
Igor Vasiljevic, Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Wolfram Burgard, Greg Shakhnarovich, and Adrien Gaidon. Neural ray surfaces for self-supervised learning of depth and ego-motion. In2020 International Conference on 3D Vision (3DV), pages 1–11. IEEE, 2020. 2
work page 2020
-
[34]
Self-supervised learning of depth and camera mo- tion from 360 videos
Fu-En Wang, Hou-Ning Hu, Hsien-Tzu Cheng, Juan-Ting Lin, Shang-Ta Yang, Meng-Li Shih, Hung-Kuo Chu, and Min Sun. Self-supervised learning of depth and camera mo- tion from 360 videos. InAsian Conference on Computer Vision, pages 53–68. Springer, 2018. 2
work page 2018
-
[35]
Vggt: Vi- sual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2
work page 2025
-
[36]
Dust3r: Geometric 3d vi- sion made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 2
work page 2024
-
[37]
Mvster: Epipolar transformer for efficient multi-view stereo
Xiaofeng Wang, Zheng Zhu, Guan Huang, Fangbo Qin, Yun Ye, Yijia He, Xu Chi, and Xingang Wang. Mvster: Epipolar transformer for efficient multi-view stereo. InEuropean con- ference on computer vision, pages 573–591. Springer, 2022. 2, 3
work page 2022
-
[38]
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 5
work page 2004
-
[39]
Self-supervised monocular depth hints
Jamie Watson, Michael Firman, Gabriel J Brostow, and Daniyar Turmukhambetov. Self-supervised monocular depth hints. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 2162–2171, 2019. 2
work page 2019
-
[40]
The temporal opportunist: Self-supervised multi-frame monocular depth
Jamie Watson, Oisin Mac Aodha, Victor Prisacariu, Gabriel Brostow, and Michael Firman. The temporal opportunist: Self-supervised multi-frame monocular depth. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1164–1174, 2021. 2
work page 2021
-
[41]
Surround- depth: Entangling surrounding views for self-supervised multi-camera depth estimation
Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Yong- ming Rao, Guan Huang, Jiwen Lu, and Jie Zhou. Surround- depth: Entangling surrounding views for self-supervised multi-camera depth estimation. InConference on robot learning, pages 539–549. PMLR, 2023. 2, 3, 6, 7
work page 2023
-
[42]
Behind the scenes: Density fields for single view reconstruction
Felix Wimbauer, Nan Yang, Christian Rupprecht, and Daniel Cremers. Behind the scenes: Density fields for single view reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9076– 9086, 2023. 2
work page 2023
-
[43]
Jialei Xu, Xianming Liu, Yuanchao Bai, Junjun Jiang, and Xiangyang Ji. Self-supervised multi-camera collaborative depth prediction with latent diffusion models.IEEE Trans- actions on Intelligent Transportation Systems, 2025. 2
work page 2025
-
[44]
Yuchen Yang, Xinyi Wang, Dong Li, Lu Tian, Ashish Sirasao, and Xun Yang. Towards scale-aware full sur- round monodepth with transformers.arXiv preprint arXiv:2407.10406, 2024. 2
-
[45]
Mvsnet: Depth inference for unstructured multi-view stereo
Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. InProceedings of the European conference on computer vi- sion (ECCV), pages 767–783, 2018. 2
work page 2018
-
[46]
Recurrent mvsnet for high-resolution multi-view stereo depth inference
Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, and Long Quan. Recurrent mvsnet for high-resolution multi-view stereo depth inference. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5525–5534, 2019. 2
work page 2019
-
[47]
Geonet: Unsupervised learn- ing of dense depth, optical flow and camera pose
Zhichao Yin and Jianping Shi. Geonet: Unsupervised learn- ing of dense depth, optical flow and camera pose. InPro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 1983–1992, 2018. 2
work page 1983
-
[48]
Ilwi Yun, Hyuk-Jae Lee, and Chae Eun Rhee. Improving 360 monocular depth estimation via non-local dense prediction transformer and joint supervised and self-supervised learn- 10 ing. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3224–3233, 2022. 3
work page 2022
-
[49]
Unsupervised learning of depth and ego-motion from video
Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1851–1858, 2017. 2
work page 2017
-
[50]
M 2 depth: Self-supervised two-frame m ulti-camera m etric depth estimation
Yingshuang Zou, Yikang Ding, Xi Qiu, Haoqian Wang, and Haotian Zhang. M 2 depth: Self-supervised two-frame m ulti-camera m etric depth estimation. InEuropean Confer- ence on Computer Vision, pages 269–285. Springer, 2024. 2 11
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.