pith. sign in

arxiv: 2604.07665 · v1 · submitted 2026-04-09 · 💻 cs.CV

Adaptive Depth-converted-Scale Convolution for Self-supervised Monocular Depth Estimation

Pith reviewed 2026-05-10 17:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords self-supervised monocular depth estimationdepth-converted-scale convolutionscale adaptationsize-depth ambiguityKITTI benchmarkconvolutional neural networks
0
0 comments X p. Extension

The pith

Depth-converted-Scale Convolution adapts filter scales using depth priors to resolve size-depth ambiguity in monocular videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Depth-converted-Scale Convolution (DcSConv) to address changing object sizes caused by depth variations in self-supervised monocular depth estimation from video. Previous approaches overlook this explicit scale handling, which creates ambiguity in scene structure. DcSConv incorporates the prior relationship between object depth and scale to select appropriate convolution receptive field sizes. It argues that filter scale matters at least as much as local shape deformation for this task. The module can be inserted into existing CNN baselines and produces measurable accuracy gains on standard benchmarks.

Core claim

By converting depth information to adjust the scale of convolution filters, DcSConv extracts features matched to actual object sizes at their distances, reducing size and depth ambiguity that arises from continuous size changes in monocular video sequences. A companion Depth-converted-Scale aware Fusion module combines these adapted features with those from conventional convolutions.

What carries the argument

Depth-converted-Scale Convolution (DcSConv), a plug-in module that adapts the scale of the convolution filter according to the prior relationship between object depth and object scale rather than deforming the filter shape locally.

If this is right

  • Existing CNN-based monocular depth estimators gain accuracy when DcSConv is inserted as a plug-and-play replacement for standard convolution blocks.
  • Adaptive fusion via DcS-F allows the network to combine scale-matched features with conventional ones without manual weighting.
  • Error metrics such as SqRel improve by up to 11.6 percent on the KITTI benchmark across multiple baseline architectures.
  • The emphasis on scale over local deformation suggests that receptive-field sizing is a primary driver of performance in depth-from-video settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same depth-to-scale conversion principle could be tested in related tasks such as video object detection where perspective scaling also occurs.
  • Removing the need for an initial depth estimate to drive the scale conversion would make the module fully self-contained.
  • Direct comparisons against deformable convolution variants on the same KITTI splits would isolate the contribution of scale adaptation versus shape deformation.

Load-bearing premise

That the prior relationship between object depth and object scale can be effectively incorporated into the convolution to extract features from appropriate scales and resolve size and depth ambiguity in monocular videos.

What would settle it

A controlled experiment on scenes containing objects of inconsistent physical sizes placed at identical depths, checking whether the reported error reductions over standard CNN baselines disappear when the depth-to-scale conversion is removed.

Figures

Figures reproduced from arXiv: 2604.07665 by Huasong Zhou, Huibin Bai, Hui Yuan, Shuai Li, Tian Xie, Wei Hua, Xingyu Gao, Xun Cai, Yanbo Gao.

Figure 1
Figure 1. Figure 1: Illustration of the object size and depth change at successcive frames [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The framework of the proposed adaptive Depth-converted-Scale Convolution (DcSConv) enhanced monocular depth estimation. The progressively [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the relationship between the object scale in an image [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the Depth-converted Multiple Scale convolution Fusion (DMSF). [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of Depth-converted-Scale Convolution (DcSConv) in comparison with the standard convolution among the successive frames. “ [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of the Depth-converted-Scale aware Fusion (DcS-F) [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Architecture of the Depth-converted-Scale aware feature Decoding (DcS-D) module. Different from the conventional skip-connection based feature [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Sample qualitative results from the NYU V2 indoor benchmark dataset. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative compassion of different methods on the KITTI Eigen split. Our model produces better quality depth maps with clearer object edges. [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
read the original abstract

Self-supervised monocular depth estimation (MDE) has received increasing interests in the last few years. The objects in the scene, including the object size and relationship among different objects, are the main clues to extract the scene structure. However, previous works lack the explicit handling of the changing sizes of the object due to the change of its depth. Especially in a monocular video, the size of the same object is continuously changed, resulting in size and depth ambiguity. To address this problem, we propose a Depth-converted-Scale Convolution (DcSConv) enhanced monocular depth estimation framework, by incorporating the prior relationship between the object depth and object scale to extract features from appropriate scales of the convolution receptive field. The proposed DcSConv focuses on the adaptive scale of the convolution filter instead of the local deformation of its shape. It establishes that the scale of the convolution filter matters no less (or even more in the evaluated task) than its local deformation. Moreover, a Depth-converted-Scale aware Fusion (DcS-F) is developed to adaptively fuse the DcSConv features and the conventional convolution features. Our DcSConv enhanced monocular depth estimation framework can be applied on top of existing CNN based methods as a plug-and-play module to enhance the conventional convolution block. Extensive experiments with different baselines have been conducted on the KITTI benchmark and our method achieves the best results with an improvement up to 11.6% in terms of SqRel reduction. Ablation study also validates the effectiveness of each proposed module.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Depth-converted-Scale Convolution (DcSConv) and Depth-converted-Scale aware Fusion (DcS-F) as a plug-and-play enhancement to CNN-based self-supervised monocular depth estimation. DcSConv adapts the scale of convolution filters using the prior relationship between object depth and scale to address size-depth ambiguity in monocular video, claiming that filter scale is at least as important as local shape deformation. The method is evaluated on the KITTI benchmark, reporting up to 11.6% SqRel reduction over baselines with supporting ablations.

Significance. If the self-supervised training loop is stably implemented, the work would be significant for showing how geometric priors can be directly embedded into convolution receptive fields rather than post-processed. It offers a new angle on scale adaptation versus deformable convolutions and demonstrates empirical gains across multiple baselines on a standard benchmark.

major comments (2)
  1. [Abstract and §3] Abstract and method description: DcSConv converts predicted depth to convolution scales inside the feature extractor, but the network is trained self-supervised with depth as the learned output. No description is given of mechanisms (stop-gradient, detached depth head, or staged training) to prevent circular dependency or unstable gradients, which directly affects whether the claimed incorporation of the depth-scale prior is valid.
  2. [Experiments] Experiments section: The reported 11.6% SqRel improvement and ablation results lack error bars, precise baseline re-implementation details, data augmentation pipelines, and full numerical tables. Without these, the quantitative support for the central claim that DcSConv outperforms prior scale-handling approaches cannot be fully assessed.
minor comments (2)
  1. [Abstract and Experiments] The strong statement that scale 'matters no less (or even more) than local deformation' would be strengthened by a direct head-to-head comparison against deformable convolution baselines in the main results table.
  2. [§3] Notation for the depth-to-scale mapping function and the fusion weights in DcS-F should be introduced with explicit equations early in the method section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the positive assessment of the work's potential significance. We address each major comment below and will revise the manuscript to incorporate clarifications and additional details.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and method description: DcSConv converts predicted depth to convolution scales inside the feature extractor, but the network is trained self-supervised with depth as the learned output. No description is given of mechanisms (stop-gradient, detached depth head, or staged training) to prevent circular dependency or unstable gradients, which directly affects whether the claimed incorporation of the depth-scale prior is valid.

    Authors: We appreciate this observation on the training dynamics. In our implementation, the predicted depth map is detached from the computation graph (via stop-gradient) when deriving the per-pixel scale factors for DcSConv kernels; this breaks the direct circular dependency while still allowing the depth-scale prior to guide feature extraction. The depth head itself is trained end-to-end via the self-supervised photometric loss. We will add an explicit description, including pseudocode, in the revised §3 to clarify this mechanism and confirm training stability. revision: yes

  2. Referee: [Experiments] Experiments section: The reported 11.6% SqRel improvement and ablation results lack error bars, precise baseline re-implementation details, data augmentation pipelines, and full numerical tables. Without these, the quantitative support for the central claim that DcSConv outperforms prior scale-handling approaches cannot be fully assessed.

    Authors: We agree that fuller experimental documentation is needed. The revised manuscript will include: (i) error bars from three independent runs with different seeds, (ii) precise baseline re-implementation details (official codebases, identical hyperparameters and training schedules), (iii) the complete data-augmentation pipeline, and (iv) exhaustive numerical tables for all metrics and ablations. The reported 11.6% SqRel reduction is the relative improvement versus the strongest re-implemented baseline on the KITTI Eigen split. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper proposes DcSConv as a plug-and-play architectural module that adapts convolution scales using a depth-to-scale prior relationship, combined with a fusion step (DcS-F). No load-bearing step reduces a claimed prediction or result to its own fitted inputs or self-citations by construction; the abstract and method description present the scale adaptation as an explicit incorporation of geometric prior rather than a tautological re-use of the network's depth output. Experimental validation on KITTI benchmarks with reported improvements over baselines confirms the chain contains independent content from the new module design.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the domain assumption of depth-scale relationship and introduces new modules DcSConv and DcS-F without additional free parameters specified in abstract.

axioms (1)
  • domain assumption There exists a prior relationship between object depth and object scale in images.
    Invoked to justify adapting convolution scale based on depth.

pith-pipeline@v0.9.0 · 5599 in / 1088 out tokens · 78833 ms · 2026-05-10T17:17:58.418339+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages

  1. [1]

    Ha-bins: Hierarchical adaptive bins for robust monoc- ular depth estimation across multiple datasets,

    R. Zhu, Z. Song, L. Liu, J. He, T. Zhang, and Y . Zhang, “Ha-bins: Hierarchical adaptive bins for robust monoc- ular depth estimation across multiple datasets,”IEEE Transactions on Circuits and Systems for Video Tech- nology, vol. 34, no. 6, pp. 4354–4366, 2024

  2. [2]

    Monocular depth estimation with augmented ordinal depth relationships,

    Y . Cao, T. Zhao, K. Xian, C. Shen, Z. Cao, and S. Xu, “Monocular depth estimation with augmented ordinal depth relationships,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 8, pp. 2674– 2682, 2020

  3. [3]

    Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation,

    Z. Li, Z. Chen, X. Liu, and J. Jiang, “Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation,”Machine In- telligence Research, pp. 1–18, 2023

  4. [4]

    Distortion-aware self-supervised indoor 360 ◦ depth estimation via hybrid projection fusion and struc- tural regularities,

    X. Wang, W. Kong, Q. Zhang, Y . Yang, T. Zhao, and J. Jiang, “Distortion-aware self-supervised indoor 360 ◦ depth estimation via hybrid projection fusion and struc- tural regularities,”IEEE Transactions on Multimedia, vol. 26, pp. 3998–4011, 2024

  5. [5]

    Shape-preserving object depth control for stereoscopic images,

    J. Lei, B. Peng, C. Zhang, X. Mei, X. Cao, X. Fan, and X. Li, “Shape-preserving object depth control for stereoscopic images,”IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 12, pp. 3333–3344, 2018

  6. [6]

    Bayesian denet: Monocular depth prediction and frame- wise fusion with synchronized uncertainty,

    X. Yang, Y . Gao, H. Luo, C. Liao, and K.-T. Cheng, “Bayesian denet: Monocular depth prediction and frame- wise fusion with synchronized uncertainty,”IEEE Trans- actions on Multimedia, vol. 21, no. 11, pp. 2701–2713, 2019

  7. [7]

    Depth map pre- diction from a single image using a multi-scale deep network,

    D. Eigen, C. Puhrsch, and R. Fergus, “Depth map pre- diction from a single image using a multi-scale deep network,”Advances in neural information processing systems, vol. 27, 2014

  8. [8]

    Adabins: Depth estimation using adaptive bins,

    S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4009–4018, 2021

  9. [9]

    Digging into self-supervised monocular depth estima- tion,

    C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estima- tion,” inIEEE International Conference on Computer Vision (ICCV), pp. 3828–3838, 2019

  10. [10]

    Liftformer: Lifting and frame theory based monocular depth estimation using depth and edge ori- ented subspace representation,

    S. Li, H. Bai, Y . Gao, C. Lv, H. Yuan, C. Li, W. Hua, and T. Xie, “Liftformer: Lifting and frame theory based monocular depth estimation using depth and edge ori- ented subspace representation,” inIEEE Transactions on Multimedia, 2025

  11. [11]

    Unsupervised learning of depth and ego-motion from video,

    T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and ego-motion from video,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1851–1858, 2017

  12. [12]

    R-msfm: Recurrent multi-scale feature modulation for monocular depth esti- mating,

    Z. Zhou, X. Fan, P. Shi, and Y . Xin, “R-msfm: Recurrent multi-scale feature modulation for monocular depth esti- mating,” inIEEE International Conference on Computer Vision (ICCV), pp. 12777–12786, 2021

  13. [13]

    Self-supervised multi-frame monocular depth estimation for dynamic scenes,

    G. Wu, H. Liu, L. Wang, K. Li, Y . Guo, and Z. Chen, “Self-supervised multi-frame monocular depth estimation for dynamic scenes,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 6, pp. 4989– 5001, 2024

  14. [14]

    Plane2depth: Hierarchical adaptive plane guidance for monocular depth estimation,

    L. Liu, R. Zhu, J. Deng, Z. Song, W. Yang, and T. Zhang, “Plane2depth: Hierarchical adaptive plane guidance for monocular depth estimation,”IEEE Transactions on Cir- cuits and Systems for Video Technology, vol. 35, no. 2, pp. 1136–1149, 2025

  15. [15]

    Un- supervised monocular depth estimation with left-right consistency,

    C. Godard, O. Mac Aodha, and G. J. Brostow, “Un- supervised monocular depth estimation with left-right consistency,” inIEEE/CVF Conference on Computer 14 Vision and Pattern Recognition (CVPR), pp. 270–279, 2017

  16. [16]

    Learning monocular depth estimation infusing traditional stereo knowledge,

    F. Tosi, F. Aleotti, M. Poggi, and S. Mattoccia, “Learning monocular depth estimation infusing traditional stereo knowledge,” inIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pp. 9799–9809, 2019

  17. [17]

    Excavat- ing the potential capacity of self-supervised monocular depth estimation,

    R. Peng, R. Wang, Y . Lai, L. Tang, and Y . Cai, “Excavat- ing the potential capacity of self-supervised monocular depth estimation,” inIEEE International Conference on Computer Vision (ICCV), pp. 15560–15569, 2021

  18. [18]

    Self-supervised monocular depth estimation with frequency-based recurrent refinement,

    R. Li, D. Xue, Y . Zhu, H. Wu, J. Sun, and Y . Zhang, “Self-supervised monocular depth estimation with frequency-based recurrent refinement,”IEEE Trans- actions on Multimedia (TMM), 2022

  19. [19]

    Channel-wise attention-based network for self-supervised monocular depth estimation,

    J. Yan, H. Zhao, P. Bu, and Y . Jin, “Channel-wise attention-based network for self-supervised monocular depth estimation,” inInternational Conference on 3D vision (3DV), pp. 464–473, IEEE, 2021

  20. [20]

    Monocular depth esti- mation using laplacian pyramid-based depth residuals,

    M. Song, S. Lim, and W. Kim, “Monocular depth esti- mation using laplacian pyramid-based depth residuals,” IEEE transactions on circuits and systems for video technology, vol. 31, no. 11, pp. 4381–4393, 2021

  21. [21]

    Deformable convolutional networks,

    J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . Wei, “Deformable convolutional networks,” inIEEE International Conference on Computer Vision (ICCV), pp. 764–773, 2017

  22. [22]

    Deeper depth prediction with fully convolu- tional residual networks,

    I. Laina, C. Rupprecht, V . Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolu- tional residual networks,” inInternational Conference on 3D vision (3DV), pp. 239–248, 2016

  23. [23]

    Deep ordinal regression network for monocular depth estimation,

    H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,” inIEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pp. 2002– 2011, 2018

  24. [24]

    Vision trans- formers for dense prediction,

    R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision trans- formers for dense prediction,” inIEEE International Conference on Computer Vision (ICCV), pp. 12179– 12188, 2021

  25. [25]

    P3depth: Monocular depth estimation with a piecewise planarity prior,

    V . Patil, C. Sakaridis, A. Liniger, and L. Van Gool, “P3depth: Monocular depth estimation with a piecewise planarity prior,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1610–1621, 2022

  26. [26]

    Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,

    A. Kendall, Y . Gal, and R. Cipolla, “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7482–7491, 2018

  27. [27]

    Fast monocular depth estimation via side prediction aggregation with continuous spatial refinement,

    J. Wu, R. Ji, Q. Wang, S. Zhang, X. Sun, Y . Wang, M. Xu, and F. Huang, “Fast monocular depth estimation via side prediction aggregation with continuous spatial refinement,”IEEE Transactions on Multimedia(TMM), 2022

  28. [28]

    Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer,

    R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V . Koltun, “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer,” IEEE transactions on pattern analysis and machine in- telligence, vol. 44, no. 3, pp. 1623–1637, 2020

  29. [29]

    Unsuper- vised learning of depth and ego-motion from monocular video using 3d geometric constraints,

    R. Mahjourian, M. Wicke, and A. Angelova, “Unsuper- vised learning of depth and ego-motion from monocular video using 3d geometric constraints,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5667–5675, 2018

  30. [30]

    Can scale-consistent monocular depth be learned in a self-supervised scale-invariant manner?,

    L. Wang, Y . Wang, L. Wang, Y . Zhan, Y . Wang, and H. Lu, “Can scale-consistent monocular depth be learned in a self-supervised scale-invariant manner?,” inIEEE International Conference on Computer Vision (ICCV), pp. 12727–12736, 2021

  31. [31]

    Geonet: Unsupervised learning of dense depth, optical flow and camera pose,

    Z. Yin and J. Shi, “Geonet: Unsupervised learning of dense depth, optical flow and camera pose,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1983–1992, 2018

  32. [32]

    Rm-depth: Unsupervised learning of recur- rent monocular depth in dynamic scenes,

    T.-W. Hui, “Rm-depth: Unsupervised learning of recur- rent monocular depth in dynamic scenes,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1675–1684, 2022

  33. [33]

    Exploiting pseudo labels in a self-supervised learning framework for improved monocular depth estimation,

    A. Petrovai and S. Nedevschi, “Exploiting pseudo labels in a self-supervised learning framework for improved monocular depth estimation,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1578–1588, 2022

  34. [34]

    Adaptive confidence thresholding for monocular depth estimation,

    H. Choi, H. Lee, S. Kim, S. Kim, S. Kim, K. Sohn, and D. Min, “Adaptive confidence thresholding for monocular depth estimation,” inIEEE International Conference on Computer Vision (ICCV), pp. 12808–12818, 2021

  35. [35]

    Fine-grained semantics- aware representation enhancement for self-supervised monocular depth estimation,

    H. Jung, E. Park, and S. Yoo, “Fine-grained semantics- aware representation enhancement for self-supervised monocular depth estimation,” inIEEE International Con- ference on Computer Vision (ICCV), pp. 12642–12652, 2021

  36. [36]

    Monoindoor: Towards good practice of self-supervised monocular depth esti- mation for indoor environments,

    P. Ji, R. Li, B. Bhanu, and Y . Xu, “Monoindoor: Towards good practice of self-supervised monocular depth esti- mation for indoor environments,” inIEEE International Conference on Computer Vision (ICCV), pp. 12787– 12796, 2021

  37. [37]

    Pyra- mid scene parsing network,

    H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyra- mid scene parsing network,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2881–2890, 2017

  38. [38]

    Dynamic convolution: Attention over convolution ker- nels,

    Y . Chen, X. Dai, M. Liu, D. Chen, L. Yuan, and Z. Liu, “Dynamic convolution: Attention over convolution ker- nels,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11030–11039, 2020

  39. [39]

    Learning depth-guided convolutions for monoc- ular 3d object detection,

    M. Ding, Y . Huo, H. Yi, Z. Wang, J. Shi, Z. Lu, and P. Luo, “Learning depth-guided convolutions for monoc- ular 3d object detection,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1000–1001, 2020

  40. [40]

    idisc: Internal discretization for monocular depth estimation,

    L. Piccinelli, C. Sakaridis, and F. Yu, “idisc: Internal discretization for monocular depth estimation,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21477–21487, 2023

  41. [41]

    arXiv preprint arXiv:2203.01502 (2022)

    W. Yuan, X. Gu, Z. Dai, S. Zhu, and P. Tan, “New crfs: Neural window fully-connected crfs for monocular depth 15 estimation,”arXiv preprint arXiv:2203.01502, 2022

  42. [42]

    Unsupervised monoc- ular depth estimation using attention and multi-warp re- construction,

    C. Ling, X. Zhang, and H. Chen, “Unsupervised monoc- ular depth estimation using attention and multi-warp re- construction,”IEEE Transactions on Multimedia (TMM), vol. 24, pp. 2938–2949, 2021

  43. [43]

    Monovit: Self- supervised monocular depth estimation with a vision transformer,

    C. Zhao, Y . Zhang, M. Poggi, F. Tosi, X. Guo, Z. Zhu, G. Huang, Y . Tang, and S. Mattoccia, “Monovit: Self- supervised monocular depth estimation with a vision transformer,” in2022 International Conference on 3D Vision (3DV), pp. 668–678, IEEE, 2022

  44. [44]

    Squeeze-and-excitation networks,

    J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141, 2018

  45. [45]

    Cbam: Convolutional block attention module,

    S. Woo, J. Park, J.-Y . Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” inThe European conference on computer vision (ECCV), pp. 3–19, 2018

  46. [46]

    Are we ready for autonomous driving? the kitti vision benchmark suite,

    A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361, 2012

  47. [47]

    Predicting depth, surface normals and semantic labels with a common multi- scale convolutional architecture,

    D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi- scale convolutional architecture,” inIEEE International Conference on Computer Vision (ICCV), pp. 2650–2658, 2015

  48. [48]

    Unsupervised high-resolution depth learning from videos with dual networks,

    J. Zhou, Y . Wang, K. Qin, and W. Zeng, “Unsupervised high-resolution depth learning from videos with dual networks,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6872–6881, 2019

  49. [49]

    Unsupervised estimation of monocular depth and vo in dynamic environments via hybrid masks,

    Q. Sun, Y . Tang, C. Zhang, C. Zhao, F. Qian, and J. Kurths, “Unsupervised estimation of monocular depth and vo in dynamic environments via hybrid masks,”IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 5, pp. 2023–2033, 2021

  50. [50]

    arXiv preprint arXiv:2002.12319 (2020)

    V . Guizilini, R. Hou, J. Li, R. Ambrus, and A. Gaidon, “Semantically-guided representation learn- ing for self-supervised monocular depth,”arXiv preprint arXiv:2002.12319, 2020

  51. [51]

    Self-supervised monocular depth estima- tion: Solving the dynamic object problem by semantic guidance,

    M. Klingner, J.-A. Term ¨ohlen, J. Mikolajczyk, and T. Fingscheidt, “Self-supervised monocular depth estima- tion: Solving the dynamic object problem by semantic guidance,” inComputer Vision–ECCV 2020: 16th Eu- ropean Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pp. 582–600, Springer, 2020

  52. [52]

    Learning monocular depth in dynamic scenes via instance-aware projection consistency,

    S. Lee, S. Im, S. Lin, and I. S. Kweon, “Learning monocular depth in dynamic scenes via instance-aware projection consistency,” inThe AAAI Conference on Artificial Intelligence, vol. 35, pp. 1863–1872, 2021

  53. [53]

    Self-supervised monocular depth estimation with multiscale perception,

    Y . Zhang, M. Gong, J. Li, M. Zhang, F. Jiang, and H. Zhao, “Self-supervised monocular depth estimation with multiscale perception,”IEEE transactions on image processing, vol. 31, pp. 3251–3266, 2022

  54. [54]

    Con- stant velocity constraints for self-supervised monocular depth estimation,

    H. Zhou, D. Greenwood, S. Taylor, and H. Gong, “Con- stant velocity constraints for self-supervised monocular depth estimation,” inProceedings of the 17th ACM SIGGRAPH European Conference on Visual Media Pro- duction, pp. 1–8, 2020

  55. [55]

    On the uncertainty of self-supervised monocular depth es- timation,

    M. Poggi, F. Aleotti, F. Tosi, and S. Mattoccia, “On the uncertainty of self-supervised monocular depth es- timation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3227– 3237, 2020

  56. [56]

    Don’t forget the past: Recurrent depth estimation from monocular video,

    V . Patil, W. Van Gansbeke, D. Dai, and L. Van Gool, “Don’t forget the past: Recurrent depth estimation from monocular video,”IEEE Robotics and Automation Let- ters, vol. 5, no. 4, pp. 6813–6820, 2020

  57. [57]

    Hr-depth: High resolution self- supervised monocular depth estimation,

    X. Lyu, L. Liu, M. Wang, X. Kong, L. Liu, Y . Liu, X. Chen, and Y . Yuan, “Hr-depth: High resolution self- supervised monocular depth estimation,” inThe AAAI Conference on Artificial Intelligence, vol. 35, pp. 2294– 2301, 2021

  58. [58]

    Self- supervised depth estimation via implicit cues from videos,

    J. Wang, G. Zhang, Z. Wu, X. Li, and L. Liu, “Self- supervised depth estimation via implicit cues from videos,” inICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2485–2489, 2021

  59. [59]

    Lite- mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation,

    N. Zhang, F. Nex, G. V osselman, and N. Kerle, “Lite- mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18537–18546, June 2023

  60. [60]

    Make3d: Learning 3d scene structure from a single still image,

    A. Saxena, M. Sun, and A. Y . Ng, “Make3d: Learning 3d scene structure from a single still image,”IEEE trans- actions on pattern analysis and machine intelligence, vol. 31, no. 5, pp. 824–840, 2008

  61. [61]

    Indoor segmentation and support inference from rgbd images,

    N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” inComputer Vision–ECCV 2012: 12th European Confer- ence on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12, pp. 746–760, Springer, 2012

  62. [62]

    From-ground-to-objects: Coarse-to-fine self-supervised monocular depth estimation of dynamic objects with ground contact prior,

    J. Moon, J. L. G. Bello, B. Kwon, and M. Kim, “From-ground-to-objects: Coarse-to-fine self-supervised monocular depth estimation of dynamic objects with ground contact prior,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10519–10529, 2024