Adaptive Depth-converted-Scale Convolution for Self-supervised Monocular Depth Estimation
Pith reviewed 2026-05-10 17:17 UTC · model grok-4.3
The pith
Depth-converted-Scale Convolution adapts filter scales using depth priors to resolve size-depth ambiguity in monocular videos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By converting depth information to adjust the scale of convolution filters, DcSConv extracts features matched to actual object sizes at their distances, reducing size and depth ambiguity that arises from continuous size changes in monocular video sequences. A companion Depth-converted-Scale aware Fusion module combines these adapted features with those from conventional convolutions.
What carries the argument
Depth-converted-Scale Convolution (DcSConv), a plug-in module that adapts the scale of the convolution filter according to the prior relationship between object depth and object scale rather than deforming the filter shape locally.
If this is right
- Existing CNN-based monocular depth estimators gain accuracy when DcSConv is inserted as a plug-and-play replacement for standard convolution blocks.
- Adaptive fusion via DcS-F allows the network to combine scale-matched features with conventional ones without manual weighting.
- Error metrics such as SqRel improve by up to 11.6 percent on the KITTI benchmark across multiple baseline architectures.
- The emphasis on scale over local deformation suggests that receptive-field sizing is a primary driver of performance in depth-from-video settings.
Where Pith is reading between the lines
- The same depth-to-scale conversion principle could be tested in related tasks such as video object detection where perspective scaling also occurs.
- Removing the need for an initial depth estimate to drive the scale conversion would make the module fully self-contained.
- Direct comparisons against deformable convolution variants on the same KITTI splits would isolate the contribution of scale adaptation versus shape deformation.
Load-bearing premise
That the prior relationship between object depth and object scale can be effectively incorporated into the convolution to extract features from appropriate scales and resolve size and depth ambiguity in monocular videos.
What would settle it
A controlled experiment on scenes containing objects of inconsistent physical sizes placed at identical depths, checking whether the reported error reductions over standard CNN baselines disappear when the depth-to-scale conversion is removed.
Figures
read the original abstract
Self-supervised monocular depth estimation (MDE) has received increasing interests in the last few years. The objects in the scene, including the object size and relationship among different objects, are the main clues to extract the scene structure. However, previous works lack the explicit handling of the changing sizes of the object due to the change of its depth. Especially in a monocular video, the size of the same object is continuously changed, resulting in size and depth ambiguity. To address this problem, we propose a Depth-converted-Scale Convolution (DcSConv) enhanced monocular depth estimation framework, by incorporating the prior relationship between the object depth and object scale to extract features from appropriate scales of the convolution receptive field. The proposed DcSConv focuses on the adaptive scale of the convolution filter instead of the local deformation of its shape. It establishes that the scale of the convolution filter matters no less (or even more in the evaluated task) than its local deformation. Moreover, a Depth-converted-Scale aware Fusion (DcS-F) is developed to adaptively fuse the DcSConv features and the conventional convolution features. Our DcSConv enhanced monocular depth estimation framework can be applied on top of existing CNN based methods as a plug-and-play module to enhance the conventional convolution block. Extensive experiments with different baselines have been conducted on the KITTI benchmark and our method achieves the best results with an improvement up to 11.6% in terms of SqRel reduction. Ablation study also validates the effectiveness of each proposed module.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Depth-converted-Scale Convolution (DcSConv) and Depth-converted-Scale aware Fusion (DcS-F) as a plug-and-play enhancement to CNN-based self-supervised monocular depth estimation. DcSConv adapts the scale of convolution filters using the prior relationship between object depth and scale to address size-depth ambiguity in monocular video, claiming that filter scale is at least as important as local shape deformation. The method is evaluated on the KITTI benchmark, reporting up to 11.6% SqRel reduction over baselines with supporting ablations.
Significance. If the self-supervised training loop is stably implemented, the work would be significant for showing how geometric priors can be directly embedded into convolution receptive fields rather than post-processed. It offers a new angle on scale adaptation versus deformable convolutions and demonstrates empirical gains across multiple baselines on a standard benchmark.
major comments (2)
- [Abstract and §3] Abstract and method description: DcSConv converts predicted depth to convolution scales inside the feature extractor, but the network is trained self-supervised with depth as the learned output. No description is given of mechanisms (stop-gradient, detached depth head, or staged training) to prevent circular dependency or unstable gradients, which directly affects whether the claimed incorporation of the depth-scale prior is valid.
- [Experiments] Experiments section: The reported 11.6% SqRel improvement and ablation results lack error bars, precise baseline re-implementation details, data augmentation pipelines, and full numerical tables. Without these, the quantitative support for the central claim that DcSConv outperforms prior scale-handling approaches cannot be fully assessed.
minor comments (2)
- [Abstract and Experiments] The strong statement that scale 'matters no less (or even more) than local deformation' would be strengthened by a direct head-to-head comparison against deformable convolution baselines in the main results table.
- [§3] Notation for the depth-to-scale mapping function and the fusion weights in DcS-F should be introduced with explicit equations early in the method section for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and the positive assessment of the work's potential significance. We address each major comment below and will revise the manuscript to incorporate clarifications and additional details.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and method description: DcSConv converts predicted depth to convolution scales inside the feature extractor, but the network is trained self-supervised with depth as the learned output. No description is given of mechanisms (stop-gradient, detached depth head, or staged training) to prevent circular dependency or unstable gradients, which directly affects whether the claimed incorporation of the depth-scale prior is valid.
Authors: We appreciate this observation on the training dynamics. In our implementation, the predicted depth map is detached from the computation graph (via stop-gradient) when deriving the per-pixel scale factors for DcSConv kernels; this breaks the direct circular dependency while still allowing the depth-scale prior to guide feature extraction. The depth head itself is trained end-to-end via the self-supervised photometric loss. We will add an explicit description, including pseudocode, in the revised §3 to clarify this mechanism and confirm training stability. revision: yes
-
Referee: [Experiments] Experiments section: The reported 11.6% SqRel improvement and ablation results lack error bars, precise baseline re-implementation details, data augmentation pipelines, and full numerical tables. Without these, the quantitative support for the central claim that DcSConv outperforms prior scale-handling approaches cannot be fully assessed.
Authors: We agree that fuller experimental documentation is needed. The revised manuscript will include: (i) error bars from three independent runs with different seeds, (ii) precise baseline re-implementation details (official codebases, identical hyperparameters and training schedules), (iii) the complete data-augmentation pipeline, and (iv) exhaustive numerical tables for all metrics and ablations. The reported 11.6% SqRel reduction is the relative improvement versus the strongest re-implemented baseline on the KITTI Eigen split. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper proposes DcSConv as a plug-and-play architectural module that adapts convolution scales using a depth-to-scale prior relationship, combined with a fusion step (DcS-F). No load-bearing step reduces a claimed prediction or result to its own fitted inputs or self-citations by construction; the abstract and method description present the scale adaptation as an explicit incorporation of geometric prior rather than a tautological re-use of the network's depth output. Experimental validation on KITTI benchmarks with reported improvements over baselines confirms the chain contains independent content from the new module design.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption There exists a prior relationship between object depth and object scale in images.
Reference graph
Works this paper leans on
-
[1]
R. Zhu, Z. Song, L. Liu, J. He, T. Zhang, and Y . Zhang, “Ha-bins: Hierarchical adaptive bins for robust monoc- ular depth estimation across multiple datasets,”IEEE Transactions on Circuits and Systems for Video Tech- nology, vol. 34, no. 6, pp. 4354–4366, 2024
work page 2024
-
[2]
Monocular depth estimation with augmented ordinal depth relationships,
Y . Cao, T. Zhao, K. Xian, C. Shen, Z. Cao, and S. Xu, “Monocular depth estimation with augmented ordinal depth relationships,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 8, pp. 2674– 2682, 2020
work page 2020
-
[3]
Z. Li, Z. Chen, X. Liu, and J. Jiang, “Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation,”Machine In- telligence Research, pp. 1–18, 2023
work page 2023
-
[4]
X. Wang, W. Kong, Q. Zhang, Y . Yang, T. Zhao, and J. Jiang, “Distortion-aware self-supervised indoor 360 ◦ depth estimation via hybrid projection fusion and struc- tural regularities,”IEEE Transactions on Multimedia, vol. 26, pp. 3998–4011, 2024
work page 2024
-
[5]
Shape-preserving object depth control for stereoscopic images,
J. Lei, B. Peng, C. Zhang, X. Mei, X. Cao, X. Fan, and X. Li, “Shape-preserving object depth control for stereoscopic images,”IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 12, pp. 3333–3344, 2018
work page 2018
-
[6]
Bayesian denet: Monocular depth prediction and frame- wise fusion with synchronized uncertainty,
X. Yang, Y . Gao, H. Luo, C. Liao, and K.-T. Cheng, “Bayesian denet: Monocular depth prediction and frame- wise fusion with synchronized uncertainty,”IEEE Trans- actions on Multimedia, vol. 21, no. 11, pp. 2701–2713, 2019
work page 2019
-
[7]
Depth map pre- diction from a single image using a multi-scale deep network,
D. Eigen, C. Puhrsch, and R. Fergus, “Depth map pre- diction from a single image using a multi-scale deep network,”Advances in neural information processing systems, vol. 27, 2014
work page 2014
-
[8]
Adabins: Depth estimation using adaptive bins,
S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4009–4018, 2021
work page 2021
-
[9]
Digging into self-supervised monocular depth estima- tion,
C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estima- tion,” inIEEE International Conference on Computer Vision (ICCV), pp. 3828–3838, 2019
work page 2019
-
[10]
S. Li, H. Bai, Y . Gao, C. Lv, H. Yuan, C. Li, W. Hua, and T. Xie, “Liftformer: Lifting and frame theory based monocular depth estimation using depth and edge ori- ented subspace representation,” inIEEE Transactions on Multimedia, 2025
work page 2025
-
[11]
Unsupervised learning of depth and ego-motion from video,
T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and ego-motion from video,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1851–1858, 2017
work page 2017
-
[12]
R-msfm: Recurrent multi-scale feature modulation for monocular depth esti- mating,
Z. Zhou, X. Fan, P. Shi, and Y . Xin, “R-msfm: Recurrent multi-scale feature modulation for monocular depth esti- mating,” inIEEE International Conference on Computer Vision (ICCV), pp. 12777–12786, 2021
work page 2021
-
[13]
Self-supervised multi-frame monocular depth estimation for dynamic scenes,
G. Wu, H. Liu, L. Wang, K. Li, Y . Guo, and Z. Chen, “Self-supervised multi-frame monocular depth estimation for dynamic scenes,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 6, pp. 4989– 5001, 2024
work page 2024
-
[14]
Plane2depth: Hierarchical adaptive plane guidance for monocular depth estimation,
L. Liu, R. Zhu, J. Deng, Z. Song, W. Yang, and T. Zhang, “Plane2depth: Hierarchical adaptive plane guidance for monocular depth estimation,”IEEE Transactions on Cir- cuits and Systems for Video Technology, vol. 35, no. 2, pp. 1136–1149, 2025
work page 2025
-
[15]
Un- supervised monocular depth estimation with left-right consistency,
C. Godard, O. Mac Aodha, and G. J. Brostow, “Un- supervised monocular depth estimation with left-right consistency,” inIEEE/CVF Conference on Computer 14 Vision and Pattern Recognition (CVPR), pp. 270–279, 2017
work page 2017
-
[16]
Learning monocular depth estimation infusing traditional stereo knowledge,
F. Tosi, F. Aleotti, M. Poggi, and S. Mattoccia, “Learning monocular depth estimation infusing traditional stereo knowledge,” inIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pp. 9799–9809, 2019
work page 2019
-
[17]
Excavat- ing the potential capacity of self-supervised monocular depth estimation,
R. Peng, R. Wang, Y . Lai, L. Tang, and Y . Cai, “Excavat- ing the potential capacity of self-supervised monocular depth estimation,” inIEEE International Conference on Computer Vision (ICCV), pp. 15560–15569, 2021
work page 2021
-
[18]
Self-supervised monocular depth estimation with frequency-based recurrent refinement,
R. Li, D. Xue, Y . Zhu, H. Wu, J. Sun, and Y . Zhang, “Self-supervised monocular depth estimation with frequency-based recurrent refinement,”IEEE Trans- actions on Multimedia (TMM), 2022
work page 2022
-
[19]
Channel-wise attention-based network for self-supervised monocular depth estimation,
J. Yan, H. Zhao, P. Bu, and Y . Jin, “Channel-wise attention-based network for self-supervised monocular depth estimation,” inInternational Conference on 3D vision (3DV), pp. 464–473, IEEE, 2021
work page 2021
-
[20]
Monocular depth esti- mation using laplacian pyramid-based depth residuals,
M. Song, S. Lim, and W. Kim, “Monocular depth esti- mation using laplacian pyramid-based depth residuals,” IEEE transactions on circuits and systems for video technology, vol. 31, no. 11, pp. 4381–4393, 2021
work page 2021
-
[21]
Deformable convolutional networks,
J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . Wei, “Deformable convolutional networks,” inIEEE International Conference on Computer Vision (ICCV), pp. 764–773, 2017
work page 2017
-
[22]
Deeper depth prediction with fully convolu- tional residual networks,
I. Laina, C. Rupprecht, V . Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolu- tional residual networks,” inInternational Conference on 3D vision (3DV), pp. 239–248, 2016
work page 2016
-
[23]
Deep ordinal regression network for monocular depth estimation,
H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,” inIEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pp. 2002– 2011, 2018
work page 2002
-
[24]
Vision trans- formers for dense prediction,
R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision trans- formers for dense prediction,” inIEEE International Conference on Computer Vision (ICCV), pp. 12179– 12188, 2021
work page 2021
-
[25]
P3depth: Monocular depth estimation with a piecewise planarity prior,
V . Patil, C. Sakaridis, A. Liniger, and L. Van Gool, “P3depth: Monocular depth estimation with a piecewise planarity prior,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1610–1621, 2022
work page 2022
-
[26]
Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,
A. Kendall, Y . Gal, and R. Cipolla, “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7482–7491, 2018
work page 2018
-
[27]
Fast monocular depth estimation via side prediction aggregation with continuous spatial refinement,
J. Wu, R. Ji, Q. Wang, S. Zhang, X. Sun, Y . Wang, M. Xu, and F. Huang, “Fast monocular depth estimation via side prediction aggregation with continuous spatial refinement,”IEEE Transactions on Multimedia(TMM), 2022
work page 2022
-
[28]
Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer,
R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V . Koltun, “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer,” IEEE transactions on pattern analysis and machine in- telligence, vol. 44, no. 3, pp. 1623–1637, 2020
work page 2020
-
[29]
Unsuper- vised learning of depth and ego-motion from monocular video using 3d geometric constraints,
R. Mahjourian, M. Wicke, and A. Angelova, “Unsuper- vised learning of depth and ego-motion from monocular video using 3d geometric constraints,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5667–5675, 2018
work page 2018
-
[30]
Can scale-consistent monocular depth be learned in a self-supervised scale-invariant manner?,
L. Wang, Y . Wang, L. Wang, Y . Zhan, Y . Wang, and H. Lu, “Can scale-consistent monocular depth be learned in a self-supervised scale-invariant manner?,” inIEEE International Conference on Computer Vision (ICCV), pp. 12727–12736, 2021
work page 2021
-
[31]
Geonet: Unsupervised learning of dense depth, optical flow and camera pose,
Z. Yin and J. Shi, “Geonet: Unsupervised learning of dense depth, optical flow and camera pose,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1983–1992, 2018
work page 1983
-
[32]
Rm-depth: Unsupervised learning of recur- rent monocular depth in dynamic scenes,
T.-W. Hui, “Rm-depth: Unsupervised learning of recur- rent monocular depth in dynamic scenes,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1675–1684, 2022
work page 2022
-
[33]
A. Petrovai and S. Nedevschi, “Exploiting pseudo labels in a self-supervised learning framework for improved monocular depth estimation,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1578–1588, 2022
work page 2022
-
[34]
Adaptive confidence thresholding for monocular depth estimation,
H. Choi, H. Lee, S. Kim, S. Kim, S. Kim, K. Sohn, and D. Min, “Adaptive confidence thresholding for monocular depth estimation,” inIEEE International Conference on Computer Vision (ICCV), pp. 12808–12818, 2021
work page 2021
-
[35]
H. Jung, E. Park, and S. Yoo, “Fine-grained semantics- aware representation enhancement for self-supervised monocular depth estimation,” inIEEE International Con- ference on Computer Vision (ICCV), pp. 12642–12652, 2021
work page 2021
-
[36]
P. Ji, R. Li, B. Bhanu, and Y . Xu, “Monoindoor: Towards good practice of self-supervised monocular depth esti- mation for indoor environments,” inIEEE International Conference on Computer Vision (ICCV), pp. 12787– 12796, 2021
work page 2021
-
[37]
Pyra- mid scene parsing network,
H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyra- mid scene parsing network,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2881–2890, 2017
work page 2017
-
[38]
Dynamic convolution: Attention over convolution ker- nels,
Y . Chen, X. Dai, M. Liu, D. Chen, L. Yuan, and Z. Liu, “Dynamic convolution: Attention over convolution ker- nels,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11030–11039, 2020
work page 2020
-
[39]
Learning depth-guided convolutions for monoc- ular 3d object detection,
M. Ding, Y . Huo, H. Yi, Z. Wang, J. Shi, Z. Lu, and P. Luo, “Learning depth-guided convolutions for monoc- ular 3d object detection,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1000–1001, 2020
work page 2020
-
[40]
idisc: Internal discretization for monocular depth estimation,
L. Piccinelli, C. Sakaridis, and F. Yu, “idisc: Internal discretization for monocular depth estimation,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21477–21487, 2023
work page 2023
-
[41]
arXiv preprint arXiv:2203.01502 (2022)
W. Yuan, X. Gu, Z. Dai, S. Zhu, and P. Tan, “New crfs: Neural window fully-connected crfs for monocular depth 15 estimation,”arXiv preprint arXiv:2203.01502, 2022
-
[42]
Unsupervised monoc- ular depth estimation using attention and multi-warp re- construction,
C. Ling, X. Zhang, and H. Chen, “Unsupervised monoc- ular depth estimation using attention and multi-warp re- construction,”IEEE Transactions on Multimedia (TMM), vol. 24, pp. 2938–2949, 2021
work page 2021
-
[43]
Monovit: Self- supervised monocular depth estimation with a vision transformer,
C. Zhao, Y . Zhang, M. Poggi, F. Tosi, X. Guo, Z. Zhu, G. Huang, Y . Tang, and S. Mattoccia, “Monovit: Self- supervised monocular depth estimation with a vision transformer,” in2022 International Conference on 3D Vision (3DV), pp. 668–678, IEEE, 2022
work page 2022
-
[44]
Squeeze-and-excitation networks,
J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141, 2018
work page 2018
-
[45]
Cbam: Convolutional block attention module,
S. Woo, J. Park, J.-Y . Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” inThe European conference on computer vision (ECCV), pp. 3–19, 2018
work page 2018
-
[46]
Are we ready for autonomous driving? the kitti vision benchmark suite,
A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361, 2012
work page 2012
-
[47]
D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi- scale convolutional architecture,” inIEEE International Conference on Computer Vision (ICCV), pp. 2650–2658, 2015
work page 2015
-
[48]
Unsupervised high-resolution depth learning from videos with dual networks,
J. Zhou, Y . Wang, K. Qin, and W. Zeng, “Unsupervised high-resolution depth learning from videos with dual networks,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6872–6881, 2019
work page 2019
-
[49]
Unsupervised estimation of monocular depth and vo in dynamic environments via hybrid masks,
Q. Sun, Y . Tang, C. Zhang, C. Zhao, F. Qian, and J. Kurths, “Unsupervised estimation of monocular depth and vo in dynamic environments via hybrid masks,”IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 5, pp. 2023–2033, 2021
work page 2023
-
[50]
arXiv preprint arXiv:2002.12319 (2020)
V . Guizilini, R. Hou, J. Li, R. Ambrus, and A. Gaidon, “Semantically-guided representation learn- ing for self-supervised monocular depth,”arXiv preprint arXiv:2002.12319, 2020
-
[51]
M. Klingner, J.-A. Term ¨ohlen, J. Mikolajczyk, and T. Fingscheidt, “Self-supervised monocular depth estima- tion: Solving the dynamic object problem by semantic guidance,” inComputer Vision–ECCV 2020: 16th Eu- ropean Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pp. 582–600, Springer, 2020
work page 2020
-
[52]
Learning monocular depth in dynamic scenes via instance-aware projection consistency,
S. Lee, S. Im, S. Lin, and I. S. Kweon, “Learning monocular depth in dynamic scenes via instance-aware projection consistency,” inThe AAAI Conference on Artificial Intelligence, vol. 35, pp. 1863–1872, 2021
work page 2021
-
[53]
Self-supervised monocular depth estimation with multiscale perception,
Y . Zhang, M. Gong, J. Li, M. Zhang, F. Jiang, and H. Zhao, “Self-supervised monocular depth estimation with multiscale perception,”IEEE transactions on image processing, vol. 31, pp. 3251–3266, 2022
work page 2022
-
[54]
Con- stant velocity constraints for self-supervised monocular depth estimation,
H. Zhou, D. Greenwood, S. Taylor, and H. Gong, “Con- stant velocity constraints for self-supervised monocular depth estimation,” inProceedings of the 17th ACM SIGGRAPH European Conference on Visual Media Pro- duction, pp. 1–8, 2020
work page 2020
-
[55]
On the uncertainty of self-supervised monocular depth es- timation,
M. Poggi, F. Aleotti, F. Tosi, and S. Mattoccia, “On the uncertainty of self-supervised monocular depth es- timation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3227– 3237, 2020
work page 2020
-
[56]
Don’t forget the past: Recurrent depth estimation from monocular video,
V . Patil, W. Van Gansbeke, D. Dai, and L. Van Gool, “Don’t forget the past: Recurrent depth estimation from monocular video,”IEEE Robotics and Automation Let- ters, vol. 5, no. 4, pp. 6813–6820, 2020
work page 2020
-
[57]
Hr-depth: High resolution self- supervised monocular depth estimation,
X. Lyu, L. Liu, M. Wang, X. Kong, L. Liu, Y . Liu, X. Chen, and Y . Yuan, “Hr-depth: High resolution self- supervised monocular depth estimation,” inThe AAAI Conference on Artificial Intelligence, vol. 35, pp. 2294– 2301, 2021
work page 2021
-
[58]
Self- supervised depth estimation via implicit cues from videos,
J. Wang, G. Zhang, Z. Wu, X. Li, and L. Liu, “Self- supervised depth estimation via implicit cues from videos,” inICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2485–2489, 2021
work page 2021
-
[59]
N. Zhang, F. Nex, G. V osselman, and N. Kerle, “Lite- mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18537–18546, June 2023
work page 2023
-
[60]
Make3d: Learning 3d scene structure from a single still image,
A. Saxena, M. Sun, and A. Y . Ng, “Make3d: Learning 3d scene structure from a single still image,”IEEE trans- actions on pattern analysis and machine intelligence, vol. 31, no. 5, pp. 824–840, 2008
work page 2008
-
[61]
Indoor segmentation and support inference from rgbd images,
N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” inComputer Vision–ECCV 2012: 12th European Confer- ence on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12, pp. 746–760, Springer, 2012
work page 2012
-
[62]
J. Moon, J. L. G. Bello, B. Kwon, and M. Kim, “From-ground-to-objects: Coarse-to-fine self-supervised monocular depth estimation of dynamic objects with ground contact prior,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10519–10529, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.