MultiDepth: Single-Image Depth Estimation via Multi-Task Regression and Classification
Pith reviewed 2026-05-24 16:14 UTC · model grok-4.3
The pith
Multi-task learning with regression and depth classification improves single-image depth estimation accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
End-to-end multi-task learning using both regression for continuous depth and classification for depth intervals considerably improves training and yields more accurate depth estimates from single images compared to regression alone.
What carries the argument
A shared CNN backbone with separate regression and classification heads, where the classification of depth intervals serves as an auxiliary task during training.
If this is right
- Training converges faster and more stably for depth regression networks.
- Depth predictions achieve higher accuracy on the KITTI benchmark.
- The auxiliary task can be removed at inference without affecting the regression output.
- Improved performance for applications in autonomous driving and advanced driver assistance systems.
Where Pith is reading between the lines
- Similar multi-task strategies might benefit other regression problems in computer vision that face convergence issues.
- The classification task could encourage the network to learn more robust scene structure features.
- Defining optimal depth intervals for the classification task may require dataset-specific tuning.
Load-bearing premise
The auxiliary classification task supplies helpful training signals to the shared features without interfering negatively with the regression task or needing heavy tuning of task weights.
What would settle it
Training the same architecture with only the regression task and observing equal or better accuracy on KITTI would show that the multi-task approach does not improve results.
Figures
read the original abstract
We introduce MultiDepth, a novel training strategy and convolutional neural network (CNN) architecture that allows approaching single-image depth estimation (SIDE) as a multi-task problem. SIDE is an important part of road scene understanding. It, thus, plays a vital role in advanced driver assistance systems and autonomous vehicles. Best results for the SIDE task so far have been achieved using deep CNNs. However, optimization of regression problems, such as estimating depth, is still a challenging task. For the related tasks of image classification and semantic segmentation, numerous CNN-based methods with robust training behavior have been proposed. Hence, in order to overcome the notorious instability and slow convergence of depth value regression during training, MultiDepth makes use of depth interval classification as an auxiliary task. The auxiliary task can be disabled at test-time to predict continuous depth values using the main regression branch more efficiently. We applied MultiDepth to road scenes and present results on the KITTI depth prediction dataset. In experiments, we were able to show that end-to-end multi-task learning with both, regression and classification, is able to considerably improve training and yield more accurate results.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MultiDepth, a CNN architecture and training strategy for single-image depth estimation (SIDE) on road scenes. It frames SIDE as a multi-task problem with a primary regression branch for continuous depth values and an auxiliary classification branch over depth intervals; the auxiliary task is used only during training to stabilize optimization and is disabled at test time. The approach is evaluated on the KITTI depth prediction dataset, with the abstract asserting that the combined objective improves training behavior and final accuracy.
Significance. If the claimed empirical gains hold under proper controls, the multi-task formulation could provide a lightweight way to regularize depth regression without architectural changes at inference. The idea of leveraging classification robustness to aid regression is plausible and has precedents in other vision tasks, but the manuscript supplies no quantitative support for the improvement, limiting any assessment of practical impact or novelty relative to existing multi-task depth methods.
major comments (2)
- [Abstract] Abstract: the central empirical claim that 'end-to-end multi-task learning with both, regression and classification, is able to considerably improve training and yield more accurate results' is unsupported; the manuscript contains no tables, figures, numerical metrics (e.g., RMSE, δ<1.25), baseline comparisons, ablation results, or training curves to substantiate the assertion.
- The weakest assumption identified in the reader's report—that the auxiliary depth-interval classification task supplies useful gradient signal without negative transfer or extensive task-weight tuning—is never tested or quantified; no loss-weighting schedule, gradient-norm analysis, or ablation removing the auxiliary head appears in the text.
Simulated Author's Rebuttal
We thank the referee for their detailed review and constructive feedback. We agree that the current manuscript version lacks sufficient quantitative evidence and ablations to support the central claims, and we will revise accordingly to address these gaps.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central empirical claim that 'end-to-end multi-task learning with both, regression and classification, is able to considerably improve training and yield more accurate results' is unsupported; the manuscript contains no tables, figures, numerical metrics (e.g., RMSE, δ<1.25), baseline comparisons, ablation results, or training curves to substantiate the assertion.
Authors: We acknowledge this point. The abstract asserts empirical improvements on KITTI without accompanying metrics or visuals in the current text. We will add a results section with tables reporting RMSE, δ<1.25, baseline comparisons against standard regression-only models, ablation studies, and training curves showing convergence differences in the revised manuscript. revision: yes
-
Referee: The weakest assumption identified in the reader's report—that the auxiliary depth-interval classification task supplies useful gradient signal without negative transfer or extensive task-weight tuning—is never tested or quantified; no loss-weighting schedule, gradient-norm analysis, or ablation removing the auxiliary head appears in the text.
Authors: We agree that the contribution of the auxiliary task requires explicit validation. The revised manuscript will include an ablation study with the auxiliary head removed, a description of the loss-weighting schedule used (e.g., equal weighting or tuned ratios), and discussion of any observed negative transfer or gradient behavior to quantify the auxiliary task's benefit. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents an empirical CNN architecture and training strategy for single-image depth estimation, with the central claim resting on experimental results on the KITTI dataset rather than any derivation or prediction. No equations, fitted parameters renamed as predictions, self-citations as load-bearing uniqueness theorems, or ansatzes are present in the abstract or described approach. The multi-task regression+classification benefit is stated as an observed outcome from end-to-end training, not a self-referential construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
end-to-end multi-task learning with both, regression and classification, is able to considerably improve training
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
auxiliary depth interval classification task
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Y . Cao, Z. Wu, and C. Shen. “Estimating Depth From Monocu- lar Images as Classification Using Deep Fully Convolutional Residual Networks”. In: TCSVT 28.11 (2018), pp. 3174–3182
work page 2018
-
[2]
Rich Caruana. “Multitask Learning”. In: Machine Learning 28.1 (1997), pp. 41–75
work page 1997
-
[3]
Multitask Learning: A Knowledge-Based Source of Inductive Bias
Richard Caruana. “Multitask Learning: A Knowledge-Based Source of Inductive Bias”. In: ICML. 1993, pp. 41–48
work page 1993
-
[4]
Florian Chabot, Mohamed Chaouch, Jaonary Rabarisoa, Celine Teuliere, and Thierry Chateau. “Deep MANTA: A Coarse-To- Fine Many-Task Network for Joint 2D and 3D Vehicle Analysis From Monocular Image”. In: CVPR. 2017, pp. 1827–1836
work page 2017
-
[5]
Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. “Semantic Image Segmenta- tion with Deep Convolutional Nets and Fully Connected CRFs”. In: ICLR. 2015, pp. 1–14. arXiv: 1412.7062v4 [cs.CV]
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[6]
Multi- View 3D Object Detection Network for Autonomous Driving
Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. “Multi- View 3D Object Detection Network for Autonomous Driving”. In: CVPR. 2017, pp. 6526–6534
work page 2017
-
[7]
Depth Esti- mation via Affinity Learned with Convolutional Spatial Propa- gation Network
Xinjing Cheng, Peng Wang, and Ruigang Yang. “Depth Esti- mation via Affinity Learned with Convolutional Spatial Propa- gation Network”. In: ECCV. 2018, pp. 108–125
work page 2018
-
[8]
AuxNet: Auxiliary tasks enhanced Semantic Segmentation for Automated Driving
Sumanth Chennupati, Ganesh Sistu, Senthil Yogamani, and Samir Rawashdeh. “AuxNet: Auxiliary tasks enhanced Seman- tic Segmentation for Automated Driving”. In: VISAPP. 2019, pp. 1–8. arXiv: 1901.05808v1 [cs.CV]
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[9]
The Cityscapes Dataset for Semantic Urban Scene Understanding
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Re- hfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. “The Cityscapes Dataset for Semantic Urban Scene Understanding”. In: CVPR. 2016, pp. 3213–3223
work page 2016
-
[10]
David Eigen and Rob Fergus. “Predicting Depth, Surface Nor- mals and Semantic Labels With a Common Multi-Scale Con- volutional Architecture”. In: ICCV. 2015, pp. 2650–2658
work page 2015
-
[11]
Depth Map Prediction from a Single Image using a Multi-Scale Deep Net- work
David Eigen, Christian Puhrsch, and Rob Fergus. “Depth Map Prediction from a Single Image using a Multi-Scale Deep Net- work”. In: NIPS. 2014, pp. 2366–2374. 1https://github.com/lukasliebel/MultiDepth
work page 2014
-
[12]
Deep Ordinal Regression Network for Monocular Depth Estimation
Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat- manghelich, and Dacheng Tao. “Deep Ordinal Regression Network for Monocular Depth Estimation”. In: CVPR. 2018, pp. 2002–2011
work page 2018
-
[13]
Monocular Depth Estimation with Affinity, Vertical Pooling, and Label Enhancement
Yukang Gan, Xiangyu Xu, Wenxiu Sun, and Liang Lin. “Monocular Depth Estimation with Affinity, Vertical Pooling, and Label Enhancement”. In: ECCV. 2018, pp. 232–247
work page 2018
-
[14]
Vision meets robotics: The KITTI dataset
A Geiger, P Lenz, C Stiller, and R Urtasun. “Vision meets robotics: The KITTI dataset”. In: Int. J. Robotics Res. 32.11 (2013), pp. 1231–1237
work page 2013
-
[15]
Unsupervised Monocular Depth Estimation With Left-Right Consistency
Clement Godard, Oisin Mac Aodha, and Gabriel J. Brostow. “Unsupervised Monocular Depth Estimation With Left-Right Consistency”. In: CVPR. 2017, pp. 6602–6611
work page 2017
-
[16]
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http : / / www . deeplearningbook . org. MIT Press, 2016
work page 2016
-
[17]
Dynamic Task Prioritization for Multitask Learning
Michelle Guo, Albert Haque, De-An Huang, Serena Yeung, and Li Fei-Fei. “Dynamic Task Prioritization for Multitask Learning”. In: ECCV. 2018, pp. 282–299
work page 2018
-
[18]
Learning Monocular Depth by Distilling Cross- domain Stereo Networks
Xiaoyang Guo, Hongsheng Li, Shuai Yi, Jimmy Ren, and Xi- aogang Wang. “Learning Monocular Depth by Distilling Cross- domain Stereo Networks”. In: ECCV. 2018, pp. 506–523
work page 2018
-
[19]
Monocular Depth Estima- tion by Learning from Heterogeneous Datasets
Akhil Gurram, Onay Urfalioglu, Ibrahim Halfaoui, Fahd Bouzaraa, and Antonio M. Lopez. “Monocular Depth Estima- tion by Learning from Heterogeneous Datasets”. In: IV. 2018, pp. 2176–2181
work page 2018
-
[20]
Deep Residual Learning for Image Recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep Residual Learning for Image Recognition”. In: CVPR. 2016, pp. 770–778
work page 2016
-
[21]
Monocular Depth Estimation Using Whole Strip Masking and Reliability-Based Refinement
Minhyeok Heo, Jaehan Lee, Kyung-Rae Kim, Han-Ul Kim, and Chang-Su Kim. “Monocular Depth Estimation Using Whole Strip Masking and Reliability-Based Refinement”. In: ECCV. 2018, pp. 39–55
work page 2018
-
[22]
The ApolloScape Dataset for Autonomous Driving
Xinyu Huang, Xinjing Cheng, Qichuan Geng, Binbin Cao, Dingfu Zhou, Peng Wang, Yuanqing Lin, and Ruigang Yang. “The ApolloScape Dataset for Autonomous Driving”. In: CVPR Workshops. 2018, pp. 1067–1037
work page 2018
-
[23]
Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geom- etry and Semantics
Alex Kendall, Yarin Gal, and Roberto Cipolla. “Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geom- etry and Semantics”. In: CVPR. 2018, pp. 7482–7491
work page 2018
-
[24]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In: ICLR. 2015, pp. 1–15
work page 2015
-
[25]
Evaluation of CNN-based Single-Image Depth Esti- mation Methods
Tobias Koch, Lukas Liebel, Friedrich Fraundorfer, and Marco Körner. “Evaluation of CNN-based Single-Image Depth Esti- mation Methods”. In: ECCV Workshops. 2018, pp. 331–348
work page 2018
-
[26]
Pixel-wise Attentional Gat- ing for Scene parsing
Shu Kong and Charless Fowlkes. “Pixel-wise Attentional Gat- ing for Scene parsing”. In: WACV. 2019, pp. 1024–1033
work page 2019
-
[27]
Semi- Supervised Deep Learning for Monocular Depth Map Predic- tion
Yevhen Kuznietsov, Jorg Stuckler, and Bastian Leibe. “Semi- Supervised Deep Learning for Monocular Depth Map Predic- tion”. In: CVPR. 2017, pp. 2215–2223
work page 2017
-
[28]
Deeper depth prediction with fully convolutional residual networks
Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. “Deeper depth prediction with fully convolutional residual networks”. In: 3DV. 2016, pp. 239–248
work page 2016
-
[29]
Bo Li, Yuchao Dai, and Mingyi He. “Monocular depth estima- tion with hierarchical fusion of dilated CNNs and soft-weighted- sum inference”. In: Pattern Recognit. 83 (2018), pp. 328–339. 8
work page 2018
-
[30]
A Two-Streamed Network for Estimating Fine-Scaled Depth Maps From Single RGB Images
Jun Li, Reinhard Klein, and Angela Yao. “A Two-Streamed Network for Estimating Fine-Scaled Depth Maps From Single RGB Images”. In: CVPR. 2017, pp. 3372–3380
work page 2017
-
[31]
Deep attention-based classification network for robust depth prediction
Ruibo Li, Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, and Lingxiao Hang. “Deep attention-based classification network for robust depth prediction”. In: ACCV. (ACCV). 2018, pp. 1–
work page 2018
-
[32]
MegaDepth: Learning Single- View Depth Prediction From Internet Photos
Zhengqi Li and Noah Snavely. “MegaDepth: Learning Single- View Depth Prediction From Internet Photos”. In: CVPR. 2018, pp. 2041–2050
work page 2018
-
[33]
Auxiliary Tasks in Multi-task Learning
Lukas Liebel and Marco Körner. “Auxiliary Tasks in Multi- task Learning”. In: (2018), pp. 1–8. arXiv: 1805.06334v2 [cs.CV]
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[34]
PlaneNet: Piece-wise Planar Reconstruction from a Single RGB Image
Chen Liu, Jimei Yang, Duygu Ceylan, Ersin Yumer, and Yasu- taka Furukawa. “PlaneNet: Piece-wise Planar Reconstruction from a Single RGB Image”. In: CVPR. 2018, pp. 2579–2588
work page 2018
-
[35]
Deep convolu- tional neural fields for depth estimation from a single image
Fayao Liu, Chunhua Shen, and Guosheng Lin. “Deep convolu- tional neural fields for depth estimation from a single image”. In: CVPR. 2015, pp. 5162–5170
work page 2015
-
[36]
Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights
Arun Mallya, Dillon Davis, and Svetlana Lazebnik. “Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights”. In: ECCV. 2018, pp. 72–88
work page 2018
-
[37]
GeoNet: Geometric Neural Network for Joint Depth and Surface Normal Estimation
Xiaojuan Qi, Renjie Liao, Zhengzhe Liu, Raquel Urtasun, and Jiaya Jia. “GeoNet: Geometric Neural Network for Joint Depth and Surface Normal Estimation”. In: CVPR. 2018, pp. 283– 291
work page 2018
-
[38]
Cross-Domain Self- Supervised Multi-Task Feature Learning Using Synthetic Im- agery
Zhongzheng Ren and Yong Jae Lee. “Cross-Domain Self- Supervised Multi-Task Feature Learning Using Synthetic Im- agery”. In: CVPR. 2018, pp. 762–771
work page 2018
-
[39]
Train Here, Deploy There: Robust Segmentation in Unseen Domains
E. Romera, L. M. Bergasa, J. M. Alvarez, and M. Trivedi. “Train Here, Deploy There: Robust Segmentation in Unseen Domains”. In: IV. 2018, pp. 1828–1833
work page 2018
-
[40]
An Overview of Multi-Task Learning in Deep Neural Networks
Sebastian Ruder. “An Overview of Multi-Task Learning in Deep Neural Networks”. In: (2017), pp. 1–14. arXiv: 1706. 05098v1 [cs.LG]
work page 2017
-
[41]
Multi-Task Learning as Multi- Objective Optimization
Ozan Sener and Vladlen Koltun. “Multi-Task Learning as Multi- Objective Optimization”. In: NIPS. 2018, pp. 525–536
work page 2018
-
[42]
Cyclical Learning Rates for Training Neural Networks
L. N. Smith. “Cyclical Learning Rates for Training Neural Networks”. In: WACV. 2017, pp. 464–472
work page 2017
-
[43]
Nikolai Smolyanskiy, Alexey Kamenev, and Stan Birchfield. “On the Importance of Stereo for Accurate Depth Estima- tion: An Efficient Semi-Supervised Deep Neural Network Ap- proach”. In: CVPR Workshops. 2018, pp. 1120–1128
work page 2018
-
[44]
MultiNet: Real-time Joint Se- mantic Reasoning for Autonomous Driving
Marvin Teichmann, Michael Weber, J. Marius Zöllner, Roberto Cipolla, and Raquel Urtasun. “MultiNet: Real-time Joint Se- mantic Reasoning for Autonomous Driving”. In: IV. 2018, pp. 1013–1020
work page 2018
-
[45]
Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke, Thomas Brox, and Andreas Geiger. “Sparsity Invariant CNNs”. In: 3DV. 2017, pp. 11–20
work page 2017
-
[46]
Dan Xu, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. “PAD- Net: Multi-Tasks Guided Prediction-and-Distillation Network for Simultaneous Depth Estimation and Scene Parsing”. In: CVPR. 2018, pp. 675–684
work page 2018
-
[47]
SegStereo: Exploiting Semantic Information for Disparity Estimation
Guorun Yang, Hengshuang Zhao, Jianping Shi, Zhidong Deng, and Jiaya Jia. “SegStereo: Exploiting Semantic Information for Disparity Estimation”. In: ECCV. 2018, pp. 660–676
work page 2018
-
[48]
Multi-Scale Context Aggrega- tion by Dilated Convolutions
Fisher Yu and Vladlen Koltun. “Multi-Scale Context Aggrega- tion by Dilated Convolutions”. In: ICLR. 2016, pp. 1–13
work page 2016
-
[49]
Joint Task-Recursive Learning for Semantic Segmentation and Depth Estimation
Zhenyu Zhang, Zhen Cui, Chunyan Xu, Zequn Jie, Xiang Li, and Jian Yang. “Joint Task-Recursive Learning for Semantic Segmentation and Depth Estimation”. In: ECCV. 2018, pp. 238– 255
work page 2018
-
[50]
Deep hierarchical guidance and regularization learn- ing for end-to-end depth estimation
Zhenyu Zhang, Chunyan Xu, Jian Yang, Ying Tai, and Liang Chen. “Deep hierarchical guidance and regularization learn- ing for end-to-end depth estimation”. In: Pattern Recognit. 83 (2018), pp. 430–442
work page 2018
-
[51]
Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. “Pyramid Scene Parsing Network”. In: CVPR. 2017, pp. 6230–6239
work page 2017
-
[52]
A Modulation Module for Multi-task Learn- ing with Applications in Image Retrieval
Xiangyun Zhao, Haoxiang Li, Xiaohui Shen, Xiaodan Liang, and Ying Wu. “A Modulation Module for Multi-task Learn- ing with Applications in Image Retrieval”. In: ECCV. 2018, pp. 415–432
work page 2018
-
[53]
OmniDepth: Dense Depth Estimation for In- doors Spherical Panoramas
Nikolaos Zioulis, Antonis Karakottas, Dimitrios Zarpalas, and Petros Daras. “OmniDepth: Dense Depth Estimation for In- doors Spherical Panoramas”. In: ECCV. 2018, pp. 453–471. 9
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.