SDNet: Semantically Guided Depth Estimation Network
Pith reviewed 2026-05-24 16:49 UTC · model grok-4.3
The pith
A single CNN jointly trained on depth and semantics outperforms separate networks on both tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A single CNN trained simultaneously for semantic segmentation and depth estimation via ordinal classification learns richer features than independent models, delivers higher accuracy on both outputs, and requires less computation.
What carries the argument
Joint training of one CNN for both semantic segmentation and ordinal depth classification, where semantic supervision guides the shared features used for depth.
If this is right
- Joint training reduces total computational cost compared with running separate depth and semantic networks.
- The network extracts more semantically meaningful features when both tasks are learned together.
- Ordinal classification for depth improves estimation quality over standard regression.
- The combined model reaches state-of-the-art performance on semantic segmentation and monocular depth on two datasets.
Where Pith is reading between the lines
- Perception stacks for autonomous vehicles could run a single model instead of two, lowering latency and power draw.
- The same joint-training idea may transfer to other paired vision tasks such as instance segmentation or optical flow.
- If richer features generalize, the network could serve as a stronger backbone for additional downstream tasks without extra supervision.
Load-bearing premise
The measured gains in accuracy and feature quality result from training the two tasks together rather than from differences in network size, training schedule, or other design choices.
What would settle it
Train an identical architecture and schedule once jointly and once as two separate networks, then compare both final task metrics and a direct measure of feature semantic content such as linear probe accuracy on held-out labels.
Figures
read the original abstract
Autonomous vehicles and robots require a full scene understanding of the environment to interact with it. Such a perception typically incorporates pixel-wise knowledge of the depths and semantic labels for each image from a video sensor. Recent learning-based methods estimate both types of information independently using two separate CNNs. In this paper, we propose a model that is able to predict both outputs simultaneously, which leads to improved results and even reduced computational costs compared to independent estimation of depth and semantics. We also empirically prove that the CNN is capable of learning more meaningful and semantically richer features. Furthermore, our SDNet estimates the depth based on ordinal classification. On the basis of these two enhancements, our proposed method achieves state-of-the-art results in semantic segmentation and depth estimation from single monocular input images on two challenging datasets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SDNet, a CNN that jointly predicts semantic segmentation labels and monocular depth (via ordinal classification) from a single image. It claims that simultaneous training produces SOTA results on two datasets, reduces computational cost relative to separate networks, and yields semantically richer features in the shared backbone.
Significance. If the empirical attribution to joint training holds after proper controls, the result would strengthen the case for multi-task architectures in scene understanding for robotics. The ordinal-depth formulation and the reduced-cost claim are concrete strengths that could be cited by follow-up work.
major comments (2)
- [§4] §4 (Experiments) and associated tables: no ablation is described that compares the joint SDNet against independent depth and semantics networks trained with identical backbone capacity, identical epoch count, and identical data augmentation; without these matched baselines the reported gains cannot be attributed to semantic guidance rather than extra parameters or longer optimization.
- [§4] §4 and abstract: the claim of 'richer and more meaningful features' is asserted without quantitative support such as feature-visualization metrics, transfer-learning probes, or t-SNE analysis that would distinguish the joint-training effect from architecture size.
minor comments (2)
- [§3] Notation for the ordinal depth bins and the cross-entropy loss weighting between the two tasks should be defined once in §3 before being used in the experimental tables.
- [Figure 2] Figure 2 (network diagram) would benefit from explicit annotation of the shared backbone versus task-specific heads to clarify parameter sharing.
Simulated Author's Rebuttal
We appreciate the referee's detailed feedback on our manuscript. We address the major comments below and will incorporate revisions to strengthen the empirical claims.
read point-by-point responses
-
Referee: [§4] §4 (Experiments) and associated tables: no ablation is described that compares the joint SDNet against independent depth and semantics networks trained with identical backbone capacity, identical epoch count, and identical data augmentation; without these matched baselines the reported gains cannot be attributed to semantic guidance rather than extra parameters or longer optimization.
Authors: We agree with the referee that a controlled comparison with matched independent networks is necessary to attribute the improvements specifically to the joint training and semantic guidance. The current manuscript compares against separate networks but does not ensure identical training conditions in all aspects. In the revised version, we will add such ablations with identical backbone, epoch count, and data augmentation to better isolate the effect of joint training. revision: yes
-
Referee: [§4] §4 and abstract: the claim of 'richer and more meaningful features' is asserted without quantitative support such as feature-visualization metrics, transfer-learning probes, or t-SNE analysis that would distinguish the joint-training effect from architecture size.
Authors: We acknowledge that the assertion of semantically richer features in the shared backbone would be more convincing with additional quantitative evidence. While the manuscript demonstrates improved performance, it does not include the suggested analyses. We will include t-SNE visualizations of features from joint vs. independent training and possibly transfer learning probes in the revised manuscript to provide quantitative support for this claim. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper presents an empirical CNN architecture for joint monocular depth and semantic segmentation, with claims resting on experimental SOTA results and an assertion of richer learned features. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The ordinal depth formulation and joint-training benefit are presented as design choices validated by data, not as quantities that reduce to their own inputs by construction. The derivation chain is therefore self-contained as an engineering contribution.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
International Journal of Computer Vision (IJCV) 126(9), 961–972 (2018)
Alhaija, H., Mustikovela, S., Mescheder, L., Geiger, A., Rother, C.: Augmented reality meets computer vision: Efficient data generation for urban driving scenes. International Journal of Computer Vision (IJCV) 126(9), 961–972 (2018)
work page 2018
-
[2]
In: European Conference on Computer Vision (ECCV)
Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: European Conference on Computer Vision (ECCV). pp. 833–851 (2018)
work page 2018
-
[3]
In: International Joint Conference on Neural Networks
Cheng, J., Wang, Z., Pollastri, G.: A neural network approach to ordinal regression. In: International Joint Conference on Neural Networks. pp. 1279–1284 (2008)
work page 2008
-
[4]
In: Conference on Computer Vision and Pattern Recogni- tion (CVPR)
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Conference on Computer Vision and Pattern Recogni- tion (CVPR). pp. 3213–3223 (2016)
work page 2016
-
[5]
In: Advances in Neural Information Processing Systems (NIPS)
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems (NIPS). pp. 2366–2374 (2014)
work page 2014
-
[6]
In: Conference on Computer Vision and Pattern Recognition (CVPR)
Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2002–2011 (2018)
work page 2002
-
[7]
In: European Conference on Computer Vision (ECCV)
Garg, R., Kumar BG, V., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: Geometry to the rescue. In: European Conference on Computer Vision (ECCV). pp. 740–756 (2016)
work page 2016
-
[8]
In: Conference on Computer Vision and Pattern Recognition (CVPR)
Gidaris, S., Komodakis, N.: Detect, replace, refine: Deep structured prediction for pixel wise labeling. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7187–7196 (2017)
work page 2017
-
[9]
In: Conference on Computer Vision and Pattern Recognition (CVPR)
Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth esti- mation with left-right consistency. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6602–6611 (2017)
work page 2017
-
[10]
In: European Conference on Computer Vision (ECCV)
Guo, X., Li, H., Yi, S., Ren, J., Wang, X.: Learning monocular depth by distill- ing cross-domain stereo networks. In: European Conference on Computer Vision (ECCV). pp. 484–500 (2018)
work page 2018
-
[11]
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 30(2), 328–341 (2008)
Hirschmller, H.: Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 30(2), 328–341 (2008)
work page 2008
-
[12]
In: Conference on Computer Vision and Pattern Recognition (CVPR)
Knbelreiter, P., Reinbacher, C., Shekhovtsov, A., Pock, T.: End-to-end training of hybrid CNN-CRF models for stereo. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1456–1465 (2017)
work page 2017
-
[13]
In: Winter Conference on Applications of Computer Vision (WACV) (2019)
Kong, S., Fowlkes, C.: Pixel-wise attentional gating for parsimonious pixel labeling. In: Winter Conference on Applications of Computer Vision (WACV) (2019)
work page 2019
-
[14]
In: Conference on Computer Vision and Pattern Recognition (CVPR)
Kuznietsov, Y., Stckler, J., Leibe, B.: Semi-supervised deep learning for monocular depth map prediction. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2215–2223 (2017)
work page 2017
-
[15]
In: Conference on Computer Vision and Pattern Recognition (CVPR)
Ladick, L., Shi, J., Pollefeys, M.: Pulling things out of perspective. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 89–96 (2014)
work page 2014
-
[16]
Pattern Recognition 83, 328 – 339 (2018)
Li, B., Dai, Y., He, M.: Monocular depth estimation with hierarchical fusion of dilated cnns and soft-weighted-sum inference. Pattern Recognition 83, 328 – 339 (2018)
work page 2018
-
[17]
Deep attention-based classification network for robust depth prediction
Li, R., Xian, K., Shen, C., Cao, Z., Lu, H., Hang, L.: Deep attention-based clas- sification network for robust depth prediction. CoRR arXiv, 1807.03959 [cs.CV] (2018) 14 M. Ochs, A. Kretz, and R. Mester
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[18]
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2018)
Lin, T., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detec- tion. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2018)
work page 2018
-
[19]
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 38(10), 2024–2039 (2016)
Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 38(10), 2024–2039 (2016)
work page 2024
-
[20]
In: Conference on Computer Vision and Pattern Recognition (CVPR)
Liu, M., Salzmann, M., He, X.: Discrete-continuous depth estimation from a single image. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 716–723 (2014)
work page 2014
-
[21]
In: Conference on Computer Vision and Pattern Recognition (CVPR)
Mayer, N., Ilg, E., Husser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4040–4048 (2016)
work page 2016
-
[22]
In: Conference on Computer Vision and Pattern Recognition (CVPR)
Niu, Z., Zhou, M., Wang, L., Gao, X., Hua, G.: Ordinal regression with multiple output cnn for age estimation. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4920–4928 (2016)
work page 2016
-
[23]
Pang, J., Sun, W., SJ. Ren, J., Yang, C., Yan, Q.: Cascade residual learning: A two-stage convolutional neural network for stereo matching. In: International Conference on Computer Vision (ICCV) - Workshop. pp. 887–895 (2017)
work page 2017
-
[24]
In: Advances in Neural Information Processing Systems (NIPS)
Saxena, A., Chung, S.H., Ng, A.Y.: Learning depth from single monocular images. In: Advances in Neural Information Processing Systems (NIPS). pp. 1161 – 1168 (2005)
work page 2005
-
[25]
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 31(5), 824–840 (2009)
Saxena, A., Sun, M., Ng, A.Y.: Make3D: Learning 3D scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 31(5), 824–840 (2009)
work page 2009
-
[26]
In: Intelligent Vehicles Symposium (IV)
Schneider, L., Cordts, M., Rehfeld, T., Pfeiffer, D., Enzweiler, M., Franke, U., Polle- feys, M., Roth, S.: Semantic stixels: Depth is not enough. In: Intelligent Vehicles Symposium (IV). pp. 110–117 (2016)
work page 2016
-
[27]
In: International Conference on 3D Vision (3DV) (2017)
Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A.: Sparsity invariant CNNs. In: International Conference on 3D Vision (3DV) (2017)
work page 2017
-
[28]
Journal of Machine Learning Research (JMLR) 17, 2287–2318 (2016)
ˇZbontar, J., LeCun, Y.: Stereo matching by training a convolutional neural network to compare image patches. Journal of Machine Learning Research (JMLR) 17, 2287–2318 (2016)
work page 2016
-
[29]
In: European Conference on Computer Vi- sion (ECCV)
Wu, Y., He, K.: Group normalization. In: European Conference on Computer Vi- sion (ECCV). pp. 3–19 (2018)
work page 2018
-
[30]
In: European Conference on Com- puter Vision (ECCV)
Xie, J., Girshick, R., Farhadi, A.: Deep3D: fully automatic 2D-to-3D video conver- sion with deep convolutional neural networks. In: European Conference on Com- puter Vision (ECCV). pp. 842–857 (2016)
work page 2016
-
[31]
In: European Conference on Computer Vision (ECCV)
Yang, G., Zhao, H., Shi, J., Deng, Z., Jia, J.: Segstereo: Exploiting semantic in- formation for disparity estimation. In: European Conference on Computer Vision (ECCV). pp. 660–676 (2018)
work page 2018
-
[32]
In: European Conference on Computer Vision (ECCV)
Yang, N., Wang, R., St¨ uckler, J., Cremers, D.: Deep virtual stereo odometry: Lever- aging deep depth prediction for monocular direct sparse odometry. In: European Conference on Computer Vision (ECCV). pp. 835–852 (2018)
work page 2018
-
[33]
Pattern Recognition 83, 430 – 442 (2018)
Zhang, Z., Xu, C., Yang, J., Tai, Y., Chen, L.: Deep hierarchical guidance and regularization learning for end-to-end depth estimation. Pattern Recognition 83, 430 – 442 (2018)
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.