pith. sign in

arxiv: 1907.10659 · v1 · pith:GOWB65BBnew · submitted 2019-07-24 · 💻 cs.CV

SDNet: Semantically Guided Depth Estimation Network

Pith reviewed 2026-05-24 16:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords depth estimationsemantic segmentationmonocular visionjoint trainingCNNordinal classificationscene understanding
0
0 comments X

The pith

A single CNN jointly trained on depth and semantics outperforms separate networks on both tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a network that predicts pixel-wise depth and semantic labels at the same time from a single monocular image. Joint training produces better accuracy on each task and lower overall computation than running two independent CNNs. The authors further claim the shared model learns more meaningful, semantically richer internal features. Depth prediction is handled through ordinal classification instead of direct regression. These changes together yield state-of-the-art numbers on two challenging datasets.

Core claim

A single CNN trained simultaneously for semantic segmentation and depth estimation via ordinal classification learns richer features than independent models, delivers higher accuracy on both outputs, and requires less computation.

What carries the argument

Joint training of one CNN for both semantic segmentation and ordinal depth classification, where semantic supervision guides the shared features used for depth.

If this is right

  • Joint training reduces total computational cost compared with running separate depth and semantic networks.
  • The network extracts more semantically meaningful features when both tasks are learned together.
  • Ordinal classification for depth improves estimation quality over standard regression.
  • The combined model reaches state-of-the-art performance on semantic segmentation and monocular depth on two datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Perception stacks for autonomous vehicles could run a single model instead of two, lowering latency and power draw.
  • The same joint-training idea may transfer to other paired vision tasks such as instance segmentation or optical flow.
  • If richer features generalize, the network could serve as a stronger backbone for additional downstream tasks without extra supervision.

Load-bearing premise

The measured gains in accuracy and feature quality result from training the two tasks together rather than from differences in network size, training schedule, or other design choices.

What would settle it

Train an identical architecture and schedule once jointly and once as two separate networks, then compare both final task metrics and a direct measure of feature semantic content such as linear probe accuracy on held-out labels.

Figures

Figures reproduced from arXiv: 1907.10659 by Adrian Kretz, Matthias Ochs, Rudolf Mester.

Figure 1
Figure 1. Figure 1: Our proposed SDNet estimates pixel-wise depth and semantic labels from [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Encoder-decoder architecture of SDNet with ASPP module. The decoder [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The left plot shows the exponential and linear discretization of the depths [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison with other state-of-the-art methods on KITTI [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: D1 error of the SDNet depth estimate from the ground truth on three [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results of semantic segmentation and depth estimation using [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results of the semantic segmentation and the depth estimation [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
read the original abstract

Autonomous vehicles and robots require a full scene understanding of the environment to interact with it. Such a perception typically incorporates pixel-wise knowledge of the depths and semantic labels for each image from a video sensor. Recent learning-based methods estimate both types of information independently using two separate CNNs. In this paper, we propose a model that is able to predict both outputs simultaneously, which leads to improved results and even reduced computational costs compared to independent estimation of depth and semantics. We also empirically prove that the CNN is capable of learning more meaningful and semantically richer features. Furthermore, our SDNet estimates the depth based on ordinal classification. On the basis of these two enhancements, our proposed method achieves state-of-the-art results in semantic segmentation and depth estimation from single monocular input images on two challenging datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SDNet, a CNN that jointly predicts semantic segmentation labels and monocular depth (via ordinal classification) from a single image. It claims that simultaneous training produces SOTA results on two datasets, reduces computational cost relative to separate networks, and yields semantically richer features in the shared backbone.

Significance. If the empirical attribution to joint training holds after proper controls, the result would strengthen the case for multi-task architectures in scene understanding for robotics. The ordinal-depth formulation and the reduced-cost claim are concrete strengths that could be cited by follow-up work.

major comments (2)
  1. [§4] §4 (Experiments) and associated tables: no ablation is described that compares the joint SDNet against independent depth and semantics networks trained with identical backbone capacity, identical epoch count, and identical data augmentation; without these matched baselines the reported gains cannot be attributed to semantic guidance rather than extra parameters or longer optimization.
  2. [§4] §4 and abstract: the claim of 'richer and more meaningful features' is asserted without quantitative support such as feature-visualization metrics, transfer-learning probes, or t-SNE analysis that would distinguish the joint-training effect from architecture size.
minor comments (2)
  1. [§3] Notation for the ordinal depth bins and the cross-entropy loss weighting between the two tasks should be defined once in §3 before being used in the experimental tables.
  2. [Figure 2] Figure 2 (network diagram) would benefit from explicit annotation of the shared backbone versus task-specific heads to clarify parameter sharing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. We address the major comments below and will incorporate revisions to strengthen the empirical claims.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and associated tables: no ablation is described that compares the joint SDNet against independent depth and semantics networks trained with identical backbone capacity, identical epoch count, and identical data augmentation; without these matched baselines the reported gains cannot be attributed to semantic guidance rather than extra parameters or longer optimization.

    Authors: We agree with the referee that a controlled comparison with matched independent networks is necessary to attribute the improvements specifically to the joint training and semantic guidance. The current manuscript compares against separate networks but does not ensure identical training conditions in all aspects. In the revised version, we will add such ablations with identical backbone, epoch count, and data augmentation to better isolate the effect of joint training. revision: yes

  2. Referee: [§4] §4 and abstract: the claim of 'richer and more meaningful features' is asserted without quantitative support such as feature-visualization metrics, transfer-learning probes, or t-SNE analysis that would distinguish the joint-training effect from architecture size.

    Authors: We acknowledge that the assertion of semantically richer features in the shared backbone would be more convincing with additional quantitative evidence. While the manuscript demonstrates improved performance, it does not include the suggested analyses. We will include t-SNE visualizations of features from joint vs. independent training and possibly transfer learning probes in the revised manuscript to provide quantitative support for this claim. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents an empirical CNN architecture for joint monocular depth and semantic segmentation, with claims resting on experimental SOTA results and an assertion of richer learned features. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The ordinal depth formulation and joint-training benefit are presented as design choices validated by data, not as quantities that reduce to their own inputs by construction. The derivation chain is therefore self-contained as an engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated. The work implicitly relies on standard supervised CNN training assumptions (availability of labeled depth and semantic data, convergence of joint optimization) that are not enumerated.

pith-pipeline@v0.9.0 · 5658 in / 1055 out tokens · 22087 ms · 2026-05-24T16:49:00.807220+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 1 internal anchor

  1. [1]

    International Journal of Computer Vision (IJCV) 126(9), 961–972 (2018)

    Alhaija, H., Mustikovela, S., Mescheder, L., Geiger, A., Rother, C.: Augmented reality meets computer vision: Efficient data generation for urban driving scenes. International Journal of Computer Vision (IJCV) 126(9), 961–972 (2018)

  2. [2]

    In: European Conference on Computer Vision (ECCV)

    Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: European Conference on Computer Vision (ECCV). pp. 833–851 (2018)

  3. [3]

    In: International Joint Conference on Neural Networks

    Cheng, J., Wang, Z., Pollastri, G.: A neural network approach to ordinal regression. In: International Joint Conference on Neural Networks. pp. 1279–1284 (2008)

  4. [4]

    In: Conference on Computer Vision and Pattern Recogni- tion (CVPR)

    Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Conference on Computer Vision and Pattern Recogni- tion (CVPR). pp. 3213–3223 (2016)

  5. [5]

    In: Advances in Neural Information Processing Systems (NIPS)

    Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems (NIPS). pp. 2366–2374 (2014)

  6. [6]

    In: Conference on Computer Vision and Pattern Recognition (CVPR)

    Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2002–2011 (2018)

  7. [7]

    In: European Conference on Computer Vision (ECCV)

    Garg, R., Kumar BG, V., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: Geometry to the rescue. In: European Conference on Computer Vision (ECCV). pp. 740–756 (2016)

  8. [8]

    In: Conference on Computer Vision and Pattern Recognition (CVPR)

    Gidaris, S., Komodakis, N.: Detect, replace, refine: Deep structured prediction for pixel wise labeling. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7187–7196 (2017)

  9. [9]

    In: Conference on Computer Vision and Pattern Recognition (CVPR)

    Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth esti- mation with left-right consistency. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6602–6611 (2017)

  10. [10]

    In: European Conference on Computer Vision (ECCV)

    Guo, X., Li, H., Yi, S., Ren, J., Wang, X.: Learning monocular depth by distill- ing cross-domain stereo networks. In: European Conference on Computer Vision (ECCV). pp. 484–500 (2018)

  11. [11]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 30(2), 328–341 (2008)

    Hirschmller, H.: Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 30(2), 328–341 (2008)

  12. [12]

    In: Conference on Computer Vision and Pattern Recognition (CVPR)

    Knbelreiter, P., Reinbacher, C., Shekhovtsov, A., Pock, T.: End-to-end training of hybrid CNN-CRF models for stereo. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1456–1465 (2017)

  13. [13]

    In: Winter Conference on Applications of Computer Vision (WACV) (2019)

    Kong, S., Fowlkes, C.: Pixel-wise attentional gating for parsimonious pixel labeling. In: Winter Conference on Applications of Computer Vision (WACV) (2019)

  14. [14]

    In: Conference on Computer Vision and Pattern Recognition (CVPR)

    Kuznietsov, Y., Stckler, J., Leibe, B.: Semi-supervised deep learning for monocular depth map prediction. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2215–2223 (2017)

  15. [15]

    In: Conference on Computer Vision and Pattern Recognition (CVPR)

    Ladick, L., Shi, J., Pollefeys, M.: Pulling things out of perspective. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 89–96 (2014)

  16. [16]

    Pattern Recognition 83, 328 – 339 (2018)

    Li, B., Dai, Y., He, M.: Monocular depth estimation with hierarchical fusion of dilated cnns and soft-weighted-sum inference. Pattern Recognition 83, 328 – 339 (2018)

  17. [17]

    Deep attention-based classification network for robust depth prediction

    Li, R., Xian, K., Shen, C., Cao, Z., Lu, H., Hang, L.: Deep attention-based clas- sification network for robust depth prediction. CoRR arXiv, 1807.03959 [cs.CV] (2018) 14 M. Ochs, A. Kretz, and R. Mester

  18. [18]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2018)

    Lin, T., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detec- tion. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2018)

  19. [19]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 38(10), 2024–2039 (2016)

    Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 38(10), 2024–2039 (2016)

  20. [20]

    In: Conference on Computer Vision and Pattern Recognition (CVPR)

    Liu, M., Salzmann, M., He, X.: Discrete-continuous depth estimation from a single image. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 716–723 (2014)

  21. [21]

    In: Conference on Computer Vision and Pattern Recognition (CVPR)

    Mayer, N., Ilg, E., Husser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4040–4048 (2016)

  22. [22]

    In: Conference on Computer Vision and Pattern Recognition (CVPR)

    Niu, Z., Zhou, M., Wang, L., Gao, X., Hua, G.: Ordinal regression with multiple output cnn for age estimation. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4920–4928 (2016)

  23. [23]

    Ren, J., Yang, C., Yan, Q.: Cascade residual learning: A two-stage convolutional neural network for stereo matching

    Pang, J., Sun, W., SJ. Ren, J., Yang, C., Yan, Q.: Cascade residual learning: A two-stage convolutional neural network for stereo matching. In: International Conference on Computer Vision (ICCV) - Workshop. pp. 887–895 (2017)

  24. [24]

    In: Advances in Neural Information Processing Systems (NIPS)

    Saxena, A., Chung, S.H., Ng, A.Y.: Learning depth from single monocular images. In: Advances in Neural Information Processing Systems (NIPS). pp. 1161 – 1168 (2005)

  25. [25]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 31(5), 824–840 (2009)

    Saxena, A., Sun, M., Ng, A.Y.: Make3D: Learning 3D scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 31(5), 824–840 (2009)

  26. [26]

    In: Intelligent Vehicles Symposium (IV)

    Schneider, L., Cordts, M., Rehfeld, T., Pfeiffer, D., Enzweiler, M., Franke, U., Polle- feys, M., Roth, S.: Semantic stixels: Depth is not enough. In: Intelligent Vehicles Symposium (IV). pp. 110–117 (2016)

  27. [27]

    In: International Conference on 3D Vision (3DV) (2017)

    Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A.: Sparsity invariant CNNs. In: International Conference on 3D Vision (3DV) (2017)

  28. [28]

    Journal of Machine Learning Research (JMLR) 17, 2287–2318 (2016)

    ˇZbontar, J., LeCun, Y.: Stereo matching by training a convolutional neural network to compare image patches. Journal of Machine Learning Research (JMLR) 17, 2287–2318 (2016)

  29. [29]

    In: European Conference on Computer Vi- sion (ECCV)

    Wu, Y., He, K.: Group normalization. In: European Conference on Computer Vi- sion (ECCV). pp. 3–19 (2018)

  30. [30]

    In: European Conference on Com- puter Vision (ECCV)

    Xie, J., Girshick, R., Farhadi, A.: Deep3D: fully automatic 2D-to-3D video conver- sion with deep convolutional neural networks. In: European Conference on Com- puter Vision (ECCV). pp. 842–857 (2016)

  31. [31]

    In: European Conference on Computer Vision (ECCV)

    Yang, G., Zhao, H., Shi, J., Deng, Z., Jia, J.: Segstereo: Exploiting semantic in- formation for disparity estimation. In: European Conference on Computer Vision (ECCV). pp. 660–676 (2018)

  32. [32]

    In: European Conference on Computer Vision (ECCV)

    Yang, N., Wang, R., St¨ uckler, J., Cremers, D.: Deep virtual stereo odometry: Lever- aging deep depth prediction for monocular direct sparse odometry. In: European Conference on Computer Vision (ECCV). pp. 835–852 (2018)

  33. [33]

    Pattern Recognition 83, 430 – 442 (2018)

    Zhang, Z., Xu, C., Yang, J., Tai, Y., Chen, L.: Deep hierarchical guidance and regularization learning for end-to-end depth estimation. Pattern Recognition 83, 430 – 442 (2018)