SDNet: Semantically Guided Depth Estimation Network

Adrian Kretz; Matthias Ochs; Rudolf Mester

arxiv: 1907.10659 · v1 · pith:GOWB65BBnew · submitted 2019-07-24 · 💻 cs.CV

SDNet: Semantically Guided Depth Estimation Network

Matthias Ochs , Adrian Kretz , Rudolf Mester This is my paper

Pith reviewed 2026-05-24 16:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords depth estimationsemantic segmentationmonocular visionjoint trainingCNNordinal classificationscene understanding

0 comments

The pith

A single CNN jointly trained on depth and semantics outperforms separate networks on both tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a network that predicts pixel-wise depth and semantic labels at the same time from a single monocular image. Joint training produces better accuracy on each task and lower overall computation than running two independent CNNs. The authors further claim the shared model learns more meaningful, semantically richer internal features. Depth prediction is handled through ordinal classification instead of direct regression. These changes together yield state-of-the-art numbers on two challenging datasets.

Core claim

A single CNN trained simultaneously for semantic segmentation and depth estimation via ordinal classification learns richer features than independent models, delivers higher accuracy on both outputs, and requires less computation.

What carries the argument

Joint training of one CNN for both semantic segmentation and ordinal depth classification, where semantic supervision guides the shared features used for depth.

If this is right

Joint training reduces total computational cost compared with running separate depth and semantic networks.
The network extracts more semantically meaningful features when both tasks are learned together.
Ordinal classification for depth improves estimation quality over standard regression.
The combined model reaches state-of-the-art performance on semantic segmentation and monocular depth on two datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Perception stacks for autonomous vehicles could run a single model instead of two, lowering latency and power draw.
The same joint-training idea may transfer to other paired vision tasks such as instance segmentation or optical flow.
If richer features generalize, the network could serve as a stronger backbone for additional downstream tasks without extra supervision.

Load-bearing premise

The measured gains in accuracy and feature quality result from training the two tasks together rather than from differences in network size, training schedule, or other design choices.

What would settle it

Train an identical architecture and schedule once jointly and once as two separate networks, then compare both final task metrics and a direct measure of feature semantic content such as linear probe accuracy on held-out labels.

Figures

Figures reproduced from arXiv: 1907.10659 by Adrian Kretz, Matthias Ochs, Rudolf Mester.

**Figure 2.** Figure 2: Encoder-decoder architecture of SDNet with ASPP module. The decoder [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The left plot shows the exponential and linear discretization of the depths [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison with other state-of-the-art methods on KITTI [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: D1 error of the SDNet depth estimate from the ground truth on three [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results of semantic segmentation and depth estimation using [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results of the semantic segmentation and the depth estimation [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

read the original abstract

Autonomous vehicles and robots require a full scene understanding of the environment to interact with it. Such a perception typically incorporates pixel-wise knowledge of the depths and semantic labels for each image from a video sensor. Recent learning-based methods estimate both types of information independently using two separate CNNs. In this paper, we propose a model that is able to predict both outputs simultaneously, which leads to improved results and even reduced computational costs compared to independent estimation of depth and semantics. We also empirically prove that the CNN is capable of learning more meaningful and semantically richer features. Furthermore, our SDNet estimates the depth based on ordinal classification. On the basis of these two enhancements, our proposed method achieves state-of-the-art results in semantic segmentation and depth estimation from single monocular input images on two challenging datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Joint depth-semantics training is a reasonable extension but the attribution to semantic guidance still needs matched baselines to hold up.

read the letter

The paper puts forward SDNet, a shared-backbone network that does depth via ordinal classification and semantics at the same time. It claims this produces SOTA numbers on two datasets plus richer features and lower compute than running two separate models. The ordinal formulation for depth is a straightforward and sensible choice that avoids some regression pitfalls. The architecture itself is described clearly enough that someone could implement the joint setup without much trouble. That is the main concrete contribution. The soft spot is the causal claim that joint training is what drives the better features and results. The abstract gives no numbers, no error bars, and no sign of ablations that hold model capacity, training schedule, and data augmentation fixed between the joint model and independent baselines. Without those controls the gains could come from extra parameters in the shared trunk or from longer optimization rather than from semantic guidance. If the full paper contains those matched comparisons then the central result strengthens; if not, the main empirical assertion stays under-supported. The rest of the work follows standard multi-task patterns in the literature without obvious citation gaps. This is the kind of paper that matters to people building real-time scene understanding stacks for vehicles or robots who want one network instead of two. A reader who needs the architecture details and is willing to check the tables themselves can get practical value. It is solid enough on its own terms to go to referees so they can verify the experimental controls and the actual quantitative claims.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SDNet, a CNN that jointly predicts semantic segmentation labels and monocular depth (via ordinal classification) from a single image. It claims that simultaneous training produces SOTA results on two datasets, reduces computational cost relative to separate networks, and yields semantically richer features in the shared backbone.

Significance. If the empirical attribution to joint training holds after proper controls, the result would strengthen the case for multi-task architectures in scene understanding for robotics. The ordinal-depth formulation and the reduced-cost claim are concrete strengths that could be cited by follow-up work.

major comments (2)

[§4] §4 (Experiments) and associated tables: no ablation is described that compares the joint SDNet against independent depth and semantics networks trained with identical backbone capacity, identical epoch count, and identical data augmentation; without these matched baselines the reported gains cannot be attributed to semantic guidance rather than extra parameters or longer optimization.
[§4] §4 and abstract: the claim of 'richer and more meaningful features' is asserted without quantitative support such as feature-visualization metrics, transfer-learning probes, or t-SNE analysis that would distinguish the joint-training effect from architecture size.

minor comments (2)

[§3] Notation for the ordinal depth bins and the cross-entropy loss weighting between the two tasks should be defined once in §3 before being used in the experimental tables.
[Figure 2] Figure 2 (network diagram) would benefit from explicit annotation of the shared backbone versus task-specific heads to clarify parameter sharing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. We address the major comments below and will incorporate revisions to strengthen the empirical claims.

read point-by-point responses

Referee: [§4] §4 (Experiments) and associated tables: no ablation is described that compares the joint SDNet against independent depth and semantics networks trained with identical backbone capacity, identical epoch count, and identical data augmentation; without these matched baselines the reported gains cannot be attributed to semantic guidance rather than extra parameters or longer optimization.

Authors: We agree with the referee that a controlled comparison with matched independent networks is necessary to attribute the improvements specifically to the joint training and semantic guidance. The current manuscript compares against separate networks but does not ensure identical training conditions in all aspects. In the revised version, we will add such ablations with identical backbone, epoch count, and data augmentation to better isolate the effect of joint training. revision: yes
Referee: [§4] §4 and abstract: the claim of 'richer and more meaningful features' is asserted without quantitative support such as feature-visualization metrics, transfer-learning probes, or t-SNE analysis that would distinguish the joint-training effect from architecture size.

Authors: We acknowledge that the assertion of semantically richer features in the shared backbone would be more convincing with additional quantitative evidence. While the manuscript demonstrates improved performance, it does not include the suggested analyses. We will include t-SNE visualizations of features from joint vs. independent training and possibly transfer learning probes in the revised manuscript to provide quantitative support for this claim. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents an empirical CNN architecture for joint monocular depth and semantic segmentation, with claims resting on experimental SOTA results and an assertion of richer learned features. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The ordinal depth formulation and joint-training benefit are presented as design choices validated by data, not as quantities that reduce to their own inputs by construction. The derivation chain is therefore self-contained as an engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated. The work implicitly relies on standard supervised CNN training assumptions (availability of labeled depth and semantic data, convergence of joint optimization) that are not enumerated.

pith-pipeline@v0.9.0 · 5658 in / 1055 out tokens · 22087 ms · 2026-05-24T16:49:00.807220+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 1 internal anchor

[1]

International Journal of Computer Vision (IJCV) 126(9), 961–972 (2018)

Alhaija, H., Mustikovela, S., Mescheder, L., Geiger, A., Rother, C.: Augmented reality meets computer vision: Eﬃcient data generation for urban driving scenes. International Journal of Computer Vision (IJCV) 126(9), 961–972 (2018)

work page 2018
[2]

In: European Conference on Computer Vision (ECCV)

Chen, L.C., Zhu, Y., Papandreou, G., Schroﬀ, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: European Conference on Computer Vision (ECCV). pp. 833–851 (2018)

work page 2018
[3]

In: International Joint Conference on Neural Networks

Cheng, J., Wang, Z., Pollastri, G.: A neural network approach to ordinal regression. In: International Joint Conference on Neural Networks. pp. 1279–1284 (2008)

work page 2008
[4]

In: Conference on Computer Vision and Pattern Recogni- tion (CVPR)

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Conference on Computer Vision and Pattern Recogni- tion (CVPR). pp. 3213–3223 (2016)

work page 2016
[5]

In: Advances in Neural Information Processing Systems (NIPS)

Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems (NIPS). pp. 2366–2374 (2014)

work page 2014
[6]

In: Conference on Computer Vision and Pattern Recognition (CVPR)

Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2002–2011 (2018)

work page 2002
[7]

In: European Conference on Computer Vision (ECCV)

Garg, R., Kumar BG, V., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: Geometry to the rescue. In: European Conference on Computer Vision (ECCV). pp. 740–756 (2016)

work page 2016
[8]

In: Conference on Computer Vision and Pattern Recognition (CVPR)

Gidaris, S., Komodakis, N.: Detect, replace, reﬁne: Deep structured prediction for pixel wise labeling. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7187–7196 (2017)

work page 2017
[9]

In: Conference on Computer Vision and Pattern Recognition (CVPR)

Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth esti- mation with left-right consistency. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6602–6611 (2017)

work page 2017
[10]

In: European Conference on Computer Vision (ECCV)

Guo, X., Li, H., Yi, S., Ren, J., Wang, X.: Learning monocular depth by distill- ing cross-domain stereo networks. In: European Conference on Computer Vision (ECCV). pp. 484–500 (2018)

work page 2018
[11]

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 30(2), 328–341 (2008)

Hirschmller, H.: Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 30(2), 328–341 (2008)

work page 2008
[12]

In: Conference on Computer Vision and Pattern Recognition (CVPR)

Knbelreiter, P., Reinbacher, C., Shekhovtsov, A., Pock, T.: End-to-end training of hybrid CNN-CRF models for stereo. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1456–1465 (2017)

work page 2017
[13]

In: Winter Conference on Applications of Computer Vision (WACV) (2019)

Kong, S., Fowlkes, C.: Pixel-wise attentional gating for parsimonious pixel labeling. In: Winter Conference on Applications of Computer Vision (WACV) (2019)

work page 2019
[14]

In: Conference on Computer Vision and Pattern Recognition (CVPR)

Kuznietsov, Y., Stckler, J., Leibe, B.: Semi-supervised deep learning for monocular depth map prediction. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2215–2223 (2017)

work page 2017
[15]

In: Conference on Computer Vision and Pattern Recognition (CVPR)

Ladick, L., Shi, J., Pollefeys, M.: Pulling things out of perspective. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 89–96 (2014)

work page 2014
[16]

Pattern Recognition 83, 328 – 339 (2018)

Li, B., Dai, Y., He, M.: Monocular depth estimation with hierarchical fusion of dilated cnns and soft-weighted-sum inference. Pattern Recognition 83, 328 – 339 (2018)

work page 2018
[17]

Deep attention-based classification network for robust depth prediction

Li, R., Xian, K., Shen, C., Cao, Z., Lu, H., Hang, L.: Deep attention-based clas- siﬁcation network for robust depth prediction. CoRR arXiv, 1807.03959 [cs.CV] (2018) 14 M. Ochs, A. Kretz, and R. Mester

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2018)

Lin, T., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detec- tion. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2018)

work page 2018
[19]

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 38(10), 2024–2039 (2016)

Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural ﬁelds. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 38(10), 2024–2039 (2016)

work page 2024
[20]

In: Conference on Computer Vision and Pattern Recognition (CVPR)

Liu, M., Salzmann, M., He, X.: Discrete-continuous depth estimation from a single image. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 716–723 (2014)

work page 2014
[21]

In: Conference on Computer Vision and Pattern Recognition (CVPR)

Mayer, N., Ilg, E., Husser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical ﬂow, and scene ﬂow estimation. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4040–4048 (2016)

work page 2016
[22]

In: Conference on Computer Vision and Pattern Recognition (CVPR)

Niu, Z., Zhou, M., Wang, L., Gao, X., Hua, G.: Ordinal regression with multiple output cnn for age estimation. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4920–4928 (2016)

work page 2016
[23]

Ren, J., Yang, C., Yan, Q.: Cascade residual learning: A two-stage convolutional neural network for stereo matching

Pang, J., Sun, W., SJ. Ren, J., Yang, C., Yan, Q.: Cascade residual learning: A two-stage convolutional neural network for stereo matching. In: International Conference on Computer Vision (ICCV) - Workshop. pp. 887–895 (2017)

work page 2017
[24]

In: Advances in Neural Information Processing Systems (NIPS)

Saxena, A., Chung, S.H., Ng, A.Y.: Learning depth from single monocular images. In: Advances in Neural Information Processing Systems (NIPS). pp. 1161 – 1168 (2005)

work page 2005
[25]

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 31(5), 824–840 (2009)

Saxena, A., Sun, M., Ng, A.Y.: Make3D: Learning 3D scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 31(5), 824–840 (2009)

work page 2009
[26]

In: Intelligent Vehicles Symposium (IV)

Schneider, L., Cordts, M., Rehfeld, T., Pfeiﬀer, D., Enzweiler, M., Franke, U., Polle- feys, M., Roth, S.: Semantic stixels: Depth is not enough. In: Intelligent Vehicles Symposium (IV). pp. 110–117 (2016)

work page 2016
[27]

In: International Conference on 3D Vision (3DV) (2017)

Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A.: Sparsity invariant CNNs. In: International Conference on 3D Vision (3DV) (2017)

work page 2017
[28]

Journal of Machine Learning Research (JMLR) 17, 2287–2318 (2016)

ˇZbontar, J., LeCun, Y.: Stereo matching by training a convolutional neural network to compare image patches. Journal of Machine Learning Research (JMLR) 17, 2287–2318 (2016)

work page 2016
[29]

In: European Conference on Computer Vi- sion (ECCV)

Wu, Y., He, K.: Group normalization. In: European Conference on Computer Vi- sion (ECCV). pp. 3–19 (2018)

work page 2018
[30]

In: European Conference on Com- puter Vision (ECCV)

Xie, J., Girshick, R., Farhadi, A.: Deep3D: fully automatic 2D-to-3D video conver- sion with deep convolutional neural networks. In: European Conference on Com- puter Vision (ECCV). pp. 842–857 (2016)

work page 2016
[31]

In: European Conference on Computer Vision (ECCV)

Yang, G., Zhao, H., Shi, J., Deng, Z., Jia, J.: Segstereo: Exploiting semantic in- formation for disparity estimation. In: European Conference on Computer Vision (ECCV). pp. 660–676 (2018)

work page 2018
[32]

In: European Conference on Computer Vision (ECCV)

Yang, N., Wang, R., St¨ uckler, J., Cremers, D.: Deep virtual stereo odometry: Lever- aging deep depth prediction for monocular direct sparse odometry. In: European Conference on Computer Vision (ECCV). pp. 835–852 (2018)

work page 2018
[33]

Pattern Recognition 83, 430 – 442 (2018)

Zhang, Z., Xu, C., Yang, J., Tai, Y., Chen, L.: Deep hierarchical guidance and regularization learning for end-to-end depth estimation. Pattern Recognition 83, 430 – 442 (2018)

work page 2018

[1] [1]

International Journal of Computer Vision (IJCV) 126(9), 961–972 (2018)

Alhaija, H., Mustikovela, S., Mescheder, L., Geiger, A., Rother, C.: Augmented reality meets computer vision: Eﬃcient data generation for urban driving scenes. International Journal of Computer Vision (IJCV) 126(9), 961–972 (2018)

work page 2018

[2] [2]

In: European Conference on Computer Vision (ECCV)

Chen, L.C., Zhu, Y., Papandreou, G., Schroﬀ, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: European Conference on Computer Vision (ECCV). pp. 833–851 (2018)

work page 2018

[3] [3]

In: International Joint Conference on Neural Networks

Cheng, J., Wang, Z., Pollastri, G.: A neural network approach to ordinal regression. In: International Joint Conference on Neural Networks. pp. 1279–1284 (2008)

work page 2008

[4] [4]

In: Conference on Computer Vision and Pattern Recogni- tion (CVPR)

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Conference on Computer Vision and Pattern Recogni- tion (CVPR). pp. 3213–3223 (2016)

work page 2016

[5] [5]

In: Advances in Neural Information Processing Systems (NIPS)

Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems (NIPS). pp. 2366–2374 (2014)

work page 2014

[6] [6]

In: Conference on Computer Vision and Pattern Recognition (CVPR)

Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2002–2011 (2018)

work page 2002

[7] [7]

In: European Conference on Computer Vision (ECCV)

Garg, R., Kumar BG, V., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: Geometry to the rescue. In: European Conference on Computer Vision (ECCV). pp. 740–756 (2016)

work page 2016

[8] [8]

In: Conference on Computer Vision and Pattern Recognition (CVPR)

Gidaris, S., Komodakis, N.: Detect, replace, reﬁne: Deep structured prediction for pixel wise labeling. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7187–7196 (2017)

work page 2017

[9] [9]

In: Conference on Computer Vision and Pattern Recognition (CVPR)

Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth esti- mation with left-right consistency. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6602–6611 (2017)

work page 2017

[10] [10]

In: European Conference on Computer Vision (ECCV)

Guo, X., Li, H., Yi, S., Ren, J., Wang, X.: Learning monocular depth by distill- ing cross-domain stereo networks. In: European Conference on Computer Vision (ECCV). pp. 484–500 (2018)

work page 2018

[11] [11]

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 30(2), 328–341 (2008)

Hirschmller, H.: Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 30(2), 328–341 (2008)

work page 2008

[12] [12]

In: Conference on Computer Vision and Pattern Recognition (CVPR)

Knbelreiter, P., Reinbacher, C., Shekhovtsov, A., Pock, T.: End-to-end training of hybrid CNN-CRF models for stereo. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1456–1465 (2017)

work page 2017

[13] [13]

In: Winter Conference on Applications of Computer Vision (WACV) (2019)

Kong, S., Fowlkes, C.: Pixel-wise attentional gating for parsimonious pixel labeling. In: Winter Conference on Applications of Computer Vision (WACV) (2019)

work page 2019

[14] [14]

In: Conference on Computer Vision and Pattern Recognition (CVPR)

Kuznietsov, Y., Stckler, J., Leibe, B.: Semi-supervised deep learning for monocular depth map prediction. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2215–2223 (2017)

work page 2017

[15] [15]

In: Conference on Computer Vision and Pattern Recognition (CVPR)

Ladick, L., Shi, J., Pollefeys, M.: Pulling things out of perspective. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 89–96 (2014)

work page 2014

[16] [16]

Pattern Recognition 83, 328 – 339 (2018)

Li, B., Dai, Y., He, M.: Monocular depth estimation with hierarchical fusion of dilated cnns and soft-weighted-sum inference. Pattern Recognition 83, 328 – 339 (2018)

work page 2018

[17] [17]

Deep attention-based classification network for robust depth prediction

Li, R., Xian, K., Shen, C., Cao, Z., Lu, H., Hang, L.: Deep attention-based clas- siﬁcation network for robust depth prediction. CoRR arXiv, 1807.03959 [cs.CV] (2018) 14 M. Ochs, A. Kretz, and R. Mester

work page internal anchor Pith review Pith/arXiv arXiv 2018

[18] [18]

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2018)

Lin, T., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detec- tion. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2018)

work page 2018

[19] [19]

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 38(10), 2024–2039 (2016)

Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural ﬁelds. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 38(10), 2024–2039 (2016)

work page 2024

[20] [20]

In: Conference on Computer Vision and Pattern Recognition (CVPR)

Liu, M., Salzmann, M., He, X.: Discrete-continuous depth estimation from a single image. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 716–723 (2014)

work page 2014

[21] [21]

In: Conference on Computer Vision and Pattern Recognition (CVPR)

Mayer, N., Ilg, E., Husser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical ﬂow, and scene ﬂow estimation. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4040–4048 (2016)

work page 2016

[22] [22]

In: Conference on Computer Vision and Pattern Recognition (CVPR)

Niu, Z., Zhou, M., Wang, L., Gao, X., Hua, G.: Ordinal regression with multiple output cnn for age estimation. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4920–4928 (2016)

work page 2016

[23] [23]

Ren, J., Yang, C., Yan, Q.: Cascade residual learning: A two-stage convolutional neural network for stereo matching

Pang, J., Sun, W., SJ. Ren, J., Yang, C., Yan, Q.: Cascade residual learning: A two-stage convolutional neural network for stereo matching. In: International Conference on Computer Vision (ICCV) - Workshop. pp. 887–895 (2017)

work page 2017

[24] [24]

In: Advances in Neural Information Processing Systems (NIPS)

Saxena, A., Chung, S.H., Ng, A.Y.: Learning depth from single monocular images. In: Advances in Neural Information Processing Systems (NIPS). pp. 1161 – 1168 (2005)

work page 2005

[25] [25]

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 31(5), 824–840 (2009)

Saxena, A., Sun, M., Ng, A.Y.: Make3D: Learning 3D scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 31(5), 824–840 (2009)

work page 2009

[26] [26]

In: Intelligent Vehicles Symposium (IV)

Schneider, L., Cordts, M., Rehfeld, T., Pfeiﬀer, D., Enzweiler, M., Franke, U., Polle- feys, M., Roth, S.: Semantic stixels: Depth is not enough. In: Intelligent Vehicles Symposium (IV). pp. 110–117 (2016)

work page 2016

[27] [27]

In: International Conference on 3D Vision (3DV) (2017)

Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A.: Sparsity invariant CNNs. In: International Conference on 3D Vision (3DV) (2017)

work page 2017

[28] [28]

Journal of Machine Learning Research (JMLR) 17, 2287–2318 (2016)

ˇZbontar, J., LeCun, Y.: Stereo matching by training a convolutional neural network to compare image patches. Journal of Machine Learning Research (JMLR) 17, 2287–2318 (2016)

work page 2016

[29] [29]

In: European Conference on Computer Vi- sion (ECCV)

Wu, Y., He, K.: Group normalization. In: European Conference on Computer Vi- sion (ECCV). pp. 3–19 (2018)

work page 2018

[30] [30]

In: European Conference on Com- puter Vision (ECCV)

Xie, J., Girshick, R., Farhadi, A.: Deep3D: fully automatic 2D-to-3D video conver- sion with deep convolutional neural networks. In: European Conference on Com- puter Vision (ECCV). pp. 842–857 (2016)

work page 2016

[31] [31]

In: European Conference on Computer Vision (ECCV)

Yang, G., Zhao, H., Shi, J., Deng, Z., Jia, J.: Segstereo: Exploiting semantic in- formation for disparity estimation. In: European Conference on Computer Vision (ECCV). pp. 660–676 (2018)

work page 2018

[32] [32]

In: European Conference on Computer Vision (ECCV)

Yang, N., Wang, R., St¨ uckler, J., Cremers, D.: Deep virtual stereo odometry: Lever- aging deep depth prediction for monocular direct sparse odometry. In: European Conference on Computer Vision (ECCV). pp. 835–852 (2018)

work page 2018

[33] [33]

Pattern Recognition 83, 430 – 442 (2018)

Zhang, Z., Xu, C., Yang, J., Tai, Y., Chen, L.: Deep hierarchical guidance and regularization learning for end-to-end depth estimation. Pattern Recognition 83, 430 – 442 (2018)

work page 2018