Structure-Aware Residual Pyramid Network for Monocular Depth Estimation

Xiaotian Chen; Xuejin Chen; Zheng-Jun Zha

arxiv: 1907.06023 · v1 · pith:QZH2T32Rnew · submitted 2019-07-13 · 💻 cs.CV

Structure-Aware Residual Pyramid Network for Monocular Depth Estimation

Xiaotian Chen , Xuejin Chen , Zheng-Jun Zha This is my paper

Pith reviewed 2026-05-24 22:05 UTC · model grok-4.3

classification 💻 cs.CV

keywords monocular depth estimationresidual pyramid decodermulti-scale structuresNYU-Depth v2depth predictionresidual refinementfeature fusionscene understanding

0 comments

The pith

A residual pyramid decoder with adaptive feature fusion improves monocular depth estimation by capturing multi-scale scene structures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a network that estimates depth from single images by separating global scene layouts from local shape details across pyramid levels. Upper levels encode broad layouts while lower levels add object shapes through successive residual predictions that refine the previous estimate. An adaptive module fuses features from every scale to support structure inference at each level. This design targets the limitation that standard CNNs often overlook explicit multi-scale structure in complex scenes. The result is reported as higher accuracy on the NYU-Depth v2 benchmark for both visual quality and standard error metrics.

Core claim

The Structure-Aware Residual Pyramid Network expresses global scene structure in upper decoder levels to represent layouts and local structure in lower levels to present shape details; at each level Residual Refinement Modules predict residual maps that progressively add finer structures onto the coarser prediction from the level above; an Adaptive Dense Feature Fusion module adaptively combines effective features from all scales to infer structures at every scale.

What carries the argument

Residual Pyramid Decoder that represents global layouts at upper levels and local details at lower levels, refined by residual maps at each scale.

If this is right

Global layouts appear in upper pyramid levels while shape details are added in lower levels.
Residual predictions allow progressive refinement without overwriting coarser information.
Adaptive fusion across all scales supplies the features needed for structure inference at each level.
The full model reaches state-of-the-art quantitative and qualitative results on NYU-Depth v2.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pyramid-plus-residual pattern could be tested on other dense prediction tasks such as surface normal estimation where multi-scale context also matters.
Evaluating the decoder on outdoor datasets would show whether the structure separation generalizes beyond indoor scenes.
Replacing the backbone with newer feature extractors while keeping the decoder and fusion module fixed would isolate how much the proposed components contribute.

Load-bearing premise

The combination of residual pyramid decoding, residual refinement modules, and adaptive dense feature fusion captures underlying multi-scale structures more effectively than prior CNN designs.

What would settle it

On the NYU-Depth v2 test set the proposed network produces higher root-mean-square error or lower delta-1 accuracy than the previous best published method under identical evaluation protocol.

Figures

Figures reproduced from arXiv: 1907.06023 by Xiaotian Chen, Xuejin Chen, Zheng-Jun Zha.

**Figure 2.** Figure 2: The network architecture. Our Structure-Aware Residual Pyramid Network consists of an encoder which extracts multi-scale visual [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: A Residual Refinement Module (RRM) for the [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison with [Jiao et al., 2018]. The depth maps pre [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results on the NYUD2 dataset. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: 3D projection from predicted depth maps. Our method [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

read the original abstract

Monocular depth estimation is an essential task for scene understanding. The underlying structure of objects and stuff in a complex scene is critical to recovering accurate and visually-pleasing depth maps. Global structure conveys scene layouts, while local structure reflects shape details. Recently developed approaches based on convolutional neural networks (CNNs) significantly improve the performance of depth estimation. However, few of them take into account multi-scale structures in complex scenes. In this paper, we propose a Structure-Aware Residual Pyramid Network (SARPN) to exploit multi-scale structures for accurate depth prediction. We propose a Residual Pyramid Decoder (RPD) which expresses global scene structure in upper levels to represent layouts, and local structure in lower levels to present shape details. At each level, we propose Residual Refinement Modules (RRM) that predict residual maps to progressively add finer structures on the coarser structure predicted at the upper level. In order to fully exploit multi-scale image features, an Adaptive Dense Feature Fusion (ADFF) module, which adaptively fuses effective features from all scales for inferring structures of each scale, is introduced. Experiment results on the challenging NYU-Depth v2 dataset demonstrate that our proposed approach achieves state-of-the-art performance in both qualitative and quantitative evaluation. The code is available at https://github.com/Xt-Chen/SARPN.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SARPN is an incremental CNN tweak for monocular depth that reports SOTA on NYU v2 with public code.

read the letter

The main thing to know is that this paper introduces SARPN, a CNN that adds a Residual Pyramid Decoder, Residual Refinement Modules, and Adaptive Dense Feature Fusion to better handle multi-scale scene structure, and it claims state-of-the-art numbers on the NYU-Depth v2 benchmark with code released on GitHub. The architecture extends standard pyramid and residual ideas by decoding from coarse global layouts down to fine local details and fusing features adaptively across scales. That combination is new enough to count as a distinct design, and the empirical results are presented with the usual tables and visuals. The code link is a real plus because it lets others check the numbers directly. The central claim rests on those benchmark comparisons rather than any new theory or derivation, which is fine for this kind of work. The soft spot is that the novelty is mostly engineering: it recombines existing residual and pyramid blocks rather than deriving something from first principles or solving a previously intractable case. Gains on NYU v2 are likely small even if they top the current list, and without the full ablations it is hard to isolate how much each module actually moves the needle. Still, the stress-test note indicates the manuscript includes those comparisons and shows no internal contradictions or hidden dependencies. This paper is for people already working on monocular depth or scene reconstruction who want to test a new decoder variant on standard data. It is solid enough on its own terms to deserve peer review rather than a desk reject.

Referee Report

0 major / 3 minor

Summary. The paper proposes the Structure-Aware Residual Pyramid Network (SARPN) for monocular depth estimation. It introduces a Residual Pyramid Decoder (RPD) to encode global scene layouts at upper levels and local shape details at lower levels, Residual Refinement Modules (RRM) that add progressive residual maps for finer structures, and an Adaptive Dense Feature Fusion (ADFF) module to fuse multi-scale image features. The central empirical claim is that this architecture achieves state-of-the-art performance on the NYU-Depth v2 benchmark in both quantitative metrics and qualitative results, with code released at https://github.com/Xt-Chen/SARPN.

Significance. If the reported gains are reproducible, the work advances monocular depth estimation by explicitly separating and refining multi-scale structural information, which is load-bearing for applications requiring accurate scene geometry. The explicit release of training code is a clear strength that enables direct verification of the NYU-Depth v2 tables and ablation studies.

minor comments (3)

[§4] §4 (Experiments): the quantitative tables should report the number of runs or standard deviations for the listed metrics (e.g., Abs Rel, RMSE) to allow readers to assess whether the claimed margins over prior methods are statistically stable.
[§3.3] Fig. 3 and §3.3: the description of the ADFF module would benefit from an explicit equation showing the adaptive weighting mechanism, rather than relying solely on the diagram.
[Related Work] Related Work section: a short paragraph contrasting SARPN with earlier pyramid networks (e.g., those using feature pyramids without residual refinement) would help readers locate the precise novelty.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work on the Structure-Aware Residual Pyramid Network and the recommendation for minor revision. The report correctly summarizes the key components (RPD, RRM, ADFF) and notes the value of code release for reproducibility on NYU-Depth v2.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical architecture proposal for monocular depth estimation. It defines SARPN via RPD (global-to-local pyramid), RRM (residual refinement), and ADFF (adaptive fusion), then reports SOTA numbers on NYU-Depth v2. No derivation chain, equations, or first-principles claims exist that reduce to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The result is a standard CNN design validated by held-out test metrics and public code; the reader's weakest assumption is directly testable by the ablations shown.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 3 invented entities

The central claim rests on standard deep-learning assumptions plus three newly introduced modules whose effectiveness is asserted but not independently verified outside the paper.

free parameters (1)

Network weights
All convolutional parameters are fitted to the NYU-Depth v2 training split.

axioms (1)

domain assumption Convolutional neural networks can learn hierarchical multi-scale features from RGB images that correlate with scene depth.
Invoked implicitly when the network is expected to exploit global and local structures.

invented entities (3)

Residual Pyramid Decoder no independent evidence
purpose: Express global scene layouts at upper levels and local shape details at lower levels.
New module introduced by the paper.
Residual Refinement Modules no independent evidence
purpose: Predict residual maps to progressively refine coarser depth predictions.
New module introduced by the paper.
Adaptive Dense Feature Fusion no independent evidence
purpose: Adaptively fuse effective features from all scales for each level's structure inference.
New module introduced by the paper.

pith-pipeline@v0.9.0 · 5765 in / 1439 out tokens · 50033 ms · 2026-05-24T22:05:28.657718+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation
cs.CV 2026-03 unverdicted novelty 7.0

Low-rank decoder adaptation enables efficient test-time optimization for zero-shot depth completion by updating only the subspace containing depth-relevant information.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 1 Pith paper

[1]

ScanNet: Richly-annotated 3D reconstructions of indoor scenes

[Dai et al., 2017] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5828–5839,

work page 2017
[2]

Imagenet: A large-scale hierarchical image database

[Deng et al., 2009] Jia Deng, Wei Dong, Richard Socher, Li- Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database

work page 2009
[3]

Pre- dicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture

[Eigen and Fergus, 2015] David Eigen and Rob Fergus. Pre- dicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In IEEE International Conference on Computer Vision, pages 2650–2658,

work page 2015
[4]

Depth map prediction from a single image using a multi-scale deep network

[Eigen et al., 2014] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In Advances in Neural Information Processing Systems, pages 2366–2374,

work page 2014
[5]

Deep ordinal regression network for monocular depth estimation

[Fu et al., 2018] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recog- nition, pages 2002–2011,

work page 2018
[6]

Detail preserving depth estimation from a sin- gle image using attention guided networks

[Hao et al., 2018] Zhixiang Hao, Yu Li, Shaodi You, and Feng Lu. Detail preserving depth estimation from a sin- gle image using attention guided networks. In 3DV, pages 304–313. IEEE,

work page 2018
[7]

Squeeze- and-excitation networks

[Hu et al., 2018] Jie Hu, Li Shen, and Gang Sun. Squeeze- and-excitation networks. In IEEE Conference on Com- puter Vision and Pattern Recognition , pages 7132–7141,

work page 2018
[8]

Revisiting single image depth estima- tion: Toward higher resolution maps with accurate object boundaries

[Hu et al., 2019] Junjie Hu, Mete Ozay, Yan Zhang, and Takayuki Okatani. Revisiting single image depth estima- tion: Toward higher resolution maps with accurate object boundaries. In IEEE Winter Conference on Applications of Computer Vision,

work page 2019
[9]

Look deeper into depth: Monocular depth estimation with semantic booster and attention-driven loss

[Jiao et al., 2018] Jianbo Jiao, Ying Cao, Yibing Song, and Rynson Lau. Look deeper into depth: Monocular depth estimation with semantic booster and attention-driven loss. In European Conference on Computer Vision , pages 53– 69,

work page 2018
[10]

Pulling things out of perspective

[Ladicky et al., 2014] Lubor Ladicky, Jianbo Shi, and Marc Pollefeys. Pulling things out of perspective. In IEEE Conference on Computer Vision and Pattern Recognition , pages 89–96,

work page 2014
[11]

Deeper depth prediction with fully convolutional residual net- works

[Laina et al., 2016] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. Deeper depth prediction with fully convolutional residual net- works. In 3DV, pages 239–248. IEEE,

work page 2016
[12]

Depth and surface nor- mal estimation from monocular images using regression on deep features and hierarchical crfs

[Li et al., 2015] Bo Li, Chunhua Shen, Yuchao Dai, Anton Van Den Hengel, and Mingyi He. Depth and surface nor- mal estimation from monocular images using regression on deep features and hierarchical crfs. In IEEE Confer- ence on Computer Vision and Pattern Recognition , pages 1119–1127,

work page 2015
[13]

Fully convolutional networks for seman- tic segmentation

[Long et al., 2015] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for seman- tic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440,

work page 2015
[14]

RDFNet: RGB-D multi-level residual fea- ture fusion for indoor semantic segmentation

[Park et al., 2017] Seong-Jin Park, Ki-Sang Hong, and Se- ungyong Lee. RDFNet: RGB-D multi-level residual fea- ture fusion for indoor semantic segmentation. In IEEE In- ternational Conference on Computer Vision , pages 4980– 4989,

work page 2017
[15]

Automatic differentiation in pytorch

[Paszke et al., 2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De- Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch

work page 2017
[16]

Geonet: Geometric neural network for joint depth and surface normal estimation

[Qi et al., 2018] Xiaojuan Qi, Renjie Liao, Zhengzhe Liu, Raquel Urtasun, and Jiaya Jia. Geonet: Geometric neural network for joint depth and surface normal estimation. In IEEE Conference on Computer Vision and Pattern Recog- nition, pages 283–291,

work page 2018
[17]

Realtime and robust hand track- ing from depth

[Qian et al., 2014] Chen Qian, Xiao Sun, Yichen Wei, Xi- aoou Tang, and Jian Sun. Realtime and robust hand track- ing from depth. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1106–1113,

work page 2014
[18]

Indoor segmentation and support inference from rgb-d images

[Silberman et al., 2012] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgb-d images. In European Conference on Computer Vision, pages 746–760,

work page 2012
[19]

Sun rgb-d: A rgb-d scene understanding benchmark suite

[Song et al., 2015] Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 567– 576,

work page 2015
[20]

Multi-scale continuous CRFs as sequential deep networks for monocular depth estima- tion

[Xu et al., 2017] Dan Xu, Elisa Ricci, Wanli Ouyang, Xiao- gang Wang, and Nicu Sebe. Multi-scale continuous CRFs as sequential deep networks for monocular depth estima- tion. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5354–5362,

work page 2017
[21]

Joint task-recursive learning for semantic segmentation and depth estimation

[Zhang et al., 2018] Zhenyu Zhang, Zhen Cui, Chunyan Xu, Zequn Jie, Xiang Li, and Jian Yang. Joint task-recursive learning for semantic segmentation and depth estimation. In European Conference on Computer Vision,

work page 2018
[22]

LA-Net: Layout-aware dense network for monocular depth estimation

[Zheng et al., 2018] Kecheng Zheng, Zheng-Jun Zha, Yang Cao, Xuejin Chen, and Feng Wu. LA-Net: Layout-aware dense network for monocular depth estimation. In ACM Multimedia Conference on Multimedia Conference , pages 1381–1388, 2018

work page 2018

[1] [1]

ScanNet: Richly-annotated 3D reconstructions of indoor scenes

[Dai et al., 2017] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5828–5839,

work page 2017

[2] [2]

Imagenet: A large-scale hierarchical image database

[Deng et al., 2009] Jia Deng, Wei Dong, Richard Socher, Li- Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database

work page 2009

[3] [3]

Pre- dicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture

[Eigen and Fergus, 2015] David Eigen and Rob Fergus. Pre- dicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In IEEE International Conference on Computer Vision, pages 2650–2658,

work page 2015

[4] [4]

Depth map prediction from a single image using a multi-scale deep network

[Eigen et al., 2014] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In Advances in Neural Information Processing Systems, pages 2366–2374,

work page 2014

[5] [5]

Deep ordinal regression network for monocular depth estimation

[Fu et al., 2018] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recog- nition, pages 2002–2011,

work page 2018

[6] [6]

Detail preserving depth estimation from a sin- gle image using attention guided networks

[Hao et al., 2018] Zhixiang Hao, Yu Li, Shaodi You, and Feng Lu. Detail preserving depth estimation from a sin- gle image using attention guided networks. In 3DV, pages 304–313. IEEE,

work page 2018

[7] [7]

Squeeze- and-excitation networks

[Hu et al., 2018] Jie Hu, Li Shen, and Gang Sun. Squeeze- and-excitation networks. In IEEE Conference on Com- puter Vision and Pattern Recognition , pages 7132–7141,

work page 2018

[8] [8]

Revisiting single image depth estima- tion: Toward higher resolution maps with accurate object boundaries

[Hu et al., 2019] Junjie Hu, Mete Ozay, Yan Zhang, and Takayuki Okatani. Revisiting single image depth estima- tion: Toward higher resolution maps with accurate object boundaries. In IEEE Winter Conference on Applications of Computer Vision,

work page 2019

[9] [9]

Look deeper into depth: Monocular depth estimation with semantic booster and attention-driven loss

[Jiao et al., 2018] Jianbo Jiao, Ying Cao, Yibing Song, and Rynson Lau. Look deeper into depth: Monocular depth estimation with semantic booster and attention-driven loss. In European Conference on Computer Vision , pages 53– 69,

work page 2018

[10] [10]

Pulling things out of perspective

[Ladicky et al., 2014] Lubor Ladicky, Jianbo Shi, and Marc Pollefeys. Pulling things out of perspective. In IEEE Conference on Computer Vision and Pattern Recognition , pages 89–96,

work page 2014

[11] [11]

Deeper depth prediction with fully convolutional residual net- works

[Laina et al., 2016] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. Deeper depth prediction with fully convolutional residual net- works. In 3DV, pages 239–248. IEEE,

work page 2016

[12] [12]

Depth and surface nor- mal estimation from monocular images using regression on deep features and hierarchical crfs

[Li et al., 2015] Bo Li, Chunhua Shen, Yuchao Dai, Anton Van Den Hengel, and Mingyi He. Depth and surface nor- mal estimation from monocular images using regression on deep features and hierarchical crfs. In IEEE Confer- ence on Computer Vision and Pattern Recognition , pages 1119–1127,

work page 2015

[13] [13]

Fully convolutional networks for seman- tic segmentation

[Long et al., 2015] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for seman- tic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440,

work page 2015

[14] [14]

RDFNet: RGB-D multi-level residual fea- ture fusion for indoor semantic segmentation

[Park et al., 2017] Seong-Jin Park, Ki-Sang Hong, and Se- ungyong Lee. RDFNet: RGB-D multi-level residual fea- ture fusion for indoor semantic segmentation. In IEEE In- ternational Conference on Computer Vision , pages 4980– 4989,

work page 2017

[15] [15]

Automatic differentiation in pytorch

[Paszke et al., 2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De- Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch

work page 2017

[16] [16]

Geonet: Geometric neural network for joint depth and surface normal estimation

[Qi et al., 2018] Xiaojuan Qi, Renjie Liao, Zhengzhe Liu, Raquel Urtasun, and Jiaya Jia. Geonet: Geometric neural network for joint depth and surface normal estimation. In IEEE Conference on Computer Vision and Pattern Recog- nition, pages 283–291,

work page 2018

[17] [17]

Realtime and robust hand track- ing from depth

[Qian et al., 2014] Chen Qian, Xiao Sun, Yichen Wei, Xi- aoou Tang, and Jian Sun. Realtime and robust hand track- ing from depth. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1106–1113,

work page 2014

[18] [18]

Indoor segmentation and support inference from rgb-d images

[Silberman et al., 2012] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgb-d images. In European Conference on Computer Vision, pages 746–760,

work page 2012

[19] [19]

Sun rgb-d: A rgb-d scene understanding benchmark suite

[Song et al., 2015] Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 567– 576,

work page 2015

[20] [20]

Multi-scale continuous CRFs as sequential deep networks for monocular depth estima- tion

[Xu et al., 2017] Dan Xu, Elisa Ricci, Wanli Ouyang, Xiao- gang Wang, and Nicu Sebe. Multi-scale continuous CRFs as sequential deep networks for monocular depth estima- tion. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5354–5362,

work page 2017

[21] [21]

Joint task-recursive learning for semantic segmentation and depth estimation

[Zhang et al., 2018] Zhenyu Zhang, Zhen Cui, Chunyan Xu, Zequn Jie, Xiang Li, and Jian Yang. Joint task-recursive learning for semantic segmentation and depth estimation. In European Conference on Computer Vision,

work page 2018

[22] [22]

LA-Net: Layout-aware dense network for monocular depth estimation

[Zheng et al., 2018] Kecheng Zheng, Zheng-Jun Zha, Yang Cao, Xuejin Chen, and Feng Wu. LA-Net: Layout-aware dense network for monocular depth estimation. In ACM Multimedia Conference on Multimedia Conference , pages 1381–1388, 2018

work page 2018