SemAttNet: Towards Attention-based Semantic Aware Guided Depth Completion

Danish Nazir; Didier Stricker; Marcus Liwicki; Muhammad Zeshan Afzal

arxiv: 2204.13635 · v2 · submitted 2022-04-28 · 💻 cs.CV

SemAttNet: Towards Attention-based Semantic Aware Guided Depth Completion

Danish Nazir , Marcus Liwicki , Didier Stricker , Muhammad Zeshan Afzal This is my paper

Pith reviewed 2026-05-24 11:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords depth completionsemantic segmentationattention-based fusionmulti-modal guidanceKITTI benchmarkguided depth estimationCSPN++ refinement

0 comments

The pith

A three-branch network with semantic guidance and attention fusion completes sparse depth maps more accurately than color-only approaches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a three-branch backbone that runs color-guided, semantic-guided, and depth-guided paths in parallel. The color branch produces depth informed by object boundaries from the RGB image; the semantic branch then uses that output plus a semantic segmentation map to add scene-level understanding; the depth branch combines all three inputs. These are fused by the semantic-aware multi-modal attention-based fusion block before CSPN++ refinement. The central claim is that semantic input supplies the scene understanding missing from RGB guidance alone, especially when illumination changes create shadows or highlights. If correct, the method would yield denser and more reliable depth maps on benchmarks that test real-world lighting variation.

Core claim

The central claim is that routing sparse depth and RGB through a three-branch backbone (color-guided, semantic-guided, depth-guided), fusing the resulting color depth, semantic depth, and guided depth via the semantic-aware multi-modal attention-based fusion block (SAMMAFB), then refining with CSPN++ and atrous convolutions, produces a dense depth map that achieves state-of-the-art performance on the KITTI depth completion benchmark at the time of submission.

What carries the argument

The three-branch backbone (color-guided, semantic-guided, depth-guided) whose outputs are adaptively fused by the semantic-aware multi-modal attention-based fusion block (SAMMAFB).

If this is right

The color-guided branch supplies object-boundary cues that the semantic-guided branch then exploits to produce semantic depth.
The depth-guided branch integrates sparse depth with both color and semantic depths to form the final guided depth.
Adaptive fusion of the three depth outputs via SAMMAFB produces the backbone result before refinement.
CSPN++ with atrous convolutions applied to the fused map yields the final dense depth prediction.
The full pipeline outperforms earlier methods that rely on only one or two guidance modalities on the KITTI benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The quality of the input semantic segmentation maps may become the practical performance limit if those maps contain errors in shadowed regions.
The same three-branch pattern with attention fusion could be tested on other sparse-to-dense tasks such as normal estimation or optical flow.
Evaluating the model on datasets that contain more extreme lighting variation than KITTI would test whether the semantic branch delivers the claimed robustness.
Ablating the attention fusion block while keeping the three branches would isolate whether the performance gain comes mainly from the added semantic input or from the fusion mechanism itself.

Load-bearing premise

Semantic segmentation supplies scene understanding that color images alone cannot provide, especially under sudden illumination changes, and that routing information through three separate guided branches plus SAMMAFB attention fusion will measurably outperform prior single-guidance or two-branch methods.

What would settle it

Reproducing the model and measuring its RMSE or MAE on the official KITTI depth completion validation set; if the numbers do not surpass the prior published state-of-the-art, the performance claim is falsified.

Figures

Figures reproduced from arXiv: 2204.13635 by Danish Nazir, Didier Stricker, Marcus Liwicki, Muhammad Zeshan Afzal.

**Figure 2.** Figure 2: FIGURE 2: The overview of proposed SemAttNet. It consists of a novel three-branch backbone and a CSPN++ module with [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: FIGURE 3: Architecture of SAMMAFB. The input to SAM [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: FIGURE 4: Qualitative results on KITTI depth completion test set. The results are obtained by online KITTI depth completion [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Depth completion involves recovering a dense depth map from a sparse map and an RGB image. Recent approaches focus on utilizing color images as guidance images to recover depth at invalid pixels. However, color images alone are not enough to provide the necessary semantic understanding of the scene. Consequently, the depth completion task suffers from sudden illumination changes in RGB images (e.g., shadows). In this paper, we propose a novel three-branch backbone comprising color-guided, semantic-guided, and depth-guided branches. Specifically, the color-guided branch takes a sparse depth map and RGB image as an input and generates color depth which includes color cues (e.g., object boundaries) of the scene. The predicted dense depth map of color-guided branch along-with semantic image and sparse depth map is passed as input to semantic-guided branch for estimating semantic depth. The depth-guided branch takes sparse, color, and semantic depths to generate the dense depth map. The color depth, semantic depth, and guided depth are adaptively fused to produce the output of our proposed three-branch backbone. In addition, we also propose to apply semantic-aware multi-modal attention-based fusion block (SAMMAFB) to fuse features between all three branches. We further use CSPN++ with Atrous convolutions to refine the dense depth map produced by our three-branch backbone. Extensive experiments show that our model achieves state-of-the-art performance in the KITTI depth completion benchmark at the time of submission.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Three-branch network adds semantic guidance and SAMMAFB fusion to depth completion but the abstract supplies no numbers or ablations to support the SOTA claim.

read the letter

The paper presents SemAttNet, a three-branch backbone for depth completion that routes a sparse depth map through color-guided, semantic-guided, and depth-guided paths before fusing them with a semantic-aware multi-modal attention block and refining with CSPN++ and atrous convolutions. The stated goal is to supply scene semantics that color images miss under changing illumination such as shadows. That motivation is straightforward and the staged branch design is a clear way to inject the extra modality. The SAMMAFB fusion is presented as the main technical addition on top of existing guided-completion pipelines. The work is therefore an incremental architecture tweak rather than a conceptual shift. The central limitation is that the abstract asserts state-of-the-art results on KITTI without reporting any RMSE, MAE, or iRMSE figures, without ablations on the branches or the attention block, and without error bars or split details. Without those numbers it is impossible to judge whether the three-branch setup and the new fusion actually move the needle over prior color-plus-semantic or attention-based methods. Everything is still trained and tested on the same KITTI data, so the outcome remains a fitted empirical result. The paper is aimed at the depth-completion community in robotics and autonomous systems. Readers already working on multi-modal fusion or KITTI baselines could extract the architectural choices for comparison, but only if the full experiments show clear, reproducible gains. The manuscript is coherent on its own terms and the design choices are traceable to prior guided-completion literature, so it is worth sending to referees who can examine the quantitative tables and ablation studies.

Referee Report

2 major / 1 minor

Summary. The paper proposes SemAttNet, a three-branch guided depth completion architecture consisting of color-guided, semantic-guided, and depth-guided branches whose outputs are fused via a semantic-aware multi-modal attention-based fusion block (SAMMAFB); the fused result is further refined by CSPN++ with atrous convolutions. The central claim is that this design achieves state-of-the-art performance on the KITTI depth completion benchmark by supplying semantic scene understanding that color guidance alone cannot provide under illumination changes.

Significance. If the reported performance gains are reproducible and the ablations confirm the contribution of the semantic branch and SAMMAFB, the work would provide concrete evidence that explicit semantic guidance improves robustness in guided depth completion, extending prior single- or dual-guidance methods with a principled multi-branch attention mechanism.

major comments (2)

[Abstract / Experiments] Abstract and Experiments section: the claim of state-of-the-art performance on KITTI is stated without any quantitative metrics (RMSE, MAE, etc.), baseline comparisons, ablation tables, or error bars, so the central empirical claim cannot be evaluated from the supplied information.
[Introduction / Method] The weakest assumption—that semantic segmentation supplies scene understanding unavailable from color images alone, especially under sudden illumination changes—is asserted but not tested with controlled experiments (e.g., performance on subsets with shadows or low-light); without such evidence the attribution of gains to the three-branch design remains unsubstantiated.

minor comments (1)

[Method] Notation for the three predicted depth maps (color depth, semantic depth, guided depth) is introduced in the abstract but never formalized with equations or a diagram in the provided text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and outline the revisions we will make to improve clarity and substantiation of our claims.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the claim of state-of-the-art performance on KITTI is stated without any quantitative metrics (RMSE, MAE, etc.), baseline comparisons, ablation tables, or error bars, so the central empirical claim cannot be evaluated from the supplied information.

Authors: We acknowledge that the abstract does not quote specific numerical results. The Experiments section contains the full quantitative evaluation on the KITTI benchmark, including tables with RMSE, MAE, iRMSE and iMAE against prior methods, ablation tables isolating the semantic branch and SAMMAFB, and official leaderboard comparisons. We will revise the abstract to report the key metrics (e.g., our RMSE on the validation set) and add explicit cross-references to the tables. Error bars are not standard practice on the fixed KITTI test set; we will add a short note on reproducibility instead. revision: yes
Referee: [Introduction / Method] The weakest assumption—that semantic segmentation supplies scene understanding unavailable from color images alone, especially under sudden illumination changes—is asserted but not tested with controlled experiments (e.g., performance on subsets with shadows or low-light); without such evidence the attribution of gains to the three-branch design remains unsubstantiated.

Authors: We agree that a controlled subset analysis would strengthen the attribution. The current results are reported on the full KITTI validation and test sets, which contain diverse illumination including shadows. Qualitative examples in the paper illustrate improved boundary recovery under shadows when the semantic branch is active. In revision we will add a short discussion subsection on illumination robustness and, where data permits, report performance on a manually annotated shadow/low-light subset or reference related analyses from the depth-completion literature. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper describes an empirical neural architecture (three-branch guided backbone with SAMMAFB fusion and CSPN++ refinement) trained on the KITTI training split and evaluated on its test split to claim SOTA performance. This is a standard fitted empirical outcome on an external benchmark, not a first-principles derivation or self-referential prediction. No equations, self-definitional steps, fitted-input-as-prediction reductions, or load-bearing self-citations appear in the provided text; the method is presented as a direct engineering extension of prior guided-completion work without reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim of state-of-the-art performance rests entirely on empirical training of a deep network on the KITTI dataset; no independent derivation or external benchmark is supplied.

free parameters (1)

network weights
All convolutional and attention parameters are optimized on KITTI training data to produce the reported depth maps.

axioms (1)

standard math Gradient-based optimization of a multi-branch convolutional network can learn useful depth completion mappings.
Standard assumption underlying all supervised deep-learning training.

pith-pipeline@v0.9.0 · 5794 in / 1278 out tokens · 30235 ms · 2026-05-24T11:38:57.536585+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 1 internal anchor

[1]

Fast depth densiﬁcation for occlusion-aware augmented reality,

A. Holynski and J. Kopf, “Fast depth densiﬁcation for occlusion-aware augmented reality,” ACM Transactions on Graphics (ToG), vol. 37, no. 6, pp. 1–11, 2018

work page 2018
[2]

A real-time interactive augmented reality depth estimation technique for surgical robotics,

M. Kalia, N. Navab, and T. Salcudean, “A real-time interactive augmented reality depth estimation technique for surgical robotics,” in 2019 Interna- tional Conference on Robotics and Automation (ICRA), pp. 8291–8297, 2019

work page 2019
[3]

3d reconstruction with time-of-ﬂight depth camera and multiple mirrors,

T.-N. Nguyen, H.-H. Huynh, and J. Meunier, “3d reconstruction with time-of-ﬂight depth camera and multiple mirrors,” IEEE Access, vol. 6, pp. 38106–38114, 2018

work page 2018
[4]

Non-local spatial propagation network for depth completion,

J. Park, K. Joo, Z. Hu, C.-K. Liu, and I. So Kweon, “Non-local spatial propagation network for depth completion,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16, pp. 120–136, Springer, 2020

work page 2020
[6]

Real- time dense mapping for self-driving vehicles using ﬁsheye cameras,

Z. Cui, L. Heng, Y . C. Yeo, A. Geiger, M. Pollefeys, and T. Sattler, “Real- time dense mapping for self-driving vehicles using ﬁsheye cameras,” in 2019 International Conference on Robotics and Automation (ICRA), pp. 6087–6093, 2019

work page 2019
[7]

An indoor obstacle detection system using depth information and region growth,

H.-C. Huang, C.-T. Hsieh, and C.-H. Yeh, “An indoor obstacle detection system using depth information and region growth,” Sensors, vol. 15, no. 10, pp. 27116–27141, 2015

work page 2015
[8]

Learning guided con- volutional network for depth completion,

J. Tang, F.-P. Tian, W. Feng, J. Li, and P. Tan, “Learning guided con- volutional network for depth completion,” IEEE Transactions on Image Processing, vol. 30, pp. 1116–1129, 2020

work page 2020
[9]

Three-dimensional imaging in the studio and elsewhere,

G. J. Iddan and G. Yahav, “Three-dimensional imaging in the studio and elsewhere,” in Three-Dimensional Image Capture and Applications IV (B. D. Corner, J. H. Nurre, and R. P. Pargas, eds.), vol. 4298, pp. 48 – 55, International Society for Optics and Photonics, SPIE, 2001

work page 2001
[10]

Sparsity invariant cnns,

J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, “Sparsity invariant cnns,” 2017

work page 2017
[11]

Are we ready for autonomous driving? the kitti vision benchmark suite,

A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361, 2012

work page 2012
[12]

Deeplidar: Deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image,

J. Qiu, Z. Cui, Y . Zhang, X. Zhang, S. Liu, B. Zeng, and M. Pollefeys, “Deeplidar: Deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

work page 2019
[13]

Towards precise and efﬁcient image guided depth completion,

M. Hu, S. Wang, B. Li, S. Ning, L. Fan, and X. Gong, “Towards precise and efﬁcient image guided depth completion,” 2021

work page 2021
[14]

Rignet: Repetitive image guided network for depth completion,

Z. Yan, K. Wang, X. Li, Z. Zhang, B. Xu, J. Li, and J. Yang, “Rignet: Repetitive image guided network for depth completion,” 2021

work page 2021
[15]

Depth completion using plane- residual representation,

B.-U. Lee, K. Lee, and I. S. Kweon, “Depth completion using plane- residual representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13916–13925, 2021

work page 2021
[16]

Multitask gans for semantic segmentation and depth completion with cycle consistency,

C. Zhang, Y . Tang, C. Zhao, Q. Sun, Z. Ye, and J. Kurths, “Multitask gans for semantic segmentation and depth completion with cycle consistency,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–12, 2021

work page 2021
[17]

Adaptive context-aware multi- modal network for depth completion,

S. Zhao, M. Gong, H. Fu, and D. Tao, “Adaptive context-aware multi- modal network for depth completion,” IEEE Transactions on Image Pro- cessing, 2021

work page 2021
[18]

Fcfr- net: Feature fusion based coarse-to-ﬁne residual learning for depth com- pletion,

L. Liu, X. Song, X. Lyu, J. Diao, M. Wang, Y . Liu, and L. Zhang, “Fcfr- net: Feature fusion based coarse-to-ﬁne residual learning for depth com- pletion,” in Proceedings of the AAAI Conference on Artiﬁcial Intelligence, vol. 35, pp. 2136–2144, 2021

work page 2021
[19]

Cspn++: Learning context and resource aware convolutional spatial propagation networks for depth completion,

X. Cheng, P. Wang, C. Guan, and R. Yang, “Cspn++: Learning context and resource aware convolutional spatial propagation networks for depth completion,” in Proceedings of the AAAI Conference on Artiﬁcial Intelli- gence, vol. 34, pp. 10615–10622, 2020

work page 2020
[20]

Learning joint 2d-3d representations for depth completion,

Y . Chen, B. Yang, M. Liang, and R. Urtasun, “Learning joint 2d-3d representations for depth completion,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10023–10032, 2019

work page 2019
[21]

Semantically guided depth upsampling,

N. Schneider, L. Schneider, P. Pinggera, U. Franke, M. Pollefeys, and C. Stiller, “Semantically guided depth upsampling,” in German conference on pattern recognition, pp. 37–48, Springer, 2016

work page 2016
[22]

Towards scene understanding: Unsupervised monocular depth estimation with semantic- aware representation,

P.-Y . Chen, A. H. Liu, Y .-C. Liu, and Y .-C. F. Wang, “Towards scene understanding: Unsupervised monocular depth estimation with semantic- aware representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2624–2632, 2019

work page 2019
[23]

Sparse and noisy lidar completion with rgb guidance and uncertainty,

W. V . Gansbeke, D. Neven, B. D. Brabandere, and L. V . Gool, “Sparse and noisy lidar completion with rgb guidance and uncertainty,” 2019

work page 2019
[24]

Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” 2017

work page 2017
[25]

F. Ma, G. V . Cavalheiro, and S. Karaman, “Self-supervised sparse- to-dense: Self-supervised depth completion from lidar and monocular 8 VOLUME 4, 2016 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS camera,” in 2019 International Conference on Robotics and Automation (ICRA), pp. 3288–3295, IEEE, 2019

work page 2016
[26]

Cbam: Convolutional block attention module,

S. Woo, J. Park, J.-Y . Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European conference on computer vision (ECCV), pp. 3–19, 2018

work page 2018
[27]

Multi-modal attention-based fusion model for semantic segmentation of rgb-depth images,

F. Fooladgar and S. Kasaei, “Multi-modal attention-based fusion model for semantic segmentation of rgb-depth images,” arXiv preprint arXiv:1912.11691, 2019

work page arXiv 1912
[28]

Deep convolutional compressed sensing for lidar depth completion,

N. Chodosh, C. Wang, and S. Lucey, “Deep convolutional compressed sensing for lidar depth completion,” 2018

work page 2018
[29]

Depthnet: Real-time lidar point cloud depth completion for autonomous vehicles,

L. Bai, Y . Zhao, M. Elhousni, and X. Huang, “Depthnet: Real-time lidar point cloud depth completion for autonomous vehicles,” IEEE Access, vol. 8, pp. 227825–227833, 2020

work page 2020
[30]

Uncertainty- aware cnns for depth completion: Uncertainty from beginning to end,

A. Eldesokey, M. Felsberg, K. Holmquist, and M. Persson, “Uncertainty- aware cnns for depth completion: Uncertainty from beginning to end,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12014–12023, 2020

work page 2020
[32]

Denselidar: A real-time pseudo dense depth guided depth completion network,

J. Gu, Z. Xiang, Y . Ye, and L. Wang, “Denselidar: A real-time pseudo dense depth guided depth completion network,” IEEE Robotics and Au- tomation Letters, vol. 6, no. 2, pp. 1808–1815, 2021

work page 2021
[33]

Learning depth with convolutional spatial propagation network,

X. Cheng, P. Wang, and R. Yang, “Learning depth with convolutional spatial propagation network,” IEEE transactions on pattern analysis and machine intelligence, 2019

work page 2019
[34]

Learning Affinity via Spatial Propagation Networks

S. Liu, S. De Mello, J. Gu, G. Zhong, M.-H. Yang, and J. Kautz, “Learning afﬁnity via spatial propagation networks,” arXiv preprint arXiv:1710.01020, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[35]

Autoencoder for words,

C.-Y . Liou, W.-C. Cheng, J.-W. Liou, and D.-R. Liou, “Autoencoder for words,” Neurocomputing, vol. 139, pp. 84–96, 2014

work page 2014
[36]

Attention-based multimodal fusion for video description,

C. Hori, T. Hori, T.-Y . Lee, Z. Zhang, B. Harsham, J. R. Hershey, T. K. Marks, and K. Sumi, “Attention-based multimodal fusion for video description,” in Proceedings of the IEEE international conference on computer vision, pp. 4193–4202, 2017

work page 2017
[37]

Attention-based multimodal fusion for estimating human emotion in real-world hri,

Y . Li, T. Zhao, and X. Shen, “Attention-based multimodal fusion for estimating human emotion in real-world hri,” in Companion of the 2020 ACM/IEEE International Conference on Human-Robot Interaction, pp. 340–342, 2020

work page 2020
[38]

Attention-based multimodal contextual fusion for sentiment and emotion classiﬁcation using bidirectional lstm,

M. G. Huddar, S. S. Sannakki, and V . S. Rajpurohit, “Attention-based multimodal contextual fusion for sentiment and emotion classiﬁcation using bidirectional lstm,” Multimedia Tools and Applications, vol. 80, no. 9, pp. 13059–13076, 2021

work page 2021
[39]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016

work page 2016
[40]

From depth what can you see? depth completion via auxiliary image reconstruction,

K. Lu, N. Barnes, S. Anwar, and L. Zheng, “From depth what can you see? depth completion via auxiliary image reconstruction,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11303–11312, 2020

work page 2020
[41]

Wider or deeper: Revisiting the resnet model for visual recognition,

Z. Wu, C. Shen, and A. van den Hengel, “Wider or deeper: Revisiting the resnet model for visual recognition,” 2016

work page 2016
[42]

Augmented reality meets computer vision: Efﬁcient data generation for urban driving scenes,

H. Alhaija, S. Mustikovela, L. Mescheder, A. Geiger, and C. Rother, “Augmented reality meets computer vision: Efﬁcient data generation for urban driving scenes,” International Journal of Computer Vision (IJCV), 2018

work page 2018
[43]

Multi-scale context aggregation by dilated convolu- tions,

F. Yu and V . Koltun, “Multi-scale context aggregation by dilated convolu- tions,” 2016

work page 2016
[44]

Deformable spatial propagation networks for depth completion,

Z. Xu, H. Yin, and J. Yao, “Deformable spatial propagation networks for depth completion,” in 2020 IEEE International Conference on Image Processing (ICIP), pp. 913–917, IEEE, 2020

work page 2020
[45]

Depth completion with twin surface extrapolation at occlusion boundaries,

S. Imran, X. Liu, and D. Morris, “Depth completion with twin surface extrapolation at occlusion boundaries,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2583–2592, 2021

work page 2021
[46]

Pytorch: An imperative style, high-performance deep learning library,

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing S...

work page 2019
[47]

Adam: A method for stochastic optimization,

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2017

work page 2017
[48]

Sparse-to-dense: Depth prediction from sparse depth samples and a single image,

F. Ma and S. Karaman, “Sparse-to-dense: Depth prediction from sparse depth samples and a single image,” in 2018 IEEE international conference on robotics and automation (ICRA), pp. 4796–4803, IEEE, 2018. VOLUME 4, 2016 9

work page 2018

[1] [1]

Fast depth densiﬁcation for occlusion-aware augmented reality,

A. Holynski and J. Kopf, “Fast depth densiﬁcation for occlusion-aware augmented reality,” ACM Transactions on Graphics (ToG), vol. 37, no. 6, pp. 1–11, 2018

work page 2018

[2] [2]

A real-time interactive augmented reality depth estimation technique for surgical robotics,

M. Kalia, N. Navab, and T. Salcudean, “A real-time interactive augmented reality depth estimation technique for surgical robotics,” in 2019 Interna- tional Conference on Robotics and Automation (ICRA), pp. 8291–8297, 2019

work page 2019

[3] [3]

3d reconstruction with time-of-ﬂight depth camera and multiple mirrors,

T.-N. Nguyen, H.-H. Huynh, and J. Meunier, “3d reconstruction with time-of-ﬂight depth camera and multiple mirrors,” IEEE Access, vol. 6, pp. 38106–38114, 2018

work page 2018

[4] [4]

Non-local spatial propagation network for depth completion,

J. Park, K. Joo, Z. Hu, C.-K. Liu, and I. So Kweon, “Non-local spatial propagation network for depth completion,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16, pp. 120–136, Springer, 2020

work page 2020

[5] [6]

Real- time dense mapping for self-driving vehicles using ﬁsheye cameras,

Z. Cui, L. Heng, Y . C. Yeo, A. Geiger, M. Pollefeys, and T. Sattler, “Real- time dense mapping for self-driving vehicles using ﬁsheye cameras,” in 2019 International Conference on Robotics and Automation (ICRA), pp. 6087–6093, 2019

work page 2019

[6] [7]

An indoor obstacle detection system using depth information and region growth,

H.-C. Huang, C.-T. Hsieh, and C.-H. Yeh, “An indoor obstacle detection system using depth information and region growth,” Sensors, vol. 15, no. 10, pp. 27116–27141, 2015

work page 2015

[7] [8]

Learning guided con- volutional network for depth completion,

J. Tang, F.-P. Tian, W. Feng, J. Li, and P. Tan, “Learning guided con- volutional network for depth completion,” IEEE Transactions on Image Processing, vol. 30, pp. 1116–1129, 2020

work page 2020

[8] [9]

Three-dimensional imaging in the studio and elsewhere,

G. J. Iddan and G. Yahav, “Three-dimensional imaging in the studio and elsewhere,” in Three-Dimensional Image Capture and Applications IV (B. D. Corner, J. H. Nurre, and R. P. Pargas, eds.), vol. 4298, pp. 48 – 55, International Society for Optics and Photonics, SPIE, 2001

work page 2001

[9] [10]

Sparsity invariant cnns,

J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, “Sparsity invariant cnns,” 2017

work page 2017

[10] [11]

Are we ready for autonomous driving? the kitti vision benchmark suite,

A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361, 2012

work page 2012

[11] [12]

Deeplidar: Deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image,

J. Qiu, Z. Cui, Y . Zhang, X. Zhang, S. Liu, B. Zeng, and M. Pollefeys, “Deeplidar: Deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

work page 2019

[12] [13]

Towards precise and efﬁcient image guided depth completion,

M. Hu, S. Wang, B. Li, S. Ning, L. Fan, and X. Gong, “Towards precise and efﬁcient image guided depth completion,” 2021

work page 2021

[13] [14]

Rignet: Repetitive image guided network for depth completion,

Z. Yan, K. Wang, X. Li, Z. Zhang, B. Xu, J. Li, and J. Yang, “Rignet: Repetitive image guided network for depth completion,” 2021

work page 2021

[14] [15]

Depth completion using plane- residual representation,

B.-U. Lee, K. Lee, and I. S. Kweon, “Depth completion using plane- residual representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13916–13925, 2021

work page 2021

[15] [16]

Multitask gans for semantic segmentation and depth completion with cycle consistency,

C. Zhang, Y . Tang, C. Zhao, Q. Sun, Z. Ye, and J. Kurths, “Multitask gans for semantic segmentation and depth completion with cycle consistency,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–12, 2021

work page 2021

[16] [17]

Adaptive context-aware multi- modal network for depth completion,

S. Zhao, M. Gong, H. Fu, and D. Tao, “Adaptive context-aware multi- modal network for depth completion,” IEEE Transactions on Image Pro- cessing, 2021

work page 2021

[17] [18]

Fcfr- net: Feature fusion based coarse-to-ﬁne residual learning for depth com- pletion,

L. Liu, X. Song, X. Lyu, J. Diao, M. Wang, Y . Liu, and L. Zhang, “Fcfr- net: Feature fusion based coarse-to-ﬁne residual learning for depth com- pletion,” in Proceedings of the AAAI Conference on Artiﬁcial Intelligence, vol. 35, pp. 2136–2144, 2021

work page 2021

[18] [19]

Cspn++: Learning context and resource aware convolutional spatial propagation networks for depth completion,

X. Cheng, P. Wang, C. Guan, and R. Yang, “Cspn++: Learning context and resource aware convolutional spatial propagation networks for depth completion,” in Proceedings of the AAAI Conference on Artiﬁcial Intelli- gence, vol. 34, pp. 10615–10622, 2020

work page 2020

[19] [20]

Learning joint 2d-3d representations for depth completion,

Y . Chen, B. Yang, M. Liang, and R. Urtasun, “Learning joint 2d-3d representations for depth completion,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10023–10032, 2019

work page 2019

[20] [21]

Semantically guided depth upsampling,

N. Schneider, L. Schneider, P. Pinggera, U. Franke, M. Pollefeys, and C. Stiller, “Semantically guided depth upsampling,” in German conference on pattern recognition, pp. 37–48, Springer, 2016

work page 2016

[21] [22]

Towards scene understanding: Unsupervised monocular depth estimation with semantic- aware representation,

P.-Y . Chen, A. H. Liu, Y .-C. Liu, and Y .-C. F. Wang, “Towards scene understanding: Unsupervised monocular depth estimation with semantic- aware representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2624–2632, 2019

work page 2019

[22] [23]

Sparse and noisy lidar completion with rgb guidance and uncertainty,

W. V . Gansbeke, D. Neven, B. D. Brabandere, and L. V . Gool, “Sparse and noisy lidar completion with rgb guidance and uncertainty,” 2019

work page 2019

[23] [24]

Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” 2017

work page 2017

[24] [25]

F. Ma, G. V . Cavalheiro, and S. Karaman, “Self-supervised sparse- to-dense: Self-supervised depth completion from lidar and monocular 8 VOLUME 4, 2016 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS camera,” in 2019 International Conference on Robotics and Automation (ICRA), pp. 3288–3295, IEEE, 2019

work page 2016

[25] [26]

Cbam: Convolutional block attention module,

S. Woo, J. Park, J.-Y . Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European conference on computer vision (ECCV), pp. 3–19, 2018

work page 2018

[26] [27]

Multi-modal attention-based fusion model for semantic segmentation of rgb-depth images,

F. Fooladgar and S. Kasaei, “Multi-modal attention-based fusion model for semantic segmentation of rgb-depth images,” arXiv preprint arXiv:1912.11691, 2019

work page arXiv 1912

[27] [28]

Deep convolutional compressed sensing for lidar depth completion,

N. Chodosh, C. Wang, and S. Lucey, “Deep convolutional compressed sensing for lidar depth completion,” 2018

work page 2018

[28] [29]

Depthnet: Real-time lidar point cloud depth completion for autonomous vehicles,

L. Bai, Y . Zhao, M. Elhousni, and X. Huang, “Depthnet: Real-time lidar point cloud depth completion for autonomous vehicles,” IEEE Access, vol. 8, pp. 227825–227833, 2020

work page 2020

[29] [30]

Uncertainty- aware cnns for depth completion: Uncertainty from beginning to end,

A. Eldesokey, M. Felsberg, K. Holmquist, and M. Persson, “Uncertainty- aware cnns for depth completion: Uncertainty from beginning to end,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12014–12023, 2020

work page 2020

[30] [32]

Denselidar: A real-time pseudo dense depth guided depth completion network,

J. Gu, Z. Xiang, Y . Ye, and L. Wang, “Denselidar: A real-time pseudo dense depth guided depth completion network,” IEEE Robotics and Au- tomation Letters, vol. 6, no. 2, pp. 1808–1815, 2021

work page 2021

[31] [33]

Learning depth with convolutional spatial propagation network,

X. Cheng, P. Wang, and R. Yang, “Learning depth with convolutional spatial propagation network,” IEEE transactions on pattern analysis and machine intelligence, 2019

work page 2019

[32] [34]

Learning Affinity via Spatial Propagation Networks

S. Liu, S. De Mello, J. Gu, G. Zhong, M.-H. Yang, and J. Kautz, “Learning afﬁnity via spatial propagation networks,” arXiv preprint arXiv:1710.01020, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[33] [35]

Autoencoder for words,

C.-Y . Liou, W.-C. Cheng, J.-W. Liou, and D.-R. Liou, “Autoencoder for words,” Neurocomputing, vol. 139, pp. 84–96, 2014

work page 2014

[34] [36]

Attention-based multimodal fusion for video description,

C. Hori, T. Hori, T.-Y . Lee, Z. Zhang, B. Harsham, J. R. Hershey, T. K. Marks, and K. Sumi, “Attention-based multimodal fusion for video description,” in Proceedings of the IEEE international conference on computer vision, pp. 4193–4202, 2017

work page 2017

[35] [37]

Attention-based multimodal fusion for estimating human emotion in real-world hri,

Y . Li, T. Zhao, and X. Shen, “Attention-based multimodal fusion for estimating human emotion in real-world hri,” in Companion of the 2020 ACM/IEEE International Conference on Human-Robot Interaction, pp. 340–342, 2020

work page 2020

[36] [38]

Attention-based multimodal contextual fusion for sentiment and emotion classiﬁcation using bidirectional lstm,

M. G. Huddar, S. S. Sannakki, and V . S. Rajpurohit, “Attention-based multimodal contextual fusion for sentiment and emotion classiﬁcation using bidirectional lstm,” Multimedia Tools and Applications, vol. 80, no. 9, pp. 13059–13076, 2021

work page 2021

[37] [39]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016

work page 2016

[38] [40]

From depth what can you see? depth completion via auxiliary image reconstruction,

K. Lu, N. Barnes, S. Anwar, and L. Zheng, “From depth what can you see? depth completion via auxiliary image reconstruction,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11303–11312, 2020

work page 2020

[39] [41]

Wider or deeper: Revisiting the resnet model for visual recognition,

Z. Wu, C. Shen, and A. van den Hengel, “Wider or deeper: Revisiting the resnet model for visual recognition,” 2016

work page 2016

[40] [42]

Augmented reality meets computer vision: Efﬁcient data generation for urban driving scenes,

H. Alhaija, S. Mustikovela, L. Mescheder, A. Geiger, and C. Rother, “Augmented reality meets computer vision: Efﬁcient data generation for urban driving scenes,” International Journal of Computer Vision (IJCV), 2018

work page 2018

[41] [43]

Multi-scale context aggregation by dilated convolu- tions,

F. Yu and V . Koltun, “Multi-scale context aggregation by dilated convolu- tions,” 2016

work page 2016

[42] [44]

Deformable spatial propagation networks for depth completion,

Z. Xu, H. Yin, and J. Yao, “Deformable spatial propagation networks for depth completion,” in 2020 IEEE International Conference on Image Processing (ICIP), pp. 913–917, IEEE, 2020

work page 2020

[43] [45]

Depth completion with twin surface extrapolation at occlusion boundaries,

S. Imran, X. Liu, and D. Morris, “Depth completion with twin surface extrapolation at occlusion boundaries,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2583–2592, 2021

work page 2021

[44] [46]

Pytorch: An imperative style, high-performance deep learning library,

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing S...

work page 2019

[45] [47]

Adam: A method for stochastic optimization,

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2017

work page 2017

[46] [48]

Sparse-to-dense: Depth prediction from sparse depth samples and a single image,

F. Ma and S. Karaman, “Sparse-to-dense: Depth prediction from sparse depth samples and a single image,” in 2018 IEEE international conference on robotics and automation (ICRA), pp. 4796–4803, IEEE, 2018. VOLUME 4, 2016 9

work page 2018