pith. sign in

arxiv: 2204.13635 · v2 · submitted 2022-04-28 · 💻 cs.CV

SemAttNet: Towards Attention-based Semantic Aware Guided Depth Completion

Pith reviewed 2026-05-24 11:38 UTC · model grok-4.3

classification 💻 cs.CV
keywords depth completionsemantic segmentationattention-based fusionmulti-modal guidanceKITTI benchmarkguided depth estimationCSPN++ refinement
0
0 comments X

The pith

A three-branch network with semantic guidance and attention fusion completes sparse depth maps more accurately than color-only approaches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a three-branch backbone that runs color-guided, semantic-guided, and depth-guided paths in parallel. The color branch produces depth informed by object boundaries from the RGB image; the semantic branch then uses that output plus a semantic segmentation map to add scene-level understanding; the depth branch combines all three inputs. These are fused by the semantic-aware multi-modal attention-based fusion block before CSPN++ refinement. The central claim is that semantic input supplies the scene understanding missing from RGB guidance alone, especially when illumination changes create shadows or highlights. If correct, the method would yield denser and more reliable depth maps on benchmarks that test real-world lighting variation.

Core claim

The central claim is that routing sparse depth and RGB through a three-branch backbone (color-guided, semantic-guided, depth-guided), fusing the resulting color depth, semantic depth, and guided depth via the semantic-aware multi-modal attention-based fusion block (SAMMAFB), then refining with CSPN++ and atrous convolutions, produces a dense depth map that achieves state-of-the-art performance on the KITTI depth completion benchmark at the time of submission.

What carries the argument

The three-branch backbone (color-guided, semantic-guided, depth-guided) whose outputs are adaptively fused by the semantic-aware multi-modal attention-based fusion block (SAMMAFB).

If this is right

  • The color-guided branch supplies object-boundary cues that the semantic-guided branch then exploits to produce semantic depth.
  • The depth-guided branch integrates sparse depth with both color and semantic depths to form the final guided depth.
  • Adaptive fusion of the three depth outputs via SAMMAFB produces the backbone result before refinement.
  • CSPN++ with atrous convolutions applied to the fused map yields the final dense depth prediction.
  • The full pipeline outperforms earlier methods that rely on only one or two guidance modalities on the KITTI benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The quality of the input semantic segmentation maps may become the practical performance limit if those maps contain errors in shadowed regions.
  • The same three-branch pattern with attention fusion could be tested on other sparse-to-dense tasks such as normal estimation or optical flow.
  • Evaluating the model on datasets that contain more extreme lighting variation than KITTI would test whether the semantic branch delivers the claimed robustness.
  • Ablating the attention fusion block while keeping the three branches would isolate whether the performance gain comes mainly from the added semantic input or from the fusion mechanism itself.

Load-bearing premise

Semantic segmentation supplies scene understanding that color images alone cannot provide, especially under sudden illumination changes, and that routing information through three separate guided branches plus SAMMAFB attention fusion will measurably outperform prior single-guidance or two-branch methods.

What would settle it

Reproducing the model and measuring its RMSE or MAE on the official KITTI depth completion validation set; if the numbers do not surpass the prior published state-of-the-art, the performance claim is falsified.

Figures

Figures reproduced from arXiv: 2204.13635 by Danish Nazir, Didier Stricker, Marcus Liwicki, Muhammad Zeshan Afzal.

Figure 1
Figure 1. Figure 1: FIGURE 1: Block diagram of SemAttNet. In the first stage, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FIGURE 2: The overview of proposed SemAttNet. It consists of a novel three-branch backbone and a CSPN++ module with [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: FIGURE 3: Architecture of SAMMAFB. The input to SAM [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: FIGURE 4: Qualitative results on KITTI depth completion test set. The results are obtained by online KITTI depth completion [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Depth completion involves recovering a dense depth map from a sparse map and an RGB image. Recent approaches focus on utilizing color images as guidance images to recover depth at invalid pixels. However, color images alone are not enough to provide the necessary semantic understanding of the scene. Consequently, the depth completion task suffers from sudden illumination changes in RGB images (e.g., shadows). In this paper, we propose a novel three-branch backbone comprising color-guided, semantic-guided, and depth-guided branches. Specifically, the color-guided branch takes a sparse depth map and RGB image as an input and generates color depth which includes color cues (e.g., object boundaries) of the scene. The predicted dense depth map of color-guided branch along-with semantic image and sparse depth map is passed as input to semantic-guided branch for estimating semantic depth. The depth-guided branch takes sparse, color, and semantic depths to generate the dense depth map. The color depth, semantic depth, and guided depth are adaptively fused to produce the output of our proposed three-branch backbone. In addition, we also propose to apply semantic-aware multi-modal attention-based fusion block (SAMMAFB) to fuse features between all three branches. We further use CSPN++ with Atrous convolutions to refine the dense depth map produced by our three-branch backbone. Extensive experiments show that our model achieves state-of-the-art performance in the KITTI depth completion benchmark at the time of submission.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes SemAttNet, a three-branch guided depth completion architecture consisting of color-guided, semantic-guided, and depth-guided branches whose outputs are fused via a semantic-aware multi-modal attention-based fusion block (SAMMAFB); the fused result is further refined by CSPN++ with atrous convolutions. The central claim is that this design achieves state-of-the-art performance on the KITTI depth completion benchmark by supplying semantic scene understanding that color guidance alone cannot provide under illumination changes.

Significance. If the reported performance gains are reproducible and the ablations confirm the contribution of the semantic branch and SAMMAFB, the work would provide concrete evidence that explicit semantic guidance improves robustness in guided depth completion, extending prior single- or dual-guidance methods with a principled multi-branch attention mechanism.

major comments (2)
  1. [Abstract / Experiments] Abstract and Experiments section: the claim of state-of-the-art performance on KITTI is stated without any quantitative metrics (RMSE, MAE, etc.), baseline comparisons, ablation tables, or error bars, so the central empirical claim cannot be evaluated from the supplied information.
  2. [Introduction / Method] The weakest assumption—that semantic segmentation supplies scene understanding unavailable from color images alone, especially under sudden illumination changes—is asserted but not tested with controlled experiments (e.g., performance on subsets with shadows or low-light); without such evidence the attribution of gains to the three-branch design remains unsubstantiated.
minor comments (1)
  1. [Method] Notation for the three predicted depth maps (color depth, semantic depth, guided depth) is introduced in the abstract but never formalized with equations or a diagram in the provided text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and outline the revisions we will make to improve clarity and substantiation of our claims.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the claim of state-of-the-art performance on KITTI is stated without any quantitative metrics (RMSE, MAE, etc.), baseline comparisons, ablation tables, or error bars, so the central empirical claim cannot be evaluated from the supplied information.

    Authors: We acknowledge that the abstract does not quote specific numerical results. The Experiments section contains the full quantitative evaluation on the KITTI benchmark, including tables with RMSE, MAE, iRMSE and iMAE against prior methods, ablation tables isolating the semantic branch and SAMMAFB, and official leaderboard comparisons. We will revise the abstract to report the key metrics (e.g., our RMSE on the validation set) and add explicit cross-references to the tables. Error bars are not standard practice on the fixed KITTI test set; we will add a short note on reproducibility instead. revision: yes

  2. Referee: [Introduction / Method] The weakest assumption—that semantic segmentation supplies scene understanding unavailable from color images alone, especially under sudden illumination changes—is asserted but not tested with controlled experiments (e.g., performance on subsets with shadows or low-light); without such evidence the attribution of gains to the three-branch design remains unsubstantiated.

    Authors: We agree that a controlled subset analysis would strengthen the attribution. The current results are reported on the full KITTI validation and test sets, which contain diverse illumination including shadows. Qualitative examples in the paper illustrate improved boundary recovery under shadows when the semantic branch is active. In revision we will add a short discussion subsection on illumination robustness and, where data permits, report performance on a manually annotated shadow/low-light subset or reference related analyses from the depth-completion literature. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper describes an empirical neural architecture (three-branch guided backbone with SAMMAFB fusion and CSPN++ refinement) trained on the KITTI training split and evaluated on its test split to claim SOTA performance. This is a standard fitted empirical outcome on an external benchmark, not a first-principles derivation or self-referential prediction. No equations, self-definitional steps, fitted-input-as-prediction reductions, or load-bearing self-citations appear in the provided text; the method is presented as a direct engineering extension of prior guided-completion work without reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim of state-of-the-art performance rests entirely on empirical training of a deep network on the KITTI dataset; no independent derivation or external benchmark is supplied.

free parameters (1)
  • network weights
    All convolutional and attention parameters are optimized on KITTI training data to produce the reported depth maps.
axioms (1)
  • standard math Gradient-based optimization of a multi-branch convolutional network can learn useful depth completion mappings.
    Standard assumption underlying all supervised deep-learning training.

pith-pipeline@v0.9.0 · 5794 in / 1278 out tokens · 30235 ms · 2026-05-24T11:38:57.536585+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 1 internal anchor

  1. [1]

    Fast depth densification for occlusion-aware augmented reality,

    A. Holynski and J. Kopf, “Fast depth densification for occlusion-aware augmented reality,” ACM Transactions on Graphics (ToG), vol. 37, no. 6, pp. 1–11, 2018

  2. [2]

    A real-time interactive augmented reality depth estimation technique for surgical robotics,

    M. Kalia, N. Navab, and T. Salcudean, “A real-time interactive augmented reality depth estimation technique for surgical robotics,” in 2019 Interna- tional Conference on Robotics and Automation (ICRA), pp. 8291–8297, 2019

  3. [3]

    3d reconstruction with time-of-flight depth camera and multiple mirrors,

    T.-N. Nguyen, H.-H. Huynh, and J. Meunier, “3d reconstruction with time-of-flight depth camera and multiple mirrors,” IEEE Access, vol. 6, pp. 38106–38114, 2018

  4. [4]

    Non-local spatial propagation network for depth completion,

    J. Park, K. Joo, Z. Hu, C.-K. Liu, and I. So Kweon, “Non-local spatial propagation network for depth completion,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16, pp. 120–136, Springer, 2020

  5. [6]

    Real- time dense mapping for self-driving vehicles using fisheye cameras,

    Z. Cui, L. Heng, Y . C. Yeo, A. Geiger, M. Pollefeys, and T. Sattler, “Real- time dense mapping for self-driving vehicles using fisheye cameras,” in 2019 International Conference on Robotics and Automation (ICRA), pp. 6087–6093, 2019

  6. [7]

    An indoor obstacle detection system using depth information and region growth,

    H.-C. Huang, C.-T. Hsieh, and C.-H. Yeh, “An indoor obstacle detection system using depth information and region growth,” Sensors, vol. 15, no. 10, pp. 27116–27141, 2015

  7. [8]

    Learning guided con- volutional network for depth completion,

    J. Tang, F.-P. Tian, W. Feng, J. Li, and P. Tan, “Learning guided con- volutional network for depth completion,” IEEE Transactions on Image Processing, vol. 30, pp. 1116–1129, 2020

  8. [9]

    Three-dimensional imaging in the studio and elsewhere,

    G. J. Iddan and G. Yahav, “Three-dimensional imaging in the studio and elsewhere,” in Three-Dimensional Image Capture and Applications IV (B. D. Corner, J. H. Nurre, and R. P. Pargas, eds.), vol. 4298, pp. 48 – 55, International Society for Optics and Photonics, SPIE, 2001

  9. [10]

    Sparsity invariant cnns,

    J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, “Sparsity invariant cnns,” 2017

  10. [11]

    Are we ready for autonomous driving? the kitti vision benchmark suite,

    A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361, 2012

  11. [12]

    Deeplidar: Deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image,

    J. Qiu, Z. Cui, Y . Zhang, X. Zhang, S. Liu, B. Zeng, and M. Pollefeys, “Deeplidar: Deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

  12. [13]

    Towards precise and efficient image guided depth completion,

    M. Hu, S. Wang, B. Li, S. Ning, L. Fan, and X. Gong, “Towards precise and efficient image guided depth completion,” 2021

  13. [14]

    Rignet: Repetitive image guided network for depth completion,

    Z. Yan, K. Wang, X. Li, Z. Zhang, B. Xu, J. Li, and J. Yang, “Rignet: Repetitive image guided network for depth completion,” 2021

  14. [15]

    Depth completion using plane- residual representation,

    B.-U. Lee, K. Lee, and I. S. Kweon, “Depth completion using plane- residual representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13916–13925, 2021

  15. [16]

    Multitask gans for semantic segmentation and depth completion with cycle consistency,

    C. Zhang, Y . Tang, C. Zhao, Q. Sun, Z. Ye, and J. Kurths, “Multitask gans for semantic segmentation and depth completion with cycle consistency,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–12, 2021

  16. [17]

    Adaptive context-aware multi- modal network for depth completion,

    S. Zhao, M. Gong, H. Fu, and D. Tao, “Adaptive context-aware multi- modal network for depth completion,” IEEE Transactions on Image Pro- cessing, 2021

  17. [18]

    Fcfr- net: Feature fusion based coarse-to-fine residual learning for depth com- pletion,

    L. Liu, X. Song, X. Lyu, J. Diao, M. Wang, Y . Liu, and L. Zhang, “Fcfr- net: Feature fusion based coarse-to-fine residual learning for depth com- pletion,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2136–2144, 2021

  18. [19]

    Cspn++: Learning context and resource aware convolutional spatial propagation networks for depth completion,

    X. Cheng, P. Wang, C. Guan, and R. Yang, “Cspn++: Learning context and resource aware convolutional spatial propagation networks for depth completion,” in Proceedings of the AAAI Conference on Artificial Intelli- gence, vol. 34, pp. 10615–10622, 2020

  19. [20]

    Learning joint 2d-3d representations for depth completion,

    Y . Chen, B. Yang, M. Liang, and R. Urtasun, “Learning joint 2d-3d representations for depth completion,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10023–10032, 2019

  20. [21]

    Semantically guided depth upsampling,

    N. Schneider, L. Schneider, P. Pinggera, U. Franke, M. Pollefeys, and C. Stiller, “Semantically guided depth upsampling,” in German conference on pattern recognition, pp. 37–48, Springer, 2016

  21. [22]

    Towards scene understanding: Unsupervised monocular depth estimation with semantic- aware representation,

    P.-Y . Chen, A. H. Liu, Y .-C. Liu, and Y .-C. F. Wang, “Towards scene understanding: Unsupervised monocular depth estimation with semantic- aware representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2624–2632, 2019

  22. [23]

    Sparse and noisy lidar completion with rgb guidance and uncertainty,

    W. V . Gansbeke, D. Neven, B. D. Brabandere, and L. V . Gool, “Sparse and noisy lidar completion with rgb guidance and uncertainty,” 2019

  23. [24]

    Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,

    L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” 2017

  24. [25]

    F. Ma, G. V . Cavalheiro, and S. Karaman, “Self-supervised sparse- to-dense: Self-supervised depth completion from lidar and monocular 8 VOLUME 4, 2016 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS camera,” in 2019 International Conference on Robotics and Automation (ICRA), pp. 3288–3295, IEEE, 2019

  25. [26]

    Cbam: Convolutional block attention module,

    S. Woo, J. Park, J.-Y . Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European conference on computer vision (ECCV), pp. 3–19, 2018

  26. [27]

    Multi-modal attention-based fusion model for semantic segmentation of rgb-depth images,

    F. Fooladgar and S. Kasaei, “Multi-modal attention-based fusion model for semantic segmentation of rgb-depth images,” arXiv preprint arXiv:1912.11691, 2019

  27. [28]

    Deep convolutional compressed sensing for lidar depth completion,

    N. Chodosh, C. Wang, and S. Lucey, “Deep convolutional compressed sensing for lidar depth completion,” 2018

  28. [29]

    Depthnet: Real-time lidar point cloud depth completion for autonomous vehicles,

    L. Bai, Y . Zhao, M. Elhousni, and X. Huang, “Depthnet: Real-time lidar point cloud depth completion for autonomous vehicles,” IEEE Access, vol. 8, pp. 227825–227833, 2020

  29. [30]

    Uncertainty- aware cnns for depth completion: Uncertainty from beginning to end,

    A. Eldesokey, M. Felsberg, K. Holmquist, and M. Persson, “Uncertainty- aware cnns for depth completion: Uncertainty from beginning to end,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12014–12023, 2020

  30. [32]

    Denselidar: A real-time pseudo dense depth guided depth completion network,

    J. Gu, Z. Xiang, Y . Ye, and L. Wang, “Denselidar: A real-time pseudo dense depth guided depth completion network,” IEEE Robotics and Au- tomation Letters, vol. 6, no. 2, pp. 1808–1815, 2021

  31. [33]

    Learning depth with convolutional spatial propagation network,

    X. Cheng, P. Wang, and R. Yang, “Learning depth with convolutional spatial propagation network,” IEEE transactions on pattern analysis and machine intelligence, 2019

  32. [34]

    Learning Affinity via Spatial Propagation Networks

    S. Liu, S. De Mello, J. Gu, G. Zhong, M.-H. Yang, and J. Kautz, “Learning affinity via spatial propagation networks,” arXiv preprint arXiv:1710.01020, 2017

  33. [35]

    Autoencoder for words,

    C.-Y . Liou, W.-C. Cheng, J.-W. Liou, and D.-R. Liou, “Autoencoder for words,” Neurocomputing, vol. 139, pp. 84–96, 2014

  34. [36]

    Attention-based multimodal fusion for video description,

    C. Hori, T. Hori, T.-Y . Lee, Z. Zhang, B. Harsham, J. R. Hershey, T. K. Marks, and K. Sumi, “Attention-based multimodal fusion for video description,” in Proceedings of the IEEE international conference on computer vision, pp. 4193–4202, 2017

  35. [37]

    Attention-based multimodal fusion for estimating human emotion in real-world hri,

    Y . Li, T. Zhao, and X. Shen, “Attention-based multimodal fusion for estimating human emotion in real-world hri,” in Companion of the 2020 ACM/IEEE International Conference on Human-Robot Interaction, pp. 340–342, 2020

  36. [38]

    Attention-based multimodal contextual fusion for sentiment and emotion classification using bidirectional lstm,

    M. G. Huddar, S. S. Sannakki, and V . S. Rajpurohit, “Attention-based multimodal contextual fusion for sentiment and emotion classification using bidirectional lstm,” Multimedia Tools and Applications, vol. 80, no. 9, pp. 13059–13076, 2021

  37. [39]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016

  38. [40]

    From depth what can you see? depth completion via auxiliary image reconstruction,

    K. Lu, N. Barnes, S. Anwar, and L. Zheng, “From depth what can you see? depth completion via auxiliary image reconstruction,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11303–11312, 2020

  39. [41]

    Wider or deeper: Revisiting the resnet model for visual recognition,

    Z. Wu, C. Shen, and A. van den Hengel, “Wider or deeper: Revisiting the resnet model for visual recognition,” 2016

  40. [42]

    Augmented reality meets computer vision: Efficient data generation for urban driving scenes,

    H. Alhaija, S. Mustikovela, L. Mescheder, A. Geiger, and C. Rother, “Augmented reality meets computer vision: Efficient data generation for urban driving scenes,” International Journal of Computer Vision (IJCV), 2018

  41. [43]

    Multi-scale context aggregation by dilated convolu- tions,

    F. Yu and V . Koltun, “Multi-scale context aggregation by dilated convolu- tions,” 2016

  42. [44]

    Deformable spatial propagation networks for depth completion,

    Z. Xu, H. Yin, and J. Yao, “Deformable spatial propagation networks for depth completion,” in 2020 IEEE International Conference on Image Processing (ICIP), pp. 913–917, IEEE, 2020

  43. [45]

    Depth completion with twin surface extrapolation at occlusion boundaries,

    S. Imran, X. Liu, and D. Morris, “Depth completion with twin surface extrapolation at occlusion boundaries,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2583–2592, 2021

  44. [46]

    Pytorch: An imperative style, high-performance deep learning library,

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing S...

  45. [47]

    Adam: A method for stochastic optimization,

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2017

  46. [48]

    Sparse-to-dense: Depth prediction from sparse depth samples and a single image,

    F. Ma and S. Karaman, “Sparse-to-dense: Depth prediction from sparse depth samples and a single image,” in 2018 IEEE international conference on robotics and automation (ICRA), pp. 4796–4803, IEEE, 2018. VOLUME 4, 2016 9