EfficientPENet: Real-Time Depth Completion from Sparse LiDAR via Lightweight Multi-Modal Fusion
Pith reviewed 2026-05-10 05:17 UTC · model grok-4.3
The pith
EfficientPENet shows that a ConvNeXt-based two-branch network with late fusion and CSPN can complete sparse LiDAR depth maps in real time on edge hardware while preserving competitive accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EfficientPENet is a two-branch depth completion network that replaces conventional ResNet encoders with ConvNeXt blocks. The RGB branch uses ImageNet-pretrained ConvNeXt with LayerNorm and stochastic depth, while the depth branch employs sparsity-invariant convolutions. Late fusion merges the features, followed by multi-scale decoding and CSPN refinement. Position-aware test-time augmentation corrects coordinates during flipping. On KITTI, this yields an RMSE of 631.94 mm using 36.24 million parameters at 20.51 ms latency, or 48.76 FPS, which is 3.7 times fewer parameters and 23 times faster than BP-Net with similar accuracy.
What carries the argument
The lightweight multi-modal fusion architecture combining ConvNeXt RGB encoding, sparsity-invariant depth encoding, late fusion, and CSPN propagation for efficient dense depth prediction from sparse inputs.
If this is right
- Real-time depth completion becomes feasible on NVIDIA Jetson-class hardware.
- Model size shrinks enough for deployment where memory is limited.
- Accuracy remains close enough to heavier models for practical robotic use.
- Test-time augmentation provides free accuracy gains through coordinate correction.
Where Pith is reading between the lines
- The approach may generalize to other sensor fusion tasks where one modality is sparse.
- Modern ConvNeXt backbones could replace older encoders in similar vision tasks without major redesign.
- Further speedups might come from model quantization given the already low latency.
Load-bearing premise
That the combination of ConvNeXt, sparsity-invariant convolutions, late fusion, CSPN, and position-aware augmentation delivers the reported accuracy-speed tradeoff without overfitting to the KITTI benchmark specifics.
What would settle it
Running the model on a held-out outdoor dataset with different LiDAR sparsity patterns and measuring whether RMSE stays below 700 mm at similar frame rates would confirm or refute the generalization of the efficiency gains.
Figures
read the original abstract
Depth completion from sparse LiDAR measurements and corresponding RGB images is a prerequisite for accurate 3D perception in robotic systems. Existing methods achieve high accuracy on standard benchmarks but rely on heavy backbone architectures that preclude real-time deployment on embedded hardware. We present EfficientPENet, a two-branch depth completion network that replaces the conventional ResNet encoder with a modernized ConvNeXt backbone, introduces sparsity-invariant convolutions for the depth stream, and refines predictions through a Convolutional Spatial Propagation Network (CSPN). The RGB branch leverages ImageNet-pretrained ConvNeXt blocks with Layer Normalization, 7x7 depthwise convolutions, and stochastic depth regularization. Features from both branches are merged via late fusion and decoded through a multi-scale deep supervision strategy. We further introduce a position-aware test-time augmentation scheme that corrects coordinate tensors during horizontal flipping, yielding consistent error reduction at inference. On the KITTI depth completion benchmark, EfficientPENet achieves an RMSE of 631.94 mm with 36.24M parameters and a latency of 20.51 ms, operating at 48.76 FPS. This represents a 3.7 times reduction in parameters and a 23 times speedup relative to BP-Net, while maintaining competitive accuracy. These results establish EfficientPENet as a practical solution for real-time depth completion on resource-constrained edge platforms such as the NVIDIA Jetson.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents EfficientPENet, a two-branch depth completion network for sparse LiDAR and RGB inputs. It modernizes the encoder with a ConvNeXt backbone, adds sparsity-invariant convolutions to the depth branch, performs late fusion, applies CSPN refinement, and introduces position-aware test-time augmentation. On the KITTI depth completion benchmark the method reports an RMSE of 631.94 mm, 36.24 M parameters, 20.51 ms latency and 48.76 FPS, claiming a 3.7× parameter reduction and 23× speedup relative to BP-Net while remaining competitive in accuracy.
Significance. If the efficiency numbers are obtained under identical hardware, resolution and optimization conditions as the cited baseline, the work would be significant for practical real-time 3D perception on embedded platforms such as the NVIDIA Jetson. The architectural choices (ConvNeXt + sparsity handling + CSPN) target a known deployment bottleneck; reproducible verification of the claimed speed/accuracy trade-off would strengthen the case for edge robotics applications.
major comments (2)
- [Abstract] Abstract: the central efficiency claim states a 3.7× parameter reduction and 23× speedup versus BP-Net, yet supplies no indication that BP-Net was re-implemented, compiled, and timed on the identical NVIDIA Jetson platform, input resolution, batch size and optimization settings used for EfficientPENet. Because these ratios are load-bearing for the “real-time on edge” positioning, the comparison must be substantiated with explicit re-measurement details.
- [Abstract] Abstract / Experiments section: the manuscript reports concrete benchmark numbers (RMSE 631.94 mm, 36.24 M parameters, 48.76 FPS) but provides no training protocol, hyper-parameter schedule, validation split, error bars, or ablation studies. Without these elements it is impossible to determine whether the reported accuracy arises from the proposed combination of ConvNeXt, sparsity-invariant convolutions, late fusion, CSPN and position-aware TTA or from post-hoc benchmark tuning.
minor comments (1)
- [Abstract] The abstract would benefit from a brief statement of the exact KITTI depth-completion split and evaluation protocol used.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important issues of reproducibility and fair comparison. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and details.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central efficiency claim states a 3.7× parameter reduction and 23× speedup versus BP-Net, yet supplies no indication that BP-Net was re-implemented, compiled, and timed on the identical NVIDIA Jetson platform, input resolution, batch size and optimization settings used for EfficientPENet. Because these ratios are load-bearing for the “real-time on edge” positioning, the comparison must be substantiated with explicit re-measurement details.
Authors: We agree that the efficiency claims require explicit substantiation. The parameter reduction (3.7×) is computed from the parameter count reported in the original BP-Net publication. The speedup (23×) is based on our measured latency for EfficientPENet (20.51 ms / 48.76 FPS on NVIDIA Jetson) compared against the latency figures published for BP-Net. In the revision we will add a dedicated implementation and evaluation subsection that specifies the exact hardware (Jetson model), software environment, input resolution, batch size (1 for inference), and measurement methodology. We will also state whether BP-Net was re-timed under identical conditions or clarify that the comparison uses published numbers, thereby removing any ambiguity. revision: yes
-
Referee: [Abstract] Abstract / Experiments section: the manuscript reports concrete benchmark numbers (RMSE 631.94 mm, 36.24 M parameters, 48.76 FPS) but provides no training protocol, hyper-parameter schedule, validation split, error bars, or ablation studies. Without these elements it is impossible to determine whether the reported accuracy arises from the proposed combination of ConvNeXt, sparsity-invariant convolutions, late fusion, CSPN and position-aware TTA or from post-hoc benchmark tuning.
Authors: We acknowledge that the current manuscript omits these essential experimental details. The revised version will contain an expanded Experiments section that reports: the full training protocol and hyper-parameter schedule (optimizer, learning-rate schedule, batch size, number of epochs, loss weighting); the precise KITTI validation split employed; error bars obtained from multiple independent training runs where feasible; and a set of ablation studies that isolate the contribution of each proposed element (ConvNeXt backbone, sparsity-invariant convolutions, late fusion, CSPN refinement, and position-aware TTA). These additions will allow readers to verify that the reported accuracy results from the architectural choices rather than post-hoc tuning. revision: yes
Circularity Check
No circularity: empirical results on external KITTI benchmark with no self-referential derivations
full rationale
The paper describes an architecture (ConvNeXt backbone, sparsity-invariant convolutions, late fusion, CSPN refinement, position-aware TTA) and reports measured performance (RMSE 631.94 mm, 36.24M params, 48.76 FPS) on the public KITTI depth completion benchmark. No equations, fitted parameters, or first-principles derivations are presented that reduce to the inputs by construction. Comparisons to BP-Net are external benchmarks; even if re-implementation details are incomplete, this does not create internal circularity. The derivation chain is self-contained as standard empirical ML reporting against an independent dataset.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
2021 report card for America’s infrastructure,
American Society of Civil Engineers, “2021 report card for America’s infrastructure,” ASCE, Tech. Rep., 2021
work page 2021
-
[2]
J. M. Aitken, M. H. Evans, R. Worley, and S. Edwards, “Simultaneous localization and mapping for inspection robots in water and sewer pipe networks: A review,”IEEE Access, vol. 9, pp. 140 173–140 198, 2021
work page 2021
-
[3]
M. O. Macaulay and M. Shafiee, “Machine learning techniques for robotic and autonomous inspection of mechanical systems and civil infrastructure,”Autonomous Intelligent Systems, vol. 2, no. 1, p. 8, 2022
work page 2022
-
[4]
Pixel-level crack detection in levee systems: A comparative study,
M. Panta, M. T. Hoque, M. Abdelguerfi, and M. C. Flanagin, “Pixel-level crack detection in levee systems: A comparative study,” inIGARSS 2022- 2022 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2022, pp. 3059–3062
work page 2022
-
[5]
Deep learning approach for accurate segmentation of sand boils in levee systems,
M. Panta, M. T. Hoque, K. N. Niles, J. Tom, M. Abdelguerfi, and M. Falanagin, “Deep learning approach for accurate segmentation of sand boils in levee systems,”IEEE Access, vol. 11, pp. 126 263–126 282, 2023
work page 2023
-
[6]
J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, “Sparsity invariant CNNs,” inInternational Conference on 3D Vision (3DV), 2017, pp. 11–20
work page 2017
-
[7]
Sparse-to-dense: Depth prediction from sparse depth samples and a single image,
F. Ma and S. Karaman, “Sparse-to-dense: Depth prediction from sparse depth samples and a single image,” inProceedings of the IEEE In- ternational Conference on Robotics and Automation (ICRA), 2018, pp. 4796–4803
work page 2018
-
[8]
Deep depth completion from extremely sparse data: A survey,
J. Hu, C. Bao, M. Ozay, C. Fan, Q. Gao, H. Liu, and T. L. Lam, “Deep depth completion from extremely sparse data: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 7, pp. 8244–8264, 2022
work page 2022
-
[9]
Vision meets robotics: The KITTI dataset,
A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,”The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013
work page 2013
-
[10]
Sparse and dense data with CNNs: Depth completion and semantic segmentation,
M. Jaritz, R. De Charette, E. Wirbel, X. Perrotton, and F. Nashashibi, “Sparse and dense data with CNNs: Depth completion and semantic segmentation,” inInternational Conference on 3D Vision (3DV), 2018, pp. 52–60
work page 2018
-
[11]
Penet: Towards precise and efficient image guided depth completion,
M. Hu, S. Wang, B. Li, S. Ning, L. Fan, and X. Gong, “Penet: Towards precise and efficient image guided depth completion,” inProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2021, pp. 13 656–13 662
work page 2021
-
[12]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778
work page 2016
-
[13]
Bilateral propagation network for depth completion,
J. Tang, F.-P. Tian, B. An, J. Li, and P. Tan, “Bilateral propagation network for depth completion,”arXiv preprint arXiv:2403.11270, 2024
-
[14]
Completionformer: Depth completion with convolutions and vision transformers,
Y . Zhang, X. Guo, M. Poggi, Z. Zhu, G. Huang, and S. Mattoccia, “Completionformer: Depth completion with convolutions and vision transformers,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 18 527–18 536. 11
work page 2023
-
[15]
Depth anything: Unleashing the power of large-scale unlabeled data,
L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 10 371–10 381
work page 2024
-
[16]
DMD3C: Distillation-based multi-modal depth completion for collab- orative construction,
L. Liu, J. Liu, Y . Zhang, Y . He, R. Zhang, H. Wang, and H. Zhang, “DMD3C: Distillation-based multi-modal depth completion for collab- orative construction,”Automation in Construction, vol. 141, p. 104437, 2022
work page 2022
-
[17]
Real-time monocular depth es- timation on embedded systems,
C. Feng, C. Zhang, Z. Chen, and W. Hu, “Real-time monocular depth es- timation on embedded systems,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 7841–7847
work page 2024
-
[18]
Lightweight monocular depth estimation on edge devices,
S. Liu, L. Yang, X. Tu, R. Li, and C. Xu, “Lightweight monocular depth estimation on edge devices,”IEEE Internet of Things Journal, vol. 9, no. 20, pp. 20 444–20 455, 2022
work page 2022
-
[19]
Object detection using depth completion and camera- LiDAR fusion for autonomous driving,
M. Carranza-Garc ´ıa, F. J. Gal ´an-Sales, J. M. Luna-Romera, and J. C. Riquelme-Santos, “Object detection using depth completion and camera- LiDAR fusion for autonomous driving,”Integrated Computer-Aided Engineering, vol. 29, no. 3, pp. 241–258, 2022
work page 2022
-
[20]
Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A ConvNet for the 2020s,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11 976– 11 986
work page 2022
-
[21]
Deep networks with stochastic depth,
G. Huang, Y . Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep networks with stochastic depth,” inEuropean Conference on Computer Vision (ECCV). Springer, 2016, pp. 646–661
work page 2016
-
[22]
Imbalance-aware culvert-sewer defect segmentation using an enhanced feature pyramid network,
R. Alshawi, M. M. Ferdaus, M. Abdelguerfi, K. N. Niles, K. Pathak, and S. Sloan, “Imbalance-aware culvert-sewer defect segmentation using an enhanced feature pyramid network,”IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2025
work page 2025
-
[23]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[24]
Depth estimation via affinity learned with convolutional spatial propagation network,
X. Cheng, P. Wang, and R. Yang, “Depth estimation via affinity learned with convolutional spatial propagation network,” inProceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 103–119
work page 2018
-
[25]
X. Cheng, P. Wang, C. Guan, and R. Yang, “CSPN++: Learning context and resource aware convolutional spatial propagation networks for depth completion,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020, pp. 10 615–10 622
work page 2020
-
[26]
Dynamic spatial propagation network for depth completion,
Y . Lin, T. Cheng, Q. Zhong, W. Zhou, and H. Yang, “Dynamic spatial propagation network for depth completion,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2022, pp. 1638–1646
work page 2022
-
[27]
Guided depth map super-resolution: A survey,
Z. Zhong, X. Liu, J. Jiang, D. Zhao, and X. Ji, “Guided depth map super-resolution: A survey,”ACM Computing Surveys, vol. 55, no. 14s, pp. 1–36, 2023
work page 2023
-
[28]
Decoupled weight decay regularization,
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations (ICLR), 2019
work page 2019
-
[29]
L. Sun, J. Zhu, J. Tan, X. Li, R. Li, H. Deng, and X. Zhang, “Deep learning-assisted automated sewage pipe defect detection for urban water environment management,”Science of The Total Environment, vol. 882, p. 163562, 2023
work page 2023
-
[30]
LIO-SAM: Tightly-coupled lidar inertial odometry via smoothing and mapping,
T. Shan, B. Englot, D. Meyers, W. Wang, C. Ratti, and D. Rus, “LIO-SAM: Tightly-coupled lidar inertial odometry via smoothing and mapping,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 5135–5142
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.