Real-time Vision-based Depth Reconstruction with NVidia Jetson
Pith reviewed 2026-05-24 20:50 UTC · model grok-4.3
The pith
Enhanced fully convolutional networks achieve real-time single-image depth reconstruction on NVIDIA Jetson at over 16 frames per second.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
After systematic trials of FCNN architectures and several efficiency-oriented enhancements, the authors isolate a configuration that supplies the best performance-accuracy tradeoff and sustains frame rates exceeding 16 FPS for 320 by 240 input on NVIDIA Jetson platforms; the same networks, when integrated into monocular vSLAM, enable real-time mapping of previously unseen indoor environments on a Jetson TX2.
What carries the argument
Fully convolutional neural networks (FCNNs) modified with efficiency enhancements that map a single RGB image directly to a dense depth map while remaining lightweight enough for embedded GPUs.
If this is right
- Depth maps produced at real-time rates supply the metric scale required for accurate vSLAM map construction without stereo rigs or depth sensors.
- The optimized network can be embedded directly into existing vision pipelines running on Jetson-class hardware.
- Open-source ROS nodes allow immediate substitution of the depth estimator into other robotic systems.
- Indoor mapping demonstrations confirm that the accuracy is sufficient for unknown-environment navigation at the reported frame rate.
Where Pith is reading between the lines
- If the same accuracy-speed profile holds on newer Jetson generations, the approach could extend to outdoor or higher-resolution streams without hardware changes.
- Replacing the current indoor test with a standardized benchmark dataset would allow direct comparison against other single-image depth methods under identical timing constraints.
- Because the method requires only one camera, it lowers the sensor cost and calibration burden for small robots that must run vSLAM.
Load-bearing premise
The tested architectural changes actually improve both speed and accuracy on the target hardware and that indoor scenes are representative enough for the claimed real-time vSLAM use.
What would settle it
Running the released models on the same Jetson TX2 with 320 by 240 input yields sustained frame rates below 16 FPS or produces depth maps whose error prevents successful loop closure in monocular vSLAM of a comparable indoor space.
Figures
read the original abstract
Vision-based depth reconstruction is a challenging problem extensively studied in computer vision but still lacking universal solution. Reconstructing depth from single image is particularly valuable to mobile robotics as it can be embedded to the modern vision-based simultaneous localization and mapping (vSLAM) methods providing them with the metric information needed to construct accurate maps in real scale. Typically, depth reconstruction is done nowadays via fully-convolutional neural networks (FCNNs). In this work we experiment with several FCNN architectures and introduce a few enhancements aimed at increasing both the effectiveness and the efficiency of the inference. We experimentally determine the solution that provides the best performance/accuracy tradeoff and is able to run on NVidia Jetson with the framerates exceeding 16FPS for 320 x 240 input. We also evaluate the suggested models by conducting monocular vSLAM of unknown indoor environment on NVidia Jetson TX2 in real-time. Open-source implementation of the models and the inference node for Robot Operating System (ROS) are available at https://github.com/CnnDepth/tx2_fcnn_node.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper experiments with multiple FCNN architectures for single-image depth reconstruction, introduces enhancements to improve inference effectiveness and efficiency, identifies the best performance/accuracy tradeoff model that exceeds 16 FPS on Nvidia Jetson for 320x240 inputs, and demonstrates its use for real-time monocular vSLAM on Jetson TX2 in an unknown indoor environment. Open-source ROS implementation is provided.
Significance. If the experimental claims are supported by quantitative metrics, the work could provide practical guidance on deploying depth estimation networks on embedded hardware for robotics, addressing the performance-accuracy tradeoff for real-time vSLAM applications.
major comments (1)
- [Experiments / vSLAM evaluation] vSLAM evaluation (Experiments section): The central claim that the selected model enables effective real-time monocular vSLAM on Jetson TX2 is unsupported because no quantitative metrics (e.g., ATE, RPE, scale consistency, or trajectory error against ground truth) are reported; the evaluation appears limited to runtime (>16 FPS) and qualitative indoor demonstration, which does not establish that the depth output produces usable metric maps.
minor comments (2)
- [Abstract] Abstract: States that models were evaluated for performance/accuracy tradeoff but reports no numerical results, baselines, or error metrics, making it difficult to assess the strength of the experimental claims.
- The manuscript would benefit from explicit comparison tables showing FPS, accuracy metrics (e.g., RMSE on standard datasets), and model size for all tested architectures and enhancements.
Simulated Author's Rebuttal
We thank the referee for the detailed review and constructive feedback on our manuscript. We address the single major comment below.
read point-by-point responses
-
Referee: [Experiments / vSLAM evaluation] vSLAM evaluation (Experiments section): The central claim that the selected model enables effective real-time monocular vSLAM on Jetson TX2 is unsupported because no quantitative metrics (e.g., ATE, RPE, scale consistency, or trajectory error against ground truth) are reported; the evaluation appears limited to runtime (>16 FPS) and qualitative indoor demonstration, which does not establish that the depth output produces usable metric maps.
Authors: We agree that the vSLAM evaluation presented in the Experiments section relies on runtime measurements and a qualitative demonstration in an unknown indoor environment, without reporting quantitative trajectory metrics such as ATE, RPE, or scale consistency against ground truth. This limitation means the manuscript does not quantitatively establish that the depth estimates yield usable metric maps for vSLAM. We will revise the manuscript to clarify the scope of the evaluation and, where possible, incorporate quantitative metrics or adjust the claims accordingly. revision: yes
Circularity Check
No circularity: purely experimental comparison of architectures
full rationale
The paper performs direct experimental benchmarking of several FCNN variants and minor enhancements for monocular depth estimation on Jetson hardware, reporting runtime and qualitative vSLAM results. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing claims. All reported outcomes rest on open-source code and fresh measurements rather than reducing to prior inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Unsupervised cnn for single view depth estimation: Geometry to the rescue,
R. Garg, V . K. BG, G. Carneiro, and I. Reid, “Unsupervised cnn for single view depth estimation: Geometry to the rescue,” in European Conference on Computer Vision . Springer, 2016, pp. 740–756
work page 2016
-
[2]
B. Li, C. Shen, Y . Dai, A. Van Den Hengel, and M. He, “Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2015, pp. 1119–1127
work page 2015
-
[3]
Unsupervised monoc- ular depth estimation with left-right consistency,
C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monoc- ular depth estimation with left-right consistency,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2017, pp. 270–279
work page 2017
-
[4]
Cream: Condensed real- time models for depth prediction using convolutional neural networks,
A. Spek, T. Dharmasiri, and T. Drummond, “Cream: Condensed real- time models for depth prediction using convolutional neural networks,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 540–547
work page 2018
-
[5]
Image and depth from a conventional camera with a coded aperture,
A. Levin, R. Fergus, F. Durand, and W. T. Freeman, “Image and depth from a conventional camera with a coded aperture,” ACM transactions on graphics (TOG) , vol. 26, no. 3, p. 70, 2007
work page 2007
-
[6]
Natural image statistics and efficient coding,
B. A. Olshausen and D. J. Field, “Natural image statistics and efficient coding,” Network: computation in neural systems , vol. 7, no. 2, pp. 333–339, 1996
work page 1996
-
[7]
3-d depth reconstruction from a single still image,
A. Saxena, S. H. Chung, and A. Y . Ng, “3-d depth reconstruction from a single still image,” International journal of computer vision , vol. 76, no. 1, pp. 53–69, 2008
work page 2008
-
[8]
A dynamic bayesian network model for autonomous 3d reconstruction from a single indoor image,
E. Delage, H. Lee, and A. Y . Ng, “A dynamic bayesian network model for autonomous 3d reconstruction from a single indoor image,” in2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2. IEEE, 2006, pp. 2418–2428
work page 2006
-
[9]
Fully convolutional networks for semantic segmentation,
J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2015, pp. 3431–3440
work page 2015
-
[10]
Depth map prediction from a single image using a multi-scale deep network,
D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in Advances in neural information processing systems , 2014, pp. 2366–2374
work page 2014
-
[11]
Imagenet classification with deep convolutional neural networks,
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems , 2012, pp. 1097–1105
work page 2012
-
[12]
Deeper depth prediction with fully convolutional residual networks,
I. Laina, C. Rupprecht, V . Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in 2016 Fourth international conference on 3D vision (3DV) . IEEE, 2016, pp. 239–248
work page 2016
-
[13]
A robust hybrid of lasso and ridge regression,
A. B. Owen, “A robust hybrid of lasso and ridge regression,” Contem- porary Mathematics, vol. 443, no. 7, pp. 59–72, 2007
work page 2007
-
[14]
Cnn-slam: Real-time dense monocular slam with learned depth prediction,
K. Tateno, F. Tombari, I. Laina, and N. Navab, “Cnn-slam: Real-time dense monocular slam with learned depth prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2017, pp. 6243–6252
work page 2017
-
[15]
Megadepth: Learning single-view depth pre- diction from internet photos,
Z. Li and N. Snavely, “Megadepth: Learning single-view depth pre- diction from internet photos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2018, pp. 2041–2050
work page 2018
-
[16]
High Quality Monocular Depth Estimation via Transfer Learning
I. Alhashim and P. Wonka, “High quality monocular depth estimation via transfer learning,” arXiv preprint arXiv:1812.11941 , 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[17]
Design and implementation of autonomous car using raspberry pi,
G. S. Pannu, M. D. Ansari, and P. Gupta, “Design and implementation of autonomous car using raspberry pi,” International Journal of Computer Applications, vol. 113, no. 9, 2015
work page 2015
-
[18]
A. Andreychuk, A. Bokovoy, and K. Yakovlev, “An empirical eval- uation of grid-based path planning algorithms on widely used in robotics raspberry pi platform,” in The 2018 International Conference on Artificial Life and Robotics (ICAROB 2018) , 2018, pp. 383–386
work page 2018
-
[19]
Low cost object sorting robotic arm using raspberry pi,
V . Pereira, V . A. Fernandes, and J. Sequeira, “Low cost object sorting robotic arm using raspberry pi,” in 2014 IEEE Global Humanitarian Technology Conference-South Asia Satellite (GHTC-SAS) . IEEE, 2014, pp. 1–6
work page 2014
-
[20]
Benchmarking of cnns for low-cost, low-power robotics applications,
D. Pena, A. Forembski, X. Xu, and D. Moloney, “Benchmarking of cnns for low-cost, low-power robotics applications,” in RSS 2017 Workshop: New Frontier for Deep Learning in Robotics , 2017
work page 2017
-
[21]
Svo: Fast semi-direct monocular visual odometry,
C. Forster, M. Pizzoli, and D. Scaramuzza, “Svo: Fast semi-direct monocular visual odometry,” in 2014 IEEE international conference on robotics and automation (ICRA) . IEEE, 2014, pp. 15–22
work page 2014
-
[22]
Autonomous, vision-based flight and live dense 3d mapping with a quadrotor micro aerial vehicle,
M. Faessler, F. Fontana, C. Forster, E. Mueggler, M. Pizzoli, and D. Scaramuzza, “Autonomous, vision-based flight and live dense 3d mapping with a quadrotor micro aerial vehicle,” Journal of Field Robotics, vol. 33, no. 4, pp. 431–450, 2016
work page 2016
-
[23]
ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation
A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep neural network architecture for real-time semantic segmentation,” arXiv preprint arXiv:1606.02147 , 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[24]
Fast YOLO: A Fast You Only Look Once System for Real-time Embedded Object Detection in Video
M. J. Shafiee, B. Chywl, F. Li, and A. Wong, “Fast yolo: a fast you only look once system for real-time embedded object detection in video,” arXiv preprint arXiv:1709.05943 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[25]
Redeye: analog convnet image sensor architecture for continuous mobile vi- sion,
R. LiKamWa, Y . Hou, J. Gao, M. Polansky, and L. Zhong, “Redeye: analog convnet image sensor architecture for continuous mobile vi- sion,” in ACM SIGARCH Computer Architecture News, vol. 44, no. 3. IEEE Press, 2016, pp. 255–266
work page 2016
-
[26]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2016, pp. 770–778
work page 2016
-
[27]
Erfnet: Effi- cient residual factorized convnet for real-time semantic segmentation,
E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, “Erfnet: Effi- cient residual factorized convnet for real-time semantic segmentation,” IEEE Transactions on Intelligent Transportation Systems , vol. 19, no. 1, pp. 263–272, 2018
work page 2018
-
[28]
Cautiousbug: a competitive algorithm for sensory-based robot navigation,
E. Magid and E. Rivlin, “Cautiousbug: a competitive algorithm for sensory-based robot navigation,” in 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566), vol. 3, Sep. 2004, pp. 2757–2762 vol.3
work page 2004
-
[29]
Towards unified depth and semantic prediction from a single image,
P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille, “Towards unified depth and semantic prediction from a single image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2800–2809
work page 2015
-
[30]
Tensorflow: A system for large-scale machine learning,
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. , “Tensorflow: A system for large-scale machine learning,” in 12th {USENIX} Symposium on Operating Systems Design and Implementation ( {OSDI} 16), 2016, pp. 265–283
work page 2016
-
[31]
A. Gulli and S. Pal, Deep Learning with Keras . Packt Publishing Ltd, 2017
work page 2017
-
[32]
Indoor seg- mentation and support inference from rgbd images,
P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor seg- mentation and support inference from rgbd images,” in ECCV, 2012
work page 2012
-
[33]
Learning depth from single monocular images using deep convolutional neural fields,
F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single monocular images using deep convolutional neural fields,” IEEE transactions on pattern analysis and machine intelligence , vol. 38, no. 10, pp. 2024–2039, 2016
work page 2024
-
[34]
M. Labb ´e and F. Michaud, “Rtab-map as an open-source lidar and visual simultaneous localization and mapping library for large-scale and long-term online operation,” Journal of Field Robotics , vol. 36, no. 2, pp. 416–446, 2019
work page 2019
-
[35]
A. Bokovoy and K. Yakovlev, “Sparse 3D point-cloud map upsampling and noise removal as a vSLAM post-processing step: Experimental evaluation,” in Proceedings of the 3rd International Conference on Interactive Collaborative Robotics (ICR-2018) . Springer, 2018, pp. 23–33
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.