pith. sign in

arxiv: 1907.07210 · v1 · pith:DO3JH3NXnew · submitted 2019-07-16 · 💻 cs.CV · cs.RO

Real-time Vision-based Depth Reconstruction with NVidia Jetson

Pith reviewed 2026-05-24 20:50 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords monocular depth estimationfully convolutional networksreal-time inferenceNVIDIA Jetsonvisual SLAMembedded visiondepth reconstructionsingle image depth
0
0 comments X

The pith

Enhanced fully convolutional networks achieve real-time single-image depth reconstruction on NVIDIA Jetson at over 16 frames per second.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper experiments with multiple fully convolutional neural network architectures for estimating depth from a single image and adds targeted modifications to raise both accuracy and speed. It identifies the variant that delivers the strongest accuracy-speed balance while running above 16 FPS on 320 by 240 inputs when executed on Jetson hardware. The same models are then inserted into a monocular visual SLAM pipeline and shown to produce usable metric maps of unknown indoor spaces on a Jetson TX2 board in real time. This line of work matters to mobile robotics because depth from one camera supplies the scale information that standard visual SLAM needs to build maps in real units without extra sensors.

Core claim

After systematic trials of FCNN architectures and several efficiency-oriented enhancements, the authors isolate a configuration that supplies the best performance-accuracy tradeoff and sustains frame rates exceeding 16 FPS for 320 by 240 input on NVIDIA Jetson platforms; the same networks, when integrated into monocular vSLAM, enable real-time mapping of previously unseen indoor environments on a Jetson TX2.

What carries the argument

Fully convolutional neural networks (FCNNs) modified with efficiency enhancements that map a single RGB image directly to a dense depth map while remaining lightweight enough for embedded GPUs.

If this is right

  • Depth maps produced at real-time rates supply the metric scale required for accurate vSLAM map construction without stereo rigs or depth sensors.
  • The optimized network can be embedded directly into existing vision pipelines running on Jetson-class hardware.
  • Open-source ROS nodes allow immediate substitution of the depth estimator into other robotic systems.
  • Indoor mapping demonstrations confirm that the accuracy is sufficient for unknown-environment navigation at the reported frame rate.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the same accuracy-speed profile holds on newer Jetson generations, the approach could extend to outdoor or higher-resolution streams without hardware changes.
  • Replacing the current indoor test with a standardized benchmark dataset would allow direct comparison against other single-image depth methods under identical timing constraints.
  • Because the method requires only one camera, it lowers the sensor cost and calibration burden for small robots that must run vSLAM.

Load-bearing premise

The tested architectural changes actually improve both speed and accuracy on the target hardware and that indoor scenes are representative enough for the claimed real-time vSLAM use.

What would settle it

Running the released models on the same Jetson TX2 with 320 by 240 input yields sustained frame rates below 16 FPS or produces depth maps whose error prevents successful loop closure in monocular vSLAM of a comparable indoor space.

Figures

Figures reproduced from arXiv: 1907.07210 by Andrey Bokovoy, Kirill Muravyev, Konstantin Yakovlev.

Figure 1
Figure 1. Figure 1: Monocular vSLAM based on FCNN depth reconstruction and [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of evaluated network architectures. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Faster up-convolution block architecture. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of introduced FCNN on NYU Dataset v2. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
read the original abstract

Vision-based depth reconstruction is a challenging problem extensively studied in computer vision but still lacking universal solution. Reconstructing depth from single image is particularly valuable to mobile robotics as it can be embedded to the modern vision-based simultaneous localization and mapping (vSLAM) methods providing them with the metric information needed to construct accurate maps in real scale. Typically, depth reconstruction is done nowadays via fully-convolutional neural networks (FCNNs). In this work we experiment with several FCNN architectures and introduce a few enhancements aimed at increasing both the effectiveness and the efficiency of the inference. We experimentally determine the solution that provides the best performance/accuracy tradeoff and is able to run on NVidia Jetson with the framerates exceeding 16FPS for 320 x 240 input. We also evaluate the suggested models by conducting monocular vSLAM of unknown indoor environment on NVidia Jetson TX2 in real-time. Open-source implementation of the models and the inference node for Robot Operating System (ROS) are available at https://github.com/CnnDepth/tx2_fcnn_node.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper experiments with multiple FCNN architectures for single-image depth reconstruction, introduces enhancements to improve inference effectiveness and efficiency, identifies the best performance/accuracy tradeoff model that exceeds 16 FPS on Nvidia Jetson for 320x240 inputs, and demonstrates its use for real-time monocular vSLAM on Jetson TX2 in an unknown indoor environment. Open-source ROS implementation is provided.

Significance. If the experimental claims are supported by quantitative metrics, the work could provide practical guidance on deploying depth estimation networks on embedded hardware for robotics, addressing the performance-accuracy tradeoff for real-time vSLAM applications.

major comments (1)
  1. [Experiments / vSLAM evaluation] vSLAM evaluation (Experiments section): The central claim that the selected model enables effective real-time monocular vSLAM on Jetson TX2 is unsupported because no quantitative metrics (e.g., ATE, RPE, scale consistency, or trajectory error against ground truth) are reported; the evaluation appears limited to runtime (>16 FPS) and qualitative indoor demonstration, which does not establish that the depth output produces usable metric maps.
minor comments (2)
  1. [Abstract] Abstract: States that models were evaluated for performance/accuracy tradeoff but reports no numerical results, baselines, or error metrics, making it difficult to assess the strength of the experimental claims.
  2. The manuscript would benefit from explicit comparison tables showing FPS, accuracy metrics (e.g., RMSE on standard datasets), and model size for all tested architectures and enhancements.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive feedback on our manuscript. We address the single major comment below.

read point-by-point responses
  1. Referee: [Experiments / vSLAM evaluation] vSLAM evaluation (Experiments section): The central claim that the selected model enables effective real-time monocular vSLAM on Jetson TX2 is unsupported because no quantitative metrics (e.g., ATE, RPE, scale consistency, or trajectory error against ground truth) are reported; the evaluation appears limited to runtime (>16 FPS) and qualitative indoor demonstration, which does not establish that the depth output produces usable metric maps.

    Authors: We agree that the vSLAM evaluation presented in the Experiments section relies on runtime measurements and a qualitative demonstration in an unknown indoor environment, without reporting quantitative trajectory metrics such as ATE, RPE, or scale consistency against ground truth. This limitation means the manuscript does not quantitatively establish that the depth estimates yield usable metric maps for vSLAM. We will revise the manuscript to clarify the scope of the evaluation and, where possible, incorporate quantitative metrics or adjust the claims accordingly. revision: yes

Circularity Check

0 steps flagged

No circularity: purely experimental comparison of architectures

full rationale

The paper performs direct experimental benchmarking of several FCNN variants and minor enhancements for monocular depth estimation on Jetson hardware, reporting runtime and qualitative vSLAM results. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing claims. All reported outcomes rest on open-source code and fresh measurements rather than reducing to prior inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical engineering paper with no mathematical axioms, free parameters, or new entities postulated.

pith-pipeline@v0.9.0 · 5721 in / 966 out tokens · 18273 ms · 2026-05-24T20:50:44.339854+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 3 internal anchors

  1. [1]

    Unsupervised cnn for single view depth estimation: Geometry to the rescue,

    R. Garg, V . K. BG, G. Carneiro, and I. Reid, “Unsupervised cnn for single view depth estimation: Geometry to the rescue,” in European Conference on Computer Vision . Springer, 2016, pp. 740–756

  2. [2]

    Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs,

    B. Li, C. Shen, Y . Dai, A. Van Den Hengel, and M. He, “Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2015, pp. 1119–1127

  3. [3]

    Unsupervised monoc- ular depth estimation with left-right consistency,

    C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monoc- ular depth estimation with left-right consistency,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2017, pp. 270–279

  4. [4]

    Cream: Condensed real- time models for depth prediction using convolutional neural networks,

    A. Spek, T. Dharmasiri, and T. Drummond, “Cream: Condensed real- time models for depth prediction using convolutional neural networks,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 540–547

  5. [5]

    Image and depth from a conventional camera with a coded aperture,

    A. Levin, R. Fergus, F. Durand, and W. T. Freeman, “Image and depth from a conventional camera with a coded aperture,” ACM transactions on graphics (TOG) , vol. 26, no. 3, p. 70, 2007

  6. [6]

    Natural image statistics and efficient coding,

    B. A. Olshausen and D. J. Field, “Natural image statistics and efficient coding,” Network: computation in neural systems , vol. 7, no. 2, pp. 333–339, 1996

  7. [7]

    3-d depth reconstruction from a single still image,

    A. Saxena, S. H. Chung, and A. Y . Ng, “3-d depth reconstruction from a single still image,” International journal of computer vision , vol. 76, no. 1, pp. 53–69, 2008

  8. [8]

    A dynamic bayesian network model for autonomous 3d reconstruction from a single indoor image,

    E. Delage, H. Lee, and A. Y . Ng, “A dynamic bayesian network model for autonomous 3d reconstruction from a single indoor image,” in2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2. IEEE, 2006, pp. 2418–2428

  9. [9]

    Fully convolutional networks for semantic segmentation,

    J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2015, pp. 3431–3440

  10. [10]

    Depth map prediction from a single image using a multi-scale deep network,

    D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in Advances in neural information processing systems , 2014, pp. 2366–2374

  11. [11]

    Imagenet classification with deep convolutional neural networks,

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems , 2012, pp. 1097–1105

  12. [12]

    Deeper depth prediction with fully convolutional residual networks,

    I. Laina, C. Rupprecht, V . Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in 2016 Fourth international conference on 3D vision (3DV) . IEEE, 2016, pp. 239–248

  13. [13]

    A robust hybrid of lasso and ridge regression,

    A. B. Owen, “A robust hybrid of lasso and ridge regression,” Contem- porary Mathematics, vol. 443, no. 7, pp. 59–72, 2007

  14. [14]

    Cnn-slam: Real-time dense monocular slam with learned depth prediction,

    K. Tateno, F. Tombari, I. Laina, and N. Navab, “Cnn-slam: Real-time dense monocular slam with learned depth prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2017, pp. 6243–6252

  15. [15]

    Megadepth: Learning single-view depth pre- diction from internet photos,

    Z. Li and N. Snavely, “Megadepth: Learning single-view depth pre- diction from internet photos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2018, pp. 2041–2050

  16. [16]

    High Quality Monocular Depth Estimation via Transfer Learning

    I. Alhashim and P. Wonka, “High quality monocular depth estimation via transfer learning,” arXiv preprint arXiv:1812.11941 , 2018

  17. [17]

    Design and implementation of autonomous car using raspberry pi,

    G. S. Pannu, M. D. Ansari, and P. Gupta, “Design and implementation of autonomous car using raspberry pi,” International Journal of Computer Applications, vol. 113, no. 9, 2015

  18. [18]

    An empirical eval- uation of grid-based path planning algorithms on widely used in robotics raspberry pi platform,

    A. Andreychuk, A. Bokovoy, and K. Yakovlev, “An empirical eval- uation of grid-based path planning algorithms on widely used in robotics raspberry pi platform,” in The 2018 International Conference on Artificial Life and Robotics (ICAROB 2018) , 2018, pp. 383–386

  19. [19]

    Low cost object sorting robotic arm using raspberry pi,

    V . Pereira, V . A. Fernandes, and J. Sequeira, “Low cost object sorting robotic arm using raspberry pi,” in 2014 IEEE Global Humanitarian Technology Conference-South Asia Satellite (GHTC-SAS) . IEEE, 2014, pp. 1–6

  20. [20]

    Benchmarking of cnns for low-cost, low-power robotics applications,

    D. Pena, A. Forembski, X. Xu, and D. Moloney, “Benchmarking of cnns for low-cost, low-power robotics applications,” in RSS 2017 Workshop: New Frontier for Deep Learning in Robotics , 2017

  21. [21]

    Svo: Fast semi-direct monocular visual odometry,

    C. Forster, M. Pizzoli, and D. Scaramuzza, “Svo: Fast semi-direct monocular visual odometry,” in 2014 IEEE international conference on robotics and automation (ICRA) . IEEE, 2014, pp. 15–22

  22. [22]

    Autonomous, vision-based flight and live dense 3d mapping with a quadrotor micro aerial vehicle,

    M. Faessler, F. Fontana, C. Forster, E. Mueggler, M. Pizzoli, and D. Scaramuzza, “Autonomous, vision-based flight and live dense 3d mapping with a quadrotor micro aerial vehicle,” Journal of Field Robotics, vol. 33, no. 4, pp. 431–450, 2016

  23. [23]

    ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

    A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep neural network architecture for real-time semantic segmentation,” arXiv preprint arXiv:1606.02147 , 2016

  24. [24]

    Fast YOLO: A Fast You Only Look Once System for Real-time Embedded Object Detection in Video

    M. J. Shafiee, B. Chywl, F. Li, and A. Wong, “Fast yolo: a fast you only look once system for real-time embedded object detection in video,” arXiv preprint arXiv:1709.05943 , 2017

  25. [25]

    Redeye: analog convnet image sensor architecture for continuous mobile vi- sion,

    R. LiKamWa, Y . Hou, J. Gao, M. Polansky, and L. Zhong, “Redeye: analog convnet image sensor architecture for continuous mobile vi- sion,” in ACM SIGARCH Computer Architecture News, vol. 44, no. 3. IEEE Press, 2016, pp. 255–266

  26. [26]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2016, pp. 770–778

  27. [27]

    Erfnet: Effi- cient residual factorized convnet for real-time semantic segmentation,

    E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, “Erfnet: Effi- cient residual factorized convnet for real-time semantic segmentation,” IEEE Transactions on Intelligent Transportation Systems , vol. 19, no. 1, pp. 263–272, 2018

  28. [28]

    Cautiousbug: a competitive algorithm for sensory-based robot navigation,

    E. Magid and E. Rivlin, “Cautiousbug: a competitive algorithm for sensory-based robot navigation,” in 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566), vol. 3, Sep. 2004, pp. 2757–2762 vol.3

  29. [29]

    Towards unified depth and semantic prediction from a single image,

    P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille, “Towards unified depth and semantic prediction from a single image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2800–2809

  30. [30]

    Tensorflow: A system for large-scale machine learning,

    M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. , “Tensorflow: A system for large-scale machine learning,” in 12th {USENIX} Symposium on Operating Systems Design and Implementation ( {OSDI} 16), 2016, pp. 265–283

  31. [31]

    Gulli and S

    A. Gulli and S. Pal, Deep Learning with Keras . Packt Publishing Ltd, 2017

  32. [32]

    Indoor seg- mentation and support inference from rgbd images,

    P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor seg- mentation and support inference from rgbd images,” in ECCV, 2012

  33. [33]

    Learning depth from single monocular images using deep convolutional neural fields,

    F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single monocular images using deep convolutional neural fields,” IEEE transactions on pattern analysis and machine intelligence , vol. 38, no. 10, pp. 2024–2039, 2016

  34. [34]

    Rtab-map as an open-source lidar and visual simultaneous localization and mapping library for large-scale and long-term online operation,

    M. Labb ´e and F. Michaud, “Rtab-map as an open-source lidar and visual simultaneous localization and mapping library for large-scale and long-term online operation,” Journal of Field Robotics , vol. 36, no. 2, pp. 416–446, 2019

  35. [35]

    Sparse 3D point-cloud map upsampling and noise removal as a vSLAM post-processing step: Experimental evaluation,

    A. Bokovoy and K. Yakovlev, “Sparse 3D point-cloud map upsampling and noise removal as a vSLAM post-processing step: Experimental evaluation,” in Proceedings of the 3rd International Conference on Interactive Collaborative Robotics (ICR-2018) . Springer, 2018, pp. 23–33