pith. sign in

arxiv: 1907.07745 · v1 · pith:MRYWD3MTnew · submitted 2019-07-17 · 💻 cs.CV · eess.IV· eess.SP

Real-Time Highly Accurate Dense Depth on a Power Budget using an FPGA-CPU Hybrid SoC

Pith reviewed 2026-05-24 20:15 UTC · model grok-4.3

classification 💻 cs.CV eess.IVeess.SP
keywords stereo depth estimationFPGAreal-time visionembedded systemsKITTI datasetSGMELASpower efficiency
0
0 comments X

The pith

A hybrid FPGA-CPU chip computes dense stereo depth at over 50 frames per second with 8.7 percent error while drawing only 5 watts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to obtain accurate dense depth from stereo images on embedded hardware that must stay under a strict power limit. It does so by splitting the work of two established stereo algorithms between the FPGA and CPU parts of a single chip so that each runs the parts it handles best. A reader would care because this combination reaches accuracy levels previously seen only on high-power systems while meeting real-time and power constraints typical of mobile robots or drones.

Core claim

The central claim is that a novel stereo method combining the best features of SGM and ELAS on an FPGA-CPU hybrid SoC produces highly accurate dense depth in real time, reaching an 8.7 percent error rate on the KITTI 2015 benchmark at over 50 FPS with a total power draw of only 5 W.

What carries the argument

Partitioning of SGM and ELAS processing steps across the FPGA fabric and CPU cores of the hybrid SoC so that memory-intensive or iterative operations run where they are efficient.

If this is right

  • Real-time dense depth becomes available on platforms limited to a few watts of power.
  • Stereo pipelines no longer need to sacrifice accuracy to fit within FPGA resource or timing limits.
  • Embedded vision systems can now use depth maps whose quality approaches that of full desktop implementations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same split of regular and irregular computation steps could be tried on other hybrid chips for different vision tasks.
  • Power budgets that once ruled out dense depth sensing may now support it, changing the design space for battery-powered robots.
  • Multiple such depth pipelines might run concurrently on one low-power SoC if the reported margins hold.

Load-bearing premise

The SGM and ELAS components can be split between FPGA and CPU without large drops in accuracy or violations of real-time timing on the chosen chip.

What would settle it

Running the described implementation on the target SoC and measuring either an error rate above 8.7 percent on KITTI 2015 or power consumption above 5 W at frame rates below 50 FPS would falsify the performance result.

Figures

Figures reproduced from arXiv: 1907.07745 by Alessio Tonioni, Luigi Di Stefano, Oscar Rahnama, Philip H. S. Torr, Simon Walker, Stuart Golodetz, Thomas Joy, Tommaso Cavallari.

Figure 1
Figure 1. Figure 1: Overview of our approach. First, we use Fast R3SGM (see §I-A1) to compute disparity images for the input stereo pair (in raster and reverse-raster order). We then flip the right result and perform a left-right consistency check to obtain an accurate but sparse disparity map for the left input image (see §I-A2). Next, as ELAS [11] does, we perform support checking (see §I-B1) to remove points whose disparit… view at source ↗
Figure 2
Figure 2. Figure 2: Comparing the implicit biases that exist in raster and [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: The results of performing a consolidating consistency [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: An example showing the effects of performing L/R [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: The results of performing a redundancy check on the [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The plane priors produced by constructing a Delaunay [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗
read the original abstract

Obtaining highly accurate depth from stereo images in real time has many applications across computer vision and robotics, but in some contexts, upper bounds on power consumption constrain the feasible hardware to embedded platforms such as FPGAs. Whilst various stereo algorithms have been deployed on these platforms, usually cut down to better match the embedded architecture, certain key parts of the more advanced algorithms, e.g. those that rely on unpredictable access to memory or are highly iterative in nature, are difficult to deploy efficiently on FPGAs, and thus the depth quality that can be achieved is limited. In this paper, we leverage a FPGA-CPU chip to propose a novel, sophisticated, stereo approach that combines the best features of SGM and ELAS-based methods to compute highly accurate dense depth in real time. Our approach achieves an 8.7% error rate on the challenging KITTI 2015 dataset at over 50 FPS, with a power consumption of only 5W.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes a hybrid FPGA-CPU stereo depth system that combines SGM and ELAS by mapping unpredictable-memory and iterative ELAS components to the CPU while running the remainder on the FPGA, claiming an 8.7% error rate on KITTI 2015 at >50 FPS with 5 W power draw.

Significance. If the hybrid partitioning demonstrably preserves the accuracy of the combined SGM+ELAS pipeline without introducing latency or synchronization artifacts that violate the real-time bound on the target SoC, the result would be a meaningful contribution to embedded vision, showing how advanced stereo algorithms can be deployed at low power without the usual accuracy trade-offs.

major comments (1)
  1. [Abstract] Abstract (and any corresponding method section): the headline 8.7% KITTI 2015 error is presented as evidence that the FPGA-CPU split succeeds, yet no ablation, timing profile, or disparity-map comparison is supplied to show that off-loading the iterative ELAS stages to the CPU preserves accuracy or meets the >50 FPS bound on the specific SoC; without this quantitative support the central claim cannot be evaluated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract (and any corresponding method section): the headline 8.7% KITTI 2015 error is presented as evidence that the FPGA-CPU split succeeds, yet no ablation, timing profile, or disparity-map comparison is supplied to show that off-loading the iterative ELAS stages to the CPU preserves accuracy or meets the >50 FPS bound on the specific SoC; without this quantitative support the central claim cannot be evaluated.

    Authors: We agree that the abstract and method section would benefit from explicit quantitative support for the hybrid partitioning. The results section of the manuscript reports the end-to-end accuracy and frame rate achieved on the target SoC, but does not include a dedicated ablation isolating the effect of CPU off-loading. In the revised version we will add an ablation comparing the hybrid system to a pure-FPGA baseline, detailed per-stage timing profiles, and side-by-side disparity-map visualizations to confirm that accuracy and the >50 FPS bound are preserved. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical hardware result with no derivation chain

full rationale

The paper reports an empirical implementation result (8.7% KITTI error at >50 FPS on 5W FPGA-CPU SoC) obtained by partitioning SGM and ELAS components. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described approach. The central claim rests on measured hardware performance against an external benchmark (KITTI 2015), which is falsifiable outside any internal construction. This matches the default case of a self-contained empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.0 · 5731 in / 1037 out tokens · 18497 ms · 2026-05-24T20:15:38.250712+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 2 internal anchors

  1. [1]

    ElasticFusion: Real-Time Dense SLAM and Light Source Estimation,

    T. Whelan, R. F. Salas-Moreno, B. Glocker, A. J. Davison, and S. Leutenegger, “ElasticFusion: Real-Time Dense SLAM and Light Source Estimation,” IJRR, vol. 35, no. 14, pp. 1697–1716, 2016

  2. [2]

    InfiniTAM v3: A Framework for Large-Scale 3D Reconstruction with Loop Closure

    V . A. Prisacariu, O. K ¨ahler, S. Golodetz, M. Sapienza, T. Cavallari, P. H. S. Torr, and D. W. Murray, “InfiniTAM v3: A Framework for Large-Scale 3D Reconstruction with Loop Closure,” arXiv preprint arXiv:1708.00783v1, 2017

  3. [3]

    Collaborative Large-Scale Dense 3D Recon- struction with Online Inter-Agent Pose Optimisation,

    S. Golodetz ∗, T. Cavallari ∗, N. A. Lord ∗, V . A. Prisacariu, D. W. Murray, and P. H. S. Torr, “Collaborative Large-Scale Dense 3D Recon- struction with Online Inter-Agent Pose Optimisation,” TVCG, vol. 24, no. 11, 2018

  4. [4]

    Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images,

    J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgib- bon, “Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images,” in CVPR, 2013, pp. 2930–2937

  5. [5]

    On-the-Fly Adaptation of Regression Forests for Online Camera Relocalisation,

    T. Cavallari, S. Golodetz*, N. A. Lord*, J. Valentin, L. D. Stefano, and P. H. S. Torr, “On-the-Fly Adaptation of Regression Forests for Online Camera Relocalisation,” in CVPR, 2017, pp. 4457–4466

  6. [6]

    Real-Time RGB-D Camera Pose Estimation in Novel Scenes using a Relocalisation Cascade

    T. Cavallari*, S. Golodetz*, N. A. Lord*, J. Valentin*, V . A. Prisacariu, L. D. Stefano, and P. H. S. Torr, “Real-Time RGB-D Camera Pose Estimation in Novel Scenes using a Relocalisation Cascade,” arXiv preprint arXiv:1810.12163, 2018

  7. [7]

    A Depth-Based Head-Mounted Visual Display to Aid Navigation in Partially Sighted Individuals,

    S. L. Hicks, I. Wilson, L. Muhammed, J. Worsfold, S. M. Downes, and C. Kennard, “A Depth-Based Head-Mounted Visual Display to Aid Navigation in Partially Sighted Individuals,” PLoS ONE , 2013

  8. [8]

    A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms,

    D. Scharstein and R. Szeliski, “A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms,” International Journal of Computer Vision , vol. 47, no. 1-3, pp. 7–42, 2002

  9. [9]

    Review of Stereo Vision Algorithms and their Suitability for Resource-Limited Systems,

    B. Tippetts, D. J. Lee, K. Lillywhite, and J. Archibald, “Review of Stereo Vision Algorithms and their Suitability for Resource-Limited Systems,” Journal of Real-Time Image Processing , vol. 11, no. 1, pp. 5–25, 2016

  10. [10]

    Stereo Processing by Semiglobal Matching and Mu- tual Information,

    H. Hirschmuller, “Stereo Processing by Semiglobal Matching and Mu- tual Information,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 2, pp. 328–341, 2008

  11. [11]

    Efficient Large-Scale Stereo Matching,

    A. Geiger, M. Roser, and R. Urtasun, “Efficient Large-Scale Stereo Matching,” in Computer Vision–ACCV 2010. Springer, 2010, pp. 25–38

  12. [12]

    R 3SGM: Real-time Raster-Respecting Semi-Global Matching for Power-Constrained Systems,

    O. Rahnama, T. Cavallari ∗, S. Golodetz ∗, S. Walker, and P. H. S. Torr, “R 3SGM: Real-time Raster-Respecting Semi-Global Matching for Power-Constrained Systems,” in FPT, 2018

  13. [13]

    Real-Time Semi-Global Matching on the CPU,

    S. K. Gehrig and C. Rabe, “Real-Time Semi-Global Matching on the CPU,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on . IEEE, 2010, pp. 85–92

  14. [14]

    Real-Time Semi-Global Matching Disparity Estimation on the GPU,

    C. Banz, H. Blume, and P. Pirsch, “Real-Time Semi-Global Matching Disparity Estimation on the GPU,” inComputer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on . IEEE, 2011, pp. 514–521

  15. [15]

    Embedded real-time stereo estimation via Semi- Global Matching on the GPU,

    D. Hernandez-Juarez, A. Chac ´on, A. Espinosa, D. V´azquez, J. C. Moure, and A. M. L ´opez, “Embedded real-time stereo estimation via Semi- Global Matching on the GPU,” Procedia Computer Science , vol. 80, 2016

  16. [16]

    Design of Real- Time FPGA-based Embedded System for Stereo Vision,

    S. Perri, F. Frustaci, F. Spagnolo, and P. Corsonello, “Design of Real- Time FPGA-based Embedded System for Stereo Vision,” in Circuits and Systems (ISCAS), 2018 IEEE International Symposium on . IEEE, 2018, pp. 1–5

  17. [17]

    Real-time depth processing for embedded platforms,

    O. Rahnama, A. Makarov, and P. Torr, “Real-time depth processing for embedded platforms,” in Real-Time Image and Video Processing 2017 , vol. 10223. International Society for Optics and Photonics, 2017, p. 102230N

  18. [18]

    Real-time high-definition stereo matching on FPGA,

    L. Zhang, K. Zhang, T. S. Chang, G. Lafruit, G. K. Kuzmanov, and D. Verkest, “Real-time high-definition stereo matching on FPGA,” in Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays . ACM, 2011, pp. 55–64

  19. [19]

    Stereo vision architecture for heterogeneous systems-on-chip,

    S. Perri, F. Frustaci, F. Spagnolo, and P. Corsonello, “Stereo vision architecture for heterogeneous systems-on-chip,” Journal of Real-Time Image Processing, pp. 1–23, 2018

  20. [20]

    High-quality real-time hardware stereo matching based on guided image filtering,

    C. Ttofis and T. Theocharides, “High-quality real-time hardware stereo matching based on guided image filtering,” in Proceedings of the Conference on Design, Automation & Test in Europe . European Design and Automation Association, 2014, p. 356

  21. [21]

    FPGA based real-time on-road stereo vision system,

    M. Dehnavi and M. Eshghi, “FPGA based real-time on-road stereo vision system,” Journal of Systems Architecture , vol. 81, pp. 32–43, 2017

  22. [22]

    A real-time global stereo-matching on FPGA,

    D. Zha, X. Jin, and T. Xiang, “A real-time global stereo-matching on FPGA,” Microprocessors and Microsystems, vol. 47, pp. 419–428, 2016

  23. [23]

    A Real-Time Low-Power Stereo Vision Engine Using Semi-Global Matching,

    S. K. Gehrig, F. Eberli, and T. Meyer, “A Real-Time Low-Power Stereo Vision Engine Using Semi-Global Matching,” in International Conference on Computer Vision Systems . Springer, 2009, pp. 134–143

  24. [24]

    Real-Time Stereo Vision System using Semi-Global Matching Disparity Estima- tion: Architecture and FPGA-Implementation,

    C. Banz, S. Hesselbarth, H. Flatt, H. Blume, and P. Pirsch, “Real-Time Stereo Vision System using Semi-Global Matching Disparity Estima- tion: Architecture and FPGA-Implementation,” in Embedded Computer Systems (SAMOS), 2010 International Conference on . IEEE, 2010, pp. 93–101

  25. [25]

    A passive RGBD sensor for accurate and real-time depth sensing self-contained into an FPGA,

    S. Mattoccia and M. Poggi, “A passive RGBD sensor for accurate and real-time depth sensing self-contained into an FPGA,” in Proceedings of the 9th International Conference on Distributed Smart Cameras . ACM, 2015, pp. 146–151

  26. [26]

    Real-time and Low Latency Embedded Computer Vision Hardware Based on a Combination of FPGA and Mobile CPU,

    D. Honegger, H. Oleynikova, and M. Pollefeys, “Real-time and Low Latency Embedded Computer Vision Hardware Based on a Combination of FPGA and Mobile CPU,” in Intelligent Robots and Systems (IROS 2014), 2014 IEEE/RSJ International Conference on . IEEE, 2014, pp. 4930–4935

  27. [27]

    Real-Time High- Quality Stereo Vision System in FPGA,

    W. Wang, J. Yan, N. Xu, Y . Wang, and F.-H. Hsu, “Real-Time High- Quality Stereo Vision System in FPGA,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 25, no. 10, pp. 1696–1708, 2015

  28. [28]

    Real-Time Dense Stereo Matching With ELAS on FPGA-Accelerated Embedded De- vices,

    O. Rahnama, D. Frost, O. Miksik, and P. H. Torr, “Real-Time Dense Stereo Matching With ELAS on FPGA-Accelerated Embedded De- vices,” IEEE Robotics and Automation Letters , vol. 3, no. 3, pp. 2008– 2015, 2018

  29. [29]

    Joint 3D Estimation of Vehicles and Scene Flow,

    M. Menze, C. Heipke, and A. Geiger, “Joint 3D Estimation of Vehicles and Scene Flow,” in ISPRS Workshop on Image Sequence Analysis (ISA) , 2015

  30. [30]

    Object Scene Flow,

    ——, “Object Scene Flow,” ISPRS Journal of Photogrammetry and Remote Sensing (JPRS) , 2018

  31. [31]

    End-to-end Learning of Cost-V olume Aggregation for Real-time Dense Stereo,

    A. Kuzmin, D. Mikushin, and V . Lempitsky, “End-to-end Learning of Cost-V olume Aggregation for Real-time Dense Stereo,” in MLSP, 2017