Performance Comparison Between OpenCV Built in CPU and GPU Functions on Image Processing Operations

Batuhan Hang\"un; \"Onder Eyecio\u{g}lu

arxiv: 1906.08819 · v1 · pith:I6LE46L2new · submitted 2019-06-20 · 💻 cs.DC

Performance Comparison Between OpenCV Built in CPU and GPU Functions on Image Processing Operations

Batuhan Hang\"un , \"Onder Eyecio\u{g}lu This is my paper

Pith reviewed 2026-05-25 19:02 UTC · model grok-4.3

classification 💻 cs.DC

keywords image processingOpenCVGPUCPUCUDAperformance comparisonparallel processing

0 comments

The pith

OpenCV's GPU functions using CUDA deliver greater speed than CPU functions for image processing operations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares performance of common image processing operations such as convolution, Fourier transforms, and matrix operations using OpenCV's built-in CPU and GPU functions. It establishes that GPUs achieve higher speeds than CPUs due to parallel processing versus serial processing. This comparison matters because larger images and complex operations can exceed CPU capabilities. The tests rely on NVIDIA's CUDA platform implemented in the GPU functions. A sympathetic reader would care about when GPU acceleration becomes necessary for practical tasks.

Core claim

GPUs provide greater speed compared to CPUs for image processing operations because of their parallel processing nature, as measured using OpenCV's built-in CPU and GPU functions that use CUDA.

What carries the argument

OpenCV's built-in CPU and GPU functions using CUDA for operations including matrix inversion, transpose, derivative, convolution, and Fourier Transform.

If this is right

For the tested operations, developers can expect faster execution times by switching to the GPU functions.
As image sizes increase, the performance gap favors GPU parallel computation over CPU serial processing.
CUDA implementation in OpenCV enables practical acceleration for specialized digital signal processing tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The results could guide selection of hardware for real-time image pipelines beyond the specific operations tested.
Similar comparisons in other libraries might reveal consistent patterns in CUDA-based acceleration.
Extending tests to varied GPU architectures would show whether the speed advantage holds across devices.

Load-bearing premise

The built-in OpenCV GPU functions using CUDA are correctly implemented, representative of general performance, and the tested operations and image sizes reflect real-world usage without hardware-specific biases.

What would settle it

Measuring the same OpenCV operations on hardware where the GPU version shows no consistent speedup over the CPU version would falsify the central claim.

Figures

Figures reproduced from arXiv: 1906.08819 by Batuhan Hang\"un, \"Onder Eyecio\u{g}lu.

**Figure 4.** Figure 4: Original image resized to %25 of original image 2.2.2 Thresholding In some applications it may be desirable to binaryize digital images. Operation to create binary images is named Thresholding and the k value, which determines which intensity values are 1 and which intensity values are 0, is also called the threshold. 𝑠 = 𝑇(𝑟) = { 0, 𝑟 < 𝑘 1, 𝑟 ≥ 𝑘 [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 9.** Figure 9: Detecting of edge in an image [PITH_FULL_IMAGE:figures/full_fig_p004_9.png] view at source ↗

**Figure 11.** Figure 11: OpenCV Block Diagram with supported Operating Systems [13] 2.3.2 Compute Unified Device Architecture (CUDA) CUDA is an API (application programming interface) and a parallel computing platform model that was created by Nvidia at 2007 [14]. After that day it gained popularity amongst people who needs high computing power with parallelism. This platform is a software layer which provides direct Access to GP… view at source ↗

**Figure 15.** Figure 15: Test image lena.jpg. Its original size is 225x225. Image was processed for sizes 112x112, 338x338, 450x450, 562x562, 675x675, 788x788, 900x900, 1012x1012, 1125x1125, 1238x1238, 1350x1350, 1462x1462, 1575x1575, 1688x1688, 1800x1800, 1912x1912, 2025x2025, 2138x2138 [PITH_FULL_IMAGE:figures/full_fig_p006_15.png] view at source ↗

**Figure 16.** Figure 16: The figure given above shows the comparison of built-in CPU and GPU functions that resize the image in terms of NxM(total number of pixels) and T(time spent in milliseconds). While resizing images, Linear Interpolation method was used [PITH_FULL_IMAGE:figures/full_fig_p006_16.png] view at source ↗

**Figure 17.** Figure 17: The figure given above shows the comparison of built-in CPU and GPU functions that thresholds the image. While thresholding images, Otsu’s method was used [PITH_FULL_IMAGE:figures/full_fig_p007_17.png] view at source ↗

read the original abstract

Image Processing is a specialized area of Digital Signal Processing which contains various mathematical and algebraic operations such as matrix inversion, transpose of matrix, derivative, convolution, Fourier Transform etc. Operations like those require higher computational capabilities than daily usage purposes of computers. At that point, with increased image sizes and more complex operations, CPUs may be unsatisfactory since they use Serial Processing by default. GPUs are the solution that come up with greater speed compared to CPUs because of their Parallel Processing/Computation nature. A parallel computing platform and programming model named CUDA was created by NVIDIA and implemented by the graphics processing units (GPUs) which were produced by them. In this paper, computing performance of some commonly used Image Processing operations will be compared on OpenCV's built in CPU and GPU functions that use CUDA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a routine OpenCV CPU-vs-GPU timing comparison that skips verifying the outputs match.

read the letter

The main point is that this paper runs a standard benchmark of a handful of OpenCV image-processing functions on CPU and their CUDA counterparts on GPU, then reports that the GPU versions are faster. That result is unsurprising and adds no new method or insight. The work simply times existing library calls on whatever hardware the authors used and states the obvious about parallel processing. It does one thing reasonably: it gives concrete wall-clock numbers for specific operations such as convolution and matrix transforms on real images, which can be handy for someone already committed to OpenCV who needs a quick reference for those exact calls. The soft spot is the one flagged in the stress-test note. The paper reports only execution times and does not describe any check that the CPU and GPU results are functionally the same. Without a pixel-wise comparison, norm, or even a visual check, it is possible the two paths are computing slightly different things due to floating-point ordering, boundary handling, or implementation details. That makes the claimed speed-up hard to interpret as a fair comparison of the same operation. The abstract gives no methodology details on image sizes, number of repetitions, or error bars, and the stress-test indicates the full text does not supply the missing equivalence test either. This paper is for practitioners who want a narrow data point on OpenCV CUDA performance for basic ops; it is not aimed at researchers looking for new techniques or reproducible findings. I would not bring it to a reading group and would not cite it. It does not deserve referee time.

Referee Report

1 major / 0 minor

Summary. The manuscript compares the performance of common image processing operations (such as those involving matrix operations, convolution, and transforms) using OpenCV's built-in CPU functions versus its GPU functions that rely on CUDA. It argues that GPUs provide greater speed due to parallel processing and reports execution times to support this for the tested operations and image sizes.

Significance. If the measurements are valid, the work could supply concrete timing data to help practitioners decide when to invoke cv::cuda:: routines in OpenCV for image-processing pipelines. The empirical nature of the study means its value rests entirely on the reproducibility and correctness of the reported timings.

major comments (1)

The central claim requires that the CPU and GPU implementations compute functionally equivalent results; otherwise the reported speed-ups compare unlike operations. The manuscript provides no description of any output-equivalence verification (pixel-wise difference, L2 norm, or visual inspection) between the cv:: and cv::cuda:: results for any of the tested operations. This verification is load-bearing for the performance comparison and is absent from the reported methodology.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need to verify functional equivalence between CPU and GPU results. We address the single major comment below.

read point-by-point responses

Referee: The central claim requires that the CPU and GPU implementations compute functionally equivalent results; otherwise the reported speed-ups compare unlike operations. The manuscript provides no description of any output-equivalence verification (pixel-wise difference, L2 norm, or visual inspection) between the cv:: and cv::cuda:: results for any of the tested operations. This verification is load-bearing for the performance comparison and is absent from the reported methodology.

Authors: We agree that the absence of documented equivalence verification weakens the central claim. The original manuscript does not describe any pixel-wise, norm-based, or visual checks between cv:: and cv::cuda:: outputs. In the revised version we will add a dedicated methodology subsection that reports the verification procedure and results (maximum absolute difference and L2 norm) for every operation and image size, confirming that the compared functions produce numerically equivalent results within floating-point tolerance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivations or self-referential claims

full rationale

The paper reports direct wall-clock timing measurements of OpenCV CPU (cv::) versus GPU (cv::cuda::) implementations for standard image-processing kernels on fixed test images. No equations, fitted parameters, uniqueness theorems, or ansatzes appear. The central claim (GPU faster due to parallelism) is an observed outcome of the benchmark, not a quantity derived from or fitted to itself. No self-citations are load-bearing; the work is self-contained against external timing data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper relies on standard assumptions about hardware performance differences but introduces no new parameters, axioms, or entities.

pith-pipeline@v0.9.0 · 5665 in / 878 out tokens · 34314 ms · 2026-05-25T19:02:04.896666+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Measurements shown that, GPU functions provide a performance improvement because they run in parallel but effects of GPU appear especially when image size increases.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

All codes written in C++ using OpenCV 's built-in CPU and GPU functions.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[2]

Oppenheim, Alan S

Alan V. Oppenheim, Alan S. Willsky, S. Hamid, Signals and Systems (2nd Edition), Pearson, 1996, Pages 1-4

work page 1996
[3]

Smith, W, Steven, A Scientist and Engineer's’ Guide to DSP.,1997, Pages 1-2

work page 1997
[4]

Lin-Ching Chang, EsamEl -Araby, VinhQ.Dang, LamH.Dao, GPU acceleration of nonlinear diffusion tensor estimation using CUDA and MPI, Neurocomputing, Volume 135, 5 July 2014, Pages 328–338

work page 2014
[5]

Diptarup Sahaa, Mr. Karan Darjib, Narendra Patelc, Darshak Thakored, Implementation of Image Enhancement Algorithms and Recursive Ray Tracing using CUDA, 7th International Conference on Communication, Computing and Virtualization 2016

work page 2016
[6]

Wu Xin, Zhang Jian -qi, Huang Xi, Liu De -lian, Separable convolution template (SCT) background prediction accelerated by CUDA for infrared small target detection, Infrared Physics & Technology an International Research Journal, 2013

work page 2013
[7]

In Kyu Park, Nitin Singhal, Man Hee Lee, Sungdae Cho and Chris W. Kim, Design and Performance Evaluation of INTERNATIONAL JOURNAL of ENGINEERING SCIENCE AND APPLICATION Hangun and Eyecioglu, Vol.1, No.2, 2017 41 Image Processing Algorithms on GPUs, IEEE Transactions on Parallel and Distributed Systems, Vol. 22, No. 1, January 2011

work page 2017
[8]

Gonzalez, Richard E

Rafael C. Gonzalez, Richard E. Woods, Digital Image Processing (3rd Edition), Pearson 2009, Pages 1-5 (Book)

work page 2009
[9]

Nobuyuki Otsu, A Threshold Selection Method from Gray-Level Histograms, IEEE Transactions on Systems, Man, and Cybernetics (Volume: 9, Issue: 1, Jan. 1979)

work page 1979
[10]

John Canny, A Computational Approach to Edge Detection, IEEE Transactions on Pattern Analysis and Machine Intelligence (Volume: PAMI -8, Issue: 6, Nov. 1986)

work page 1986
[12]

Gary Bradski, Adrian Kaehler, Learning OpenCV 3 (1st Edition), O'REILLY 2016, Pages 1-3

work page 2016
[13]

Gary Bradski, Adrian Kaehler, Learning OpenCV 3 (1st Edition), O'REILLY 2016, Pages 9-10

work page 2016
[14]

Nvidia CUDA Home Page

work page
[15]

Sapientia

Richard Forster, Agnes Fülöp, Yang -Mills Lattice on CUDA, The Journal of "Sapientia" Hungarian University of Transylvania (Volume 5, Issue: 2, Dec. 2013)

work page 2013
[16]

Sheaffer, Kevin Skadron, A performance study of general -purpose applications on graphics processors using CUDA, Journal of Parallel and Distributed Computing, 2008

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Kevin Skadron, A performance study of general -purpose applications on graphics processors using CUDA, Journal of Parallel and Distributed Computing, 2008

work page 2008
[17]

Antonella Galizia, Daniele D’Agostino, Andrea Clematis, An MPI–CUDA library for image processing on HPC architectures, Journal of Computational and Applied Mathematics, 2014

work page 2014
[18]

Zhiyi Yang, Yating Zhu, Yong Pu, Parallel Image Processing Based on CUDA, International Conference on Computer Science and Software Engineering

work page

[1] [2]

Oppenheim, Alan S

Alan V. Oppenheim, Alan S. Willsky, S. Hamid, Signals and Systems (2nd Edition), Pearson, 1996, Pages 1-4

work page 1996

[2] [3]

Smith, W, Steven, A Scientist and Engineer's’ Guide to DSP.,1997, Pages 1-2

work page 1997

[3] [4]

Lin-Ching Chang, EsamEl -Araby, VinhQ.Dang, LamH.Dao, GPU acceleration of nonlinear diffusion tensor estimation using CUDA and MPI, Neurocomputing, Volume 135, 5 July 2014, Pages 328–338

work page 2014

[4] [5]

Diptarup Sahaa, Mr. Karan Darjib, Narendra Patelc, Darshak Thakored, Implementation of Image Enhancement Algorithms and Recursive Ray Tracing using CUDA, 7th International Conference on Communication, Computing and Virtualization 2016

work page 2016

[5] [6]

Wu Xin, Zhang Jian -qi, Huang Xi, Liu De -lian, Separable convolution template (SCT) background prediction accelerated by CUDA for infrared small target detection, Infrared Physics & Technology an International Research Journal, 2013

work page 2013

[6] [7]

In Kyu Park, Nitin Singhal, Man Hee Lee, Sungdae Cho and Chris W. Kim, Design and Performance Evaluation of INTERNATIONAL JOURNAL of ENGINEERING SCIENCE AND APPLICATION Hangun and Eyecioglu, Vol.1, No.2, 2017 41 Image Processing Algorithms on GPUs, IEEE Transactions on Parallel and Distributed Systems, Vol. 22, No. 1, January 2011

work page 2017

[7] [8]

Gonzalez, Richard E

Rafael C. Gonzalez, Richard E. Woods, Digital Image Processing (3rd Edition), Pearson 2009, Pages 1-5 (Book)

work page 2009

[8] [9]

Nobuyuki Otsu, A Threshold Selection Method from Gray-Level Histograms, IEEE Transactions on Systems, Man, and Cybernetics (Volume: 9, Issue: 1, Jan. 1979)

work page 1979

[9] [10]

John Canny, A Computational Approach to Edge Detection, IEEE Transactions on Pattern Analysis and Machine Intelligence (Volume: PAMI -8, Issue: 6, Nov. 1986)

work page 1986

[10] [12]

Gary Bradski, Adrian Kaehler, Learning OpenCV 3 (1st Edition), O'REILLY 2016, Pages 1-3

work page 2016

[11] [13]

Gary Bradski, Adrian Kaehler, Learning OpenCV 3 (1st Edition), O'REILLY 2016, Pages 9-10

work page 2016

[12] [14]

Nvidia CUDA Home Page

work page

[13] [15]

Sapientia

Richard Forster, Agnes Fülöp, Yang -Mills Lattice on CUDA, The Journal of "Sapientia" Hungarian University of Transylvania (Volume 5, Issue: 2, Dec. 2013)

work page 2013

[14] [16]

Sheaffer, Kevin Skadron, A performance study of general -purpose applications on graphics processors using CUDA, Journal of Parallel and Distributed Computing, 2008

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Kevin Skadron, A performance study of general -purpose applications on graphics processors using CUDA, Journal of Parallel and Distributed Computing, 2008

work page 2008

[15] [17]

Antonella Galizia, Daniele D’Agostino, Andrea Clematis, An MPI–CUDA library for image processing on HPC architectures, Journal of Computational and Applied Mathematics, 2014

work page 2014

[16] [18]

Zhiyi Yang, Yating Zhu, Yong Pu, Parallel Image Processing Based on CUDA, International Conference on Computer Science and Software Engineering

work page