pith. sign in

arxiv: 1906.10811 · v1 · pith:3FSF75W5new · submitted 2019-06-26 · 💻 cs.MS · cs.PL

Investigating the OPS intermediate representation to target GPUs in the Devito DSL

Pith reviewed 2026-05-25 15:21 UTC · model grok-4.3

classification 💻 cs.MS cs.PL
keywords Devito DSLOPS intermediate representationGPU code generationfinite difference methodsseismic inversionstructured meshesbackend integration
0
0 comments X

The pith

Devito adds an OPS backend to target GPUs and deliver considerable speedups over its core backend for finite-difference PDE solvers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Devito is a domain-specific language that generates finite-difference code for partial differential equations, focused on seismic inversion. The paper examines the addition of support for the OPS API, which produces optimized code for structured meshes across platforms including GPUs. This is done by implementing an OPS backend inside Devito. The integration produces considerable performance gains relative to Devito's existing backend. A sympathetic reader would care because the change lets existing Devito models run efficiently on GPUs without rewriting the high-level specification.

Core claim

By providing an implementation of a OPS backend in Devito, the authors obtain considerable speed ups compared to the core Devito backend for applications running on structured meshes targeting various platforms including GPUs.

What carries the argument

The OPS backend implementation in Devito, which maps Devito's finite-difference code generation to the OPS API for platform-specific optimized output.

If this is right

  • Devito-generated code for seismic problems can now target GPUs through the OPS layer.
  • The same high-level Devito models produce optimized output for multiple hardware platforms via OPS.
  • Performance gains appear for finite-difference methods on structured meshes without altering the original problem specification.
  • The integration demonstrates a pathway for other code-generation DSLs to reach GPUs by adopting an intermediate API such as OPS.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar backend additions could allow Devito to target additional accelerators if corresponding OPS-like APIs exist for those devices.
  • The observed speedups may vary with problem size, mesh resolution, or specific finite-difference stencil, suggesting targeted benchmarking would be needed for new applications.
  • If the OPS integration adds no extra user-facing complexity, it could encourage adoption of Devito in production GPU workflows where hand-written kernels are currently used.

Load-bearing premise

An OPS backend can be added to Devito while preserving correctness and delivering net performance gains on target GPU hardware without prohibitive compilation or runtime overheads.

What would settle it

A side-by-side run of a representative seismic inversion problem on GPU hardware showing either incorrect results or no runtime improvement with the OPS backend versus the core Devito backend would falsify the claim.

Figures

Figures reproduced from arXiv: 1906.10811 by Vincenzo Pandolfo.

Figure 2.1
Figure 2.1. Figure 2.1: The input data (left) and the result (right) of the execution of the [PITH_FULL_IMAGE:figures/full_fig_p009_2_1.png] view at source ↗
Figure 2.2
Figure 2.2. Figure 2.2: The Devito pipeline [9] 2.1.4 IET nodes Most of the focus of this report is on manipulation of IETs and therefore of IET nodes. We will now give a brief overview of some IET nodes that will be used in this report. • Expression: encapsulates a ClusterizedEq, a Sympy equation with associ￾ated iteration and data space. It is rendered in code generation as an assign￾ment • Call: represents a function call • … view at source ↗
Figure 3.1
Figure 3.1. Figure 3.1: Code generation pipeline for the OPS backend [PITH_FULL_IMAGE:figures/full_fig_p017_3_1.png] view at source ↗
Figure 4.1
Figure 4.1. Figure 4.1: CUDA compilation pipeline in Devito in 2020 [21]) and it is organized as a monolithic script for standalone execution, making it cumbersome to use within Devito. Upgrading the translator to Python 3 would be a good first step in simplifying its integration within Devito, but a better solution would be a complete rework of the OPS translator as a proper Python module. It would need to expose methods to tr… view at source ↗
Figure 5.1
Figure 5.1. Figure 5.1: OPS backend, CUDA, advanced DSE 1 2 4 8 16 32 64 Operational intensity (FLOPs/Byte) 16 32 64 128 256 512 1024 2048 4096 8192 16384 Performance (GFLOPs/s) SO=4 62% 0.25s SO=8 45% 0.35s SO=12 34% 0.46s SO=16 28% 0.56s nbody ideal peak (a) 25002 grid 1 2 4 8 16 32 64 Operational intensity (FLOPs/Byte) 16 32 64 128 256 512 1024 2048 4096 8192 16384 Performance (GFLOPs/s) SO=4 63% 3.96s SO=8 46% 5.48s SO=12 3… view at source ↗
Figure 5.2
Figure 5.2. Figure 5.2: OPS backend, CUDA, aggressive DSE sive mode: among other optimizations, it substitutes common divisions with mul￾tiplications as shown in listing 5.1. However, for more complex kernels the DSE in aggressive mode would apply further transformations that would not be necessarily beneficial to performance on a GPU. This is not done by default in Devito as it is not necessary on CPUs, but on GPUs it makes a … view at source ↗
Figure 5.3
Figure 5.3. Figure 5.3: 100002 grid, core backend, OpenMP 32 [PITH_FULL_IMAGE:figures/full_fig_p035_5_3.png] view at source ↗
Figure 5.4
Figure 5.4. Figure 5.4: 100002 grid, OPS backend, OpenMP 5.1.7 Summary While the observed performance of the OPS backend on CPU is less than promising, very good results were obtained with CUDA, with good percentages of peak perfor￾mance and from 4 to 7 times faster execution times compared to the core backend on the hardware used. 5.2 Software evaluation One of the considerations one has to make when adopting a new library is … view at source ↗
Figure 5.5
Figure 5.5. Figure 5.5: GitHub insights for the OPS repository (accessed 15th Jun 2019) [PITH_FULL_IMAGE:figures/full_fig_p037_5_5.png] view at source ↗
read the original abstract

The Devito DSL is a code generation tool for the solution of partial differential equations using the finite difference method specifically aimed at seismic inversion problems. In this work we investigate the integration of OPS, an API to generate highly optimized code for applications running on structured meshes targeting various platforms, within Devito as a mean of bringing it to the GPU realm by providing an implementation of a OPS backend in Devito, obtaining considerable speed ups compared to the core Devito backend.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript describes the integration of the OPS intermediate representation into the Devito DSL for finite-difference PDE solvers aimed at seismic inversion. It presents an implementation of an OPS backend within Devito and reports obtaining considerable speedups compared to the core Devito backend on GPU hardware.

Significance. If the performance claims hold with proper validation, the work would show a practical route for extending Devito to GPUs via an existing structured-mesh code-generation API, which could benefit performance-critical geophysics applications without requiring a full rewrite of the symbolic layer.

major comments (1)
  1. [Abstract] Abstract: the claim that 'considerable speed ups' were obtained supplies no measurement protocol, problem sizes, hardware details, or error bars, so the central performance claim cannot be evaluated from the given text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. The single major comment concerns the level of detail in the abstract regarding performance claims. We address it below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'considerable speed ups' were obtained supplies no measurement protocol, problem sizes, hardware details, or error bars, so the central performance claim cannot be evaluated from the given text.

    Authors: We agree that the abstract does not supply enough context on its own for a reader to evaluate the performance claims. The body of the manuscript contains the full experimental protocol, problem sizes, hardware specifications, and results (including variability measures), but the abstract should be improved to include representative details of these elements. We will revise the abstract to incorporate key information on the measurement protocol, problem sizes, hardware, and error bars/variability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; implementation report with empirical results

full rationale

The manuscript is an engineering report describing the addition of an OPS backend to Devito for GPU targeting, with the central claim being the observed speedups from that integration. No equations, derivations, fitted parameters, or predictions appear in the provided text. The work contains no self-citation chains, ansatzes, or uniqueness theorems that could reduce to inputs by construction. The result is therefore self-contained as a description of implementation outcomes rather than a mathematical argument.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, free parameters, or new physical entities are introduced; the work is a software integration report.

pith-pipeline@v0.9.0 · 5589 in / 1029 out tokens · 16792 ms · 2026-05-25T15:21:49.280751+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 1 internal anchor

  1. [1]

    Firedrake: Automating the finite element method by composing abstractions,

    F. Rathgeber, D. A. Ham, L. Mitchell, M. Lange, F. Luporini, A. T. T. Mcrae, G.-T. Bercea, G. R. Markall, and P. H. J. Kelly, “Firedrake: Automating the finite element method by composing abstractions,”ACM Trans. Math. Softw., vol. 43, no. 3, pp. 24:1–24:27, 2016

  2. [2]

    The fenics project version 1.5,

    M. S. Alnæs, J. Blechta, J. Hake, A. Johansson, B. Kehlet, A. Logg, C. Richard- son, J. Ring, M. E. Rognes, and G. N. Wells, “The fenics project version 1.5,” Archive of Numerical Software, vol. 3, no. 100, 2015

  3. [3]

    Unified form language: A domain-specific language for weak formulations of partial differential equations,

    M. S. Alnæs, A. Logg, K. B. Ølgaard, M. E. Rognes, and G. N. Wells, “Unified form language: A domain-specific language for weak formulations of partial differential equations,”ACM Transactions on Mathematical Software, vol. 40, no. 2, 2014

  4. [4]

    Devito: an embedded domain-specific language for finite differences and geophysical exploration,

    M. Louboutin, M. Lange, F. Luporini, N. Kukreja, P. A. Witte, F. J. Herrmann, P. Velesko, and G. J. Gorman, “Devito: an embedded domain-specific language for finite differences and geophysical exploration,”CoRR, vol. abs/1808.01995, Aug 2018

  5. [5]

    Yask—yet another stencil kernel: A framework for hpc stencil code-generation and tuning,

    C. Yount, J. Tobin, A. Breuer, and A. Duran, “Yask—yet another stencil kernel: A framework for hpc stencil code-generation and tuning,”2016 Sixth Interna- tional Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC), pp. 30–39, 2016

  6. [6]

    The ops domain specific abstraction for multi-block structured grid compu- tations,

    I. Z. Reguly, G. R. Mudalige, M. B. Giles, D. Curran, and S. McIntosh-Smith, “The ops domain specific abstraction for multi-block structured grid compu- tations,” in Proceedings of the Fourth International Workshop on Domain- Specific Languages and High-Level Frameworks for High Performance Com- puting, WOLFHPC ’14, (Piscataway, NJ, USA), pp. 58–67, IEEE Pr...

  7. [7]

    Sympy: symbolic computing in python,

    A. Meurer, C. P. Smith, M. Paprocki, O. Čertík, S. B. Kirpichev, M. Rocklin, A. Kumar, S. Ivanov, J. K. Moore, S. Singh, T. Rathnayake, S. Vig, B. E. Granger, R. P. Muller, F. Bonazzi, H. Gupta, S. Vats, F. Johansson, F. Pe- dregosa, M. J. Curry, A. R. Terrel, v. Roučka, A. Saboo, I. Fernando, S. Kulal, R. Cimrman, and A. Scopatz, “Sympy: symbolic computi...

  8. [8]

    Devito cfd tutorial series

    “Devito cfd tutorial series.” https://nbviewer.jupyter.org/github/ opesci/devito/blob/master/examples/cfd/01_convection.ipynb. Ac- cessed: 24th Jan 2019. 38

  9. [9]

    Architec- ture and performance of devito, a system for automated stencil computation,

    F. Luporini, M. Lange, M. Louboutin, N. Kukreja, J. Hückelheim, C. Yount, P. A. Witte, P. H. J. Kelly, G. J. Gorman, and F. J. Herrmann, “Architec- ture and performance of devito, a system for automated stencil computation,” CoRR, vol. abs/1807.03032, 2018

  10. [10]

    Cgen - c/c++ source generation from an ast

    “Cgen - c/c++ source generation from an ast.”https://github.com/inducer/ cgen. Accessed: 25th Jan 2019

  11. [11]

    Vector folding: Improving stencil performance via multi-dimensional simd-vector representation,

    C. Yount, “Vector folding: Improving stencil performance via multi-dimensional simd-vector representation,” in2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th Interna- tional Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th In- ternational Conference on Embedded Software and Systems, pp...

  12. [12]

    Multi-level spatial and temporal tiling for efficient hpc stencil computation on many-core processors with large shared caches,

    C. Yount, A. Duran, and J. Tobin, “Multi-level spatial and temporal tiling for efficient hpc stencil computation on many-core processors with large shared caches,” Future Generation Computer Systems, vol. 92, pp. 903 – 919, 2019

  13. [13]

    Loo.py: transformation-based code generation for GPUs and CPUs

    A.Klöckner, “Loo.py: transformation-basedcodegenerationforgpusandcpus,” CoRR, vol. abs/1405.7470, 2014

  14. [14]

    isl: An integer set library for the polyhedral model,

    S. Verdoolaege, “isl: An integer set library for the polyhedral model,” inMath- ematical Software – ICMS 2010(K. Fukuda, J. v. d. Hoeven, M. Joswig, and N. Takayama, eds.), (Berlin, Heidelberg), pp. 299–302, Springer Berlin Heidel- berg, 2010

  15. [15]

    Mint: Realizing cuda performance in 3d stencil methods with annotated c,

    D. Unat, X. Cai, and S. B. Baden, “Mint: Realizing cuda performance in 3d stencil methods with annotated c,” pp. 214–224, 01 2011

  16. [16]

    High performance stencil code generation with lift,

    B. Hagedorn, L. Stoltzfus, M. Steuwer, S. Gorlatch, and C. Dubach, “High performance stencil code generation with lift,” inProceedings of the 2018 Inter- national Symposium on Code Generation and Optimization, CGO 2018, (New York, NY, USA), pp. 100–112, ACM, 2018

  17. [17]

    Lift: A functional data-parallel ir for high-performance gpu code generation,

    M. Steuwer, T. Remmelg, and C. Dubach, “Lift: A functional data-parallel ir for high-performance gpu code generation,” in2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 74–85, Feb 2017

  18. [18]

    Ops expressions translation #760

    V. Mickus and V. Pandolfo, “Ops expressions translation #760.” https:// github.com/opesci/devito/pull/760. Accessed: 2nd Jun 2019

  19. [19]

    Kloeckner, “codepy.” Accessed: 7th June 2019

    A. Kloeckner, “codepy.” Accessed: 7th June 2019

  20. [20]

    C-types foreign function interface (numpy.ctypeslib)

    “C-types foreign function interface (numpy.ctypeslib).”https://docs.scipy. org/doc/numpy/reference/routines.ctypeslib.html. Accessed: 10th June 2019

  21. [21]

    Pep 373 python 2.7 release schedule

    “Pep 373 python 2.7 release schedule.” https://legacy.python.org/dev/ peps/pep-0373/. Accessed: 7th June 2019. 39

  22. [22]

    Geforce gtx 1080 | specifications

    NVIDIA, “Geforce gtx 1080 | specifications.” https://www.geforce.co. uk/hardware/desktop-gpus/geforce-gtx-1080/specifications. Accessed: 6th June 2019

  23. [23]

    Azure linux vm sizes - hpc | microsoft docs

    “Azure linux vm sizes - hpc | microsoft docs.”https://docs.microsoft.com/ en-us/azure/virtual-machines/linux/sizes-hpc. Accessed: 13th June 2019

  24. [24]

    opescibench

    “opescibench.” https://github.com/opesci/opescibench. Accessed: 6th June 2019

  25. [25]

    Roofline: An insightful visual performance model for floating-point programs and multicore architectures,

    S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for floating-point programs and multicore architectures,” tech. rep., Lawrence Berkeley National Lab.(LBNL), Berkeley, CA (United States), 2009

  26. [26]

    Performance of various computers using standard linear equa- tions software,

    J. J. Dongarra, “Performance of various computers using standard linear equa- tions software,” SIGARCH Comput. Archit. News, vol. 20, pp. 22–44, June 1992

  27. [27]

    Fast n-body simulation with cuda,

    L. Nyland, M. Harris, and J. Prins, “Fast n-body simulation with cuda,”GPU Gem, Vol. 3, pp. 677–695, 01 2009. 40