Investigating the OPS intermediate representation to target GPUs in the Devito DSL

Vincenzo Pandolfo

arxiv: 1906.10811 · v1 · pith:3FSF75W5new · submitted 2019-06-26 · 💻 cs.MS · cs.PL

Investigating the OPS intermediate representation to target GPUs in the Devito DSL

Vincenzo Pandolfo This is my paper

Pith reviewed 2026-05-25 15:21 UTC · model grok-4.3

classification 💻 cs.MS cs.PL

keywords Devito DSLOPS intermediate representationGPU code generationfinite difference methodsseismic inversionstructured meshesbackend integration

0 comments

The pith

Devito adds an OPS backend to target GPUs and deliver considerable speedups over its core backend for finite-difference PDE solvers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Devito is a domain-specific language that generates finite-difference code for partial differential equations, focused on seismic inversion. The paper examines the addition of support for the OPS API, which produces optimized code for structured meshes across platforms including GPUs. This is done by implementing an OPS backend inside Devito. The integration produces considerable performance gains relative to Devito's existing backend. A sympathetic reader would care because the change lets existing Devito models run efficiently on GPUs without rewriting the high-level specification.

Core claim

By providing an implementation of a OPS backend in Devito, the authors obtain considerable speed ups compared to the core Devito backend for applications running on structured meshes targeting various platforms including GPUs.

What carries the argument

The OPS backend implementation in Devito, which maps Devito's finite-difference code generation to the OPS API for platform-specific optimized output.

If this is right

Devito-generated code for seismic problems can now target GPUs through the OPS layer.
The same high-level Devito models produce optimized output for multiple hardware platforms via OPS.
Performance gains appear for finite-difference methods on structured meshes without altering the original problem specification.
The integration demonstrates a pathway for other code-generation DSLs to reach GPUs by adopting an intermediate API such as OPS.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar backend additions could allow Devito to target additional accelerators if corresponding OPS-like APIs exist for those devices.
The observed speedups may vary with problem size, mesh resolution, or specific finite-difference stencil, suggesting targeted benchmarking would be needed for new applications.
If the OPS integration adds no extra user-facing complexity, it could encourage adoption of Devito in production GPU workflows where hand-written kernels are currently used.

Load-bearing premise

An OPS backend can be added to Devito while preserving correctness and delivering net performance gains on target GPU hardware without prohibitive compilation or runtime overheads.

What would settle it

A side-by-side run of a representative seismic inversion problem on GPU hardware showing either incorrect results or no runtime improvement with the OPS backend versus the core Devito backend would falsify the claim.

Figures

Figures reproduced from arXiv: 1906.10811 by Vincenzo Pandolfo.

**Figure 2.1.** Figure 2.1: The input data (left) and the result (right) of the execution of the [PITH_FULL_IMAGE:figures/full_fig_p009_2_1.png] view at source ↗

**Figure 2.2.** Figure 2.2: The Devito pipeline [9] 2.1.4 IET nodes Most of the focus of this report is on manipulation of IETs and therefore of IET nodes. We will now give a brief overview of some IET nodes that will be used in this report. • Expression: encapsulates a ClusterizedEq, a Sympy equation with associated iteration and data space. It is rendered in code generation as an assignment • Call: represents a function call • … view at source ↗

**Figure 3.1.** Figure 3.1: Code generation pipeline for the OPS backend [PITH_FULL_IMAGE:figures/full_fig_p017_3_1.png] view at source ↗

**Figure 4.1.** Figure 4.1: CUDA compilation pipeline in Devito in 2020 [21]) and it is organized as a monolithic script for standalone execution, making it cumbersome to use within Devito. Upgrading the translator to Python 3 would be a good first step in simplifying its integration within Devito, but a better solution would be a complete rework of the OPS translator as a proper Python module. It would need to expose methods to tr… view at source ↗

**Figure 5.1.** Figure 5.1: OPS backend, CUDA, advanced DSE 1 2 4 8 16 32 64 Operational intensity (FLOPs/Byte) 16 32 64 128 256 512 1024 2048 4096 8192 16384 Performance (GFLOPs/s) SO=4 62% 0.25s SO=8 45% 0.35s SO=12 34% 0.46s SO=16 28% 0.56s nbody ideal peak (a) 25002 grid 1 2 4 8 16 32 64 Operational intensity (FLOPs/Byte) 16 32 64 128 256 512 1024 2048 4096 8192 16384 Performance (GFLOPs/s) SO=4 63% 3.96s SO=8 46% 5.48s SO=12 3… view at source ↗

**Figure 5.2.** Figure 5.2: OPS backend, CUDA, aggressive DSE sive mode: among other optimizations, it substitutes common divisions with multiplications as shown in listing 5.1. However, for more complex kernels the DSE in aggressive mode would apply further transformations that would not be necessarily beneficial to performance on a GPU. This is not done by default in Devito as it is not necessary on CPUs, but on GPUs it makes a … view at source ↗

**Figure 5.3.** Figure 5.3: 100002 grid, core backend, OpenMP 32 [PITH_FULL_IMAGE:figures/full_fig_p035_5_3.png] view at source ↗

**Figure 5.4.** Figure 5.4: 100002 grid, OPS backend, OpenMP 5.1.7 Summary While the observed performance of the OPS backend on CPU is less than promising, very good results were obtained with CUDA, with good percentages of peak performance and from 4 to 7 times faster execution times compared to the core backend on the hardware used. 5.2 Software evaluation One of the considerations one has to make when adopting a new library is … view at source ↗

**Figure 5.5.** Figure 5.5: GitHub insights for the OPS repository (accessed 15th Jun 2019) [PITH_FULL_IMAGE:figures/full_fig_p037_5_5.png] view at source ↗

read the original abstract

The Devito DSL is a code generation tool for the solution of partial differential equations using the finite difference method specifically aimed at seismic inversion problems. In this work we investigate the integration of OPS, an API to generate highly optimized code for applications running on structured meshes targeting various platforms, within Devito as a mean of bringing it to the GPU realm by providing an implementation of a OPS backend in Devito, obtaining considerable speed ups compared to the core Devito backend.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a short engineering report on adding an OPS backend to Devito; the claimed speedups are stated but not supported by any data or protocol in the text.

read the letter

The paper describes an effort to connect the OPS API to the Devito DSL so that finite-difference kernels for seismic problems can target GPUs. The work consists of implementing the backend and reporting that this produces speedups over Devito's existing path. That integration step is the only concrete contribution; no new algorithm, solver, or derivation is introduced. The description of how the two systems were wired together may be of narrow interest to people already maintaining Devito or similar code-generation tools for structured meshes. Beyond that, the text supplies no new technique that would transfer to other DSLs or platforms. The central weakness is the performance claim. The abstract asserts considerable speedups, yet the manuscript gives no problem sizes, grid resolutions, hardware details, timing methodology, or comparison baseline. Without those elements the result cannot be checked or reproduced, so the main asserted benefit remains unevaluated. The paper is therefore best read as an internal implementation note rather than a finished piece of research. It would be useful only to a small group already working inside the Devito codebase who need to know the practical steps taken. For anyone outside that circle the lack of evidence makes it hard to extract value. I would not send this to peer review; the absence of verifiable results means it does not yet meet the threshold for referee time.

Referee Report

1 major / 0 minor

Summary. The manuscript describes the integration of the OPS intermediate representation into the Devito DSL for finite-difference PDE solvers aimed at seismic inversion. It presents an implementation of an OPS backend within Devito and reports obtaining considerable speedups compared to the core Devito backend on GPU hardware.

Significance. If the performance claims hold with proper validation, the work would show a practical route for extending Devito to GPUs via an existing structured-mesh code-generation API, which could benefit performance-critical geophysics applications without requiring a full rewrite of the symbolic layer.

major comments (1)

[Abstract] Abstract: the claim that 'considerable speed ups' were obtained supplies no measurement protocol, problem sizes, hardware details, or error bars, so the central performance claim cannot be evaluated from the given text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. The single major comment concerns the level of detail in the abstract regarding performance claims. We address it below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'considerable speed ups' were obtained supplies no measurement protocol, problem sizes, hardware details, or error bars, so the central performance claim cannot be evaluated from the given text.

Authors: We agree that the abstract does not supply enough context on its own for a reader to evaluate the performance claims. The body of the manuscript contains the full experimental protocol, problem sizes, hardware specifications, and results (including variability measures), but the abstract should be improved to include representative details of these elements. We will revise the abstract to incorporate key information on the measurement protocol, problem sizes, hardware, and error bars/variability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; implementation report with empirical results

full rationale

The manuscript is an engineering report describing the addition of an OPS backend to Devito for GPU targeting, with the central claim being the observed speedups from that integration. No equations, derivations, fitted parameters, or predictions appear in the provided text. The work contains no self-citation chains, ansatzes, or uniqueness theorems that could reduce to inputs by construction. The result is therefore self-contained as a description of implementation outcomes rather than a mathematical argument.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, free parameters, or new physical entities are introduced; the work is a software integration report.

pith-pipeline@v0.9.0 · 5589 in / 1029 out tokens · 16792 ms · 2026-05-25T15:21:49.280751+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 1 internal anchor

[1]

Firedrake: Automating the ﬁnite element method by composing abstractions,

F. Rathgeber, D. A. Ham, L. Mitchell, M. Lange, F. Luporini, A. T. T. Mcrae, G.-T. Bercea, G. R. Markall, and P. H. J. Kelly, “Firedrake: Automating the ﬁnite element method by composing abstractions,”ACM Trans. Math. Softw., vol. 43, no. 3, pp. 24:1–24:27, 2016

work page 2016
[2]

The fenics project version 1.5,

M. S. Alnæs, J. Blechta, J. Hake, A. Johansson, B. Kehlet, A. Logg, C. Richard- son, J. Ring, M. E. Rognes, and G. N. Wells, “The fenics project version 1.5,” Archive of Numerical Software, vol. 3, no. 100, 2015

work page 2015
[3]

Uniﬁed form language: A domain-speciﬁc language for weak formulations of partial diﬀerential equations,

M. S. Alnæs, A. Logg, K. B. Ølgaard, M. E. Rognes, and G. N. Wells, “Uniﬁed form language: A domain-speciﬁc language for weak formulations of partial diﬀerential equations,”ACM Transactions on Mathematical Software, vol. 40, no. 2, 2014

work page 2014
[4]

Devito: an embedded domain-speciﬁc language for ﬁnite diﬀerences and geophysical exploration,

M. Louboutin, M. Lange, F. Luporini, N. Kukreja, P. A. Witte, F. J. Herrmann, P. Velesko, and G. J. Gorman, “Devito: an embedded domain-speciﬁc language for ﬁnite diﬀerences and geophysical exploration,”CoRR, vol. abs/1808.01995, Aug 2018

work page arXiv 2018
[5]

Yask—yet another stencil kernel: A framework for hpc stencil code-generation and tuning,

C. Yount, J. Tobin, A. Breuer, and A. Duran, “Yask—yet another stencil kernel: A framework for hpc stencil code-generation and tuning,”2016 Sixth Interna- tional Workshop on Domain-Speciﬁc Languages and High-Level Frameworks for High Performance Computing (WOLFHPC), pp. 30–39, 2016

work page 2016
[6]

The ops domain speciﬁc abstraction for multi-block structured grid compu- tations,

I. Z. Reguly, G. R. Mudalige, M. B. Giles, D. Curran, and S. McIntosh-Smith, “The ops domain speciﬁc abstraction for multi-block structured grid compu- tations,” in Proceedings of the Fourth International Workshop on Domain- Speciﬁc Languages and High-Level Frameworks for High Performance Com- puting, WOLFHPC ’14, (Piscataway, NJ, USA), pp. 58–67, IEEE Pr...

work page 2014
[7]

Sympy: symbolic computing in python,

A. Meurer, C. P. Smith, M. Paprocki, O. Čertík, S. B. Kirpichev, M. Rocklin, A. Kumar, S. Ivanov, J. K. Moore, S. Singh, T. Rathnayake, S. Vig, B. E. Granger, R. P. Muller, F. Bonazzi, H. Gupta, S. Vats, F. Johansson, F. Pe- dregosa, M. J. Curry, A. R. Terrel, v. Roučka, A. Saboo, I. Fernando, S. Kulal, R. Cimrman, and A. Scopatz, “Sympy: symbolic computi...

work page 2017
[8]

Devito cfd tutorial series

“Devito cfd tutorial series.” https://nbviewer.jupyter.org/github/ opesci/devito/blob/master/examples/cfd/01_convection.ipynb. Ac- cessed: 24th Jan 2019. 38

work page 2019
[9]

Architec- ture and performance of devito, a system for automated stencil computation,

F. Luporini, M. Lange, M. Louboutin, N. Kukreja, J. Hückelheim, C. Yount, P. A. Witte, P. H. J. Kelly, G. J. Gorman, and F. J. Herrmann, “Architec- ture and performance of devito, a system for automated stencil computation,” CoRR, vol. abs/1807.03032, 2018

work page arXiv 2018
[10]

Cgen - c/c++ source generation from an ast

“Cgen - c/c++ source generation from an ast.”https://github.com/inducer/ cgen. Accessed: 25th Jan 2019

work page 2019
[11]

Vector folding: Improving stencil performance via multi-dimensional simd-vector representation,

C. Yount, “Vector folding: Improving stencil performance via multi-dimensional simd-vector representation,” in2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th Interna- tional Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th In- ternational Conference on Embedded Software and Systems, pp...

work page 2015
[12]

Multi-level spatial and temporal tiling for eﬃcient hpc stencil computation on many-core processors with large shared caches,

C. Yount, A. Duran, and J. Tobin, “Multi-level spatial and temporal tiling for eﬃcient hpc stencil computation on many-core processors with large shared caches,” Future Generation Computer Systems, vol. 92, pp. 903 – 919, 2019

work page 2019
[13]

Loo.py: transformation-based code generation for GPUs and CPUs

A.Klöckner, “Loo.py: transformation-basedcodegenerationforgpusandcpus,” CoRR, vol. abs/1405.7470, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[14]

isl: An integer set library for the polyhedral model,

S. Verdoolaege, “isl: An integer set library for the polyhedral model,” inMath- ematical Software – ICMS 2010(K. Fukuda, J. v. d. Hoeven, M. Joswig, and N. Takayama, eds.), (Berlin, Heidelberg), pp. 299–302, Springer Berlin Heidel- berg, 2010

work page 2010
[15]

Mint: Realizing cuda performance in 3d stencil methods with annotated c,

D. Unat, X. Cai, and S. B. Baden, “Mint: Realizing cuda performance in 3d stencil methods with annotated c,” pp. 214–224, 01 2011

work page 2011
[16]

High performance stencil code generation with lift,

B. Hagedorn, L. Stoltzfus, M. Steuwer, S. Gorlatch, and C. Dubach, “High performance stencil code generation with lift,” inProceedings of the 2018 Inter- national Symposium on Code Generation and Optimization, CGO 2018, (New York, NY, USA), pp. 100–112, ACM, 2018

work page 2018
[17]

Lift: A functional data-parallel ir for high-performance gpu code generation,

M. Steuwer, T. Remmelg, and C. Dubach, “Lift: A functional data-parallel ir for high-performance gpu code generation,” in2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 74–85, Feb 2017

work page 2017
[18]

Ops expressions translation #760

V. Mickus and V. Pandolfo, “Ops expressions translation #760.” https:// github.com/opesci/devito/pull/760. Accessed: 2nd Jun 2019

work page 2019
[19]

Kloeckner, “codepy.” Accessed: 7th June 2019

A. Kloeckner, “codepy.” Accessed: 7th June 2019

work page 2019
[20]

C-types foreign function interface (numpy.ctypeslib)

“C-types foreign function interface (numpy.ctypeslib).”https://docs.scipy. org/doc/numpy/reference/routines.ctypeslib.html. Accessed: 10th June 2019

work page 2019
[21]

Pep 373 python 2.7 release schedule

“Pep 373 python 2.7 release schedule.” https://legacy.python.org/dev/ peps/pep-0373/. Accessed: 7th June 2019. 39

work page 2019
[22]

Geforce gtx 1080 | speciﬁcations

NVIDIA, “Geforce gtx 1080 | speciﬁcations.” https://www.geforce.co. uk/hardware/desktop-gpus/geforce-gtx-1080/specifications. Accessed: 6th June 2019

work page 2019
[23]

Azure linux vm sizes - hpc | microsoft docs

“Azure linux vm sizes - hpc | microsoft docs.”https://docs.microsoft.com/ en-us/azure/virtual-machines/linux/sizes-hpc. Accessed: 13th June 2019

work page 2019
[24]

opescibench

“opescibench.” https://github.com/opesci/opescibench. Accessed: 6th June 2019

work page 2019
[25]

Rooﬂine: An insightful visual performance model for ﬂoating-point programs and multicore architectures,

S. Williams, A. Waterman, and D. Patterson, “Rooﬂine: An insightful visual performance model for ﬂoating-point programs and multicore architectures,” tech. rep., Lawrence Berkeley National Lab.(LBNL), Berkeley, CA (United States), 2009

work page 2009
[26]

Performance of various computers using standard linear equa- tions software,

J. J. Dongarra, “Performance of various computers using standard linear equa- tions software,” SIGARCH Comput. Archit. News, vol. 20, pp. 22–44, June 1992

work page 1992
[27]

Fast n-body simulation with cuda,

L. Nyland, M. Harris, and J. Prins, “Fast n-body simulation with cuda,”GPU Gem, Vol. 3, pp. 677–695, 01 2009. 40

work page 2009

[1] [1]

Firedrake: Automating the ﬁnite element method by composing abstractions,

F. Rathgeber, D. A. Ham, L. Mitchell, M. Lange, F. Luporini, A. T. T. Mcrae, G.-T. Bercea, G. R. Markall, and P. H. J. Kelly, “Firedrake: Automating the ﬁnite element method by composing abstractions,”ACM Trans. Math. Softw., vol. 43, no. 3, pp. 24:1–24:27, 2016

work page 2016

[2] [2]

The fenics project version 1.5,

M. S. Alnæs, J. Blechta, J. Hake, A. Johansson, B. Kehlet, A. Logg, C. Richard- son, J. Ring, M. E. Rognes, and G. N. Wells, “The fenics project version 1.5,” Archive of Numerical Software, vol. 3, no. 100, 2015

work page 2015

[3] [3]

Uniﬁed form language: A domain-speciﬁc language for weak formulations of partial diﬀerential equations,

M. S. Alnæs, A. Logg, K. B. Ølgaard, M. E. Rognes, and G. N. Wells, “Uniﬁed form language: A domain-speciﬁc language for weak formulations of partial diﬀerential equations,”ACM Transactions on Mathematical Software, vol. 40, no. 2, 2014

work page 2014

[4] [4]

Devito: an embedded domain-speciﬁc language for ﬁnite diﬀerences and geophysical exploration,

M. Louboutin, M. Lange, F. Luporini, N. Kukreja, P. A. Witte, F. J. Herrmann, P. Velesko, and G. J. Gorman, “Devito: an embedded domain-speciﬁc language for ﬁnite diﬀerences and geophysical exploration,”CoRR, vol. abs/1808.01995, Aug 2018

work page arXiv 2018

[5] [5]

Yask—yet another stencil kernel: A framework for hpc stencil code-generation and tuning,

C. Yount, J. Tobin, A. Breuer, and A. Duran, “Yask—yet another stencil kernel: A framework for hpc stencil code-generation and tuning,”2016 Sixth Interna- tional Workshop on Domain-Speciﬁc Languages and High-Level Frameworks for High Performance Computing (WOLFHPC), pp. 30–39, 2016

work page 2016

[6] [6]

The ops domain speciﬁc abstraction for multi-block structured grid compu- tations,

I. Z. Reguly, G. R. Mudalige, M. B. Giles, D. Curran, and S. McIntosh-Smith, “The ops domain speciﬁc abstraction for multi-block structured grid compu- tations,” in Proceedings of the Fourth International Workshop on Domain- Speciﬁc Languages and High-Level Frameworks for High Performance Com- puting, WOLFHPC ’14, (Piscataway, NJ, USA), pp. 58–67, IEEE Pr...

work page 2014

[7] [7]

Sympy: symbolic computing in python,

A. Meurer, C. P. Smith, M. Paprocki, O. Čertík, S. B. Kirpichev, M. Rocklin, A. Kumar, S. Ivanov, J. K. Moore, S. Singh, T. Rathnayake, S. Vig, B. E. Granger, R. P. Muller, F. Bonazzi, H. Gupta, S. Vats, F. Johansson, F. Pe- dregosa, M. J. Curry, A. R. Terrel, v. Roučka, A. Saboo, I. Fernando, S. Kulal, R. Cimrman, and A. Scopatz, “Sympy: symbolic computi...

work page 2017

[8] [8]

Devito cfd tutorial series

“Devito cfd tutorial series.” https://nbviewer.jupyter.org/github/ opesci/devito/blob/master/examples/cfd/01_convection.ipynb. Ac- cessed: 24th Jan 2019. 38

work page 2019

[9] [9]

Architec- ture and performance of devito, a system for automated stencil computation,

F. Luporini, M. Lange, M. Louboutin, N. Kukreja, J. Hückelheim, C. Yount, P. A. Witte, P. H. J. Kelly, G. J. Gorman, and F. J. Herrmann, “Architec- ture and performance of devito, a system for automated stencil computation,” CoRR, vol. abs/1807.03032, 2018

work page arXiv 2018

[10] [10]

Cgen - c/c++ source generation from an ast

“Cgen - c/c++ source generation from an ast.”https://github.com/inducer/ cgen. Accessed: 25th Jan 2019

work page 2019

[11] [11]

Vector folding: Improving stencil performance via multi-dimensional simd-vector representation,

C. Yount, “Vector folding: Improving stencil performance via multi-dimensional simd-vector representation,” in2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th Interna- tional Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th In- ternational Conference on Embedded Software and Systems, pp...

work page 2015

[12] [12]

Multi-level spatial and temporal tiling for eﬃcient hpc stencil computation on many-core processors with large shared caches,

C. Yount, A. Duran, and J. Tobin, “Multi-level spatial and temporal tiling for eﬃcient hpc stencil computation on many-core processors with large shared caches,” Future Generation Computer Systems, vol. 92, pp. 903 – 919, 2019

work page 2019

[13] [13]

Loo.py: transformation-based code generation for GPUs and CPUs

A.Klöckner, “Loo.py: transformation-basedcodegenerationforgpusandcpus,” CoRR, vol. abs/1405.7470, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[14] [14]

isl: An integer set library for the polyhedral model,

S. Verdoolaege, “isl: An integer set library for the polyhedral model,” inMath- ematical Software – ICMS 2010(K. Fukuda, J. v. d. Hoeven, M. Joswig, and N. Takayama, eds.), (Berlin, Heidelberg), pp. 299–302, Springer Berlin Heidel- berg, 2010

work page 2010

[15] [15]

Mint: Realizing cuda performance in 3d stencil methods with annotated c,

D. Unat, X. Cai, and S. B. Baden, “Mint: Realizing cuda performance in 3d stencil methods with annotated c,” pp. 214–224, 01 2011

work page 2011

[16] [16]

High performance stencil code generation with lift,

B. Hagedorn, L. Stoltzfus, M. Steuwer, S. Gorlatch, and C. Dubach, “High performance stencil code generation with lift,” inProceedings of the 2018 Inter- national Symposium on Code Generation and Optimization, CGO 2018, (New York, NY, USA), pp. 100–112, ACM, 2018

work page 2018

[17] [17]

Lift: A functional data-parallel ir for high-performance gpu code generation,

M. Steuwer, T. Remmelg, and C. Dubach, “Lift: A functional data-parallel ir for high-performance gpu code generation,” in2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 74–85, Feb 2017

work page 2017

[18] [18]

Ops expressions translation #760

V. Mickus and V. Pandolfo, “Ops expressions translation #760.” https:// github.com/opesci/devito/pull/760. Accessed: 2nd Jun 2019

work page 2019

[19] [19]

Kloeckner, “codepy.” Accessed: 7th June 2019

A. Kloeckner, “codepy.” Accessed: 7th June 2019

work page 2019

[20] [20]

C-types foreign function interface (numpy.ctypeslib)

“C-types foreign function interface (numpy.ctypeslib).”https://docs.scipy. org/doc/numpy/reference/routines.ctypeslib.html. Accessed: 10th June 2019

work page 2019

[21] [21]

Pep 373 python 2.7 release schedule

“Pep 373 python 2.7 release schedule.” https://legacy.python.org/dev/ peps/pep-0373/. Accessed: 7th June 2019. 39

work page 2019

[22] [22]

Geforce gtx 1080 | speciﬁcations

NVIDIA, “Geforce gtx 1080 | speciﬁcations.” https://www.geforce.co. uk/hardware/desktop-gpus/geforce-gtx-1080/specifications. Accessed: 6th June 2019

work page 2019

[23] [23]

Azure linux vm sizes - hpc | microsoft docs

“Azure linux vm sizes - hpc | microsoft docs.”https://docs.microsoft.com/ en-us/azure/virtual-machines/linux/sizes-hpc. Accessed: 13th June 2019

work page 2019

[24] [24]

opescibench

“opescibench.” https://github.com/opesci/opescibench. Accessed: 6th June 2019

work page 2019

[25] [25]

Rooﬂine: An insightful visual performance model for ﬂoating-point programs and multicore architectures,

S. Williams, A. Waterman, and D. Patterson, “Rooﬂine: An insightful visual performance model for ﬂoating-point programs and multicore architectures,” tech. rep., Lawrence Berkeley National Lab.(LBNL), Berkeley, CA (United States), 2009

work page 2009

[26] [26]

Performance of various computers using standard linear equa- tions software,

J. J. Dongarra, “Performance of various computers using standard linear equa- tions software,” SIGARCH Comput. Archit. News, vol. 20, pp. 22–44, June 1992

work page 1992

[27] [27]

Fast n-body simulation with cuda,

L. Nyland, M. Harris, and J. Prins, “Fast n-body simulation with cuda,”GPU Gem, Vol. 3, pp. 677–695, 01 2009. 40

work page 2009