On the Limits of Performance Portability in Directive-Based GPU Programming

Alessandro Romeo; Alessio Suriano; Andrea Mignone; Nitin Shukla; Stefano Truzzi

arxiv: 2606.12753 · v1 · pith:COMTRJBInew · submitted 2026-06-10 · 💻 cs.DC

On the Limits of Performance Portability in Directive-Based GPU Programming

Alessandro Romeo , Nitin Shukla , Stefano Truzzi , Alessio Suriano , Andrea Mignone This is my paper

Pith reviewed 2026-06-27 08:01 UTC · model grok-4.3

classification 💻 cs.DC

keywords performance portabilitydirective-based GPU programmingOpenMPOpenACCGPU architecturesmagnetohydrodynamicscompiler limitationsmemory access patterns

0 comments

The pith

OpenMP port of a magnetohydrodynamics code runs three times slower on AMD GPUs than OpenACC on NVIDIA due to strided accesses and compiler limits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper ports the gPLUTO magnetohydrodynamics code from OpenACC to OpenMP and measures performance on NVIDIA A100 and AMD MI250X GPUs. On NVIDIA hardware the two directive models deliver comparable results, but the OpenMP version shows roughly threefold application-level slowdown on AMD hardware, with individual kernels up to ten times slower. Profiling attributes the gap to sensitivity to strided memory patterns, memory-latency bounds rather than bandwidth, and extra register pressure from C++ abstractions in low-parallelism kernels. A reader would care because scientific codes must run efficiently across the heterogeneous accelerators now appearing in exascale machines without repeated manual rewrites for each vendor.

Core claim

On NVIDIA A100 the OpenACC and OpenMP versions of gPLUTO achieve comparable performance. The identical OpenMP implementation runs approximately three times slower at the full-application level on AMD MI250X relative to the NVIDIA OpenACC baseline, with kernel-level slowdowns reaching an order of magnitude. Kernel profiling shows that dominant run-time contributions are memory-latency-bound rather than peak-bandwidth-limited. In low-parallelism kernels, C++ abstraction layers increase register pressure and spilling, producing slowdowns up to 47 times in isolated cases.

What carries the argument

Direct comparison of OpenACC and OpenMP implementations of gPLUTO together with kernel-level profiling on NVIDIA A100 and AMD MI250X GPUs.

If this is right

Portable high performance across GPU vendors requires application-level changes in addition to standard directive use.
Continued advances in compiler backends are needed to handle architecture-specific access patterns without large slowdowns.
Architecture-aware optimization strategies must be developed to reduce the impact of latency-bound kernels.
C++ abstraction layers in low-parallelism regions must be inspected to limit register spilling on certain backends.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Other directive or abstraction layers may encounter analogous vendor-specific slowdowns when ported between NVIDIA and AMD GPUs.
Profiling focused on memory-latency metrics rather than bandwidth could become a standard step when targeting mixed-vendor exascale systems.
Hybrid approaches that combine directives with selective architecture-specific kernels may be necessary until compiler maturity improves.

Load-bearing premise

The measured performance differences arise mainly from the directive models and their compiler backends rather than from unstated details of the gPLUTO implementation or hardware-specific factors not controlled in the experiments.

What would settle it

Recompiling and rerunning the same OpenMP source on the AMD MI250X with an alternate compiler backend or version and observing whether the three-fold application slowdown and order-of-magnitude kernel gaps disappear.

Figures

Figures reproduced from arXiv: 2606.12753 by Alessandro Romeo, Alessio Suriano, Andrea Mignone, Nitin Shukla, Stefano Truzzi.

read the original abstract

The transition of scientific applications to GPU-accelerated exascale systems is constrained by trade-offs between performance, portability, and productivity. This work evaluates the performance portability of directive-based GPU programming by porting gPLUTO, a production-grade magnetohydrodynamics code for astrophysical simulations, from OpenACC to OpenMP, and analyzing its performance on NVIDIA A100 (Leonardo Booster) and AMD MI250X (LUMI-G) devices. On NVIDIA platforms, OpenACC and OpenMP achieve comparable performance due to a shared compiler backend, providing a consistent baseline for assessing algorithmic efficiency. In contrast, the same OpenMP implementation is approximately three times slower at the application level on AMD MI250X with respect to the NVIDIA A100 OpenACC baseline, with kernel-level slowdowns reaching up to an order of magnitude, driven by sensitivity to strided memory-access patterns and compiler limitations. Kernel-level profiling shows that the dominant contributors to run-time are memory-latency-bound rather than limited by peak band-width. In low-parallelism kernels, C++ abstraction layers increase register pressure and spilling, leading to extreme slowdowns of up to 47x in specific cases. These results indicate that portable performance across GPU architectures requires not only application-level changes but also continued advances in compiler backends and architecture-aware optimization strategies

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New slowdown numbers for this MHD code on AMD are useful data, but the experiments do not isolate directive models from port tuning or compiler choices.

read the letter

The main thing to know is that this paper measures a 3x application slowdown and up to 47x kernel slowdown when running the OpenMP port of gPLUTO on AMD MI250X versus the OpenACC version on NVIDIA A100, with profiling pointing to memory latency and register pressure from C++ layers. Those specific numbers on a production astrophysics code are new.

The work does a solid job of using a real scientific application rather than toy kernels, and the NVIDIA baseline where both directive sets perform similarly gives a reasonable starting point for comparison. The kernel profiling that identifies strided accesses and low-parallelism issues as the main costs is the most concrete part.

The soft spot is the attribution. The central claim that portable performance needs compiler and backend advances beyond application changes rests on comparing one OpenMP implementation without reported controls for equivalent memory layout tuning, loop scheduling, or matching compiler versions and flags across the two platforms. The stress-test concern lands: if a re-optimized OpenMP version or different release closes most of the gap, the evidence no longer supports the need for fundamental compiler work as the primary requirement. Methodology details such as error bars, exact compiler settings, and exclusion criteria are also thin in the abstract, though the full text might fill some of that in.

This paper is for people who actually port large scientific codes to mixed GPU systems and want concrete numbers on current directive performance. Readers working on exascale MHD or similar stencil-heavy codes will find the data worth looking at even if the conclusions need more support. It deserves a serious referee because the empirical results are fresh and the topic matters, but the review should focus on tightening the experimental controls and clarifying what was and was not tuned.

I would send it to peer review rather than desk reject, with the expectation that revisions address the isolation of causes.

Referee Report

2 major / 0 minor

Summary. The paper evaluates performance portability of directive-based GPU programming by porting the production gPLUTO MHD code from OpenACC to OpenMP. On NVIDIA A100, both models achieve comparable performance via a shared backend. On AMD MI250X, the same OpenMP implementation shows ~3x application-level and up to 47x kernel-level slowdowns relative to the NVIDIA OpenACC baseline, attributed to strided memory accesses, memory-latency bounds, and C++ abstraction register pressure under OpenMP. The central claim is that portable performance requires compiler/backend advances beyond application-level changes.

Significance. If the attribution of slowdowns to directive-model and compiler limitations can be isolated from implementation and tuning differences, the work would provide valuable empirical data on cross-vendor portability limits for a real scientific application. This is relevant to exascale heterogeneous computing, where directive models are promoted for productivity.

major comments (2)

[Abstract] Abstract: The reported slowdown factors (3x application-level, up to 47x kernel-level on MI250X) are presented without any methodology details on compiler versions, optimization flags, number of runs, error bars, baseline tuning equivalence between ports, or data exclusion criteria. This prevents assessment of whether the differences arise from the OpenMP model itself or from unstated port-specific choices.
[Results/Discussion] Results/Discussion (implied in abstract): The claim that the observed gaps demonstrate the need for compiler advances (rather than further application-level changes) rests on the assumption that the OpenMP port received equivalent optimization effort to the OpenACC baseline. No evidence is supplied that memory layouts, loop schedules, or architecture-specific tuning were applied equivalently to the OpenMP version on MI250X, leaving the central attribution unisolated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the referee's constructive comments. We address each major point below and will revise the manuscript to improve methodological transparency and evidence of equivalent optimization effort.

read point-by-point responses

Referee: [Abstract] Abstract: The reported slowdown factors (3x application-level, up to 47x kernel-level on MI250X) are presented without any methodology details on compiler versions, optimization flags, number of runs, error bars, baseline tuning equivalence between ports, or data exclusion criteria. This prevents assessment of whether the differences arise from the OpenMP model itself or from unstated port-specific choices.

Authors: We agree the abstract omits these details due to length constraints. The full manuscript's Experimental Setup section specifies compilers (NVIDIA HPC SDK 23.5, ROCm 5.4), flags (-O3), minimum three runs with averages and error bars, and identical data layouts/loop structures for both ports. We will add a concise methodology summary to the abstract in revision. revision: yes
Referee: [Results/Discussion] Results/Discussion (implied in abstract): The claim that the observed gaps demonstrate the need for compiler advances (rather than further application-level changes) rests on the assumption that the OpenMP port received equivalent optimization effort to the OpenACC baseline. No evidence is supplied that memory layouts, loop schedules, or architecture-specific tuning were applied equivalently to the OpenMP version on MI250X, leaving the central attribution unisolated.

Authors: The manuscript states the OpenMP version is a direct port preserving identical memory layouts, loop nests, and schedules, with tuning limited to directive-supported options applied consistently. Kernel profiling isolates the gaps to AMD backend handling of strided accesses and C++ register pressure. We will expand the porting description with explicit comparison of tuning steps to strengthen this evidence. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with direct runtime measurements

full rationale

The paper reports performance measurements from porting gPLUTO between OpenACC and OpenMP on specific GPU hardware (NVIDIA A100, AMD MI250X). All claims rest on observed wall-clock times, kernel profiles, and slowdown ratios (e.g., 3x application-level, up to 47x kernel-level). No equations, fitted parameters, predictions, or first-principles derivations appear; the central conclusion follows directly from the experimental data without reduction to self-defined quantities or self-citation chains. Self-citations, if present, are not load-bearing for any derivation. This matches the default case of an empirical study whose results are externally falsifiable via re-execution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical performance evaluation study containing no mathematical derivations, free parameters, axioms, or postulated entities.

pith-pipeline@v0.9.1-grok · 5771 in / 1124 out tokens · 26091 ms · 2026-06-27T08:01:03.750412+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 28 canonical work pages

[1]

Advanced Micro Devices, Inc. 2023. Omniperf: Performance Analysis Tool for AMD GPUs. https://github.com/ROCm/rocm-systems

2023
[2]

Advanced Micro Devices, Inc. 2024. GPU Architecture Hardware Specifications (ROCm Documentation). https://rocm.docs.amd.com/en/docs-6.0.2/reference/ gpu-arch/gpu-arch-spec-overview.html

2024
[3]

Aldinucci et al

M. Aldinucci et al. 2021. Practical parallelization of scientific applications with OpenMP, OpenACC and MPI. J. Parallel and Distrib. Comput. 157 (2021), 13–29. doi:10.1016/j.jpdc.2021.05.017 Conference’17, July 2017, Washington, DC, USA Alessandro Romeo, Nitin Shukla, Stefano Truzzi, Alessio Suriano, and Andrea Mignone

work page doi:10.1016/j.jpdc.2021.05.017 2021
[4]

S. F. Antao et al. 2016. Offloading Support for OpenMP in Clang and LLVM. In2016 Third Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC). 1–

2016
[5]

doi:10.1109/LLVM-HPC.2016.006

work page doi:10.1109/llvm-hpc.2016.006 2016
[6]

Argonne Leadership Computing Facility. 2021. Inside the NVIDIA Ampere A100 GPU. Slide deck. https://www.alcf.anl.gov/sites/default/files/2021-07/ALCF_ A100_20210728%5B80%5D.pdf

2021
[7]

Bertolli, C. et al. 2015. Integrating GPU support for OpenMP offloading direc- tives into clang. In Proceedings of LLVM-HPC 2015. Association for Computing Machinery, Inc. doi:10.1145/2833157.2833161

work page doi:10.1145/2833157.2833161 2015
[8]

Choquette et al

J. Choquette et al. 2021. NVIDIA A100 GPU: Performance and Innovation. IEEE Micro 41, 2 (2021), 29–35. doi:10.1109/MM.2021.3061394

work page doi:10.1109/mm.2021.3061394 2021
[9]

J. H. Davis et al . 2025. Taking GPU Programming Models to Task for Perfor- mance Portability. In Proceedings of the 39th ACM International Conference on Supercomputing (ICS ’25). ACM, 776–791. doi:10.1145/3721145.3730423

work page doi:10.1145/3721145.3730423 2025
[10]

Deakin et al

T. Deakin et al. 2020. Performance Portability across Diverse Computer Architec- tures. In 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). Institute of Electrical and Electronics Engi- neers (IEEE). doi:10.1109/P3HPC49587.2019.00006

work page doi:10.1109/p3hpc49587.2019.00006 2020
[11]

Deakin and T

T. Deakin and T. G. Mattson. 2023. Programming Your GPU with OpenMP: Performance Portability for GPUs. MIT Press. https://mitpress.mit.edu/ 9780262547536/programming-your-gpu-with-openmp/

2023
[12]

Deakin, J

T. Deakin, J. Price, M. Martineau, and S. McIntosh-Smith. 2018. Evaluating attainable memory bandwidth of parallel programming models via BabelStream. International Journal of Computational Science and Engineering 17, 3 (2018), 247–262. doi:10.1504/IJCSE.2018.095847

work page doi:10.1504/ijcse.2018.095847 2018
[13]

Dubey et al

A. Dubey et al. 2021. Performance Portability in the Exascale Computing Project: Exploration Through a Panel Series. Computing in Science & Engineering 23, 5 (2021), 46–54. doi:10.1109/MCSE.2021.3098231

work page doi:10.1109/mcse.2021.3098231 2021
[14]

H. C. Edwards and C. R. Trott. 2013. Kokkos: Enabling Performance Portability Across Manycore Architectures. In 2013 Extreme Scaling Workshop (xsw 2013). 18–24. doi:10.1109/XSW.2013.7

work page doi:10.1109/xsw.2013.7 2013
[15]

W. Elwasif. 2023. Experimental Characterization of OpenMP Offloading Mem- ory Operations and Unified Shared Memory Support. In OpenMP: Advanced Task-Based, Device and Compiler Programming. Springer Nature Switzerland, Cham, 210–225. doi:10.1007/978-3-031-40744-4_14

work page doi:10.1007/978-3-031-40744-4_14 2023
[16]

ENCCS. 2022. Hierarchical Roofline Performance Analysis on AMD GPUs. https: //enccs.github.io/amd-rocm-development

2022
[17]

Folch et al

A. Folch et al. 2023. The EU Center of Excellence for Exascale in Solid Earth (ChEESE): Implementation, results, and roadmap for the second phase. Future Generation Computer Systems 146 (2023), 47–61. doi:10.1016/j.future.2023.04.006

work page doi:10.1016/j.future.2023.04.006 2023
[18]

Fridman, Y

Y. Fridman, Y. Goren, and G. Oren. 2025. From OpenACC to OpenMP5 GPU Offloading: Performance Evaluation on NAS Parallel Benchmarks. InProceedings of the 2025 4th International Workshop on Extreme Heterogeneity Solutions (ExHET ’25). Association for Computing Machinery, New York, NY, USA, 10–18. doi:10.1145/3720555.3721989

work page doi:10.1145/3720555.3721989 2025
[19]

Garcia et al

A. Garcia et al. 2025. MaX - Materials Design at the eXascale: Recent Selected Re- sults. In Proceedings of the 22nd ACM International Conference on Computing Frontiers (CF ’25). 150–156. doi:10.1145/3706594.3727577

work page doi:10.1145/3706594.3727577 2025
[20]

Grete, F

P. Grete, F. W. Glines, and B. W. O’Shea. 2021. K-Athena: A Performance Portable Structured Grid Finite Volume Magnetohydrodynamics Code. IEEE Transactions on Parallel and Distributed Systems 32, 1 (2021), 85–97. doi:10.1109/TPDS.2020. 3010016

work page doi:10.1109/tpds.2020 2021
[21]

M. A. Heroux and J. M. Willenbring. 2009. Barely sufficient software engineering: 10 practices to improve your CSE software. In 2009 ICSE Workshop on Software Engineering for Computational Science and Engineering. 15–21. doi:10.1109/ SECSE.2009.5069157

arXiv 2009
[22]

J. K. Holmen, B. Peterson, and M. Berzins. 2019. An Approach for Indirectly Adopt- ing a Performance Portability Layer in Large Legacy Codes. In 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). 36–49. doi:10.1109/P3HPC49587.2019.00009

work page doi:10.1109/p3hpc49587.2019.00009 2019
[23]

Khalilov and A

M. Khalilov and A. Timoveev. 2021. Performance analysis of CUDA, OpenACC and OpenMP programming models on TESLA V100 GPU. Journal of Physics: Conference Series 1740, 1 (jan 2021), 012056. doi:10.1088/1742-6596/1740/1/ 012056

work page doi:10.1088/1742-6596/1740/1/ 2021
[24]

M. Klemm. 2025. OpenMP®Target Offloading for AMD Instinct GPUs and APUs. https://tu-dresden.de/zih/das-department/ressourcen/dateien/ kolloquium/2025_03_27-MichaelKlemm.pdf. Tutorial on OpenMP offloading and GPU performance, Accessed 2025

2025
[25]

Krishnasamy et al

E. Krishnasamy et al. 2026. Performance and Programmability of MPI+X Inte- gration with CUDA, HIP, SYCL, OpenACC, and OpenMP Offloading for Super- computing: A Case Study on Dense Matrix-Vector Multiplication. doi:10.1145/ 3784828.3786264

arXiv 2026
[26]

A. Marowka. 2025. Portability efficiency approach for calculating performance portability. Future Generation Computer Systems 170 (2025), 107826. doi:10. 1016/j.future.2025.107826

arXiv 2025
[27]

N. A. Mehta, R. Gayatri, Y. Ghadar, C. Knight, and J. Deslippe. 2021. Evaluating Performance Portability of OpenMP for SNAP on NVIDIA, Intel, and AMD GPUs Using the Roofline Methodology. In Accelerator Programming Using Directives. Springer International Publishing, Cham, 3–24. doi:10.1007/978-3-030-74224-9_1

work page doi:10.1007/978-3-030-74224-9_1 2021
[28]

Memeti, L

S. Memeti, L. Li, S. Pllana, J. Kołodziej, and C. Kessler. 2017. Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Perfor- mance, and Energy Consumption. In Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing (Wash- ington, DC, USA) (ARMS-CC ’17). Association for Computing Machinery, ...

work page doi:10.1145/3110355.3110356 2017
[29]

Parallel Computing , volume=

A. Myers et al . 2021. Porting WarpX to GPU-accelerated platforms. Parallel Comput. 108 (2021), 102833. doi:10.1016/j.parco.2021.102833

work page doi:10.1016/j.parco.2021.102833 2021
[30]

NVIDIA. 2026. NVIDIA Ampere GPU Architecture Tuning Guide. https://docs. nvidia.com/cuda/ampere-tuning-guide/index.html

2026
[31]

NVIDIA Corporation. 2023. Nsight Compute Documentation: Memory Workload Analysis. https://docs.nvidia.com/nsight-compute/NsightCompute/index.html

2023
[32]

OpenACC-Standard.org. 2023. The OpenACC Application Programming Interface, Version 3.3. Technical Report. OpenACC Organization. https: //www.openacc.org/specification

2023
[33]

OpenMP Architecture Review Board. 2021. OpenMP Application Programming Interface, Version 5.2. Technical Report. OpenMP ARB. https://www.openmp. org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf

2021
[34]

Owens et al

J. Owens et al. 2008. GPU computing. Proc. IEEE 96 (05 2008), 879–899. doi:10. 1109/JPROC.2008.917757

arXiv 2008
[35]

S. J. Pennycook, J. D. Sewall, and V. W. Lee. 2016. A Metric for Performance Portability. arXiv:1611.07409 [cs.PF] https://arxiv.org/abs/1611.07409

Pith/arXiv arXiv 2016
[36]

Rossazza et al

M. Rossazza et al. 2026. The PLUTO code on GPUs: A first look at Eulerian MHD methods. Astronomy and Computing (2026), 101076. doi:10.1016/j.ascom.2026. 101076

work page doi:10.1016/j.ascom.2026 2026
[37]

Schieffer et al

G. Schieffer et al . 2024. Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric. In Proceedings of the SC ’24Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis (Atlanta, GA, USA) (SC-W ’24). IEEE Press, 567–576. doi:10.1109/ SCW63240.2024.00079

arXiv 2024
[38]

Sewall, S

J. Sewall, S. J. Pennycook, D. Jacobsen, T. Deakin, and S. McIntosh-Smith. 2020. Interpreting and Visualizing Performance Portability Metrics. In2020 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). 14–24. doi:10.1109/P3HPC51967.2020.00007

work page doi:10.1109/p3hpc51967.2020.00007 2020
[39]

Shukla et al

N. Shukla et al. 2025. Towards Exascale Computing for Astrophysical Simula- tion Leveraging the Leonardo EuroHPC System. Procedia Computer Science 267 (2025), 112–123. doi:10.1016/j.procs.2025.08.238 Proceedings of the Third EuroHPC user day

work page doi:10.1016/j.procs.2025.08.238 2025
[40]

Shukla et al

N. Shukla et al. 2026. Exascale computing to accelerate discoveries in astrophysics and space plasma physics. Nature Astronomy 10 (2026), 330–334. doi:10.1038/ s41550-026-02807-8

2026
[41]

C. P. Sishtla et al. 2019. Multi-GPU Acceleration of the iPIC3D Implicit Particle- in-Cell Code. In Computational Science – ICCS 2019. Springer International Publishing, Cham, 612–618

2019
[42]

Smith and N

A. Smith and N. James. 2022. AMD Instinct™MI200 Series Accelerator and Node Architectures. In 2022 IEEE Hot Chips 34 Symposium (HCS). 1–23. doi:10.1109/ HCS55958.2022.9895477

arXiv 2022
[43]

J. M. Stone, K. Tomida, C. J. White, and K. G. Felker. 2020. The Athena++ Adaptive Mesh Refinement Framework: Design and Magnetohydrodynamic Solvers. The Astrophysical Journal Supplement Series 249, 1 (June 2020), 4. doi:10.3847/1538- 4365/ab929b

work page doi:10.3847/1538- 2020
[44]

Suriano et al

A. Suriano et al. 2026. The PLUTO code on GPUs: Offloading Lagrangian Particle methods. Astronomy and Computing 55 (2026), 101088. doi:10.1016/j.ascom. 2026.101088

work page doi:10.1016/j.ascom 2026
[45]

Tandon et al

S. Tandon et al . 2024. Porting HPC Applications to AMD Instinct™MI300A using Unified Memory and OpenMP®. In ISC High Performance 2024 Research Paper Proceedings (39th International Conference). 1–9. doi:10.23919/ISC.2024. 10528925

work page doi:10.23919/isc.2024 2024
[46]

Wienke, P

S. Wienke, P. Springer, C. Terboven, and D. Mey. 2012. OpenACC - First Ex- periences with Real-World Applications. In Euro-Par 2012 Parallel Processing. Springer Berlin Heidelberg, Berlin, Heidelberg, 859–870

2012
[47]

Wienke, C

S. Wienke, C. Terboven, J. C. Beyer, and M. S. Müller. 2014. A Pattern-Based Comparison of OpenACC and OpenMP for Accelerator Computing. In Euro-Par 2014 Parallel Processing. Springer International Publishing, Cham, 812–823

2014
[48]

Williams et al

J. Williams et al . 2024. Optimizing BIT1, a Particle-in-Cell Monte Carlo Code, with OpenMP/OpenACC and GPU Acceleration. Springer Nature Switzerland, Cham, 316–330. doi:10.1007/978-3-031-63749-0_22

work page doi:10.1007/978-3-031-63749-0_22 2024
[49]

Williams, A

S. Williams, A. Waterman, and D. Patterson. 2009. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Commun. ACM 52, 4 (2009), 65–76. doi:10.1145/1498765.1498785

work page doi:10.1145/1498765.1498785 2009
[50]

Yan et al

Y. Yan et al. 2025. OpenMP: Balancing Productivity and Performance Portability. Springer. doi:10.1007/978-3-032-06343-4

work page doi:10.1007/978-3-032-06343-4 2025

[1] [1]

Advanced Micro Devices, Inc. 2023. Omniperf: Performance Analysis Tool for AMD GPUs. https://github.com/ROCm/rocm-systems

2023

[2] [2]

Advanced Micro Devices, Inc. 2024. GPU Architecture Hardware Specifications (ROCm Documentation). https://rocm.docs.amd.com/en/docs-6.0.2/reference/ gpu-arch/gpu-arch-spec-overview.html

2024

[3] [3]

Aldinucci et al

M. Aldinucci et al. 2021. Practical parallelization of scientific applications with OpenMP, OpenACC and MPI. J. Parallel and Distrib. Comput. 157 (2021), 13–29. doi:10.1016/j.jpdc.2021.05.017 Conference’17, July 2017, Washington, DC, USA Alessandro Romeo, Nitin Shukla, Stefano Truzzi, Alessio Suriano, and Andrea Mignone

work page doi:10.1016/j.jpdc.2021.05.017 2021

[4] [4]

S. F. Antao et al. 2016. Offloading Support for OpenMP in Clang and LLVM. In2016 Third Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC). 1–

2016

[5] [5]

doi:10.1109/LLVM-HPC.2016.006

work page doi:10.1109/llvm-hpc.2016.006 2016

[6] [6]

Argonne Leadership Computing Facility. 2021. Inside the NVIDIA Ampere A100 GPU. Slide deck. https://www.alcf.anl.gov/sites/default/files/2021-07/ALCF_ A100_20210728%5B80%5D.pdf

2021

[7] [7]

Bertolli, C. et al. 2015. Integrating GPU support for OpenMP offloading direc- tives into clang. In Proceedings of LLVM-HPC 2015. Association for Computing Machinery, Inc. doi:10.1145/2833157.2833161

work page doi:10.1145/2833157.2833161 2015

[8] [8]

Choquette et al

J. Choquette et al. 2021. NVIDIA A100 GPU: Performance and Innovation. IEEE Micro 41, 2 (2021), 29–35. doi:10.1109/MM.2021.3061394

work page doi:10.1109/mm.2021.3061394 2021

[9] [9]

J. H. Davis et al . 2025. Taking GPU Programming Models to Task for Perfor- mance Portability. In Proceedings of the 39th ACM International Conference on Supercomputing (ICS ’25). ACM, 776–791. doi:10.1145/3721145.3730423

work page doi:10.1145/3721145.3730423 2025

[10] [10]

Deakin et al

T. Deakin et al. 2020. Performance Portability across Diverse Computer Architec- tures. In 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). Institute of Electrical and Electronics Engi- neers (IEEE). doi:10.1109/P3HPC49587.2019.00006

work page doi:10.1109/p3hpc49587.2019.00006 2020

[11] [11]

Deakin and T

T. Deakin and T. G. Mattson. 2023. Programming Your GPU with OpenMP: Performance Portability for GPUs. MIT Press. https://mitpress.mit.edu/ 9780262547536/programming-your-gpu-with-openmp/

2023

[12] [12]

Deakin, J

T. Deakin, J. Price, M. Martineau, and S. McIntosh-Smith. 2018. Evaluating attainable memory bandwidth of parallel programming models via BabelStream. International Journal of Computational Science and Engineering 17, 3 (2018), 247–262. doi:10.1504/IJCSE.2018.095847

work page doi:10.1504/ijcse.2018.095847 2018

[13] [13]

Dubey et al

A. Dubey et al. 2021. Performance Portability in the Exascale Computing Project: Exploration Through a Panel Series. Computing in Science & Engineering 23, 5 (2021), 46–54. doi:10.1109/MCSE.2021.3098231

work page doi:10.1109/mcse.2021.3098231 2021

[14] [14]

H. C. Edwards and C. R. Trott. 2013. Kokkos: Enabling Performance Portability Across Manycore Architectures. In 2013 Extreme Scaling Workshop (xsw 2013). 18–24. doi:10.1109/XSW.2013.7

work page doi:10.1109/xsw.2013.7 2013

[15] [15]

W. Elwasif. 2023. Experimental Characterization of OpenMP Offloading Mem- ory Operations and Unified Shared Memory Support. In OpenMP: Advanced Task-Based, Device and Compiler Programming. Springer Nature Switzerland, Cham, 210–225. doi:10.1007/978-3-031-40744-4_14

work page doi:10.1007/978-3-031-40744-4_14 2023

[16] [16]

ENCCS. 2022. Hierarchical Roofline Performance Analysis on AMD GPUs. https: //enccs.github.io/amd-rocm-development

2022

[17] [17]

Folch et al

A. Folch et al. 2023. The EU Center of Excellence for Exascale in Solid Earth (ChEESE): Implementation, results, and roadmap for the second phase. Future Generation Computer Systems 146 (2023), 47–61. doi:10.1016/j.future.2023.04.006

work page doi:10.1016/j.future.2023.04.006 2023

[18] [18]

Fridman, Y

Y. Fridman, Y. Goren, and G. Oren. 2025. From OpenACC to OpenMP5 GPU Offloading: Performance Evaluation on NAS Parallel Benchmarks. InProceedings of the 2025 4th International Workshop on Extreme Heterogeneity Solutions (ExHET ’25). Association for Computing Machinery, New York, NY, USA, 10–18. doi:10.1145/3720555.3721989

work page doi:10.1145/3720555.3721989 2025

[19] [19]

Garcia et al

A. Garcia et al. 2025. MaX - Materials Design at the eXascale: Recent Selected Re- sults. In Proceedings of the 22nd ACM International Conference on Computing Frontiers (CF ’25). 150–156. doi:10.1145/3706594.3727577

work page doi:10.1145/3706594.3727577 2025

[20] [20]

Grete, F

P. Grete, F. W. Glines, and B. W. O’Shea. 2021. K-Athena: A Performance Portable Structured Grid Finite Volume Magnetohydrodynamics Code. IEEE Transactions on Parallel and Distributed Systems 32, 1 (2021), 85–97. doi:10.1109/TPDS.2020. 3010016

work page doi:10.1109/tpds.2020 2021

[21] [21]

M. A. Heroux and J. M. Willenbring. 2009. Barely sufficient software engineering: 10 practices to improve your CSE software. In 2009 ICSE Workshop on Software Engineering for Computational Science and Engineering. 15–21. doi:10.1109/ SECSE.2009.5069157

arXiv 2009

[22] [22]

J. K. Holmen, B. Peterson, and M. Berzins. 2019. An Approach for Indirectly Adopt- ing a Performance Portability Layer in Large Legacy Codes. In 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). 36–49. doi:10.1109/P3HPC49587.2019.00009

work page doi:10.1109/p3hpc49587.2019.00009 2019

[23] [23]

Khalilov and A

M. Khalilov and A. Timoveev. 2021. Performance analysis of CUDA, OpenACC and OpenMP programming models on TESLA V100 GPU. Journal of Physics: Conference Series 1740, 1 (jan 2021), 012056. doi:10.1088/1742-6596/1740/1/ 012056

work page doi:10.1088/1742-6596/1740/1/ 2021

[24] [24]

M. Klemm. 2025. OpenMP®Target Offloading for AMD Instinct GPUs and APUs. https://tu-dresden.de/zih/das-department/ressourcen/dateien/ kolloquium/2025_03_27-MichaelKlemm.pdf. Tutorial on OpenMP offloading and GPU performance, Accessed 2025

2025

[25] [25]

Krishnasamy et al

E. Krishnasamy et al. 2026. Performance and Programmability of MPI+X Inte- gration with CUDA, HIP, SYCL, OpenACC, and OpenMP Offloading for Super- computing: A Case Study on Dense Matrix-Vector Multiplication. doi:10.1145/ 3784828.3786264

arXiv 2026

[26] [26]

A. Marowka. 2025. Portability efficiency approach for calculating performance portability. Future Generation Computer Systems 170 (2025), 107826. doi:10. 1016/j.future.2025.107826

arXiv 2025

[27] [27]

N. A. Mehta, R. Gayatri, Y. Ghadar, C. Knight, and J. Deslippe. 2021. Evaluating Performance Portability of OpenMP for SNAP on NVIDIA, Intel, and AMD GPUs Using the Roofline Methodology. In Accelerator Programming Using Directives. Springer International Publishing, Cham, 3–24. doi:10.1007/978-3-030-74224-9_1

work page doi:10.1007/978-3-030-74224-9_1 2021

[28] [28]

Memeti, L

S. Memeti, L. Li, S. Pllana, J. Kołodziej, and C. Kessler. 2017. Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Perfor- mance, and Energy Consumption. In Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing (Wash- ington, DC, USA) (ARMS-CC ’17). Association for Computing Machinery, ...

work page doi:10.1145/3110355.3110356 2017

[29] [29]

Parallel Computing , volume=

A. Myers et al . 2021. Porting WarpX to GPU-accelerated platforms. Parallel Comput. 108 (2021), 102833. doi:10.1016/j.parco.2021.102833

work page doi:10.1016/j.parco.2021.102833 2021

[30] [30]

NVIDIA. 2026. NVIDIA Ampere GPU Architecture Tuning Guide. https://docs. nvidia.com/cuda/ampere-tuning-guide/index.html

2026

[31] [31]

NVIDIA Corporation. 2023. Nsight Compute Documentation: Memory Workload Analysis. https://docs.nvidia.com/nsight-compute/NsightCompute/index.html

2023

[32] [32]

OpenACC-Standard.org. 2023. The OpenACC Application Programming Interface, Version 3.3. Technical Report. OpenACC Organization. https: //www.openacc.org/specification

2023

[33] [33]

OpenMP Architecture Review Board. 2021. OpenMP Application Programming Interface, Version 5.2. Technical Report. OpenMP ARB. https://www.openmp. org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf

2021

[34] [34]

Owens et al

J. Owens et al. 2008. GPU computing. Proc. IEEE 96 (05 2008), 879–899. doi:10. 1109/JPROC.2008.917757

arXiv 2008

[35] [35]

S. J. Pennycook, J. D. Sewall, and V. W. Lee. 2016. A Metric for Performance Portability. arXiv:1611.07409 [cs.PF] https://arxiv.org/abs/1611.07409

Pith/arXiv arXiv 2016

[36] [36]

Rossazza et al

M. Rossazza et al. 2026. The PLUTO code on GPUs: A first look at Eulerian MHD methods. Astronomy and Computing (2026), 101076. doi:10.1016/j.ascom.2026. 101076

work page doi:10.1016/j.ascom.2026 2026

[37] [37]

Schieffer et al

G. Schieffer et al . 2024. Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric. In Proceedings of the SC ’24Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis (Atlanta, GA, USA) (SC-W ’24). IEEE Press, 567–576. doi:10.1109/ SCW63240.2024.00079

arXiv 2024

[38] [38]

Sewall, S

J. Sewall, S. J. Pennycook, D. Jacobsen, T. Deakin, and S. McIntosh-Smith. 2020. Interpreting and Visualizing Performance Portability Metrics. In2020 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). 14–24. doi:10.1109/P3HPC51967.2020.00007

work page doi:10.1109/p3hpc51967.2020.00007 2020

[39] [39]

Shukla et al

N. Shukla et al. 2025. Towards Exascale Computing for Astrophysical Simula- tion Leveraging the Leonardo EuroHPC System. Procedia Computer Science 267 (2025), 112–123. doi:10.1016/j.procs.2025.08.238 Proceedings of the Third EuroHPC user day

work page doi:10.1016/j.procs.2025.08.238 2025

[40] [40]

Shukla et al

N. Shukla et al. 2026. Exascale computing to accelerate discoveries in astrophysics and space plasma physics. Nature Astronomy 10 (2026), 330–334. doi:10.1038/ s41550-026-02807-8

2026

[41] [41]

C. P. Sishtla et al. 2019. Multi-GPU Acceleration of the iPIC3D Implicit Particle- in-Cell Code. In Computational Science – ICCS 2019. Springer International Publishing, Cham, 612–618

2019

[42] [42]

Smith and N

A. Smith and N. James. 2022. AMD Instinct™MI200 Series Accelerator and Node Architectures. In 2022 IEEE Hot Chips 34 Symposium (HCS). 1–23. doi:10.1109/ HCS55958.2022.9895477

arXiv 2022

[43] [43]

J. M. Stone, K. Tomida, C. J. White, and K. G. Felker. 2020. The Athena++ Adaptive Mesh Refinement Framework: Design and Magnetohydrodynamic Solvers. The Astrophysical Journal Supplement Series 249, 1 (June 2020), 4. doi:10.3847/1538- 4365/ab929b

work page doi:10.3847/1538- 2020

[44] [44]

Suriano et al

A. Suriano et al. 2026. The PLUTO code on GPUs: Offloading Lagrangian Particle methods. Astronomy and Computing 55 (2026), 101088. doi:10.1016/j.ascom. 2026.101088

work page doi:10.1016/j.ascom 2026

[45] [45]

Tandon et al

S. Tandon et al . 2024. Porting HPC Applications to AMD Instinct™MI300A using Unified Memory and OpenMP®. In ISC High Performance 2024 Research Paper Proceedings (39th International Conference). 1–9. doi:10.23919/ISC.2024. 10528925

work page doi:10.23919/isc.2024 2024

[46] [46]

Wienke, P

S. Wienke, P. Springer, C. Terboven, and D. Mey. 2012. OpenACC - First Ex- periences with Real-World Applications. In Euro-Par 2012 Parallel Processing. Springer Berlin Heidelberg, Berlin, Heidelberg, 859–870

2012

[47] [47]

Wienke, C

S. Wienke, C. Terboven, J. C. Beyer, and M. S. Müller. 2014. A Pattern-Based Comparison of OpenACC and OpenMP for Accelerator Computing. In Euro-Par 2014 Parallel Processing. Springer International Publishing, Cham, 812–823

2014

[48] [48]

Williams et al

J. Williams et al . 2024. Optimizing BIT1, a Particle-in-Cell Monte Carlo Code, with OpenMP/OpenACC and GPU Acceleration. Springer Nature Switzerland, Cham, 316–330. doi:10.1007/978-3-031-63749-0_22

work page doi:10.1007/978-3-031-63749-0_22 2024

[49] [49]

Williams, A

S. Williams, A. Waterman, and D. Patterson. 2009. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Commun. ACM 52, 4 (2009), 65–76. doi:10.1145/1498765.1498785

work page doi:10.1145/1498765.1498785 2009

[50] [50]

Yan et al

Y. Yan et al. 2025. OpenMP: Balancing Productivity and Performance Portability. Springer. doi:10.1007/978-3-032-06343-4

work page doi:10.1007/978-3-032-06343-4 2025