OpenMP port of gPLUTO achieves comparable performance to OpenACC on NVIDIA but is 3x slower at application level and up to 10x at kernel level on AMD MI250X, driven by strided memory accesses, latency bounds, and C++ abstraction overheads.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.DC 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Three scaling strategies for an N-body code on Tenstorrent Wormhole accelerators are compared via execution time and energy measurements, identifying the configuration with the best efficiency-performance balance.
citing papers explorer
-
On the Limits of Performance Portability in Directive-Based GPU Programming
OpenMP port of gPLUTO achieves comparable performance to OpenACC on NVIDIA but is 3x slower at application level and up to 10x at kernel level on AMD MI250X, driven by strided memory accesses, latency bounds, and C++ abstraction overheads.
-
Assessing Performance and Porting Strategies for Gravitational $N$-Body Simulations on the RISC-V-Based Tenstorrent Wormhole\textsuperscript{\texttrademark}
Three scaling strategies for an N-body code on Tenstorrent Wormhole accelerators are compared via execution time and energy measurements, identifying the configuration with the best efficiency-performance balance.