Multi-GPU Hybrid Particle-in-Cell Monte Carlo Simulations for Exascale Computing Systems

Ales Podolnik; Allen D. Malony; David Tskhakaya; Erwin Laure; Frank Jenko; Jakub Hromadka; Jeremy J. Williams; Jonah Ekelund; Jordy Trilaksono; Leon Kos

arxiv: 2603.24508 · v3 · pith:PCW2GJ3Lnew · submitted 2026-03-25 · ⚛️ physics.plasm-ph · cs.DC· cs.PF

Multi-GPU Hybrid Particle-in-Cell Monte Carlo Simulations for Exascale Computing Systems

Jeremy J. Williams , Jordy Trilaksono , Stefan Costea , Yi Ju , Luca Pennati , Jonah Ekelund , David Tskhakaya , Leon Kos

show 8 more authors

Ales Podolnik Jakub Hromadka Allen D. Malony Sameer Shende Tilman Dannert Frank Jenko Erwin Laure Stefano Markidis

This is my paper

Pith reviewed 2026-05-15 00:29 UTC · model grok-4.3

classification ⚛️ physics.plasm-ph cs.DCcs.PF

keywords Particle-in-Cell Monte CarloMulti-GPUExascale computingHybrid MPI OpenMPPlasma physicsOpenMP target tasksFrontier supercomputeropenPMD ADIOS2

0 comments

The pith

A hybrid MPI+OpenMP implementation scales BIT1 Particle-in-Cell Monte Carlo simulations to 16,000 GPUs on exascale systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a portable multi-GPU version of the BIT1 code that combines MPI with OpenMP target tasks to run plasma simulations across many accelerators. It reduces data movement through persistent device memory, a one-dimensional data layout, pinned host memory, and direct GPU access while overlapping computation and communication via explicit task dependencies. Standardized I/O comes from openPMD and ADIOS2 libraries. Tests on systems including the Frontier supercomputer show gains in speed, scaling, and GPU utilization for large PIC MC runs. A sympathetic reader would care because these changes make previously limited plasma physics calculations practical on the largest available machines.

Core claim

The authors present a portable, multi-GPU hybrid MPI+OpenMP implementation of BIT1 that uses OpenMP target tasks with explicit dependencies to overlap computation and communication. Portability across Nvidia and AMD accelerators is achieved through persistent device-resident memory, an optimized contiguous one-dimensional data layout, a shift to pinned host memory, GPU Direct Memory Access, and runtime interoperability for direct device-pointer access. Standardized I/O via openPMD and ADIOS2 supports efficient file operations and in-situ analysis. Performance results on pre-exascale and exascale systems, including Frontier with up to 16,000 GPUs, show significant improvements in run time,,

What carries the argument

OpenMP target tasks with explicit dependencies that overlap computation and communication across multiple devices, supported by persistent device-resident memory and optimized data layouts.

Load-bearing premise

The described memory, layout, and task optimizations preserve the numerical accuracy and physical correctness of the original BIT1 simulation.

What would settle it

A side-by-side run of the original BIT1 and the new implementation on an identical small test case, followed by direct comparison of particle position and velocity distributions or electromagnetic field values for any measurable differences.

Figures

Figures reproduced from arXiv: 2603.24508 by Ales Podolnik, Allen D. Malony, David Tskhakaya, Erwin Laure, Frank Jenko, Jakub Hromadka, Jeremy J. Williams, Jonah Ekelund, Jordy Trilaksono, Leon Kos, Luca Pennati, Sameer Shende, Stefan Costea, Stefano Markidis, Tilman Dannert, Yi Ju.

**Figure 1.** Figure 1: A diagram representing the PIC Method on HPC architectures. After initialization, the PIC method repeats at each time step. In gray, we highlight the particle mover step that we parallelize in the portable multi-GPU hybrid BIT1. simulation loop then executes four phases: (i) the particle mover, which updates particle positions and velocities; (ii) deposition to the grid, which maps particle charges and cur… view at source ↗

**Figure 2.** Figure 2: Ionization case function percentage breakdown (using gprof) on Dardel, showing where most of the execution time is spent for Original BIT1, openPMD BP4, and openPMD SST simulations [20,23,25]. The arrj sorting function (yellow) dominates but drops from 75.5% (Original BIT1) to 65.5% (BP4) and 35.5% (SST). As previously reported by Williams et al. [20,23,25], [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Hybrid BIT1 (Ionization Case) Total Simulation (Development Progression) strong scaling on 1 Node (4 MPI ranks & 4 GPUs) on MN5 ACC for 2K times steps. 5.3 Hybrid BIT1 (Minimal I/O & Diagnostics) up to 800 GPUs Moving to the high-density sheath (production-like case), we evaluate the portable, multi-GPU hybrid MPI+OpenMP asynchronous version of BIT1 in both strong and weak scaling tests under minimal I/O a… view at source ↗

**Figure 4.** Figure 4: Hybrid BIT1 (Sheath) Total Simulation (Relative) Speed Up (left) and PE (Right) - Strong and Weak Scaling up to 100 Nodes (up to 800 GPUs) on MN5 ACC, LUMI-G and Frontier for 2K times steps. 5.4 Hybrid BIT1 (Heavy I/O & Diagnostics) up to 16,000 GPUs We evaluate the exascale readiness of the portable, multi-GPU hybrid MPI +OpenMP version of BIT1 in both strong and weak scaling tests under heavy I/O and dia… view at source ↗

**Figure 5.** Figure 5: Hybrid BIT1 (Sheath) Total Simulation (Relative) Speed Up (left) and PE (Right) - Strong and Weak Scaling up to 2000 Nodes (up to 16,000 GPUs) on Frontier for 10K times steps. For weak scaling, the corresponding PE at 2,000 nodes is 67.9% for the original hybrid BIT1 GPU version, while the openPMD GPU version with BP4 sustains 72.0% PE, and the openPMD GPU version with SST further improves this to 73.6% P… view at source ↗

**Figure 6.** Figure 6: A single time step Hybrid BIT1 AMD MI250X GPU activity trace showing HSA and HIP activity, ROCTX regions, asynchronous data copies, and mover kernel execution, obtained using rocprof and visualized with Perfetto on Dardel GPU, with corresponding confirmation traces on both LUMI-G and Frontier. Future research will extend hybrid BIT1 to Intel GPU platforms at exascale, targeting Aurora, and Europe’s first … view at source ↗

read the original abstract

Particle-in-Cell (PIC) Monte Carlo (MC) simulations are central to plasma physics but face increasing challenges on heterogeneous HPC systems due to excessive data movement, synchronization overheads, and inefficient utilization of multiple accelerators. In this work, we present a portable, multi-GPU hybrid MPI+OpenMP implementation of BIT1 that enables scalable execution on both Nvidia and AMD accelerators through OpenMP target tasks with explicit dependencies to overlap computation and communication across devices. Portability is achieved through persistent device-resident memory, an optimized contiguous one-dimensional data layout, and a transition from unified to pinned host memory to improve large data-transfer efficiency, together with GPU Direct Memory Access (DMA) and runtime interoperability for direct device-pointer access. Standardized and scalable I/O is provided using openPMD and ADIOS2, supporting high-performance file I/O, in-memory data streaming, and in-situ analysis and visualization. Performance results on pre-exascale and exascale systems, including Frontier (OLCF-5) for up to 16,000 GPUs, demonstrate significant improvements in run time, scalability, and resource utilization for large-scale PIC MC simulations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BIT1 now has a working multi-GPU version that scales to 16k GPUs on Frontier with OpenMP target tasks and explicit dependencies.

read the letter

The main takeaway is a practical engineering update to the BIT1 PIC-MC code that lets it run across thousands of GPUs on both Nvidia and AMD hardware. They combine MPI with OpenMP target tasks that use explicit dependencies to overlap computation and data movement, keep most data resident on the device, switch to a contiguous 1D layout, use pinned host memory for transfers, and enable GPU DMA. The I/O layer adds openPMD and ADIOS2 support for large-scale output and in-situ work. The paper shows scaling curves, timing breakdowns, and utilization numbers up to 16,000 GPUs on Frontier, which is the concrete evidence that matters here. Those results look solid and directly back the performance claims without internal contradictions in the described approach. The techniques are extensions of known methods, but the clean application to BIT1 and the real exascale runs make the contribution useful. A minor gap is the lack of a side-by-side check that physics quantities match the original single-device version exactly, though the focus on implementation means this is not a load-bearing flaw. This paper is for computational plasma physicists and HPC developers who need to scale existing PIC-MC codes to heterogeneous machines. Readers working on similar fusion or plasma modeling tools will get practical details they can adapt. It deserves peer review because the hardware results are specific and the implementation choices are laid out clearly enough to evaluate.

Referee Report

1 major / 2 minor

Summary. The manuscript describes a portable multi-GPU hybrid MPI+OpenMP implementation of the BIT1 particle-in-cell Monte Carlo code. It uses OpenMP target tasks with explicit dependencies to overlap computation and communication, along with persistent device-resident memory, a contiguous 1D data layout, pinned host memory, GPU DMA, and runtime interoperability for direct device-pointer access. Standardized I/O is provided via openPMD and ADIOS2. Performance benchmarks on pre-exascale and exascale systems, including strong scaling to 16,000 GPUs on Frontier, are reported to demonstrate improvements in runtime, scalability, and resource utilization.

Significance. If the reported performance gains are reproducible and the implementation preserves the numerical fidelity of the original BIT1 code, the work supplies a practical, portable framework that can enable substantially larger PIC-MC simulations on heterogeneous exascale platforms such as Frontier, directly supporting computational studies in fusion plasmas and space physics.

major comments (1)

[Performance evaluation section] Performance evaluation section: although timing breakdowns and scaling curves to 16k GPUs are supplied, the manuscript does not present a side-by-side comparison of key physical observables (e.g., density or velocity distributions, energy conservation) between the original BIT1 and the optimized multi-GPU version; this verification is load-bearing for the claim that the optimizations constitute a faithful implementation.

minor comments (2)

[Abstract] Abstract: quantitative metrics (speedup factors, parallel efficiency, or absolute runtimes) are absent, making the claimed 'significant improvements' difficult to assess without reading the full results section.
[I/O description] I/O description: the overhead introduced by openPMD/ADIOS2 integration relative to the total runtime is not quantified, which would help evaluate the net benefit of the standardized I/O layer.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment and positive recommendation for minor revision. We address the single major comment below and will strengthen the manuscript accordingly.

read point-by-point responses

Referee: [Performance evaluation section] Performance evaluation section: although timing breakdowns and scaling curves to 16k GPUs are supplied, the manuscript does not present a side-by-side comparison of key physical observables (e.g., density or velocity distributions, energy conservation) between the original BIT1 and the optimized multi-GPU version; this verification is load-bearing for the claim that the optimizations constitute a faithful implementation.

Authors: We agree that explicit verification of numerical fidelity is essential to support the claim of a faithful implementation. The multi-GPU version employs identical particle-push, field-solve, and Monte Carlo collision kernels as the original BIT1 code; the modifications are confined to data layout (contiguous 1D arrays), memory residency (persistent device memory), communication overlap via OpenMP target tasks with dependencies, and I/O via openPMD/ADIOS2. Nevertheless, to provide direct evidence, we will add a new subsection in the revised Performance evaluation section containing side-by-side comparisons on representative test cases. These will include plasma density profiles, ion and electron velocity distribution functions, and global energy conservation metrics (relative error < 0.1 %) between the original BIT1 and the multi-GPU implementation at equivalent problem sizes. The comparisons will be performed on both NVIDIA and AMD platforms to confirm portability of the physics results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; implementation and external benchmarks only

full rationale

The paper describes a portable MPI+OpenMP target implementation of the existing BIT1 PIC-MC code, together with standard engineering choices (persistent device memory, contiguous layouts, pinned host buffers, GPU DMA, openPMD/ADIOS2 I/O) and reports measured wall-clock times, strong-scaling curves, and utilization metrics on Frontier up to 16 000 GPUs. No equations, fitted parameters, or uniqueness theorems are introduced; the central claim is supported directly by external hardware measurements rather than by any derivation that reduces to the paper's own inputs or to self-citations. The work is therefore self-contained against external benchmarks and contains no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software implementation and performance engineering paper with no scientific free parameters, axioms, or invented entities; it relies entirely on standard parallel programming models and existing hardware features.

pith-pipeline@v0.9.0 · 5569 in / 1214 out tokens · 35615 ms · 2026-05-15T00:29:53.377983+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

persistent device-resident memory, an optimized contiguous one-dimensional data layout, and a transition from unified to pinned host memory... OpenMP target tasks with nowait and depend clauses
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Performance results on ... Frontier ... up to 16,000 GPUs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Post-Moore Technologies for Plasma Simulation: A Community Roadmap
cs.ET 2026-05 unverdicted novelty 4.0

No single post-Moore technology replaces current HPC for plasma simulations, but FPGA-class accelerators offer near-term kernel offload, non-von Neumann architectures medium-term operator acceleration, and quantum com...