pith. sign in

arxiv: 2603.24508 · v2 · submitted 2026-03-25 · ⚛️ physics.plasm-ph · cs.DC· cs.PF

Multi-GPU Hybrid Particle-in-Cell Monte Carlo Simulations for Exascale Computing Systems

Pith reviewed 2026-05-15 00:29 UTC · model grok-4.3

classification ⚛️ physics.plasm-ph cs.DCcs.PF
keywords Particle-in-Cell Monte CarloMulti-GPUExascale computingHybrid MPI OpenMPPlasma physicsOpenMP target tasksFrontier supercomputeropenPMD ADIOS2
0
0 comments X

The pith

A hybrid MPI+OpenMP implementation scales BIT1 Particle-in-Cell Monte Carlo simulations to 16,000 GPUs on exascale systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a portable multi-GPU version of the BIT1 code that combines MPI with OpenMP target tasks to run plasma simulations across many accelerators. It reduces data movement through persistent device memory, a one-dimensional data layout, pinned host memory, and direct GPU access while overlapping computation and communication via explicit task dependencies. Standardized I/O comes from openPMD and ADIOS2 libraries. Tests on systems including the Frontier supercomputer show gains in speed, scaling, and GPU utilization for large PIC MC runs. A sympathetic reader would care because these changes make previously limited plasma physics calculations practical on the largest available machines.

Core claim

The authors present a portable, multi-GPU hybrid MPI+OpenMP implementation of BIT1 that uses OpenMP target tasks with explicit dependencies to overlap computation and communication. Portability across Nvidia and AMD accelerators is achieved through persistent device-resident memory, an optimized contiguous one-dimensional data layout, a shift to pinned host memory, GPU Direct Memory Access, and runtime interoperability for direct device-pointer access. Standardized I/O via openPMD and ADIOS2 supports efficient file operations and in-situ analysis. Performance results on pre-exascale and exascale systems, including Frontier with up to 16,000 GPUs, show significant improvements in run time,,

What carries the argument

OpenMP target tasks with explicit dependencies that overlap computation and communication across multiple devices, supported by persistent device-resident memory and optimized data layouts.

Load-bearing premise

The described memory, layout, and task optimizations preserve the numerical accuracy and physical correctness of the original BIT1 simulation.

What would settle it

A side-by-side run of the original BIT1 and the new implementation on an identical small test case, followed by direct comparison of particle position and velocity distributions or electromagnetic field values for any measurable differences.

Figures

Figures reproduced from arXiv: 2603.24508 by Ales Podolnik, Allen D. Malony, David Tskhakaya, Erwin Laure, Frank Jenko, Jakub Hromadka, Jeremy J. Williams, Jonah Ekelund, Jordy Trilaksono, Leon Kos, Luca Pennati, Sameer Shende, Stefan Costea, Stefano Markidis, Tilman Dannert, Yi Ju.

Figure 1
Figure 1. Figure 1: A diagram representing the PIC Method on HPC architectures. After initialization, the PIC method repeats at each time step. In gray, we highlight the particle mover step that we parallelize in the portable multi-GPU hybrid BIT1. simulation loop then executes four phases: (i) the particle mover, which updates particle positions and velocities; (ii) deposition to the grid, which maps particle charges and cur… view at source ↗
Figure 2
Figure 2. Figure 2: Ionization case function percentage breakdown (using gprof) on Dardel, showing where most of the execution time is spent for Original BIT1, openPMD BP4, and openPMD SST simulations [20,23,25]. The arrj sorting function (yellow) dominates but drops from 75.5% (Original BIT1) to 65.5% (BP4) and 35.5% (SST). As previously reported by Williams et al. [20,23,25], [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Hybrid BIT1 (Ionization Case) Total Simulation (Development Progression) strong scaling on 1 Node (4 MPI ranks & 4 GPUs) on MN5 ACC for 2K times steps. 5.3 Hybrid BIT1 (Minimal I/O & Diagnostics) up to 800 GPUs Moving to the high-density sheath (production-like case), we evaluate the portable, multi-GPU hybrid MPI+OpenMP asynchronous version of BIT1 in both strong and weak scaling tests under minimal I/O a… view at source ↗
Figure 4
Figure 4. Figure 4: Hybrid BIT1 (Sheath) Total Simulation (Relative) Speed Up (left) and PE (Right) - Strong and Weak Scaling up to 100 Nodes (up to 800 GPUs) on MN5 ACC, LUMI-G and Frontier for 2K times steps. 5.4 Hybrid BIT1 (Heavy I/O & Diagnostics) up to 16,000 GPUs We evaluate the exascale readiness of the portable, multi-GPU hybrid MPI +OpenMP version of BIT1 in both strong and weak scaling tests under heavy I/O and dia… view at source ↗
Figure 5
Figure 5. Figure 5: Hybrid BIT1 (Sheath) Total Simulation (Relative) Speed Up (left) and PE (Right) - Strong and Weak Scaling up to 2000 Nodes (up to 16,000 GPUs) on Frontier for 10K times steps. For weak scaling, the corresponding PE at 2,000 nodes is 67.9% for the orig￾inal hybrid BIT1 GPU version, while the openPMD GPU version with BP4 sustains 72.0% PE, and the openPMD GPU version with SST further improves this to 73.6% P… view at source ↗
Figure 6
Figure 6. Figure 6: A single time step Hybrid BIT1 AMD MI250X GPU activity trace showing HSA and HIP activity, ROCTX regions, asynchronous data copies, and mover kernel execution, obtained using rocprof and visualized with Perfetto on Dardel GPU, with corresponding confirmation traces on both LUMI-G and Frontier. Future research will extend hybrid BIT1 to Intel GPU platforms at exas￾cale, targeting Aurora, and Europe’s first … view at source ↗
read the original abstract

Particle-in-Cell (PIC) Monte Carlo (MC) simulations are central to plasma physics but face increasing challenges on heterogeneous HPC systems due to excessive data movement, synchronization overheads, and inefficient utilization of multiple accelerators. In this work, we present a portable, multi-GPU hybrid MPI+OpenMP implementation of BIT1 that enables scalable execution on both Nvidia and AMD accelerators through OpenMP target tasks with explicit dependencies to overlap computation and communication across devices. Portability is achieved through persistent device-resident memory, an optimized contiguous one-dimensional data layout, and a transition from unified to pinned host memory to improve large data-transfer efficiency, together with GPU Direct Memory Access (DMA) and runtime interoperability for direct device-pointer access. Standardized and scalable I/O is provided using openPMD and ADIOS2, supporting high-performance file I/O, in-memory data streaming, and in-situ analysis and visualization. Performance results on pre-exascale and exascale systems, including Frontier (OLCF-5) for up to 16,000 GPUs, demonstrate significant improvements in run time, scalability, and resource utilization for large-scale PIC MC simulations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript describes a portable multi-GPU hybrid MPI+OpenMP implementation of the BIT1 particle-in-cell Monte Carlo code. It uses OpenMP target tasks with explicit dependencies to overlap computation and communication, along with persistent device-resident memory, a contiguous 1D data layout, pinned host memory, GPU DMA, and runtime interoperability for direct device-pointer access. Standardized I/O is provided via openPMD and ADIOS2. Performance benchmarks on pre-exascale and exascale systems, including strong scaling to 16,000 GPUs on Frontier, are reported to demonstrate improvements in runtime, scalability, and resource utilization.

Significance. If the reported performance gains are reproducible and the implementation preserves the numerical fidelity of the original BIT1 code, the work supplies a practical, portable framework that can enable substantially larger PIC-MC simulations on heterogeneous exascale platforms such as Frontier, directly supporting computational studies in fusion plasmas and space physics.

major comments (1)
  1. [Performance evaluation section] Performance evaluation section: although timing breakdowns and scaling curves to 16k GPUs are supplied, the manuscript does not present a side-by-side comparison of key physical observables (e.g., density or velocity distributions, energy conservation) between the original BIT1 and the optimized multi-GPU version; this verification is load-bearing for the claim that the optimizations constitute a faithful implementation.
minor comments (2)
  1. [Abstract] Abstract: quantitative metrics (speedup factors, parallel efficiency, or absolute runtimes) are absent, making the claimed 'significant improvements' difficult to assess without reading the full results section.
  2. [I/O description] I/O description: the overhead introduced by openPMD/ADIOS2 integration relative to the total runtime is not quantified, which would help evaluate the net benefit of the standardized I/O layer.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment and positive recommendation for minor revision. We address the single major comment below and will strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: [Performance evaluation section] Performance evaluation section: although timing breakdowns and scaling curves to 16k GPUs are supplied, the manuscript does not present a side-by-side comparison of key physical observables (e.g., density or velocity distributions, energy conservation) between the original BIT1 and the optimized multi-GPU version; this verification is load-bearing for the claim that the optimizations constitute a faithful implementation.

    Authors: We agree that explicit verification of numerical fidelity is essential to support the claim of a faithful implementation. The multi-GPU version employs identical particle-push, field-solve, and Monte Carlo collision kernels as the original BIT1 code; the modifications are confined to data layout (contiguous 1D arrays), memory residency (persistent device memory), communication overlap via OpenMP target tasks with dependencies, and I/O via openPMD/ADIOS2. Nevertheless, to provide direct evidence, we will add a new subsection in the revised Performance evaluation section containing side-by-side comparisons on representative test cases. These will include plasma density profiles, ion and electron velocity distribution functions, and global energy conservation metrics (relative error < 0.1 %) between the original BIT1 and the multi-GPU implementation at equivalent problem sizes. The comparisons will be performed on both NVIDIA and AMD platforms to confirm portability of the physics results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; implementation and external benchmarks only

full rationale

The paper describes a portable MPI+OpenMP target implementation of the existing BIT1 PIC-MC code, together with standard engineering choices (persistent device memory, contiguous layouts, pinned host buffers, GPU DMA, openPMD/ADIOS2 I/O) and reports measured wall-clock times, strong-scaling curves, and utilization metrics on Frontier up to 16 000 GPUs. No equations, fitted parameters, or uniqueness theorems are introduced; the central claim is supported directly by external hardware measurements rather than by any derivation that reduces to the paper's own inputs or to self-citations. The work is therefore self-contained against external benchmarks and contains no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software implementation and performance engineering paper with no scientific free parameters, axioms, or invented entities; it relies entirely on standard parallel programming models and existing hardware features.

pith-pipeline@v0.9.0 · 5569 in / 1214 out tokens · 35615 ms · 2026-05-15T00:29:53.377983+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Post-Moore Technologies for Plasma Simulation: A Community Roadmap

    cs.ET 2026-05 unverdicted novelty 4.0

    No single post-Moore technology replaces current HPC for plasma simulations, but FPGA-class accelerators offer near-term kernel offload, non-von Neumann architectures medium-term operator acceleration, and quantum com...

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 1 Pith paper

  1. [1]

    In: Work- shop on Software Challenges to Exascale Computing

    Chaudhury, B., et al.: Hybrid Parallelization of Particle in Cell Monte Carlo Colli- sion (PIC-MCC) Algorithm for Simulation of Low Temperature Plasmas. In: Work- shop on Software Challenges to Exascale Computing. pp. 32–53. Springer (2018)

  2. [2]

    Concurrency and Computation: Practice and Experience33(4), e6018 (2021)

    Choi, J., et al.: Comparing Unified, Pinned, and Host/Device Memory Allocations for Memory-Intensive Workloads on Tegra SoC. Concurrency and Computation: Practice and Experience33(4), e6018 (2021)

  3. [3]

    https://doi.org/10.5281/zenodo.591699, available at:https://www

    Huebl, A., et al.: openPMD: A meta data standard for particle and mesh based data (2015). https://doi.org/10.5281/zenodo.591699, available at:https://www. openPMD.org,https://github.com/openPMD

  4. [4]

    https://doi.org/10.14278/rodare.27, available at:https: //github.com/openPMD/openPMD-api

    Huebl, A., et al.: openPMD-api: C++ & Python API for Scientific I/O with openPMD (06 2018). https://doi.org/10.14278/rodare.27, available at:https: //github.com/openPMD/openPMD-api

  5. [5]

    (2025), available at:https://repo.tok.ipp.cas.cz/tskhakaya/bit1/-/blob/feature/ CPU-OpenMP/BIT1_c8/mover.c(updated: 2025-12-12)

    IPP-CAS: Bit1 OpenMP Tasks Particle Mover Parallelization. (2025), available at:https://repo.tok.ipp.cas.cz/tskhakaya/bit1/-/blob/feature/ CPU-OpenMP/BIT1_c8/mover.c(updated: 2025-12-12)

  6. [6]

    In: Proceedings of the 2025 9th Interna- tional Conference on High Performance Compilation, Computing and Communi- cations

    Krishnaamy, E., et al.: OpenMP Offloading on AMD and NVIDIA GPUs: Pro- grammability and Performance Analysis. In: Proceedings of the 2025 9th Interna- tional Conference on High Performance Compilation, Computing and Communi- cations. pp. 44–56 (2025)

  7. [7]

    In: International Workshop on Accelerator Programming Using Directives

    Mehta, N., et al.: Evaluating Performance Portability of OpenMP for Snap on Nvidia, Intel, and AMD GPUs using the Roofline Methodology. In: International Workshop on Accelerator Programming Using Directives. pp. 3–24. Springer (2020)

  8. [8]

    In: 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE)

    Milojicic, D., Faraboschi, P., Dube, N., Roweth, D.: Future of HPC: Diversify- ing Heterogeneity. In: 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE). pp. 276–281. IEEE (2021)

  9. [9]

    In: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC

    Mishra, A., et al.: Benchmarking and Evaluating Unified Memory for OpenMP GPU offloading. In: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC. pp. 1–10 (2017) Multi-GPU Hybrid PIC MC Simulations for Exascale Computing Systems 15

  10. [10]

    In: International Workshop on OpenMP

    Neth, B., et al.: Beyond Explicit Transfers: Shared and Managed Memory in OpenMP. In: International Workshop on OpenMP. pp. 183–194. Springer (2021)

  11. [11]

    In: Proceedings of the 2010 IEEE 6th International Conference on Intelligent Computer Commu- nication and Processing

    Noaje, G., et al.: MultiGPU computing using MPI or OpenMP. In: Proceedings of the 2010 IEEE 6th International Conference on Intelligent Computer Commu- nication and Processing. pp. 347–354. IEEE (2010)

  12. [12]

    In: 2016 Third Workshop on Accelerator Programming Using Directives (WACCPD)

    Sewall, J., et al.: A modern memory management system for OpenMP. In: 2016 Third Workshop on Accelerator Programming Using Directives (WACCPD). pp. 25–35. IEEE (2016)

  13. [13]

    In: International Workshop on OpenMP

    Tian, S., et al.: Experience Report: Writing a Portable GPU Runtime with OpenMP 5.1. In: International Workshop on OpenMP. pp. 159–169. Springer (2021)

  14. [14]

    In: International Conference on Physics of Reactors (PHYSOR 2022)

    Tramm, J., et al.: Toward Portable GPU Acceleration of the OpenMC Monte Carlo Particle Transport Code. In: International Conference on Physics of Reactors (PHYSOR 2022). Pittsburgh, USA (2022)

  15. [15]

    Journal of Computational Physics225(1), 829–839 (2007)

    Tskhakaya, D., et al.: Optimization of PIC Codes by Improved Memory Manage- ment. Journal of Computational Physics225(1), 829–839 (2007)

  16. [16]

    Contributions to Plasma Physics47(8-9), 563–594 (2007)

    Tskhakaya, D., et al.: The Particle-in-Cell Method. Contributions to Plasma Physics47(8-9), 563–594 (2007)

  17. [17]

    In: 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Pro- cessing

    Tskhakaya, D., et al.: PIC/MC Code BIT1 for Plasma Simulations on HPC. In: 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Pro- cessing. pp. 476–481. IEEE (2010)

  18. [18]

    In: 2020 43rd International Convention on Information, Communication and Electronic Technology (MIPRO)

    Vasileska, I., et al.: Modernization of the PIC codes for exascale plasma simula- tion. In: 2020 43rd International Convention on Information, Communication and Electronic Technology (MIPRO). pp. 209–213. IEEE (2020)

  19. [19]

    Journal of Computational Physics 104(2), 321–328 (1993)

    Verboncoeur, J., et al.: Simultaneous Potential and Circuit Solution for 1D Bounded Plasma Particle Simulation Codes. Journal of Computational Physics 104(2), 321–328 (1993)

  20. [20]

    In: European Conference on Parallel Processing

    Williams, J., et al.: Leveraging HPC Profiling and Tracing Tools to Understand the Performance of Particle-in-Cell Monte Carlo Simulations. In: European Conference on Parallel Processing. pp. 123–134. Springer (2023)

  21. [21]

    In: 2024 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops)

    Williams, J., et al.: Enabling High-Throughput Parallel I/O in Particle-in-Cell Monte Carlo Simulations with OpenPMD and Darshan I/O Monitoring. In: 2024 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops). pp. 86–95. IEEE (2024)

  22. [22]

    In: International Conference on Com- putational Science

    Williams, J., et al.: Optimizing BIT1, a Particle-in-Cell Monte Carlo Code, with OpenMP/OpenACC and GPU Acceleration. In: International Conference on Com- putational Science. pp. 316–330. Springer (2024)

  23. [23]

    In: European Conference on Parallel Processing

    Williams, J., et al.: Understanding the Impact of OpenPMD on BIT1, a Particle-in- Cell Monte Carlo Code, Through Instrumentation, Monitoring, and In-Situ Anal- ysis. In: European Conference on Parallel Processing. pp. 214–226. Springer (2024)

  24. [24]

    Journal of Computational Science p

    Williams, J., et al.: Accelerating Particle-in-Cell Monte Carlo Simulations with MPI, OpenMP/OpenACC and Asynchronous Multi-GPU Programming. Journal of Computational Science p. 102590 (2025)

  25. [25]

    The International Journal of High Performance Computing Applications (2026)

    Williams, J., et al.: Integrating High Performance In-Memory Data Streaming and In-Situ Visualization in Hybrid MPI+ OpenMP PIC MC Simulations Towards Exascale. The International Journal of High Performance Computing Applications (2026)

  26. [26]

    ACM SIGPLAN Notices 48(8), 57–68 (2013)

    Wu, B., et al.: Complexity Analysis and Algorithm Design for Reorganizing Data to Minimize Non-Coalesced Memory Accesses on GPU. ACM SIGPLAN Notices 48(8), 57–68 (2013)