pith. machine review for the scientific record. sign in

arxiv: 2604.19337 · v1 · submitted 2026-04-21 · 💻 cs.DC

Recognition: unknown

POLAR-PIC: A Holistic Framework for Matrixized PIC with Co-Designed Compute, Layout, and Communication

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:48 UTC · model grok-4.3

classification 💻 cs.DC
keywords Particle-in-CellMatrix Processing UnitsPlasma SimulationScalabilityCo-designExascale ComputingParticle LayoutCommunication Overlap
0
0 comments X

The pith

POLAR-PIC co-designs PIC particle processing for matrix units to reach 10.9x speedup while scaling to millions of cores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops POLAR-PIC to remove key bottlenecks in large-scale Particle-in-Cell simulations, where particle-grid interactions and frequent particle redistribution limit performance on modern hardware. It achieves this by recasting field interpolation as an outer-product operation that matrix processing units handle efficiently, enforcing a physically ordered particle layout that keeps memory accesses contiguous, and overlapping redistribution communication with the deposition phase to hide latency. A reader would care because PIC methods underpin plasma research in fusion energy, laser physics, and space weather, and current codes cannot fully exploit exascale resources without such changes. The reported results show the entire particle phase running up to 10.9 times faster than the reference WarpX pipeline in uniform cases and 4.4 times faster in dynamic laser-ion problems, with 67.5 percent weak scaling efficiency beyond two million cores.

Core claim

By reformulating field interpolation into an MPU-friendly outer-product form, maintaining a physically ordered particle layout to preserve memory contiguity, and overlapping particle communication with deposition, POLAR-PIC accelerates the entire particle-processing phase by up to 10.9x in uniform plasma and 4.4x in real-world laser-ion acceleration scenarios compared to the native WarpX reference pipeline on LX2, while maintaining 67.5% weak scaling efficiency on over 2 million cores.

What carries the argument

The three-part co-design of outer-product field interpolation for matrix units, physically ordered particle layout for contiguous memory access, and asynchronous communication overlapped with deposition.

If this is right

  • Interpolation and deposition kernels achieve 8.0x and 13.2x speedups respectively on matrix-centric hardware.
  • Dynamic high-migration workloads can sustain 99.1 percent communication overlap and 67.5 percent weak scaling beyond two million cores.
  • PIC particle processing can reach 13.2 percent of theoretical peak on CPU-based matrix systems versus lower fractions on GPU baselines.
  • The co-design approach removes the previous limits from irregular memory accesses and bulk-synchronous redistribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The outer-product reformulation could be tested in other particle-grid codes such as molecular dynamics or smoothed-particle hydrodynamics on similar hardware.
  • Physically ordered layouts might combine with adaptive mesh refinement to further improve locality in multi-scale plasma problems.
  • If accuracy holds across more test cases, the framework suggests that future matrix-heavy architectures will favor similar holistic co-design over incremental kernel tuning.

Load-bearing premise

Reformulating field interpolation as an outer-product and enforcing a physically ordered particle layout preserves the numerical accuracy and stability of the original PIC method without extra error checks.

What would settle it

Run the same standard test problem, such as a uniform plasma or laser-ion acceleration case, with both POLAR-PIC and a reference PIC code and compare final particle positions, energies, and field values to see whether differences exceed floating-point roundoff.

Figures

Figures reproduced from arXiv: 2604.19337 by Guangnan Feng, Jiabin Xie, Jinhui Wei, Languang Gao, Shangzhi Pang, Xingjian Cui, Yizhuo Rao, Yutong Lu, Zhenyu Wang, Zhiguang Chen, Ziyan Zhang.

Figure 1
Figure 1. Figure 1: Native WarpX v24.07 Runtime breakdown of Uni [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: POLAR-PIC integrates the MPU-based Interpola [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Matrixized Field Interpolation and Sort-on-Write dataflow. Left: Matrix outer-product Interpolation via tensor stacking; [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of Particle Redistribution Strategies. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Architectural diagram of the LX2 processor, illus [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Particle-Phase Performance and Long-tail Effect [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Correctness verification for Laser-Ion Acceleration [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: VPU results for G0–G4 at 𝑃𝑃𝐶 = 512, and MPU Interpolation results for G5–G7 across PPC at 𝑢𝑡ℎ = 0.01 (with G0/G1 shown as baselines). As shown in [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Reuse benefits for D0–D2 under fixed 𝑢𝑡ℎ = 0.01 and Robustness for D0, D2-D3 under PPC=512. 6.3 Ablation 2: Synergistic Gains in Deposition These experiments fix the Interpolation operator to either G2 (logi￾cal) or G4 (physical) and vary Deposition implementations (D0–D3) to evaluate the efficiency of layout reuse. 6.3.1 Comparison of Different Layout Reuse (D0–D2). We evaluate layout reuse efficiency, f… view at source ↗
Figure 11
Figure 11. Figure 11: Impact of 𝑢𝑡ℎ on overlap efficiency and runtime breakdown. Comparisons between average and maximum rank times highlight load imbalance and tail latency effects. 6.3.2 Trade-off Between Dynamic–Static Splitting and Tail Over￾Sorting (D0, D2–D3). Comparing tail handling strategies under vary￾ing migration intensities reveals that the dynamic–static splitting policy (D3) yields the optimal trade-off, avoidin… view at source ↗
Figure 12
Figure 12. Figure 12: End-to-end full-timestep weak-scaling breakdown [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗
read the original abstract

Particle-in-Cell (PIC) simulations are fundamental to plasma physics but often suffer from limited scalability due to particle-grid interaction bottlenecks and particle redistribution costs. Specifically, the particle-grid interaction computations have not taken full advantage of the emerging Matrix Processing Units (MPUs), the particle motion introduces irregular memory accesses, and the bulk-synchronous redistribution further destroys long-term data locality thereby limiting parallel efficiency. To address these inefficiencies, we present POLAR-PIC, a co-designed framework for large-scale PIC simulations that (i) reformulates Field Interpolation into an MPU-friendly outer-product form, (ii) maintains a physically ordered particle layout to preserve memory contiguity, and (iii) overlaps particle communication with Deposition to hide redistribution overhead. The evaluation on the pilot system of an Exascale supercomputer demonstrates that POLAR-PIC accelerates the entire particle-processing phase by up to 10.9x in uniform plasma and 4.4x in real-world laser-ion acceleration scenarios compared to the native WarpX reference pipeline on LX2. Ablation studies reveal that the speedups achieved by Interpolation and Deposition are 8.0x and 13.2x, respectively, and the asynchronous communication design sustains a 99.1% overlap ratio. In cross-platform comparisons, POLAR-PIC achieves 13.2% of theoretical peak efficiency on the CPU-based LS system, while WarpX reaches 9.6% on NVIDIA A800 GPUs. Notably, the scalability evaluation demonstrates that POLAR-PIC maintains 67.5% weak scaling efficiency on over 2 million cores under high-migration dynamic workloads, highlighting the importance of holistic co-design for future matrix-centric HPC systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents POLAR-PIC, a co-designed framework for large-scale Particle-in-Cell (PIC) simulations targeting matrix processing units (MPUs). It reformulates field interpolation as an MPU-friendly outer-product operation, enforces a physically ordered particle layout to preserve memory locality, and overlaps asynchronous particle communication with the deposition phase. On an exascale pilot system, the framework reports up to 10.9x acceleration of the full particle-processing phase versus native WarpX in uniform plasma and 4.4x in laser-ion acceleration workloads, 8.0x and 13.2x gains from the interpolation and deposition optimizations respectively, 99.1% communication overlap, 13.2% of theoretical peak on CPU-based LS hardware (versus 9.6% for WarpX on A800 GPUs), and 67.5% weak-scaling efficiency at >2 million cores under high-migration conditions.

Significance. If the numerical properties of the underlying PIC discretization are preserved, the work provides concrete evidence that hardware-specific co-design of compute, data layout, and communication can deliver substantial throughput improvements for production plasma codes at exascale. The scaling results on millions of cores and the cross-platform efficiency comparison are notable strengths; the ablation data isolating individual contributions is also useful for the community.

major comments (3)
  1. [Evaluation] Evaluation section (ablation studies and scaling results): the headline speedups (10.9x / 4.4x) and 67.5% weak-scaling efficiency are presented without any accompanying numerical verification that the MPU outer-product reformulation of interpolation and the physically ordered layout leave the original PIC stencil, charge conservation, or dispersion relations unchanged. No L2 error norms, reference-solution comparisons, charge-conservation diagnostics, or long-time stability runs versus WarpX are reported.
  2. [§3] Abstract and §3 (reformulation of field interpolation): the claim that the outer-product form is mathematically equivalent to the original interpolation weights is not accompanied by a derivation, proof of equivalence, or even a small-scale numerical check that the effective stencil and conservation properties remain identical.
  3. [Evaluation] Evaluation (timing methodology): no error bars, variance across runs, or description of how ablation studies controlled for confounding factors (cache effects, compiler flags, or measurement overhead) are provided, weakening confidence in the reported speedups and overlap ratios.
minor comments (2)
  1. [Abstract] The abstract states 'maintains 67.5% weak scaling efficiency' but does not define the baseline problem size or migration rate used for the 2-million-core experiment; a brief clarification would improve reproducibility.
  2. [§3] Notation for the outer-product reformulation could be made more explicit (e.g., explicit matrix dimensions and index mapping) to allow readers to verify the claimed MPU friendliness without re-deriving the mapping.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify important gaps in numerical verification, mathematical exposition, and experimental methodology that we will address in the revision. We outline our responses and planned changes below.

read point-by-point responses
  1. Referee: Evaluation section (ablation studies and scaling results): the headline speedups (10.9x / 4.4x) and 67.5% weak-scaling efficiency are presented without any accompanying numerical verification that the MPU outer-product reformulation of interpolation and the physically ordered layout leave the original PIC stencil, charge conservation, or dispersion relations unchanged. No L2 error norms, reference-solution comparisons, charge-conservation diagnostics, or long-time stability runs versus WarpX are reported.

    Authors: We agree that explicit numerical verification is essential to substantiate that the co-design changes preserve the underlying PIC discretization. In the revised manuscript we will add a new subsection in the Evaluation section that reports L2 error norms against WarpX reference solutions for both test cases, charge-conservation diagnostics over long simulation times, and stability comparisons (including dispersion-relation checks) for the uniform-plasma and laser-ion workloads. These results will be presented alongside the performance numbers to confirm equivalence of the numerical properties. revision: yes

  2. Referee: Abstract and §3 (reformulation of field interpolation): the claim that the outer-product form is mathematically equivalent to the original interpolation weights is not accompanied by a derivation, proof of equivalence, or even a small-scale numerical check that the effective stencil and conservation properties remain identical.

    Authors: We will expand §3 with a complete step-by-step derivation showing that the outer-product reformulation produces identical interpolation weights and the same effective stencil as the original formulation. We will also include a small-scale numerical verification (on a single cell or small grid) demonstrating that the stencil, charge deposition, and conservation properties match those of the reference WarpX implementation to machine precision. revision: yes

  3. Referee: Evaluation (timing methodology): no error bars, variance across runs, or description of how ablation studies controlled for confounding factors (cache effects, compiler flags, or measurement overhead) are provided, weakening confidence in the reported speedups and overlap ratios.

    Authors: We acknowledge that the current timing methodology description is insufficient. In the revised Evaluation section we will report error bars (standard deviation) from at least five independent runs for all speedup and overlap figures, quantify run-to-run variance, and add an explicit paragraph describing the controls used for cache effects, compiler flags, and measurement overhead (including use of hardware performance counters and repeated warm-up runs). revision: yes

Circularity Check

0 steps flagged

No circularity: empirical speedups measured against external WarpX baseline

full rationale

The paper's core results are runtime measurements (10.9x/4.4x particle-phase speedups, 67.5% weak scaling on 2M+ cores) obtained by executing the POLAR-PIC implementation against the native WarpX pipeline on LX2 hardware. No mathematical derivations, first-principles predictions, or equations are presented that reduce any reported quantity to fitted parameters or self-citations by construction. The three co-design elements (outer-product interpolation reformulation, physically ordered layout, overlapped communication) are engineering choices whose outcomes are validated only by direct benchmarking, leaving the evaluation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard PIC conservation properties and hardware assumptions about MPU availability and network latency; no new free parameters or invented physical entities are introduced in the abstract.

axioms (2)
  • domain assumption Particle-grid interpolation and deposition operations can be mathematically rearranged into outer-product matrix form without changing the underlying physics.
    Invoked when reformulating Field Interpolation for MPU compatibility.
  • domain assumption Maintaining a physically ordered particle layout preserves memory contiguity and does not increase overall computational cost.
    Stated as part of the layout co-design.

pith-pipeline@v0.9.0 · 5654 in / 1425 out tokens · 50984 ms · 2026-05-10T01:48:44.255643+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 1 canonical work pages

  1. [1]

    Luedtke, Stephen Lien Harrell, Michela Taufer, and Brian Albright

    Robert Bird, Nigel Tan, Scott V. Luedtke, Stephen Lien Harrell, Michela Taufer, and Brian Albright. Vpic 2.0: Next generation particle-in-cell simulations.IEEE Transactions on Parallel and Distributed Systems, 33(4):952–963, 2022. POLAR-PIC: A Holistic Framework for Matrixized PIC with Co-Designed Compute, Layout, and Communication HPDC ’26, July 13–16, 2...

  2. [2]

    Pushing the frontier in the design of laser-based electron accelerators with groundbreaking mesh-refined particle-in-cell simulations on exascale-class supercomputers

    Luca Fedeli, Axel Huebl, France Boillod-Cerneux, Thomas Clark, Kevin Gott, Conrad Hillairet, Stephan Jaure, Adrien Leblanc, Rémi Lehe, Andrew Myers, Christelle Piechurski, Mitsuhisa Sato, Neïl Zaim, Weiqun Zhang, Jean-Luc Vay, and Henri Vincenti. Pushing the frontier in the design of laser-based electron accelerators with groundbreaking mesh-refined parti...

  3. [3]

    Osiris: A three-dimensional, fully relativistic particle in cell code for modeling plasma based accelerators

    Ricardo A Fonseca, Luis O Silva, Frank S Tsung, Viktor K Decyk, Wei Lu, Chuang Ren, Warren B Mori, Shaogui Deng, Shiyoun Lee, T Katsouleas, et al. Osiris: A three-dimensional, fully relativistic particle in cell code for modeling plasma based accelerators. InInternational conference on computational science, pages 342–351. Springer, 2002

  4. [4]

    Contemporary particle-in-cell approach to laser-plasma modelling.Plasma Physics and Controlled Fusion, 57(11):113001, 2015

    Tony D Arber, Keith Bennett, Christopher S Brady, Alistair Lawrence-Douglas, MG Ramsay, Nathan J Sircombe, Paddy Gillies, Roger G Evans, Holger Schmitz, Anthony R Bell, et al. Contemporary particle-in-cell approach to laser-plasma modelling.Plasma Physics and Controlled Fusion, 57(11):113001, 2015

  5. [5]

    Myers, A

    A. Myers, A. Almgren, L.D. Amorim, J. Bell, L. Fedeli, L. Ge, K. Gott, D.P. Grote, M. Hogan, A. Huebl, R. Jambunathan, R. Lehe, C. Ng, M. Rowan, O. Shapo- val, M. Thévenet, J.-L. Vay, H. Vincenti, E. Yang, N. Zaïm, W. Zhang, Y. Zhao, and E. Zoni. Porting warpx to gpu-accelerated platforms.Parallel Computing, 108:102833, 2021

  6. [6]

    Matrix-pic: Harnessing matrix outer-product for high-performance particle-in-cell simulations

    Yizhuo Rao, Xingjian Cui, Jiabin Xie, Shangzhi Pang, Guangnan Feng, Jinhui Wei, Zhiguang Chen, and Yutong Lu. Matrix-pic: Harnessing matrix outer-product for high-performance particle-in-cell simulations. InIn 21st European Conference on Computer Systems (EUROSYS ’26),. Association for Computing Machinery, 2026. https://arxiv.org/abs/2601.08277

  7. [7]

    A 400 trillion-grid vlasov simulation on fugaku supercomputer: large-scale distribution of cosmic relic neutrinos in a six-dimensional phase space

    Kohji Yoshikawa, Satoshi Tanaka, and Naoki Yoshida. A 400 trillion-grid vlasov simulation on fugaku supercomputer: large-scale distribution of cosmic relic neutrinos in a six-dimensional phase space. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, New York, NY, USA, 2021. Association ...

  8. [8]

    Derouillat, A

    J. Derouillat, A. Beck, F. Pérez, T. Vinci, M. Chiaramello, A. Grassi, M. Flé, G. Bouchard, I. Plotnikov, N. Aunai, J. Dargent, C. Riconda, and M. Grech. Smilei : A collaborative, open-source, multi-purpose particle-in-cell code for plasma simulation.Computer Physics Communications, 222:351–373, 2018

  9. [9]

    Optimization of a gemm implementation using intel amx

    Yusuke Endo, Satoshi Ohshima, and Takeshi Nanri. Optimization of a gemm implementation using intel amx. InProceedings of the Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region, SCA/HPCAsia ’26, page 81–90, New York, NY, USA, 2026. Association for Computing Machinery

  10. [10]

    Hello sme! generating fast matrix multi- plication kernels using the scalable matrix extension

    Stefan Remke and Alexander Breuer. Hello sme! generating fast matrix multi- plication kernels using the scalable matrix extension. InProceedings of the SC ’24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC-W ’24, page 1443–1454. IEEE Press, 2025

  11. [11]

    Katz, Andrew Myers, Tan Nguyen, Andrew Nonaka, Michele Rosso, Samuel Williams, and Michael Zingale

    Weiqun Zhang, Ann Almgren, Vince Beckner, John Bell, Johannes Blaschke, Cy Chan, Marcus Day, Brian Friesen, Kevin Gott, Daniel Graves, Max P. Katz, Andrew Myers, Tan Nguyen, Andrew Nonaka, Michele Rosso, Samuel Williams, and Michael Zingale. Amrex: a framework for block-structured adaptive mesh refinement.Journal of Open Source Software, 4(37):1370, 2019

  12. [12]

    Unr: Unified notifiable rma library for hpc

    Guangnan Feng, Jiabin Xie, Dezun Dong, and Yutong Lu. Unr: Unified notifiable rma library for hpc. InProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’24. IEEE Press, 2024

  13. [13]

    Decyk and Tajendra V

    Viktor K. Decyk and Tajendra V. Singh. Particle-in-cell algorithms for emerging computer architectures.Computer Physics Communications, 185(3):708–719, 2014

  14. [14]

    Çatalyürek, Srini- vasan Parthasarathy, and P

    Changwan Hong, Aravind Sukumaran-Rajam, Bortik Bandyopadhyay, Jinsung Kim, Süreyya Emre Kurt, Israt Nisa, Shivani Sabhlok, Ümit V. Çatalyürek, Srini- vasan Parthasarathy, and P. Sadayappan. Efficient sparse-matrix multi-vector product on gpus. InProceedings of the 27th International Symposium on High- Performance Parallel and Distributed Computing, HPDC ’...

  15. [15]

    Ping Gao, Xiaohui Duan, Bertil Schmidt, Wusheng Zhang, Lin Gan, Haohuan Fu, Wei Xue, Weiguo Liu, and Guangwen Yang. Optimization of reactive force field simulation: Refactor, parallelization, and vectorization for interactions.IEEE Transactions on Parallel and Distributed Systems, 33(2):359–373, 2022

  16. [16]

    Hstencil: Matrix-vector stencil computation with inter- leaved outer product and mla

    Han Huang, Jiabin Xie, Guangnan Feng, Xianwei Zhang, Dan Huang, Zhiguang Chen, and Yutong Lu. Hstencil: Matrix-vector stencil computation with inter- leaved outer product and mla. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’25, page 1816–1829, New York, NY, USA, 2025. Association for ...

  17. [17]

    Lorastencil: Low-rank adaptation of stencil computation on tensor cores

    Yiwei Zhang, Kun Li, Liang Yuan, Jiawen Cheng, Yunquan Zhang, Ting Cao, and Mao Yang. Lorastencil: Low-rank adaptation of stencil computation on tensor cores. InProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’24. IEEE Press, 2024

  18. [18]

    Convstencil: Transform stencil computation to matrix multiplication on tensor cores

    Yuetao Chen, Kun Li, Yuhao Wang, Donglin Bai, Lei Wang, Lingxiao Ma, Liang Yuan, Yunquan Zhang, Ting Cao, and Mao Yang. Convstencil: Transform stencil computation to matrix multiplication on tensor cores. InProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP ’24, page 333–347, New York, NY, USA, 2...

  19. [19]

    Toward accelerated stencil computation by adapting tensor core unit on gpu

    Xiaoyan Liu, Yi Liu, Hailong Yang, Jianjin Liao, Mingzhen Li, Zhongzhi Luan, and Depei Qian. Toward accelerated stencil computation by adapting tensor core unit on gpu. InProceedings of the 36th ACM International Conference on Supercomputing, ICS ’22, New York, NY, USA, 2022. Association for Computing Machinery

  20. [20]

    Agcm-3dlf: Accelerating atmospheric general circulation model via 3-d parallelization and leap-format

    Hang Cao, Liang Yuan, He Zhang, Yunquan Zhang, Baodong Wu, Kun Li, Shigang Li, Minghua Zhang, Pengqi Lu, and Junmin Xiao. Agcm-3dlf: Accelerating atmospheric general circulation model via 3-d parallelization and leap-format. IEEE Transactions on Parallel and Distributed Systems, 34(3):766–780, 2023

  21. [21]

    Ceresz: Enabling and scaling error-bounded lossy compression on cerebras cs-2

    Shihui Song, Yafan Huang, Peng Jiang, Xiaodong Yu, Weijian Zheng, Sheng Di, Qinglei Cao, Yunhe Feng, Zhen Xie, and Franck Cappello. Ceresz: Enabling and scaling error-bounded lossy compression on cerebras cs-2. InProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’24, page 309–321, New York, NY, US...

  22. [22]

    Chamberlain, Romain Cledat, H

    Didem Unat, Anshu Dubey, Torsten Hoefler, John Shalf, Mark Abraham, Mauro Bianco, Bradford L. Chamberlain, Romain Cledat, H. Carter Edwards, Hal Finkel, Karl Fuerlinger, Frank Hannig, Emmanuel Jeannot, Amir Kamil, Jeff Keasler, Paul H J Kelly, Vitus Leung, Hatem Ltaief, Naoya Maruyama, Chris J. Newburn, and Miquel Pericás. Trends in data locality abstract...

  23. [23]

    A. Beck, J. Derouillat, M. Lobet, A. Farjallah, F. Massimo, I. Zemzemi, F. Perez, T. Vinci, and M. Grech. Adaptive simd optimizations in particle-in-cell codes with fine-grain particle sorting.Computer Physics Communications, 244:246–263, 2019

  24. [24]

    Hirstoaga, and Éric Violard

    Yann Barsamian, Sever A. Hirstoaga, and Éric Violard. Efficient data layouts for a three-dimensional electrostatic particle-in-cell code.Journal of Computational Science, 27:345–356, 2018

  25. [25]

    Hirstoaga, and Michel Mehren- berger

    Yann Barsamian, Arthur Charguéraud, Sever A. Hirstoaga, and Michel Mehren- berger. Efficient strict-binning particle-in-cell algorithm for multi-core simd processors. In Marco Aldinucci, Luca Padovani, and Massimo Torquati, edi- tors,Euro-Par 2018: Parallel Processing, pages 749–763, Cham, 2018. Springer International Publishing

  26. [26]

    Bender and Haodong Hu

    Michael A. Bender and Haodong Hu. An adaptive packed-memory array.ACM Trans. Database Syst., 32(4):26–es, November 2007

  27. [27]

    SIAM, 2021

    Brian Wheatman and Helen Xu.A Parallel Packed Memory Array to Store Dynamic Graphs, pages 31–45. SIAM, 2021

  28. [28]

    Combining data duplication and graph reordering to accelerate parallel graph processing

    Vignesh Balaji and Brandon Lucia. Combining data duplication and graph reordering to accelerate parallel graph processing. InProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’19, page 133–144, New York, NY, USA, 2019. Association for Computing Machinery

  29. [29]

    Overlapping communication and com- putation with high level communication routines

    Torsten Hoefler and Andrew Lumsdaine. Overlapping communication and com- putation with high level communication routines. In2008 Eighth IEEE Interna- tional Symposium on Cluster Computing and the Grid (CCGRID), pages 572–577, 2008

  30. [30]

    Cools and W

    S. Cools and W. Vanroose. The communication-hiding pipelined bicgstab method for the parallel solution of large unsymmetric linear systems.Parallel Computing, 65:1–20, 2017

  31. [31]

    A methodology for assessing computation/communication overlap of mpi non- blocking collectives.Concurrency and Computation: Practice and Experience, 34(22):e7168, 2022

    Alexandre Denis, Julien Jaeger, Emmanuel Jeannot, and Florian Reynier. A methodology for assessing computation/communication overlap of mpi non- blocking collectives.Concurrency and Computation: Practice and Experience, 34(22):e7168, 2022

  32. [32]

    Mpi progress for all

    Hui Zhou, Robert Latham, Ken Raffenetti, Yanfei Guo, and Rajeev Thakur. Mpi progress for all. InSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 425–435, 2024

  33. [33]

    Acceler- ating mpi collectives with process-in-process-based multi-object techniques

    Jiajun Huang, Kaiming Ouyang, Yujia Zhai, Jinyang Liu, Min Si, Ken Raffenetti, Hui Zhou, Atsushi Hori, Zizhong Chen, Yanfei Guo, and Rajeev Thakur. Acceler- ating mpi collectives with process-in-process-based multi-object techniques. In Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’23, page 3...

  34. [34]

    Exploiting copy engines for intra-node mpi collective communication.The Journal of Supercomputing, 79(16):17962–17982, 2023

    Joong-Yeon Cho, Pu-Rum Seo, and Hyun-Wook Jin. Exploiting copy engines for intra-node mpi collective communication.The Journal of Supercomputing, 79(16):17962–17982, 2023

  35. [35]

    High-performance distributed rma locks

    Patrick Schmid, Maciej Besta, and Torsten Hoefler. High-performance distributed rma locks. InProceedings of the 25th ACM International Symposium on High- Performance Parallel and Distributed Computing, HPDC ’16, page 19–30, New York, NY, USA, 2016. Association for Computing Machinery

  36. [36]

    Hm2: Efficient host memory management for rdma-enabled distributed systems

    Chen Tang, Zhaole Chu, Peiquan Jin, Yongping Luo, and Kuankuan Guo. Hm2: Efficient host memory management for rdma-enabled distributed systems. In Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’23, page 335–336, New York, NY, USA, 2023. Association for Computing Machinery

  37. [37]

    Matthieu Schaller, Pedro Gonnet, Aidan B. G. Chalk, and Peter W. Draper. Swift: Using task-based parallelism, fully asynchronous communication, and graph HPDC ’26, July 13–16, 2026, Cleveland, OH, USA Yizhuo Rao et al. partition-based domain decomposition for strong scaling on more than 100,000 cores. InProceedings of the Platform for Advanced Scientific ...

  38. [38]

    Nicolas Guidotti, Pedro Ceyrat, João Barreto, José Monteiro, Rodrigo Rodrigues, Ricardo Fonseca, Xavier Martorell, and Antonio J. Peña. Particle-in-cell simulation using asynchronous tasking. In Leonel Sousa, Nuno Roma, and Pedro Tomás, editors,Euro-Par 2021: Parallel Processing, pages 482–498, Cham, 2021. Springer International Publishing

  39. [39]

    Relativistic plasma simulation-optimization of a hybrid code

    Jay P Boris. Relativistic plasma simulation-optimization of a hybrid code. InProc. Fourth Conf. Num. Sim. Plasmas, pages 3–67, 1970

  40. [40]

    Simulation of beams or plasmas crossing at relativistic velocity.Physics of Plasmas, 15(5), 2008

    J-L Vay. Simulation of beams or plasmas crossing at relativistic velocity.Physics of Plasmas, 15(5), 2008. A Implementation Details Table 5 details the specific implementation strategies for each ex- perimental variant (G0–G7, D0–D3, and C0–C4). It highlights the architectural distinctions between baseline VPU approaches and the proposed MPU-accelerated d...