Recognition: unknown
POLAR-PIC: A Holistic Framework for Matrixized PIC with Co-Designed Compute, Layout, and Communication
Pith reviewed 2026-05-10 01:48 UTC · model grok-4.3
The pith
POLAR-PIC co-designs PIC particle processing for matrix units to reach 10.9x speedup while scaling to millions of cores.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By reformulating field interpolation into an MPU-friendly outer-product form, maintaining a physically ordered particle layout to preserve memory contiguity, and overlapping particle communication with deposition, POLAR-PIC accelerates the entire particle-processing phase by up to 10.9x in uniform plasma and 4.4x in real-world laser-ion acceleration scenarios compared to the native WarpX reference pipeline on LX2, while maintaining 67.5% weak scaling efficiency on over 2 million cores.
What carries the argument
The three-part co-design of outer-product field interpolation for matrix units, physically ordered particle layout for contiguous memory access, and asynchronous communication overlapped with deposition.
If this is right
- Interpolation and deposition kernels achieve 8.0x and 13.2x speedups respectively on matrix-centric hardware.
- Dynamic high-migration workloads can sustain 99.1 percent communication overlap and 67.5 percent weak scaling beyond two million cores.
- PIC particle processing can reach 13.2 percent of theoretical peak on CPU-based matrix systems versus lower fractions on GPU baselines.
- The co-design approach removes the previous limits from irregular memory accesses and bulk-synchronous redistribution.
Where Pith is reading between the lines
- The outer-product reformulation could be tested in other particle-grid codes such as molecular dynamics or smoothed-particle hydrodynamics on similar hardware.
- Physically ordered layouts might combine with adaptive mesh refinement to further improve locality in multi-scale plasma problems.
- If accuracy holds across more test cases, the framework suggests that future matrix-heavy architectures will favor similar holistic co-design over incremental kernel tuning.
Load-bearing premise
Reformulating field interpolation as an outer-product and enforcing a physically ordered particle layout preserves the numerical accuracy and stability of the original PIC method without extra error checks.
What would settle it
Run the same standard test problem, such as a uniform plasma or laser-ion acceleration case, with both POLAR-PIC and a reference PIC code and compare final particle positions, energies, and field values to see whether differences exceed floating-point roundoff.
Figures
read the original abstract
Particle-in-Cell (PIC) simulations are fundamental to plasma physics but often suffer from limited scalability due to particle-grid interaction bottlenecks and particle redistribution costs. Specifically, the particle-grid interaction computations have not taken full advantage of the emerging Matrix Processing Units (MPUs), the particle motion introduces irregular memory accesses, and the bulk-synchronous redistribution further destroys long-term data locality thereby limiting parallel efficiency. To address these inefficiencies, we present POLAR-PIC, a co-designed framework for large-scale PIC simulations that (i) reformulates Field Interpolation into an MPU-friendly outer-product form, (ii) maintains a physically ordered particle layout to preserve memory contiguity, and (iii) overlaps particle communication with Deposition to hide redistribution overhead. The evaluation on the pilot system of an Exascale supercomputer demonstrates that POLAR-PIC accelerates the entire particle-processing phase by up to 10.9x in uniform plasma and 4.4x in real-world laser-ion acceleration scenarios compared to the native WarpX reference pipeline on LX2. Ablation studies reveal that the speedups achieved by Interpolation and Deposition are 8.0x and 13.2x, respectively, and the asynchronous communication design sustains a 99.1% overlap ratio. In cross-platform comparisons, POLAR-PIC achieves 13.2% of theoretical peak efficiency on the CPU-based LS system, while WarpX reaches 9.6% on NVIDIA A800 GPUs. Notably, the scalability evaluation demonstrates that POLAR-PIC maintains 67.5% weak scaling efficiency on over 2 million cores under high-migration dynamic workloads, highlighting the importance of holistic co-design for future matrix-centric HPC systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents POLAR-PIC, a co-designed framework for large-scale Particle-in-Cell (PIC) simulations targeting matrix processing units (MPUs). It reformulates field interpolation as an MPU-friendly outer-product operation, enforces a physically ordered particle layout to preserve memory locality, and overlaps asynchronous particle communication with the deposition phase. On an exascale pilot system, the framework reports up to 10.9x acceleration of the full particle-processing phase versus native WarpX in uniform plasma and 4.4x in laser-ion acceleration workloads, 8.0x and 13.2x gains from the interpolation and deposition optimizations respectively, 99.1% communication overlap, 13.2% of theoretical peak on CPU-based LS hardware (versus 9.6% for WarpX on A800 GPUs), and 67.5% weak-scaling efficiency at >2 million cores under high-migration conditions.
Significance. If the numerical properties of the underlying PIC discretization are preserved, the work provides concrete evidence that hardware-specific co-design of compute, data layout, and communication can deliver substantial throughput improvements for production plasma codes at exascale. The scaling results on millions of cores and the cross-platform efficiency comparison are notable strengths; the ablation data isolating individual contributions is also useful for the community.
major comments (3)
- [Evaluation] Evaluation section (ablation studies and scaling results): the headline speedups (10.9x / 4.4x) and 67.5% weak-scaling efficiency are presented without any accompanying numerical verification that the MPU outer-product reformulation of interpolation and the physically ordered layout leave the original PIC stencil, charge conservation, or dispersion relations unchanged. No L2 error norms, reference-solution comparisons, charge-conservation diagnostics, or long-time stability runs versus WarpX are reported.
- [§3] Abstract and §3 (reformulation of field interpolation): the claim that the outer-product form is mathematically equivalent to the original interpolation weights is not accompanied by a derivation, proof of equivalence, or even a small-scale numerical check that the effective stencil and conservation properties remain identical.
- [Evaluation] Evaluation (timing methodology): no error bars, variance across runs, or description of how ablation studies controlled for confounding factors (cache effects, compiler flags, or measurement overhead) are provided, weakening confidence in the reported speedups and overlap ratios.
minor comments (2)
- [Abstract] The abstract states 'maintains 67.5% weak scaling efficiency' but does not define the baseline problem size or migration rate used for the 2-million-core experiment; a brief clarification would improve reproducibility.
- [§3] Notation for the outer-product reformulation could be made more explicit (e.g., explicit matrix dimensions and index mapping) to allow readers to verify the claimed MPU friendliness without re-deriving the mapping.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments identify important gaps in numerical verification, mathematical exposition, and experimental methodology that we will address in the revision. We outline our responses and planned changes below.
read point-by-point responses
-
Referee: Evaluation section (ablation studies and scaling results): the headline speedups (10.9x / 4.4x) and 67.5% weak-scaling efficiency are presented without any accompanying numerical verification that the MPU outer-product reformulation of interpolation and the physically ordered layout leave the original PIC stencil, charge conservation, or dispersion relations unchanged. No L2 error norms, reference-solution comparisons, charge-conservation diagnostics, or long-time stability runs versus WarpX are reported.
Authors: We agree that explicit numerical verification is essential to substantiate that the co-design changes preserve the underlying PIC discretization. In the revised manuscript we will add a new subsection in the Evaluation section that reports L2 error norms against WarpX reference solutions for both test cases, charge-conservation diagnostics over long simulation times, and stability comparisons (including dispersion-relation checks) for the uniform-plasma and laser-ion workloads. These results will be presented alongside the performance numbers to confirm equivalence of the numerical properties. revision: yes
-
Referee: Abstract and §3 (reformulation of field interpolation): the claim that the outer-product form is mathematically equivalent to the original interpolation weights is not accompanied by a derivation, proof of equivalence, or even a small-scale numerical check that the effective stencil and conservation properties remain identical.
Authors: We will expand §3 with a complete step-by-step derivation showing that the outer-product reformulation produces identical interpolation weights and the same effective stencil as the original formulation. We will also include a small-scale numerical verification (on a single cell or small grid) demonstrating that the stencil, charge deposition, and conservation properties match those of the reference WarpX implementation to machine precision. revision: yes
-
Referee: Evaluation (timing methodology): no error bars, variance across runs, or description of how ablation studies controlled for confounding factors (cache effects, compiler flags, or measurement overhead) are provided, weakening confidence in the reported speedups and overlap ratios.
Authors: We acknowledge that the current timing methodology description is insufficient. In the revised Evaluation section we will report error bars (standard deviation) from at least five independent runs for all speedup and overlap figures, quantify run-to-run variance, and add an explicit paragraph describing the controls used for cache effects, compiler flags, and measurement overhead (including use of hardware performance counters and repeated warm-up runs). revision: yes
Circularity Check
No circularity: empirical speedups measured against external WarpX baseline
full rationale
The paper's core results are runtime measurements (10.9x/4.4x particle-phase speedups, 67.5% weak scaling on 2M+ cores) obtained by executing the POLAR-PIC implementation against the native WarpX pipeline on LX2 hardware. No mathematical derivations, first-principles predictions, or equations are presented that reduce any reported quantity to fitted parameters or self-citations by construction. The three co-design elements (outer-product interpolation reformulation, physically ordered layout, overlapped communication) are engineering choices whose outcomes are validated only by direct benchmarking, leaving the evaluation self-contained.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Particle-grid interpolation and deposition operations can be mathematically rearranged into outer-product matrix form without changing the underlying physics.
- domain assumption Maintaining a physically ordered particle layout preserves memory contiguity and does not increase overall computational cost.
Reference graph
Works this paper leans on
-
[1]
Luedtke, Stephen Lien Harrell, Michela Taufer, and Brian Albright
Robert Bird, Nigel Tan, Scott V. Luedtke, Stephen Lien Harrell, Michela Taufer, and Brian Albright. Vpic 2.0: Next generation particle-in-cell simulations.IEEE Transactions on Parallel and Distributed Systems, 33(4):952–963, 2022. POLAR-PIC: A Holistic Framework for Matrixized PIC with Co-Designed Compute, Layout, and Communication HPDC ’26, July 13–16, 2...
2022
-
[2]
Pushing the frontier in the design of laser-based electron accelerators with groundbreaking mesh-refined particle-in-cell simulations on exascale-class supercomputers
Luca Fedeli, Axel Huebl, France Boillod-Cerneux, Thomas Clark, Kevin Gott, Conrad Hillairet, Stephan Jaure, Adrien Leblanc, Rémi Lehe, Andrew Myers, Christelle Piechurski, Mitsuhisa Sato, Neïl Zaim, Weiqun Zhang, Jean-Luc Vay, and Henri Vincenti. Pushing the frontier in the design of laser-based electron accelerators with groundbreaking mesh-refined parti...
2022
-
[3]
Osiris: A three-dimensional, fully relativistic particle in cell code for modeling plasma based accelerators
Ricardo A Fonseca, Luis O Silva, Frank S Tsung, Viktor K Decyk, Wei Lu, Chuang Ren, Warren B Mori, Shaogui Deng, Shiyoun Lee, T Katsouleas, et al. Osiris: A three-dimensional, fully relativistic particle in cell code for modeling plasma based accelerators. InInternational conference on computational science, pages 342–351. Springer, 2002
2002
-
[4]
Contemporary particle-in-cell approach to laser-plasma modelling.Plasma Physics and Controlled Fusion, 57(11):113001, 2015
Tony D Arber, Keith Bennett, Christopher S Brady, Alistair Lawrence-Douglas, MG Ramsay, Nathan J Sircombe, Paddy Gillies, Roger G Evans, Holger Schmitz, Anthony R Bell, et al. Contemporary particle-in-cell approach to laser-plasma modelling.Plasma Physics and Controlled Fusion, 57(11):113001, 2015
2015
-
[5]
Myers, A
A. Myers, A. Almgren, L.D. Amorim, J. Bell, L. Fedeli, L. Ge, K. Gott, D.P. Grote, M. Hogan, A. Huebl, R. Jambunathan, R. Lehe, C. Ng, M. Rowan, O. Shapo- val, M. Thévenet, J.-L. Vay, H. Vincenti, E. Yang, N. Zaïm, W. Zhang, Y. Zhao, and E. Zoni. Porting warpx to gpu-accelerated platforms.Parallel Computing, 108:102833, 2021
2021
-
[6]
Matrix-pic: Harnessing matrix outer-product for high-performance particle-in-cell simulations
Yizhuo Rao, Xingjian Cui, Jiabin Xie, Shangzhi Pang, Guangnan Feng, Jinhui Wei, Zhiguang Chen, and Yutong Lu. Matrix-pic: Harnessing matrix outer-product for high-performance particle-in-cell simulations. InIn 21st European Conference on Computer Systems (EUROSYS ’26),. Association for Computing Machinery, 2026. https://arxiv.org/abs/2601.08277
-
[7]
A 400 trillion-grid vlasov simulation on fugaku supercomputer: large-scale distribution of cosmic relic neutrinos in a six-dimensional phase space
Kohji Yoshikawa, Satoshi Tanaka, and Naoki Yoshida. A 400 trillion-grid vlasov simulation on fugaku supercomputer: large-scale distribution of cosmic relic neutrinos in a six-dimensional phase space. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, New York, NY, USA, 2021. Association ...
2021
-
[8]
Derouillat, A
J. Derouillat, A. Beck, F. Pérez, T. Vinci, M. Chiaramello, A. Grassi, M. Flé, G. Bouchard, I. Plotnikov, N. Aunai, J. Dargent, C. Riconda, and M. Grech. Smilei : A collaborative, open-source, multi-purpose particle-in-cell code for plasma simulation.Computer Physics Communications, 222:351–373, 2018
2018
-
[9]
Optimization of a gemm implementation using intel amx
Yusuke Endo, Satoshi Ohshima, and Takeshi Nanri. Optimization of a gemm implementation using intel amx. InProceedings of the Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region, SCA/HPCAsia ’26, page 81–90, New York, NY, USA, 2026. Association for Computing Machinery
2026
-
[10]
Hello sme! generating fast matrix multi- plication kernels using the scalable matrix extension
Stefan Remke and Alexander Breuer. Hello sme! generating fast matrix multi- plication kernels using the scalable matrix extension. InProceedings of the SC ’24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC-W ’24, page 1443–1454. IEEE Press, 2025
2025
-
[11]
Katz, Andrew Myers, Tan Nguyen, Andrew Nonaka, Michele Rosso, Samuel Williams, and Michael Zingale
Weiqun Zhang, Ann Almgren, Vince Beckner, John Bell, Johannes Blaschke, Cy Chan, Marcus Day, Brian Friesen, Kevin Gott, Daniel Graves, Max P. Katz, Andrew Myers, Tan Nguyen, Andrew Nonaka, Michele Rosso, Samuel Williams, and Michael Zingale. Amrex: a framework for block-structured adaptive mesh refinement.Journal of Open Source Software, 4(37):1370, 2019
2019
-
[12]
Unr: Unified notifiable rma library for hpc
Guangnan Feng, Jiabin Xie, Dezun Dong, and Yutong Lu. Unr: Unified notifiable rma library for hpc. InProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’24. IEEE Press, 2024
2024
-
[13]
Decyk and Tajendra V
Viktor K. Decyk and Tajendra V. Singh. Particle-in-cell algorithms for emerging computer architectures.Computer Physics Communications, 185(3):708–719, 2014
2014
-
[14]
Çatalyürek, Srini- vasan Parthasarathy, and P
Changwan Hong, Aravind Sukumaran-Rajam, Bortik Bandyopadhyay, Jinsung Kim, Süreyya Emre Kurt, Israt Nisa, Shivani Sabhlok, Ümit V. Çatalyürek, Srini- vasan Parthasarathy, and P. Sadayappan. Efficient sparse-matrix multi-vector product on gpus. InProceedings of the 27th International Symposium on High- Performance Parallel and Distributed Computing, HPDC ’...
2018
-
[15]
Ping Gao, Xiaohui Duan, Bertil Schmidt, Wusheng Zhang, Lin Gan, Haohuan Fu, Wei Xue, Weiguo Liu, and Guangwen Yang. Optimization of reactive force field simulation: Refactor, parallelization, and vectorization for interactions.IEEE Transactions on Parallel and Distributed Systems, 33(2):359–373, 2022
2022
-
[16]
Hstencil: Matrix-vector stencil computation with inter- leaved outer product and mla
Han Huang, Jiabin Xie, Guangnan Feng, Xianwei Zhang, Dan Huang, Zhiguang Chen, and Yutong Lu. Hstencil: Matrix-vector stencil computation with inter- leaved outer product and mla. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’25, page 1816–1829, New York, NY, USA, 2025. Association for ...
2025
-
[17]
Lorastencil: Low-rank adaptation of stencil computation on tensor cores
Yiwei Zhang, Kun Li, Liang Yuan, Jiawen Cheng, Yunquan Zhang, Ting Cao, and Mao Yang. Lorastencil: Low-rank adaptation of stencil computation on tensor cores. InProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’24. IEEE Press, 2024
2024
-
[18]
Convstencil: Transform stencil computation to matrix multiplication on tensor cores
Yuetao Chen, Kun Li, Yuhao Wang, Donglin Bai, Lei Wang, Lingxiao Ma, Liang Yuan, Yunquan Zhang, Ting Cao, and Mao Yang. Convstencil: Transform stencil computation to matrix multiplication on tensor cores. InProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP ’24, page 333–347, New York, NY, USA, 2...
2024
-
[19]
Toward accelerated stencil computation by adapting tensor core unit on gpu
Xiaoyan Liu, Yi Liu, Hailong Yang, Jianjin Liao, Mingzhen Li, Zhongzhi Luan, and Depei Qian. Toward accelerated stencil computation by adapting tensor core unit on gpu. InProceedings of the 36th ACM International Conference on Supercomputing, ICS ’22, New York, NY, USA, 2022. Association for Computing Machinery
2022
-
[20]
Agcm-3dlf: Accelerating atmospheric general circulation model via 3-d parallelization and leap-format
Hang Cao, Liang Yuan, He Zhang, Yunquan Zhang, Baodong Wu, Kun Li, Shigang Li, Minghua Zhang, Pengqi Lu, and Junmin Xiao. Agcm-3dlf: Accelerating atmospheric general circulation model via 3-d parallelization and leap-format. IEEE Transactions on Parallel and Distributed Systems, 34(3):766–780, 2023
2023
-
[21]
Ceresz: Enabling and scaling error-bounded lossy compression on cerebras cs-2
Shihui Song, Yafan Huang, Peng Jiang, Xiaodong Yu, Weijian Zheng, Sheng Di, Qinglei Cao, Yunhe Feng, Zhen Xie, and Franck Cappello. Ceresz: Enabling and scaling error-bounded lossy compression on cerebras cs-2. InProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’24, page 309–321, New York, NY, US...
2024
-
[22]
Chamberlain, Romain Cledat, H
Didem Unat, Anshu Dubey, Torsten Hoefler, John Shalf, Mark Abraham, Mauro Bianco, Bradford L. Chamberlain, Romain Cledat, H. Carter Edwards, Hal Finkel, Karl Fuerlinger, Frank Hannig, Emmanuel Jeannot, Amir Kamil, Jeff Keasler, Paul H J Kelly, Vitus Leung, Hatem Ltaief, Naoya Maruyama, Chris J. Newburn, and Miquel Pericás. Trends in data locality abstract...
2017
-
[23]
A. Beck, J. Derouillat, M. Lobet, A. Farjallah, F. Massimo, I. Zemzemi, F. Perez, T. Vinci, and M. Grech. Adaptive simd optimizations in particle-in-cell codes with fine-grain particle sorting.Computer Physics Communications, 244:246–263, 2019
2019
-
[24]
Hirstoaga, and Éric Violard
Yann Barsamian, Sever A. Hirstoaga, and Éric Violard. Efficient data layouts for a three-dimensional electrostatic particle-in-cell code.Journal of Computational Science, 27:345–356, 2018
2018
-
[25]
Hirstoaga, and Michel Mehren- berger
Yann Barsamian, Arthur Charguéraud, Sever A. Hirstoaga, and Michel Mehren- berger. Efficient strict-binning particle-in-cell algorithm for multi-core simd processors. In Marco Aldinucci, Luca Padovani, and Massimo Torquati, edi- tors,Euro-Par 2018: Parallel Processing, pages 749–763, Cham, 2018. Springer International Publishing
2018
-
[26]
Bender and Haodong Hu
Michael A. Bender and Haodong Hu. An adaptive packed-memory array.ACM Trans. Database Syst., 32(4):26–es, November 2007
2007
-
[27]
SIAM, 2021
Brian Wheatman and Helen Xu.A Parallel Packed Memory Array to Store Dynamic Graphs, pages 31–45. SIAM, 2021
2021
-
[28]
Combining data duplication and graph reordering to accelerate parallel graph processing
Vignesh Balaji and Brandon Lucia. Combining data duplication and graph reordering to accelerate parallel graph processing. InProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’19, page 133–144, New York, NY, USA, 2019. Association for Computing Machinery
2019
-
[29]
Overlapping communication and com- putation with high level communication routines
Torsten Hoefler and Andrew Lumsdaine. Overlapping communication and com- putation with high level communication routines. In2008 Eighth IEEE Interna- tional Symposium on Cluster Computing and the Grid (CCGRID), pages 572–577, 2008
2008
-
[30]
Cools and W
S. Cools and W. Vanroose. The communication-hiding pipelined bicgstab method for the parallel solution of large unsymmetric linear systems.Parallel Computing, 65:1–20, 2017
2017
-
[31]
A methodology for assessing computation/communication overlap of mpi non- blocking collectives.Concurrency and Computation: Practice and Experience, 34(22):e7168, 2022
Alexandre Denis, Julien Jaeger, Emmanuel Jeannot, and Florian Reynier. A methodology for assessing computation/communication overlap of mpi non- blocking collectives.Concurrency and Computation: Practice and Experience, 34(22):e7168, 2022
2022
-
[32]
Mpi progress for all
Hui Zhou, Robert Latham, Ken Raffenetti, Yanfei Guo, and Rajeev Thakur. Mpi progress for all. InSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 425–435, 2024
2024
-
[33]
Acceler- ating mpi collectives with process-in-process-based multi-object techniques
Jiajun Huang, Kaiming Ouyang, Yujia Zhai, Jinyang Liu, Min Si, Ken Raffenetti, Hui Zhou, Atsushi Hori, Zizhong Chen, Yanfei Guo, and Rajeev Thakur. Acceler- ating mpi collectives with process-in-process-based multi-object techniques. In Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’23, page 3...
2023
-
[34]
Exploiting copy engines for intra-node mpi collective communication.The Journal of Supercomputing, 79(16):17962–17982, 2023
Joong-Yeon Cho, Pu-Rum Seo, and Hyun-Wook Jin. Exploiting copy engines for intra-node mpi collective communication.The Journal of Supercomputing, 79(16):17962–17982, 2023
2023
-
[35]
High-performance distributed rma locks
Patrick Schmid, Maciej Besta, and Torsten Hoefler. High-performance distributed rma locks. InProceedings of the 25th ACM International Symposium on High- Performance Parallel and Distributed Computing, HPDC ’16, page 19–30, New York, NY, USA, 2016. Association for Computing Machinery
2016
-
[36]
Hm2: Efficient host memory management for rdma-enabled distributed systems
Chen Tang, Zhaole Chu, Peiquan Jin, Yongping Luo, and Kuankuan Guo. Hm2: Efficient host memory management for rdma-enabled distributed systems. In Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’23, page 335–336, New York, NY, USA, 2023. Association for Computing Machinery
2023
-
[37]
Matthieu Schaller, Pedro Gonnet, Aidan B. G. Chalk, and Peter W. Draper. Swift: Using task-based parallelism, fully asynchronous communication, and graph HPDC ’26, July 13–16, 2026, Cleveland, OH, USA Yizhuo Rao et al. partition-based domain decomposition for strong scaling on more than 100,000 cores. InProceedings of the Platform for Advanced Scientific ...
2026
-
[38]
Nicolas Guidotti, Pedro Ceyrat, João Barreto, José Monteiro, Rodrigo Rodrigues, Ricardo Fonseca, Xavier Martorell, and Antonio J. Peña. Particle-in-cell simulation using asynchronous tasking. In Leonel Sousa, Nuno Roma, and Pedro Tomás, editors,Euro-Par 2021: Parallel Processing, pages 482–498, Cham, 2021. Springer International Publishing
2021
-
[39]
Relativistic plasma simulation-optimization of a hybrid code
Jay P Boris. Relativistic plasma simulation-optimization of a hybrid code. InProc. Fourth Conf. Num. Sim. Plasmas, pages 3–67, 1970
1970
-
[40]
Simulation of beams or plasmas crossing at relativistic velocity.Physics of Plasmas, 15(5), 2008
J-L Vay. Simulation of beams or plasmas crossing at relativistic velocity.Physics of Plasmas, 15(5), 2008. A Implementation Details Table 5 details the specific implementation strategies for each ex- perimental variant (G0–G7, D0–D3, and C0–C4). It highlights the architectural distinctions between baseline VPU approaches and the proposed MPU-accelerated d...
2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.