pith. sign in

arxiv: 2606.19059 · v1 · pith:33A4MILInew · submitted 2026-06-17 · 🧮 math.NA · cs.DC· cs.NA· physics.comp-ph

A performance portable fast Ewald summation for Stokes flow

Pith reviewed 2026-06-26 20:06 UTC · model grok-4.3

classification 🧮 math.NA cs.DCcs.NAphysics.comp-ph
keywords Ewald summationStokes flowperformance portabilityGPU algorithmsN-body problemsparticle-to-gridperiodic domains
0
0 comments X

The pith

A novel P2G algorithm delivers up to 16x speedup for Ewald summation in periodic Stokes flow on GPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops algorithms for Ewald summation to accelerate N-body Stokes flow simulations in periodic domains by splitting interactions into near-field particle-to-particle and far-field particle-to-grid plus grid-to-particle steps. It introduces a new particle-to-grid method to remove a common bottleneck in the far field and implements the full code with PyKokkos to run efficiently on NVIDIA and AMD GPUs as well as ARM and x86 CPUs. The result is high compute efficiency for the kernels and overall throughput of roughly 8 million particles per second on an H200 GPU at nine digits of accuracy. A multi-GPU test confirms scaling to 256 million particles on 64 GPUs with mostly bounded communication costs.

Core claim

We present GPU algorithms for Ewald summation methods for accelerating N-body Stokes flow problems in periodic domains. Like most N-body codes, Ewald sums use a near-field/far-field decomposition. The near field involves particle-to-particle (P2P) interactions. The far field primarily involves particle-to-grid (P2G) and grid-to-particle (G2P) interactions, as well as Fast Fourier Transforms. For each interaction, we investigate several algorithmic variants. Our implementation uses PyKokkos, a Python interface for the Kokkos C++ parallel programming framework, which supports portability to AMD/NVIDIA GPU and ARM/x86 CPU architectures. A novel P2G algorithm achieves up to 16× speedup compared

What carries the argument

The near-field/far-field decomposition of the Ewald sum, with a novel particle-to-grid (P2G) kernel that replaces the baseline far-field mapping step.

If this is right

  • The full Ewald sum code reaches 8 million particles per second on NVIDIA H200 GPUs and half a million on Grace CPUs at nine digits of accuracy.
  • P2P kernel compute efficiency reaches 84 percent on NVIDIA A100 and 73 percent on H200.
  • Weak scaling on up to 256 million particles across 64 GPUs keeps communication costs bounded except for the all-to-all particle sort.
  • The same code base runs on AMD GPUs, ARM CPUs, and x86 CPUs through the PyKokkos layer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The P2G optimization may apply to other far-field summations if the underlying kernel structure is similar.
  • Switching the particle sort to neighbor-only communication in typical time-stepping loops could further improve large-scale runs.
  • Analytical performance models supplied in the paper can be used to predict runtimes on future hardware generations.

Load-bearing premise

The near-field/far-field decomposition and the specific algorithmic variants chosen for P2P, P2G, and G2P remain efficient and representative across the full range of target Stokes problems and particle distributions.

What would settle it

A measurement on a non-uniform particle distribution or different hardware showing that the novel P2G kernel does not reach the reported 16× speedup over baseline.

read the original abstract

We present GPU algorithms for Ewald summation methods for accelerating N-body Stokes flow problems in periodic domains. Like most N-body codes, Ewald sums use a near-field/far-field decomposition. The near field involves particle-to-particle (P2P) interactions. The far field primarily involves particle-to-grid (P2G) and grid-to-particle (G2P) interactions, as well as Fast Fourier Transforms. For each interaction, we investigate several algorithmic variants. Our implementation uses PyKokkos, a Python interface for the Kokkos C++ parallel programming framework, which supports portability to AMD/NVIDIA GPU and ARM/x86 CPU architectures. Double and single-precision numerical results, alongside analytical performance models, confirm the efficiency of our algorithms on AMD and NVIDIA GPU and on ARM and AMD CPU architectures. The P2P interaction achieves around 73% compute efficiency on NVIDIA H200, 84% on NVIDIA A100, 60% on AMD MI300, 52% on Grace CPU, and 68% on AMD Epyc CPU. A straightforward implementation of the P2G kernel can become a computational bottleneck. We introduce a novel P2G algorithm that achieves up to 16$\times$ speedup compared to a baseline GPU implementation. The overall Ewald sum code processes approximately 8 million particles per second on a H200 GPU, and about a half-million particles per second on a Grace CPU, for nine digits of accuracy. We also perform a multi-GPU weak scaling test on up to 256 million particles (64 GPUs) that shows bounded communication cost for all stages except the all-to-all particle sorting, which can be reduced to neighbor communication in the relevant time-stepping regime.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper presents GPU algorithms for Ewald summation in periodic Stokes flow N-body problems, employing a near-field/far-field split with P2P, P2G, and G2P interactions. Multiple algorithmic variants are explored for each, implemented via PyKokkos for portability across NVIDIA/AMD GPUs and ARM/x86 CPUs. Reported results include P2P compute efficiencies of 73% on H200, 84% on A100, 60% on MI300, 52% on Grace CPU, and 68% on Epyc CPU; a novel P2G algorithm yielding up to 16× speedup over a baseline GPU version; overall throughputs of ~8 million particles/sec on H200 and ~0.5 million on Grace CPU at 9-digit accuracy; and weak scaling to 256M particles on 64 GPUs with bounded communication except for all-to-all sorting.

Significance. If the measured throughputs and scaling hold, the work supplies a practical, architecture-portable implementation for large-scale Stokes flow simulations. The concrete efficiency percentages, analytical performance models, 16× P2G improvement, and 256M-particle weak-scaling test constitute reproducible evidence of utility for computational fluid dynamics and N-body codes.

minor comments (2)
  1. The abstract states 'nine digits of accuracy' without naming the error norm, tolerance definition, or reference solution used; add this detail in the results section for reproducibility.
  2. Clarify in the P2G section whether the 16× speedup baseline is a direct port of an existing CPU/GPU code or a naive Kokkos implementation, and list the exact particle counts and distributions for that measurement.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No specific major comments were raised.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an implementation and benchmarking paper focused on GPU/CPU performance of Ewald summation variants for Stokes flow. All central claims (throughputs, 16× P2G speedup, 9-digit accuracy, scaling) are supported by direct measurements, analytical performance models, and empirical evaluation on specific hardware and particle distributions. No derivations, fitted parameters renamed as predictions, self-citations, or uniqueness theorems appear in the load-bearing steps; the near/far decomposition and kernel choices are presented as design decisions validated by timing data rather than reduced to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution is algorithmic implementation and optimization; it rests on the standard mathematical validity of the Ewald decomposition for periodic Stokes problems but introduces no new fitted parameters, physical constants, or postulated entities.

axioms (1)
  • domain assumption The near-field/far-field split in Ewald summation remains computationally advantageous for the target particle counts and accuracies.
    Invoked in the opening description of the method and kernel design.

pith-pipeline@v0.9.1-grok · 5857 in / 1284 out tokens · 26657 ms · 2026-06-26T20:06:54.039326+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 22 canonical work pages

  1. [1]

    and Foreman-Mackey, Dan and Shih, Yu-hsuan and Barnett, Alex , year=

    Garrison, Lehman H. and Foreman-Mackey, Dan and Shih, Yu-hsuan and Barnett, Alex , year=. nifty-ls: Fast and Accurate Lomb–Scargle Periodograms Using a Non-uniform FFT , volume=. Research Notes of the AAS , publisher=. doi:10.3847/2515-5172/ad82cd , abstractNote=

  2. [2]

    2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) , author=

    cuFINUFFT: a load-balanced GPU library for general-purpose nonuniform FFTs , url=. 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) , author=. 2021 , month=jun, pages=. doi:10.1109/IPDPSW52791.2021.00105 , abstractNote=

  3. [3]

    A massively parallel adaptive fast-multipole method on heterogeneous architectures , ISBN=

    Lashuk, Ilya and Chandramowlishwaran, Aparna and Langston, Harper and Nguyen, Tuan-Anh and Sampath, Rahul and Shringarpure, Aashay and Vuduc, Richard and Ying, Lexing and Zorin, Denis and Biros, George , year=. A massively parallel adaptive fast-multipole method on heterogeneous architectures , ISBN=. Proceedings of the Conference on High Performance Comp...

  4. [4]

    A performance portability framework for Python , ISBN=

    Al Awar, Nader and Zhu, Steven and Biros, George and Gligoric, Milos , year=. A performance portability framework for Python , ISBN=. Proceedings of the 35th ACM International Conference on Supercomputing , publisher=. doi:10.1145/3447818.3460376 , abstractNote=

  5. [5]

    and Lebrun-Grandié, Damien and Arndt, Daniel and Ciesko, Jan and Dang, Vinh and Ellingwood, Nathan and Gayatri, Rahulkumar and Harvey, Evan and Hollman, Daisy S

    Trott, Christian R. and Lebrun-Grandié, Damien and Arndt, Daniel and Ciesko, Jan and Dang, Vinh and Ellingwood, Nathan and Gayatri, Rahulkumar and Harvey, Evan and Hollman, Daisy S. and Ibanez, Dan and Liber, Nevin and Madsen, Jonathan and Miles, Jeff and Poliakoff, David and Powell, Amy and Rajamanickam, Sivasankaran and Simberg, Mikael and Sunderland, D...

  6. [6]

    The Kokkos Ecosystem: Comprehensive Performance Portability for High Performance Computing , year=

    Trott, Christian and Berger-Vergiat, Luc and Poliakoff, David and Rajamanickam, Sivasankaran and Lebrun-Grandie, Damien and Madsen, Jonathan and Al Awar, Nader and Gligoric, Milos and Shipman, Galen and Womeldorff, Geoff , journal=. The Kokkos Ecosystem: Comprehensive Performance Portability for High Performance Computing , year=

  7. [7]

    Journal of Parallel and Distributed Computing74(12), 3202–3216 (Dec 2014)

    H. Carter Edwards and Christian R. Trott and Daniel Sunderland. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. Journal of Parallel and Distributed Computing. 2014. doi:https://doi.org/10.1016/j.jpdc.2014.07.003

  8. [8]

    Shende and Allen D

    Sameer S. Shende and Allen D. Malony , title =. The International Journal of High Performance Computing Applications , volume =. 2006 , doi =. https://doi.org/10.1177/1094342006064482 , abstract =

  9. [9]

    Journal of Computational Physics , author=

    Fast Ewald summation for Stokes flow with arbitrary periodicity , volume=. Journal of Computational Physics , author=. 2023 , month=nov, pages=. doi:10.1016/j.jcp.2023.112473 , abstractNote=

  10. [10]

    ACM Trans

    A Massively Parallel Performance Portable Free-Space Spectral Poisson Solver , volume=. ACM Trans. Math. Softw. , author=. 2025 , month="sept", pages=. doi:10.1145/3748815 , abstractNote=

  11. [11]

    CuPy: A NumPy-Compatible Library for NVIDIA GPU Calculations

    Okuta, Ryosuke and Unno, Yuya and Nishino, Daisuke and Hido, Shohei and Loomis, Crissman. CuPy: A NumPy-Compatible Library for NVIDIA GPU Calculations. Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS). 2017

  12. [12]

    Nsight Compute Documentation — NsightCompute 12.8 documentation

  13. [13]

    Journal of Computational Physics , volume =

    Lindbo, Dag and Tornberg, Anna-Karin , title =. Journal of Computational Physics , volume =. 2010 , doi =

  14. [14]

    Computers and Fluids , volume =

    Nguyen, Hoang-Ngan and Olson, Sarah and Leiderman, Karin , title =. Computers and Fluids , volume =. 2016 , doi =

  15. [15]

    2016 , journal =

    Wang, Mu and Brady, John F , title =. 2016 , journal =

  16. [16]

    2017 , journal =

    Fiore, A M and Usabiaga, F B and Donev, A and Swan, J W , title =. 2017 , journal =

  17. [17]

    and Swan, James W

    Fiore, Andrew M. and Swan, James W. , year=. Fast Stokesian dynamics , volume=. doi:10.1017/jfm.2019.640 , journal=

  18. [18]

    Graham and Horton, Mitchel D

    Lopez, M. Graham and Horton, Mitchel D. and Chow, Edmond , title =. Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment , articleno =. 2014 , isbn =. doi:10.1145/2616498.2616523 , abstract =

  19. [19]

    2012 , journal =

    Guasto, J S and Rusconi, R and Stocker, R , title =. 2012 , journal =

  20. [20]

    Numerical–experimental observation of shape bistability of red blood cells flowing in a microchannel

    Guckenberger, Achim and Kihm, Alexander and John, Thomas and Wagner, Christian and Gekle, Stephan. Numerical–experimental observation of shape bistability of red blood cells flowing in a microchannel. Soft Matter. 2018. doi:10.1039/C7SM02272G

  21. [21]

    2005 , journal =

    Squires, T M and Quake, S R , title =. 2005 , journal =

  22. [22]

    2021 , journal =

    Ladiges, D R and Nonaka, A and Klymko, K and Moore, G C and Bell, J B and Carney, S P and Garcia, A L and Natesh, S R and Donev, A , title =. 2021 , journal =

  23. [23]

    2021 , journal =

    Lai, Pin-Kuang and Swan, James W and Trout, Bernhardt L , title =. 2021 , journal =

  24. [24]

    The Journal of Chemical Physics , volume =

    Páll, Szilárd and Zhmurov, Artem and Bauer, Paul and Abraham, Mark and Lundborg, Magnus and Gray, Alan and Hess, Berk and Lindahl, Erik , title =. The Journal of Chemical Physics , volume =. 2020 , month =. doi:10.1063/5.0018516 , url =

  25. [25]

    A. P. Thompson and H. M. Aktulga and R. Berger and D. S. Bolintineanu and W. M. Brown and P. S. Crozier and P. J. in 't Veld and A. Kohlmeyer and S. G. Moore and T. D. Nguyen and R. Shan and M. J. Stevens and J. Tranchida and C. Trott and S. J. Plimpton. LAMMPS - a flexible simulation tool for particle-based materials modeling at the atomic, meso, and con...

  26. [26]

    and Sutmann, Godehard

    Halver, Rene and Meinke, Jan H. and Sutmann, Godehard. Examining Performance Portability with Kokkos for an Ewald Sum Coulomb Solver. Parallel Processing and Applied Mathematics. 2020

  27. [27]

    Journal of Computational Physics , volume =

    Vico, Felipe and Greengard, Leslie and Ferrando, Miguel , title =. Journal of Computational Physics , volume =. 2016 , doi =

  28. [28]

    Anderson and Jens Glaser and Sharon C

    Joshua A. Anderson and Jens Glaser and Sharon C. Glotzer , keywords =. HOOMD-blue: A Python package for high-performance molecular dynamics and hard particle Monte Carlo simulations , journal =. 2020 , issn =. doi:https://doi.org/10.1016/j.commatsci.2019.109363 , url =

  29. [29]

    2022 , journal =

    Turetta, Lorenzo and Lattuada, Marco , title =. 2022 , journal =

  30. [30]

    2021 , journal =

    Maxian, Ondrej and Pel\'. 2021 , journal =

  31. [31]

    2019 , journal =

    Sherman, Z M and Pallone, J L and Erb, R M and Swan, J W , title =. 2019 , journal =

  32. [32]

    International Journal for Numerical Methods in Fluids , volume =

    Bagge, Joar and Tornberg, Anna-Karin , title =. International Journal for Numerical Methods in Fluids , volume =. doi:https://doi.org/10.1002/fld.4970 , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1002/fld.4970 , abstract =

  33. [33]

    Journal of Computational Physics , author =

    L Greengard and V Rokhlin , abstract =. A fast algorithm for particle simulations , journal =. 1987 , issn =. doi:https://doi.org/10.1016/0021-9991(87)90140-9 , url =

  34. [34]

    and Gohara, David and Shi, Guochun , journal=

    Stone, John E. and Gohara, David and Shi, Guochun , journal=. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , year=

  35. [35]

    D. A. Beckingsale and J. Burmark and R. Hornung and H. Jones and W. Killian and A. J. Kunen and O. Pearce and P. Robinson and B. S. Ryujin and T. R. W. Scogland , title =. 2019 , booktitle =

  36. [36]

    2021 , doi =

    Journal of Chemical Physics , volume =. 2021 , doi =

  37. [37]

    and Isfahani, A.H.G

    Zhao, H. and Isfahani, A.H.G. and Olson, L.N. and Freund, J.B. , journal=. 2010 , publisher=

  38. [38]

    Journal of Computational Physics , pages=

    A scalable computational platform for particulate Stokes suspensions , author=. Journal of Computational Physics , pages=. 2020 , publisher=

  39. [39]

    Physical Review Applied , volume=

    Microfluidic Particle Sorting in Concentrated Erythrocyte Suspensions , author=. Physical Review Applied , volume=. 2019 , publisher=

  40. [40]

    Journal of Computational Physics , year =

    Costas Pozrikidis , title =. Journal of Computational Physics , year =

  41. [41]

    Zinchenko and R

    A.Z. Zinchenko and R. H. Davis , Title = ". Philosophical Transactions Of The Royal Society Of London Series A-Mathematical Physical And Engineering Sciences , Volume =

  42. [42]

    Lashuk and A

    I. Lashuk and A. Chandramowlishwaran and H. Langston, T-A. Nguyen and R. Sampath and A. Shringarpure and R. Vuduc and L. Ying, D. Zorin and G. Biros , title =. SC '09: Proceedings of the 2009 ACM/IEEE conference on Supercomputing , year =

  43. [43]

    Journal of Fluid Mechanics , volume=

    On the periodic fundamental solutions of the Stokes equations and their application to viscous flow past a cubic array of spheres , author=. Journal of Fluid Mechanics , volume=. 1959 , publisher=

  44. [44]

    The Journal of chemical physics , volume=

    Phillips, James C and Hardy, David J and Maia, Julio DC and Stone, John E and Ribeiro, Jo. The Journal of chemical physics , volume=. 2020 , publisher=

  45. [45]

    2009 , publisher=

    Harvey, MJ and De Fabritiis, G , journal=. 2009 , publisher=

  46. [46]

    Routine microsecond molecular dynamics simulations with AMBER on GPUs. 2. Explicit solvent particle mesh Ewald , author=. Journal of chemical theory and computation , volume=. 2013 , publisher=

  47. [47]

    2021 , publisher=

    Shamshirgar, D Saffar and Bagge, Joar and Tornberg, A-K , journal=. 2021 , publisher=

  48. [48]

    Wiley Interdisciplinary Reviews: Computational Molecular Science , volume=

    Classical molecular dynamics on graphics processing unit architectures , author=. Wiley Interdisciplinary Reviews: Computational Molecular Science , volume=. 2020 , publisher=

  49. [49]

    SIAM Journal on Scientific Computing , volume =

    Amir Gholami and Dhairya Malhotra and Hari Sundar and George Biros , title =. SIAM Journal on Scientific Computing , volume =. 2016 , doi =

  50. [50]

    SIAM Review , volume =

    Greengard, Leslie and Lee, June-Yub , title =. SIAM Review , volume =. 2004 , doi =. https://doi.org/10.1137/S003614450343200X , abstract =

  51. [51]

    The AAA Algorithm for Rational Approximation , journal =

    Nakatsukasa, Yuji and S\`. The AAA Algorithm for Rational Approximation , journal =. 2018 , doi =. https://doi.org/10.1137/16M1106122 , abstract =

  52. [52]

    , title =

    Nakatsukasa, Yuji and Trefethen, Lloyd N. , title =. SIAM Journal on Scientific Computing , volume =. 2020 , doi =. https://doi.org/10.1137/19M1281897 , abstract =

  53. [53]

    Communications of the ACM , volume=

    Anton, a special-purpose machine for molecular dynamics simulation , author=. Communications of the ACM , volume=. 2008 , publisher=

  54. [54]

    Proceedings of the international conference for high performance computing, networking, storage and analysis , pages=

    Anton 3: twenty microseconds of molecular dynamics simulation before lunch , author=. Proceedings of the international conference for high performance computing, networking, storage and analysis , pages=

  55. [55]

    Journal of Open Source Software , volume=

    ExaFMM: a high-performance fast multipole method library with C++ and Python interfaces , author=. Journal of Open Source Software , volume=

  56. [56]

    SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

    Mapping to irregular torus topologies and other techniques for petascale biomolecular simulation , author=. SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2014 , organization=

  57. [57]

    Sbalzarini and J.H

    I.F. Sbalzarini and J.H. Walther and M. Bergdorf and S.E. Hieber and E.M. Kotsalis and P. Koumoutsakos , keywords =. PPM – A highly efficient parallel particle–mesh library for the simulation of continuum systems , journal =. 2006 , issn =. doi:https://doi.org/10.1016/j.jcp.2005.11.017 , url =

  58. [58]

    2025 , eprint=

    Fast summation of Stokes potentials using a new kernel-splitting in the DMK framework , author=. 2025 , eprint=

  59. [59]

    2016 , publisher=

    Understanding latency hiding on GPUs , author=. 2016 , publisher=

  60. [60]

    2022 , url =

    Leopold Cambier and Doris Pan and Lukasz Ligowski , title =. 2022 , url =

  61. [61]

    and Magland, Jeremy and af Klinteberg, Ludvig , title =

    Barnett, Alexander H. and Magland, Jeremy and af Klinteberg, Ludvig , title =. SIAM Journal on Scientific Computing , volume =. 2019 , doi =

  62. [62]

    Journal of Computational Physics , author=

    Spectral accuracy in fast Ewald-based methods for particle simulations , volume=. Journal of Computational Physics , author=. 2011 , month=oct, pages=. doi:10.1016/j.jcp.2011.08.022 , abstractNote=

  63. [63]

    Communications on Pure and Applied Mathematics , author=

    A dual-space multilevel kernel-splitting framework for discrete and continuous convolution , volume=. Communications on Pure and Applied Mathematics , author=. 2025 , pages=. doi:10.1002/cpa.22240 , abstractNote=