Portable Ewald summation algorithms for Stokes flow achieve ~8M particles/sec on H200 GPU with a novel P2G kernel providing 16x speedup and good multi-GPU scaling.
Brian Larkins and P
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
Extends ItoyoriFBC with promise-future synchronization via MPI one-sided communication for dynamic dependencies in AMT runtimes, shown with HLU achieving 15.6x speedup on 16 nodes.
Charm++ techniques enable efficient overdecomposition on multi-vendor GPGPU distributed systems.
citing papers explorer
-
A performance portable fast Ewald summation for Stokes flow
Portable Ewald summation algorithms for Stokes flow achieve ~8M particles/sec on H200 GPU with a novel P2G kernel providing 16x speedup and good multi-GPU scaling.
-
Promise-Future Synchronization for Cluster Asynchronous Many-Task Runtimes via MPI One-Sided Communication
Extends ItoyoriFBC with promise-future synchronization via MPI one-sided communication for dynamic dependencies in AMT runtimes, shown with HLU achieving 15.6x speedup on 16 nodes.
-
Efficient and Portable Support for Overdecomposition on Distributed Memory GPGPU Platforms
Charm++ techniques enable efficient overdecomposition on multi-vendor GPGPU distributed systems.