A Simple Communication Scheme for Distributed Fast Multipole Methods
Pith reviewed 2026-05-10 02:11 UTC · model grok-4.3
The pith
A hierarchical communication scheme using MPI neighborhood collectives and uniform trees extends existing shared-memory Fast Multipole Method implementations to distributed memory with minimal redesign while scaling to billions of points.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present a simple hierarchical communication scheme for distributed Fast Multipole Methods (FMMs) based on MPI neighborhood collectives and uniform trees. The method targets the common case of extending an existing high-performance shared-memory uniform-tree FMM implementation to distributed memory with minimal redesign while preserving any shared memory optimizations. Benchmarks on the ARCHER2 supercomputer demonstrate that our method can scale to very large problem sizes; we demonstrate weak-scaling up to 3.2e10 uniformly distributed points on 512 nodes of the machine in our largest runs.
What carries the argument
hierarchical communication scheme based on MPI neighborhood collectives and uniform trees, which organizes inter-node exchanges in a tree structure to reuse shared-memory code with little modification.
If this is right
- Existing shared-memory uniform-tree FMM implementations can be ported to distributed memory while retaining their performance optimizations.
- Weak scaling is achieved up to 3.2e10 points on 512 nodes for uniformly distributed data.
- Non-uniform distributions can still be processed with runtimes that remain usable in practice.
- The scheme reduces the engineering effort required to move FMM codes onto large clusters.
Where Pith is reading between the lines
- This pattern could be applied to other tree-based scientific codes that already have efficient shared-memory versions.
- Applications with roughly uniform data distributions gain a low-effort route to high node counts.
- Hybrid approaches that add limited adaptivity on top of the uniform-tree backbone could address the scaling gap for clustered data without full redesign.
Load-bearing premise
The simplifications from using uniform trees will still yield practically useful runtimes for intended applications even though they produce worse asymptotic scaling on non-uniform point distributions.
What would settle it
A benchmark on a typical non-uniform distribution from the target application class showing that runtimes become impractically long or fail to scale beyond a modest number of nodes.
Figures
read the original abstract
We present a simple hierarchical communication scheme for distributed Fast Multipole Methods (FMMs) based on MPI neighborhood collectives and uniform trees. The method targets the common case of extending an existing high-performance shared-memory uniform-tree FMM implementation to distributed memory with minimal redesign while preserving any shared memory optimizations. Benchmarks on the ARCHER2 supercomputer demonstrate that our method can scale to very large problem sizes, we demonstrate weak-scaling up to 3.2e10 uniformly distributed points on 512 nodes of the machine in our largest runs. Our simplifications based on uniform trees result in worse asymptotic scaling for non-uniform points, however we still obtain practically useful runtimes due to the ability to retain our shared memory optimizations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a simple hierarchical communication scheme for distributed Fast Multipole Methods (FMMs) based on MPI neighborhood collectives and uniform trees. It targets minimal redesign when extending existing shared-memory uniform-tree FMM implementations to distributed memory while retaining shared-memory optimizations. Weak-scaling benchmarks on ARCHER2 are reported, reaching 3.2e10 uniformly distributed points on 512 nodes, with the authors acknowledging worse asymptotic scaling for non-uniform distributions but asserting that practically useful runtimes are still achieved.
Significance. If the central claims are substantiated, the scheme offers a low-overhead route to distributed FMM on large-scale systems by leveraging prior shared-memory work, which could lower barriers to exascale simulations in computational science. The concrete weak-scaling results for uniform cases on a production machine provide a useful data point for the field.
major comments (2)
- [Abstract] The assertion that 'practically useful runtimes' are obtained for non-uniform distributions (common in FMM applications) lacks empirical support. The reported benchmarks cover only uniform point sets, and the text explicitly notes worse asymptotic scaling under the uniform-tree simplifications without providing any runtime, scaling, or timing data for clustered or non-uniform inputs.
- [Benchmarks] The manuscript provides no implementation details on the MPI neighborhood collective usage, no error analysis or accuracy verification for the FMM, and no direct performance comparisons against existing distributed FMM communication methods. These omissions leave the novelty, correctness, and practical advantage of the scheme only partially substantiated by the scaling numbers alone.
minor comments (1)
- [Abstract] The abstract contains a grammatical inconsistency: the clause beginning 'we demonstrate weak-scaling...' starts with a lowercase letter after a comma.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for recognizing the potential significance of the simple communication scheme. We address each major comment below with proposed revisions.
read point-by-point responses
-
Referee: [Abstract] The assertion that 'practically useful runtimes' are obtained for non-uniform distributions (common in FMM applications) lacks empirical support. The reported benchmarks cover only uniform point sets, and the text explicitly notes worse asymptotic scaling under the uniform-tree simplifications without providing any runtime, scaling, or timing data for clustered or non-uniform inputs.
Authors: We agree that the claim lacks empirical support in the current manuscript, which reports only uniform-distribution results. The statement reflected experience from the underlying shared-memory implementation but should not have been asserted without data here. We will revise the abstract to remove or carefully qualify the claim about non-uniform cases. We will also add a short subsection with timing results for a few clustered (non-uniform) point sets at moderate sizes to illustrate that runtimes can remain practical despite the worse asymptotic behavior. revision: yes
-
Referee: [Benchmarks] The manuscript provides no implementation details on the MPI neighborhood collective usage, no error analysis or accuracy verification for the FMM, and no direct performance comparisons against existing distributed FMM communication methods. These omissions leave the novelty, correctness, and practical advantage of the scheme only partially substantiated by the scaling numbers alone.
Authors: We acknowledge these gaps. In the revised manuscript we will add an implementation section with pseudocode and explicit description of the MPI neighborhood collective calls used in the hierarchical scheme. We will also include an accuracy verification subsection reporting relative errors against direct summation for small test cases. Direct head-to-head runtime comparisons with other distributed FMM codes are outside the scope of this work, which emphasizes minimal changes to existing shared-memory uniform-tree codes; we will expand the related-work discussion to clarify this positioning and note that the reported weak-scaling results on 512 nodes still provide a useful data point for the approach. revision: partial
Circularity Check
No circularity: algorithmic scheme with direct empirical benchmarks
full rationale
The paper presents a concrete algorithmic communication scheme for distributed FMMs using MPI neighborhood collectives on uniform trees, together with scaling benchmarks on uniform point distributions up to 3.2e10 points. No derivation chain, first-principles predictions, fitted parameters, or uniqueness theorems are claimed; the work is self-contained as an engineering extension of existing shared-memory FMM code, with explicit acknowledgment of asymptotic limitations for non-uniform cases. No load-bearing step reduces to its own inputs by construction or self-citation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2408.07436 , year=
M2L Translation Operators for Kernel Independent Fast Multipole Methods on Modern Architectures , author=. arXiv preprint arXiv:2408.07436 , year=
-
[2]
Journal of computational physics , volume=
A fast algorithm for particle simulations , author=. Journal of computational physics , volume=. 1987 , publisher=
work page 1987
-
[3]
Numerische Mathematik , volume=
On the fast matrix multiplication in the boundary element method by panel clustering , author=. Numerische Mathematik , volume=. 1989 , publisher=
work page 1989
-
[4]
Part I: Introduction to-matrices , author=
A sparse matrix arithmetic based on-matrices. Part I: Introduction to-matrices , author=. Computing , volume=. 1999 , publisher=
work page 1999
-
[5]
The International Journal of High Performance Computing Applications , volume=
A performance model for the communication in fast multipole methods on high-performance computing platforms , author=. The International Journal of High Performance Computing Applications , volume=. 2016 , publisher=
work page 2016
-
[6]
A massively parallel adaptive fast-multipole method on heterogeneous architectures , author=. Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis , pages=
-
[7]
ACM SIGARCH Computer Architecture News , volume=
Technology-driven, highly-scalable dragonfly topology , author=. ACM SIGARCH Computer Architecture News , volume=. 2008 , publisher=
work page 2008
-
[8]
An in-depth analysis of the slingshot interconnect , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=
work page 2020
-
[9]
SIAM Journal on Scientific Computing , volume=
Provably good partitioning and load balancing algorithms for parallel adaptive N-body simulation , author=. SIAM Journal on Scientific Computing , volume=. 1998 , publisher=
work page 1998
-
[10]
arXiv preprint arXiv:1406.1974 , year=
Communication complexity of the fast multipole method and its algebraic variants , author=. arXiv preprint arXiv:1406.1974 , year=
- [11]
-
[12]
Proceedings of the 1993 ACM/IEEE conference on Supercomputing , pages=
A parallel hashed oct-tree n-body algorithm , author=. Proceedings of the 1993 ACM/IEEE conference on Supercomputing , pages=
work page 1993
-
[13]
SIAM Journal on Scientific Computing , volume=
Bottom-up construction and 2: 1 balance refinement of linear octrees in parallel , author=. SIAM Journal on Scientific Computing , volume=. 2008 , publisher=
work page 2008
-
[14]
Proceedings in applied mathematics and mechanics , volume=
A short overview of H2-matrices , author=. Proceedings in applied mathematics and mechanics , volume=
-
[15]
Journal of Computational Physics , volume=
A kernel-independent adaptive fast multipole algorithm in two and three dimensions , author=. Journal of Computational Physics , volume=. 2004 , publisher=
work page 2004
-
[16]
Communications in Computational Physics , volume=
PVFMM: A parallel kernel independent FMM for particle and volume potentials , author=. Communications in Computational Physics , volume=. 2015 , publisher=
work page 2015
-
[17]
Journal of Open Source Software , volume=
ExaFMM: a high-performance fast multipole method library with C++ and Python interfaces , author=. Journal of Open Source Software , volume=
-
[18]
The International Journal of High Performance Computing Applications , volume=
A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems , author=. The International Journal of High Performance Computing Applications , volume=. 2012 , publisher=
work page 2012
-
[19]
SIAM Conference on Computational Science and Engineering (SIAM CSE 2015) , year=
ScalFMM: A generic parallel fast multipole library , author=. SIAM Conference on Computational Science and Engineering (SIAM CSE 2015) , year=
work page 2015
-
[20]
Journal of Open Source Software , volume=
TBFMM: A C++ generic and parallel fast multipole method library , author=. Journal of Open Source Software , volume=
-
[21]
International Journal for Numerical Methods in Engineering , volume=
Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units , author=. International Journal for Numerical Methods in Engineering , volume=. 2012 , publisher=
work page 2012
-
[22]
arXiv preprint arXiv:1405.7487 , year=
Asynchronous execution of the fast multipole method using charm++ , author=. arXiv preprint arXiv:1405.7487 , year=
-
[23]
Concurrency and Computation: Practice and Experience , volume=
Data-driven execution of fast multipole methods , author=. Concurrency and Computation: Practice and Experience , volume=. 2014 , publisher=
work page 2014
-
[24]
arXiv preprint arXiv:1210.7292 , year=
Optimized M2L kernels for the Chebyshev interpolation based fast multipole method , author=. arXiv preprint arXiv:1210.7292 , year=
-
[25]
Journal of Computational Physics , volume=
The black-box fast multipole method , author=. Journal of Computational Physics , volume=. 2009 , publisher=
work page 2009
-
[26]
Journal of Computational Physics , volume=
The fast multipole method: numerical implementation , author=. Journal of Computational Physics , volume=. 2000 , publisher=
work page 2000
-
[27]
SIAM Journal on Scientific and Statistical Computing , volume=
An implementation of the fast multipole method without multipoles , author=. SIAM Journal on Scientific and Statistical Computing , volume=. 1992 , publisher=
work page 1992
-
[28]
Journal of Computational Physics , volume=
Fast multipole methods on graphics processors , author=. Journal of Computational Physics , volume=. 2008 , publisher=
work page 2008
-
[29]
2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS) , pages=
Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures , author=. 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS) , pages=. 2010 , organization=
work page 2010
-
[30]
SIAM Journal on Scientific Computing , volume=
An accelerated kernel-independent fast multipole method in one dimension , author=. SIAM Journal on Scientific Computing , volume=. 2007 , publisher=
work page 2007
-
[31]
Journal of Algorithms & Computational Technology , volume=
An FMM based on dual tree traversal for many-core architectures , author=. Journal of Algorithms & Computational Technology , volume=. 2013 , publisher=
work page 2013
-
[32]
Modern Resesarch Software For Fast Multipole Methods , author=. 2025 , school=
work page 2025
-
[33]
Task-based fast multipole method for clusters of multicore processors , author=. 2017 , school=
work page 2017
-
[34]
Philosophical Transactions of the Royal Society A , volume=
Hierarchical algorithms on hierarchical architectures , author=. Philosophical Transactions of the Royal Society A , volume=. 2020 , publisher=
work page 2020
-
[35]
Hyksort: a new variant of hypercube quicksort on distributed memory architectures , author=. Proceedings of the 27th international ACM conference on international conference on supercomputing , pages=
-
[36]
Improving all-to-many personalized communication in two-phase i/o , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=
work page 2020
-
[37]
Optimizing the bruck algorithm for non-uniform all-to-all communication , author=. Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing , pages=
-
[38]
Towards leveraging collective performance with the support of MPI 4.0 features in MPC , author=. Parallel Computing , volume=. 2022 , publisher=
work page 2022
-
[39]
An optimisation of allreduce communication in message-passing systems , author=. Parallel Computing , volume=. 2021 , publisher=
work page 2021
-
[40]
Optimizing MPI Collectives on Shared Memory Multi-Cores , author=. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=
-
[41]
Proceedings of the 26th ACM international conference on Supercomputing , pages=
An analysis of computational workloads for the ORNL Jaguar system , author=. Proceedings of the 26th ACM international conference on Supercomputing , pages=
-
[42]
S&TR December 2024 El Capitan issue , author=. 2024 , institution=
work page 2024
-
[43]
2010 18th IEEE Symposium on High Performance Interconnects , pages=
The gemini system interconnect , author=. 2010 18th IEEE Symposium on High Performance Interconnects , pages=. 2010 , organization=
work page 2010
-
[44]
Communications of the ACM , volume=
Roofline: an insightful visual performance model for multicore architectures , author=. Communications of the ACM , volume=. 2009 , publisher=
work page 2009
-
[45]
Proceedings of the third annual ACM symposium on Parallel algorithms and architectures , pages=
A comparison of sorting algorithms for the connection machine CM-2 , author=. Proceedings of the third annual ACM symposium on Parallel algorithms and architectures , pages=
-
[46]
Netterville, Nick and Fan, Ke and Kumar, Sidharth and Gilray, Thomas , doi =. Proceedings - 2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics Workshops, HiPCW 2022 , pages =
work page 2022
-
[47]
International Conference on High Performance Computing , pages=
Communication reducing algorithms for distributed hierarchical N-body problems with boundary distributions , author=. International Conference on High Performance Computing , pages=. 2017 , organization=
work page 2017
-
[48]
International Journal of High Performance Computing Applications , keywords =
Thakur, Rajeev and Rabenseifner, Rolf and Gropp, William , doi =. International Journal of High Performance Computing Applications , keywords =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.