pith. sign in

arxiv: 2604.19243 · v2 · submitted 2026-04-21 · 💻 cs.DC

A Simple Communication Scheme for Distributed Fast Multipole Methods

Pith reviewed 2026-05-10 02:11 UTC · model grok-4.3

classification 💻 cs.DC
keywords Fast Multipole Methoddistributed computingMPI neighborhood collectivesuniform treesweak scalinghierarchical communicationparallel algorithms
0
0 comments X

The pith

A hierarchical communication scheme using MPI neighborhood collectives and uniform trees extends existing shared-memory Fast Multipole Method implementations to distributed memory with minimal redesign while scaling to billions of points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a straightforward method to run Fast Multipole Methods across multiple computers by building a communication pattern around uniform trees and MPI neighborhood collectives. This targets the practical need to take high-performance single-machine FMM code and move it to clusters without rewriting the core algorithms or losing existing speedups. The approach accepts simpler trees that give worse theoretical scaling on irregular data, but the authors show through large-scale tests that the resulting runtimes stay usable. The key payoff is the ability to handle uniform distributions of up to tens of billions of points on hundreds of nodes.

Core claim

We present a simple hierarchical communication scheme for distributed Fast Multipole Methods (FMMs) based on MPI neighborhood collectives and uniform trees. The method targets the common case of extending an existing high-performance shared-memory uniform-tree FMM implementation to distributed memory with minimal redesign while preserving any shared memory optimizations. Benchmarks on the ARCHER2 supercomputer demonstrate that our method can scale to very large problem sizes; we demonstrate weak-scaling up to 3.2e10 uniformly distributed points on 512 nodes of the machine in our largest runs.

What carries the argument

hierarchical communication scheme based on MPI neighborhood collectives and uniform trees, which organizes inter-node exchanges in a tree structure to reuse shared-memory code with little modification.

If this is right

  • Existing shared-memory uniform-tree FMM implementations can be ported to distributed memory while retaining their performance optimizations.
  • Weak scaling is achieved up to 3.2e10 points on 512 nodes for uniformly distributed data.
  • Non-uniform distributions can still be processed with runtimes that remain usable in practice.
  • The scheme reduces the engineering effort required to move FMM codes onto large clusters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This pattern could be applied to other tree-based scientific codes that already have efficient shared-memory versions.
  • Applications with roughly uniform data distributions gain a low-effort route to high node counts.
  • Hybrid approaches that add limited adaptivity on top of the uniform-tree backbone could address the scaling gap for clustered data without full redesign.

Load-bearing premise

The simplifications from using uniform trees will still yield practically useful runtimes for intended applications even though they produce worse asymptotic scaling on non-uniform point distributions.

What would settle it

A benchmark on a typical non-uniform distribution from the target application class showing that runtimes become impractically long or fail to scale beyond a modest number of nodes.

Figures

Figures reproduced from arXiv: 2604.19243 by Srinath Kailasa.

Figure 1
Figure 1. Figure 1: Interaction lists for a box β ∈ T, where T is a quadtree for a problem in R 2 , shown for clarity. Panel (a) shows an adaptively refined tree and panel (b) a uniformly refined tree [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: We illustrate a problem in R 2 . The left plot shows the well-separated blue region in the exterior of an orange source box where the representation u for the source box is valid. The right plot shows the blue region of validity in the interior of a target box where d summarizes the field from well-separated source boxes in the orange exterior. The arrows represent the outgoing and incoming nature of the r… view at source ↗
Figure 3
Figure 3. Figure 3: We illustrate the partition of a computational domain Ω discretized with a uniform linear Hashed Octree (HOT) discretized to a depth d, such that each process i ∈ [0, P − 1] is responsible for a subdomain Ωi defined by the leaf boxes such that Ω is formed from their disjoint union Ω = SP i=1 Ωi. We illustrate that a spatial encoding scheme that preserves geometric locality amongst boxes leads to a partitio… view at source ↗
Figure 4
Figure 4. Figure 4: An illustration of octrees distributed over P processes using Ibeid et al.’s specification (Ibeid et al. 2016). Each node in the graph corresponds to a box, and each edge corresponds to a parent-child relationship, omitting some nodes for clarity. The root box of the global tree is labelled G, and corresponds to a box covering the problem domain. The root boxes of each local tree are labelled Pi for i ∈ [1… view at source ↗
Figure 5
Figure 5. Figure 5: The first row illustrates an example of a layout of a distributed tree over 8 MPI processes, each box corresponds to a local root, where we use a two dimensional tree for clarity. Colors correspond to MPI processes, each of which can contain multiple local roots. In the second row we highlight a target box β in process 4 and the edges and nodes that must be added to the neighborhood communication graphs at… view at source ↗
Figure 6
Figure 6. Figure 6: The default decision tree in MPICH for choosing an algorithm for MPI Alltoallv. Adapted from [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Weak scaling to 4e9 points on 512 ARCHER2 nodes. Top: uniform random distribution. Bottom: spherical surface distribution. Blue circles, orange squares, and green triangles correspond to CCX, CCD, and socket pinning, respectively. Times are reported from wall time, error bars are reported from standard deviation over 5 runs. This figure is the clearest demonstration of the regime we ultimately target: a co… view at source ↗
Figure 8
Figure 8. Figure 8: Weak scaling to 3.2e10 points on 512 ARCHER2 nodes using socket pinning. Top: uniform random distribution. Bottom: spherical surface distribution. with up to 3.2e10 non-uniform points exposing a significant load-imbalance in work per-rank, as well as the messages sent across the network, our observed mean wall times remain in the region of 10s of seconds. 5.3 Strong Scaling [PITH_FULL_IMAGE:figures/full_f… view at source ↗
Figure 9
Figure 9. Figure 9: Practical strong-scaling study for a fixed problem size of 128 × 106 points, using up to 64 ARCHER2 nodes. Top: uniform random distribution. Bottom: spherical surface distribution. Blue circles, orange squares, and green triangles correspond to CCX, CCD, and socket pinning, respectively. Dashed lines indicate ideal strong-scaling trends for the corresponding pinning strategy and should be read only as visu… view at source ↗
Figure 10
Figure 10. Figure 10: Setup-time breakdown for the socket-pinned weak-scaling configurations on ARCHER2 at 4e9 points. Top: uniform random distribution. Bottom: spherical surface distribution. In each plot, the left panel shows the total setup cost including tree construction and the initial distributed sort, while the right panel excludes that dominant phase to reveal the remaining distributed setup stages: domain-bounds exch… view at source ↗
Figure 11
Figure 11. Figure 11: Setup-time breakdown for the largest socket-pinned weak-scaling configurations on ARCHER2, up to 3.2 × 1010 points. Top: uniform random distribution. Bottom: spherical surface distribution. As in [PITH_FULL_IMAGE:figures/full_fig_p032_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Aggregate communication and computation time for the same socket-pinned weak-scaling configurations used in the preceding scaling study. Top: uniform random distribution. Bottom: spherical surface distribution. The stacked bars report mean times over five repeated runs and separate communication from computation, while the dashed red and green curves indicate reference log P and P trends computed from the… view at source ↗
Figure 13
Figure 13. Figure 13: Breakdown of communication time for the same socket-pinned weak-scaling configurations used above. Top: uniform random distribution. Bottom: spherical surface distribution. The stacked bars report mean communication times over five repeated runs and are plotted for illustration; from bottom to top they show MPI Scatterv, MPI Gatherv, and MPI Neighbor Alltoallv. collectives to encode static communication g… view at source ↗
read the original abstract

We present a simple hierarchical communication scheme for distributed Fast Multipole Methods (FMMs) based on MPI neighborhood collectives and uniform trees. The method targets the common case of extending an existing high-performance shared-memory uniform-tree FMM implementation to distributed memory with minimal redesign while preserving any shared memory optimizations. Benchmarks on the ARCHER2 supercomputer demonstrate that our method can scale to very large problem sizes, we demonstrate weak-scaling up to 3.2e10 uniformly distributed points on 512 nodes of the machine in our largest runs. Our simplifications based on uniform trees result in worse asymptotic scaling for non-uniform points, however we still obtain practically useful runtimes due to the ability to retain our shared memory optimizations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces a simple hierarchical communication scheme for distributed Fast Multipole Methods (FMMs) based on MPI neighborhood collectives and uniform trees. It targets minimal redesign when extending existing shared-memory uniform-tree FMM implementations to distributed memory while retaining shared-memory optimizations. Weak-scaling benchmarks on ARCHER2 are reported, reaching 3.2e10 uniformly distributed points on 512 nodes, with the authors acknowledging worse asymptotic scaling for non-uniform distributions but asserting that practically useful runtimes are still achieved.

Significance. If the central claims are substantiated, the scheme offers a low-overhead route to distributed FMM on large-scale systems by leveraging prior shared-memory work, which could lower barriers to exascale simulations in computational science. The concrete weak-scaling results for uniform cases on a production machine provide a useful data point for the field.

major comments (2)
  1. [Abstract] The assertion that 'practically useful runtimes' are obtained for non-uniform distributions (common in FMM applications) lacks empirical support. The reported benchmarks cover only uniform point sets, and the text explicitly notes worse asymptotic scaling under the uniform-tree simplifications without providing any runtime, scaling, or timing data for clustered or non-uniform inputs.
  2. [Benchmarks] The manuscript provides no implementation details on the MPI neighborhood collective usage, no error analysis or accuracy verification for the FMM, and no direct performance comparisons against existing distributed FMM communication methods. These omissions leave the novelty, correctness, and practical advantage of the scheme only partially substantiated by the scaling numbers alone.
minor comments (1)
  1. [Abstract] The abstract contains a grammatical inconsistency: the clause beginning 'we demonstrate weak-scaling...' starts with a lowercase letter after a comma.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential significance of the simple communication scheme. We address each major comment below with proposed revisions.

read point-by-point responses
  1. Referee: [Abstract] The assertion that 'practically useful runtimes' are obtained for non-uniform distributions (common in FMM applications) lacks empirical support. The reported benchmarks cover only uniform point sets, and the text explicitly notes worse asymptotic scaling under the uniform-tree simplifications without providing any runtime, scaling, or timing data for clustered or non-uniform inputs.

    Authors: We agree that the claim lacks empirical support in the current manuscript, which reports only uniform-distribution results. The statement reflected experience from the underlying shared-memory implementation but should not have been asserted without data here. We will revise the abstract to remove or carefully qualify the claim about non-uniform cases. We will also add a short subsection with timing results for a few clustered (non-uniform) point sets at moderate sizes to illustrate that runtimes can remain practical despite the worse asymptotic behavior. revision: yes

  2. Referee: [Benchmarks] The manuscript provides no implementation details on the MPI neighborhood collective usage, no error analysis or accuracy verification for the FMM, and no direct performance comparisons against existing distributed FMM communication methods. These omissions leave the novelty, correctness, and practical advantage of the scheme only partially substantiated by the scaling numbers alone.

    Authors: We acknowledge these gaps. In the revised manuscript we will add an implementation section with pseudocode and explicit description of the MPI neighborhood collective calls used in the hierarchical scheme. We will also include an accuracy verification subsection reporting relative errors against direct summation for small test cases. Direct head-to-head runtime comparisons with other distributed FMM codes are outside the scope of this work, which emphasizes minimal changes to existing shared-memory uniform-tree codes; we will expand the related-work discussion to clarify this positioning and note that the reported weak-scaling results on 512 nodes still provide a useful data point for the approach. revision: partial

Circularity Check

0 steps flagged

No circularity: algorithmic scheme with direct empirical benchmarks

full rationale

The paper presents a concrete algorithmic communication scheme for distributed FMMs using MPI neighborhood collectives on uniform trees, together with scaling benchmarks on uniform point distributions up to 3.2e10 points. No derivation chain, first-principles predictions, fitted parameters, or uniqueness theorems are claimed; the work is self-contained as an engineering extension of existing shared-memory FMM code, with explicit acknowledgment of asymptotic limitations for non-uniform cases. No load-bearing step reduces to its own inputs by construction or self-citation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work introduces no new mathematical parameters, axioms, or postulated entities; it relies entirely on existing MPI primitives and standard uniform-tree data structures.

pith-pipeline@v0.9.0 · 5406 in / 1005 out tokens · 104271 ms · 2026-05-10T02:11:32.606273+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages

  1. [1]

    arXiv preprint arXiv:2408.07436 , year=

    M2L Translation Operators for Kernel Independent Fast Multipole Methods on Modern Architectures , author=. arXiv preprint arXiv:2408.07436 , year=

  2. [2]

    Journal of computational physics , volume=

    A fast algorithm for particle simulations , author=. Journal of computational physics , volume=. 1987 , publisher=

  3. [3]

    Numerische Mathematik , volume=

    On the fast matrix multiplication in the boundary element method by panel clustering , author=. Numerische Mathematik , volume=. 1989 , publisher=

  4. [4]

    Part I: Introduction to-matrices , author=

    A sparse matrix arithmetic based on-matrices. Part I: Introduction to-matrices , author=. Computing , volume=. 1999 , publisher=

  5. [5]

    The International Journal of High Performance Computing Applications , volume=

    A performance model for the communication in fast multipole methods on high-performance computing platforms , author=. The International Journal of High Performance Computing Applications , volume=. 2016 , publisher=

  6. [6]

    Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis , pages=

    A massively parallel adaptive fast-multipole method on heterogeneous architectures , author=. Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis , pages=

  7. [7]

    ACM SIGARCH Computer Architecture News , volume=

    Technology-driven, highly-scalable dragonfly topology , author=. ACM SIGARCH Computer Architecture News , volume=. 2008 , publisher=

  8. [8]

    SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

    An in-depth analysis of the slingshot interconnect , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=

  9. [9]

    SIAM Journal on Scientific Computing , volume=

    Provably good partitioning and load balancing algorithms for parallel adaptive N-body simulation , author=. SIAM Journal on Scientific Computing , volume=. 1998 , publisher=

  10. [10]

    arXiv preprint arXiv:1406.1974 , year=

    Communication complexity of the fast multipole method and its algebraic variants , author=. arXiv preprint arXiv:1406.1974 , year=

  11. [11]

    2025 , publisher =

    Srinath Kailasa , title =. 2025 , publisher =

  12. [12]

    Proceedings of the 1993 ACM/IEEE conference on Supercomputing , pages=

    A parallel hashed oct-tree n-body algorithm , author=. Proceedings of the 1993 ACM/IEEE conference on Supercomputing , pages=

  13. [13]

    SIAM Journal on Scientific Computing , volume=

    Bottom-up construction and 2: 1 balance refinement of linear octrees in parallel , author=. SIAM Journal on Scientific Computing , volume=. 2008 , publisher=

  14. [14]

    Proceedings in applied mathematics and mechanics , volume=

    A short overview of H2-matrices , author=. Proceedings in applied mathematics and mechanics , volume=

  15. [15]

    Journal of Computational Physics , volume=

    A kernel-independent adaptive fast multipole algorithm in two and three dimensions , author=. Journal of Computational Physics , volume=. 2004 , publisher=

  16. [16]

    Communications in Computational Physics , volume=

    PVFMM: A parallel kernel independent FMM for particle and volume potentials , author=. Communications in Computational Physics , volume=. 2015 , publisher=

  17. [17]

    Journal of Open Source Software , volume=

    ExaFMM: a high-performance fast multipole method library with C++ and Python interfaces , author=. Journal of Open Source Software , volume=

  18. [18]

    The International Journal of High Performance Computing Applications , volume=

    A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems , author=. The International Journal of High Performance Computing Applications , volume=. 2012 , publisher=

  19. [19]

    SIAM Conference on Computational Science and Engineering (SIAM CSE 2015) , year=

    ScalFMM: A generic parallel fast multipole library , author=. SIAM Conference on Computational Science and Engineering (SIAM CSE 2015) , year=

  20. [20]

    Journal of Open Source Software , volume=

    TBFMM: A C++ generic and parallel fast multipole method library , author=. Journal of Open Source Software , volume=

  21. [21]

    International Journal for Numerical Methods in Engineering , volume=

    Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units , author=. International Journal for Numerical Methods in Engineering , volume=. 2012 , publisher=

  22. [22]

    arXiv preprint arXiv:1405.7487 , year=

    Asynchronous execution of the fast multipole method using charm++ , author=. arXiv preprint arXiv:1405.7487 , year=

  23. [23]

    Concurrency and Computation: Practice and Experience , volume=

    Data-driven execution of fast multipole methods , author=. Concurrency and Computation: Practice and Experience , volume=. 2014 , publisher=

  24. [24]

    arXiv preprint arXiv:1210.7292 , year=

    Optimized M2L kernels for the Chebyshev interpolation based fast multipole method , author=. arXiv preprint arXiv:1210.7292 , year=

  25. [25]

    Journal of Computational Physics , volume=

    The black-box fast multipole method , author=. Journal of Computational Physics , volume=. 2009 , publisher=

  26. [26]

    Journal of Computational Physics , volume=

    The fast multipole method: numerical implementation , author=. Journal of Computational Physics , volume=. 2000 , publisher=

  27. [27]

    SIAM Journal on Scientific and Statistical Computing , volume=

    An implementation of the fast multipole method without multipoles , author=. SIAM Journal on Scientific and Statistical Computing , volume=. 1992 , publisher=

  28. [28]

    Journal of Computational Physics , volume=

    Fast multipole methods on graphics processors , author=. Journal of Computational Physics , volume=. 2008 , publisher=

  29. [29]

    2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS) , pages=

    Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures , author=. 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS) , pages=. 2010 , organization=

  30. [30]

    SIAM Journal on Scientific Computing , volume=

    An accelerated kernel-independent fast multipole method in one dimension , author=. SIAM Journal on Scientific Computing , volume=. 2007 , publisher=

  31. [31]

    Journal of Algorithms & Computational Technology , volume=

    An FMM based on dual tree traversal for many-core architectures , author=. Journal of Algorithms & Computational Technology , volume=. 2013 , publisher=

  32. [32]

    2025 , school=

    Modern Resesarch Software For Fast Multipole Methods , author=. 2025 , school=

  33. [33]

    2017 , school=

    Task-based fast multipole method for clusters of multicore processors , author=. 2017 , school=

  34. [34]

    Philosophical Transactions of the Royal Society A , volume=

    Hierarchical algorithms on hierarchical architectures , author=. Philosophical Transactions of the Royal Society A , volume=. 2020 , publisher=

  35. [35]

    Proceedings of the 27th international ACM conference on international conference on supercomputing , pages=

    Hyksort: a new variant of hypercube quicksort on distributed memory architectures , author=. Proceedings of the 27th international ACM conference on international conference on supercomputing , pages=

  36. [36]

    SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

    Improving all-to-many personalized communication in two-phase i/o , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=

  37. [37]

    Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing , pages=

    Optimizing the bruck algorithm for non-uniform all-to-all communication , author=. Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing , pages=

  38. [38]

    Parallel Computing , volume=

    Towards leveraging collective performance with the support of MPI 4.0 features in MPC , author=. Parallel Computing , volume=. 2022 , publisher=

  39. [39]

    Parallel Computing , volume=

    An optimisation of allreduce communication in message-passing systems , author=. Parallel Computing , volume=. 2021 , publisher=

  40. [40]

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

    Optimizing MPI Collectives on Shared Memory Multi-Cores , author=. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

  41. [41]

    Proceedings of the 26th ACM international conference on Supercomputing , pages=

    An analysis of computational workloads for the ORNL Jaguar system , author=. Proceedings of the 26th ACM international conference on Supercomputing , pages=

  42. [42]

    2024 , institution=

    S&TR December 2024 El Capitan issue , author=. 2024 , institution=

  43. [43]

    2010 18th IEEE Symposium on High Performance Interconnects , pages=

    The gemini system interconnect , author=. 2010 18th IEEE Symposium on High Performance Interconnects , pages=. 2010 , organization=

  44. [44]

    Communications of the ACM , volume=

    Roofline: an insightful visual performance model for multicore architectures , author=. Communications of the ACM , volume=. 2009 , publisher=

  45. [45]

    Proceedings of the third annual ACM symposium on Parallel algorithms and architectures , pages=

    A comparison of sorting algorithms for the connection machine CM-2 , author=. Proceedings of the third annual ACM symposium on Parallel algorithms and architectures , pages=

  46. [46]

    Proceedings - 2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics Workshops, HiPCW 2022 , pages =

    Netterville, Nick and Fan, Ke and Kumar, Sidharth and Gilray, Thomas , doi =. Proceedings - 2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics Workshops, HiPCW 2022 , pages =

  47. [47]

    International Conference on High Performance Computing , pages=

    Communication reducing algorithms for distributed hierarchical N-body problems with boundary distributions , author=. International Conference on High Performance Computing , pages=. 2017 , organization=

  48. [48]

    International Journal of High Performance Computing Applications , keywords =

    Thakur, Rajeev and Rabenseifner, Rolf and Gropp, William , doi =. International Journal of High Performance Computing Applications , keywords =