pith. sign in

arxiv: 2604.05099 · v1 · submitted 2026-04-06 · 💻 cs.DC

Analyzing Persistent Alltoallv RMA Implementations for High-Performance MPI Communication

Pith reviewed 2026-05-10 18:33 UTC · model grok-4.3

classification 💻 cs.DC
keywords persistent MPIRMA Alltoallvcollective communicationHPC performancefence synchronizationirregular communicationmetadata reuse
0
0 comments X

The pith

Persistent RMA Alltoallv variants reduce runtime up to 44% for large messages by reusing metadata and window state.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper designs persistent MPI RMA versions of Alltoallv that use fence or lock synchronization to separate a one-time initialization phase from repeated execution phases. This structure allows reuse of communication metadata and window state across iterations instead of rebuilding it each time. Benchmarks on LLNL's Dane supercomputer show the fence-persistent version consistently beats the standard non-persistent baseline for large messages, with speedups that grow as message size and process count increase. The work also tests irregular sparse patterns, compares fence versus lock designs, and supplies a break-even model showing when the approach becomes worthwhile. Such improvements matter for HPC applications that call Alltoallv repeatedly with uneven data sizes, where repeated setup costs can add up.

Core claim

Persistent MPI RMA Alltoallv implementations based on fence and lock synchronization separate initialization from per-iteration execution to enable reuse of communication metadata and window state. Benchmarks demonstrate that the fence-persistent variant outperforms the non-persistent baseline for large messages, achieving up to 44% runtime reduction and better scalability with process count, including a drop from 2.49s to 1.54s at 448 processes. Message-size sweeps and a break-even model establish that persistence provides immediate payoff for messages of 32,768 bytes or larger while smaller messages show limited benefit due to amortization costs.

What carries the argument

Persistent RMA Alltoallv, which splits one-time setup from repeated execution phases to reuse metadata and window state across epochs.

If this is right

  • Persistence delivers immediate benefits once messages reach or exceed 32,768 bytes as runtime becomes dominated by data movement rather than setup.
  • Time savings increase with larger message sizes and higher process counts, improving overall scalability.
  • Fence-based synchronization shows stronger gains than lock-based in the tested scenarios, including under irregular sparse communication patterns.
  • Hierarchical extensions of the designs remain practical for workloads with large messages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of initialization and execution could be applied to other MPI collectives that involve irregular or repeated communication.
  • Application developers working with repeated Alltoallv patterns could adopt the persistent interface with minimal code changes to gain the reported speedups.
  • The break-even model may help predict performance on systems where metadata costs differ from those measured on Dane.
  • Similar persistence techniques might extend to non-MPI communication libraries that support remote memory access primitives.

Load-bearing premise

The measured runtime reductions are caused by the persistence mechanism and metadata reuse rather than unstated details of the MPI library, compiler effects, or specific characteristics of the Dane supercomputer hardware.

What would settle it

Running identical benchmarks on a different supercomputer with a different MPI implementation and finding no comparable runtime reduction for the persistent variant would indicate the improvements are not due to persistence itself.

Figures

Figures reproduced from arXiv: 2604.05099 by Evelyn Namugwanya.

Figure 2
Figure 2. Figure 2: Varying Message size per process (bytes) on 8 nodes, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: hugetrace-00020 (47,997,626 nnz), performance com [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: 8 Nodes 8 Processes, analyzing the suite sparse [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Collective communication operations such as MPI_Alltoallv are central to many HPC applications, particularly those with irregular message sizes. We design, implement, and evaluate persistent MPI RMA variants of Alltoallv based on fence and lock synchronization, separating a one time initialization phase from per iteration execution to enable reuse of communication metadata and window state across repeated epochs. Our benchmarks tested on LLNL's Dane supercomputer show that the fence-persistent variant consistently outperforms the non-persistent baseline for large message sizes, achieving up to 44% reduction in runtime and improving scalability with increasing process counts; at 448 processes the runtime decreases from 2.49s to 1.54s (38% faster). We further evaluate the algorithms under irregular sparse communication patterns and compare fence- and lock-based designs, including hierarchical extensions. Message-size sweeps and a break-even model demonstrate that persistence provides immediate payoff for messages greater or equal to 32,768 bytes, while smaller messages show limited benefit due to metadata amortization costs. These results indicate that persistent RMA Alltoallv is a practical approach for workloads with large messages, where removing repeated metadata processing leaves runtime dominated by data movement, as evidenced by the increasing time savings with message size, and they clarify the trade-offs between fence and lock synchronization on modern HPC systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript designs and evaluates persistent MPI RMA variants of Alltoallv using fence and lock synchronization. By separating a one-time initialization phase from repeated execution epochs, the approach enables reuse of communication metadata and window state. Benchmarks on LLNL's Dane supercomputer show the fence-persistent variant outperforming the non-persistent baseline for large messages (up to 44% runtime reduction), with improved scalability at higher process counts (e.g., runtime dropping from 2.49s to 1.54s at 448 processes, a 38% improvement). The work also examines irregular sparse patterns, compares fence vs. lock designs, and presents a break-even model indicating benefits for messages >= 32 KB.

Significance. If the performance gains are confirmed to stem from metadata reuse, the results would offer practical guidance for optimizing repeated collective operations in HPC applications with irregular communication. The message-size sweeps and synchronization trade-off analysis could inform MPI library implementations and encourage broader use of persistent RMA primitives where data movement dominates runtime.

major comments (1)
  1. Abstract and benchmarks section: The central performance claims (e.g., 38% reduction from 2.49s to 1.54s at 448 processes and up to 44% overall) are reported without error bars, the number of repetitions per measurement, warm-up details, or statistical tests for significance. This absence directly affects verification that the observed speedups are attributable to persistence and metadata reuse rather than measurement variability or unstated factors.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and will incorporate the suggested improvements in the revised version.

read point-by-point responses
  1. Referee: Abstract and benchmarks section: The central performance claims (e.g., 38% reduction from 2.49s to 1.54s at 448 processes and up to 44% overall) are reported without error bars, the number of repetitions per measurement, warm-up details, or statistical tests for significance. This absence directly affects verification that the observed speedups are attributable to persistence and metadata reuse rather than measurement variability or unstated factors.

    Authors: We agree that the manuscript would benefit from additional details on the benchmark methodology to strengthen the claims. In the revised version, we will expand the abstract and benchmarks section to include the number of repetitions per measurement, warm-up iterations, error bars on the reported timings (e.g., standard deviation), and a description of any statistical tests applied to assess significance. This will better demonstrate that the performance improvements arise from the persistent RMA approach rather than variability. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a pure empirical benchmarking paper that reports direct runtime measurements, message-size sweeps, and scalability comparisons between persistent and non-persistent MPI RMA Alltoallv implementations on LLNL's Dane supercomputer. It contains no equations, derivations, fitted parameters presented as predictions, or self-citation chains that reduce the central claims to inputs by construction. The reported speedups (e.g., 44% for large messages) and break-even observations are straightforward empirical outcomes from the described experiments, not logical reductions of the paper's own premises.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is implementation-driven and relies on the correctness of the underlying MPI standard and hardware rather than new theoretical constructs.

axioms (1)
  • domain assumption MPI RMA fence and lock synchronization primitives behave according to the MPI standard specification.
    The persistent variants are built directly on these primitives.

pith-pipeline@v0.9.0 · 5535 in / 1189 out tokens · 49804 ms · 2026-05-10T18:33:47.488897+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    2022.Analysis of the Communication and Computation Cost of FFT Libraries towards Exascale

    Azzam Ayala, Stanimire Tomov, Piotr Luszczek, Samuel Cayrols, Gregory Ragghi- anti, and Jack Dongarra. 2022.Analysis of the Communication and Computation Cost of FFT Libraries towards Exascale. Technical Report ICL-UTK-1558-2022. Innovative Computing Laboratory, University of Tennessee, Knoxville (ICL-UTK). https://icl.utk.edu/files/publications/2022/icl-...

  2. [2]

    Maciej Besta and Torsten Hoefler. 2020. Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages. arXiv:2010.09135 [cs.DC] https://arxiv.org/abs/2010.09135

  3. [3]

    Gropp, and Scott Lock- hart

    Amanda Bienz, Lois Curfman McInnes Olson, William D. Gropp, and Scott Lock- hart. 2021. Modeling Data Movement Performance on Heterogeneous Archi- tectures. In2021 IEEE High Performance Extreme Computing Conference (HPEC). IEEE. arXiv:2010.10378 [cs.DC] https://arxiv.org/abs/2010.10378

  4. [4]

    Amanda Bienz, Derek Schafer, and Anthony Skjellum. 2023. MPI Advance : Open-Source Message Passing Optimizations. arXiv:2309.07337 [cs.DC] https: //arxiv.org/abs/2309.07337

  5. [5]

    Matthew G. F. Dosanjh, Taylor Groves, Ryan E. Grant, Ron Brightwell, and Patrick G. Bridges. 2016. RMA-MT: A Benchmark Suite for Assessing MPI Multi- threaded RMA Performance. In2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). 550–559. doi:10.1109/CCGrid.2016.84

  6. [6]

    Robert Gerstenberger, Maciej Besta, and Torsten Hoefler. 2013. Enabling highly- scalable remote memory access programming with MPI-3 one sided. InProceed- ings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC13). ACM, 1–12. doi:10.1145/2503210.2503286

  7. [7]

    Michael Hofmann and Gudula Rünger. 2010. An In-Place Algorithm for Irregular All-to-All Communication with Limited Memory. InRecent Advances in the Message Passing Interface, Rainer Keller, Edgar Gabriel, Michael Resch, and Jack Dongarra (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 113–121

  8. [8]

    Alexander Jocksch, Marco Kraushaar, and Daniele Daverio. 2019. Optimized all-to-all communication on multicore architectures applied to FFTs with pencil decomposition.Concurrency and Computation: Practice and Experience31, 16 (2019), e4964. doi:10.1002/cpe.4964

  9. [9]

    Rahul Kumar, Amith Mamidala, and D. K. Panda. 2008. Scaling alltoall collective on multi-core systems. In2008 IEEE International Symposium on Parallel and Distributed Processing. 1–8. doi:10.1109/IPDPS.2008.4536141

  10. [10]

    Kini, Pete Wyckoff, and Dhabaleswar K

    Jiuxing Liu, Jiesheng Wu, Sushmitha P. Kini, Pete Wyckoff, and Dhabaleswar K. Panda. 2003. High performance RDMA-based MPI implementation over Infini- Band. InProceedings of the 17th Annual International Conference on Supercomput- ing(San Francisco, CA, USA)(ICS ’03). Association for Computing Machinery, New York, NY, USA, 295–304. doi:10.1145/782814.782855

  11. [11]

    Hyacinthe Nzigou Mamadou, Feng Long Gu, Vivien Oddou, Takeshi Nanri, and Kazuaki Murakami. 2009. A Dynamic Solution for Efficient MPI Collective Communications. In2009 International Joint Conference on Computational Sciences and Optimization, Vol. 1. 3–7. doi:10.1109/CSO.2009.448

  12. [12]

    2021.MPI: A Message-Passing Interface Standard Version 4.0

    Message Passing Interface Forum. 2021.MPI: A Message-Passing Interface Standard Version 4.0. https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf

  13. [13]

    Holmes, Anthony Skjellum, Purushotham Bangalore, and Srinivas Sridharan

    Bradley Morgan, Daniel J. Holmes, Anthony Skjellum, Purushotham Bangalore, and Srinivas Sridharan. 2017. Planning for performance: persistent collective operations for MPI. InProceedings of the 24th European MPI Users’ Group Meeting (Chicago, Illinois)(EuroMPI ’17). Association for Computing Machinery, New York, NY, USA, Article 4, 11 pages. doi:10.1145/3...

  14. [14]

    Evelyn Namugwanya, Amanda Bienz, Derek Schafer, and Anthony Skjellum

  15. [15]

    arXiv:2306.16589 [cs.MS] https://arxiv.org/ abs/2306.16589

    Collective-Optimized FFTs. arXiv:2306.16589 [cs.MS] https://arxiv.org/ abs/2306.16589

  16. [16]

    Joseph Schuchart, Christoph Niethammer, José Gracia, and George Bosilca. 2021. Quo Vadis MPI RMA? Towards a More Efficient Use of MPI One-Sided Commu- nication. arXiv:2111.08142 [cs.DC] https://arxiv.org/abs/2111.08142

  17. [17]

    S. Sur, U. K. R. Bondhugula, A. Mamidala, H. W. Jin, and D. K. Panda. 2005. High Performance RDMA Based All-to-All Broadcast for InfiniBand Clusters. InHigh Performance Computing – HiPC 2005, David A. Bader, Manish Parashar, Varadarajan Sridhar, and Viktor K. Prasanna (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 148–157

  18. [18]

    DK Panda

    Tu Tran, Goutham Kalikrishna Reddy Kuncham, Bharath Ramesh, Shulei Xu, Hari Subramoni, and Dhabaleswar K. DK Panda. 2025. OHIO: Enhancing RDMA Scalability in Alltoall With Optimized Communication Overlap .IEEE Micro45, 02 (March 2025), 36–45. doi:10.1109/MM.2024.3524891

  19. [19]

    III White, James Buford. 2025. Large-Message All-to-All Communication at Fron- tier Scale. InProceedings of the SC ’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC Work- shops ’25). Association for Computing Machinery, New York, NY, USA, 461–467. doi:10.1145/3731599.3767389