Analyzing Persistent Alltoallv RMA Implementations for High-Performance MPI Communication

Evelyn Namugwanya

arxiv: 2604.05099 · v1 · submitted 2026-04-06 · 💻 cs.DC

Analyzing Persistent Alltoallv RMA Implementations for High-Performance MPI Communication

Evelyn Namugwanya This is my paper

Pith reviewed 2026-05-10 18:33 UTC · model grok-4.3

classification 💻 cs.DC

keywords persistent MPIRMA Alltoallvcollective communicationHPC performancefence synchronizationirregular communicationmetadata reuse

0 comments

The pith

Persistent RMA Alltoallv variants reduce runtime up to 44% for large messages by reusing metadata and window state.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper designs persistent MPI RMA versions of Alltoallv that use fence or lock synchronization to separate a one-time initialization phase from repeated execution phases. This structure allows reuse of communication metadata and window state across iterations instead of rebuilding it each time. Benchmarks on LLNL's Dane supercomputer show the fence-persistent version consistently beats the standard non-persistent baseline for large messages, with speedups that grow as message size and process count increase. The work also tests irregular sparse patterns, compares fence versus lock designs, and supplies a break-even model showing when the approach becomes worthwhile. Such improvements matter for HPC applications that call Alltoallv repeatedly with uneven data sizes, where repeated setup costs can add up.

Core claim

Persistent MPI RMA Alltoallv implementations based on fence and lock synchronization separate initialization from per-iteration execution to enable reuse of communication metadata and window state. Benchmarks demonstrate that the fence-persistent variant outperforms the non-persistent baseline for large messages, achieving up to 44% runtime reduction and better scalability with process count, including a drop from 2.49s to 1.54s at 448 processes. Message-size sweeps and a break-even model establish that persistence provides immediate payoff for messages of 32,768 bytes or larger while smaller messages show limited benefit due to amortization costs.

What carries the argument

Persistent RMA Alltoallv, which splits one-time setup from repeated execution phases to reuse metadata and window state across epochs.

If this is right

Persistence delivers immediate benefits once messages reach or exceed 32,768 bytes as runtime becomes dominated by data movement rather than setup.
Time savings increase with larger message sizes and higher process counts, improving overall scalability.
Fence-based synchronization shows stronger gains than lock-based in the tested scenarios, including under irregular sparse communication patterns.
Hierarchical extensions of the designs remain practical for workloads with large messages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of initialization and execution could be applied to other MPI collectives that involve irregular or repeated communication.
Application developers working with repeated Alltoallv patterns could adopt the persistent interface with minimal code changes to gain the reported speedups.
The break-even model may help predict performance on systems where metadata costs differ from those measured on Dane.
Similar persistence techniques might extend to non-MPI communication libraries that support remote memory access primitives.

Load-bearing premise

The measured runtime reductions are caused by the persistence mechanism and metadata reuse rather than unstated details of the MPI library, compiler effects, or specific characteristics of the Dane supercomputer hardware.

What would settle it

Running identical benchmarks on a different supercomputer with a different MPI implementation and finding no comparable runtime reduction for the persistent variant would indicate the improvements are not due to persistence itself.

Figures

Figures reproduced from arXiv: 2604.05099 by Evelyn Namugwanya.

**Figure 3.** Figure 3: hugetrace-00020 (47,997,626 nnz), performance com [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: 8 Nodes 8 Processes, analyzing the suite sparse [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Collective communication operations such as MPI_Alltoallv are central to many HPC applications, particularly those with irregular message sizes. We design, implement, and evaluate persistent MPI RMA variants of Alltoallv based on fence and lock synchronization, separating a one time initialization phase from per iteration execution to enable reuse of communication metadata and window state across repeated epochs. Our benchmarks tested on LLNL's Dane supercomputer show that the fence-persistent variant consistently outperforms the non-persistent baseline for large message sizes, achieving up to 44% reduction in runtime and improving scalability with increasing process counts; at 448 processes the runtime decreases from 2.49s to 1.54s (38% faster). We further evaluate the algorithms under irregular sparse communication patterns and compare fence- and lock-based designs, including hierarchical extensions. Message-size sweeps and a break-even model demonstrate that persistence provides immediate payoff for messages greater or equal to 32,768 bytes, while smaller messages show limited benefit due to metadata amortization costs. These results indicate that persistent RMA Alltoallv is a practical approach for workloads with large messages, where removing repeated metadata processing leaves runtime dominated by data movement, as evidenced by the increasing time savings with message size, and they clarify the trade-offs between fence and lock synchronization on modern HPC systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Persistent RMA Alltoallv shows real speedups for large messages on Dane but the gains rest on timing data without enough controls or stats to isolate the cause.

read the letter

The paper's core result is that their fence-based persistent Alltoallv variant beats the standard non-persistent version by up to 44% on large messages, with a clear example at 448 processes dropping from 2.49s to 1.54s. They achieve this by splitting initialization from the repeated execution phase so metadata and window state get reused across epochs. That separation is the practical move they test on irregular patterns and across message sizes and process counts on the real Dane machine at LLNL.

Referee Report

1 major / 0 minor

Summary. The manuscript designs and evaluates persistent MPI RMA variants of Alltoallv using fence and lock synchronization. By separating a one-time initialization phase from repeated execution epochs, the approach enables reuse of communication metadata and window state. Benchmarks on LLNL's Dane supercomputer show the fence-persistent variant outperforming the non-persistent baseline for large messages (up to 44% runtime reduction), with improved scalability at higher process counts (e.g., runtime dropping from 2.49s to 1.54s at 448 processes, a 38% improvement). The work also examines irregular sparse patterns, compares fence vs. lock designs, and presents a break-even model indicating benefits for messages >= 32 KB.

Significance. If the performance gains are confirmed to stem from metadata reuse, the results would offer practical guidance for optimizing repeated collective operations in HPC applications with irregular communication. The message-size sweeps and synchronization trade-off analysis could inform MPI library implementations and encourage broader use of persistent RMA primitives where data movement dominates runtime.

major comments (1)

Abstract and benchmarks section: The central performance claims (e.g., 38% reduction from 2.49s to 1.54s at 448 processes and up to 44% overall) are reported without error bars, the number of repetitions per measurement, warm-up details, or statistical tests for significance. This absence directly affects verification that the observed speedups are attributable to persistence and metadata reuse rather than measurement variability or unstated factors.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and will incorporate the suggested improvements in the revised version.

read point-by-point responses

Referee: Abstract and benchmarks section: The central performance claims (e.g., 38% reduction from 2.49s to 1.54s at 448 processes and up to 44% overall) are reported without error bars, the number of repetitions per measurement, warm-up details, or statistical tests for significance. This absence directly affects verification that the observed speedups are attributable to persistence and metadata reuse rather than measurement variability or unstated factors.

Authors: We agree that the manuscript would benefit from additional details on the benchmark methodology to strengthen the claims. In the revised version, we will expand the abstract and benchmarks section to include the number of repetitions per measurement, warm-up iterations, error bars on the reported timings (e.g., standard deviation), and a description of any statistical tests applied to assess significance. This will better demonstrate that the performance improvements arise from the persistent RMA approach rather than variability. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a pure empirical benchmarking paper that reports direct runtime measurements, message-size sweeps, and scalability comparisons between persistent and non-persistent MPI RMA Alltoallv implementations on LLNL's Dane supercomputer. It contains no equations, derivations, fitted parameters presented as predictions, or self-citation chains that reduce the central claims to inputs by construction. The reported speedups (e.g., 44% for large messages) and break-even observations are straightforward empirical outcomes from the described experiments, not logical reductions of the paper's own premises.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is implementation-driven and relies on the correctness of the underlying MPI standard and hardware rather than new theoretical constructs.

axioms (1)

domain assumption MPI RMA fence and lock synchronization primitives behave according to the MPI standard specification.
The persistent variants are built directly on these primitives.

pith-pipeline@v0.9.0 · 5535 in / 1189 out tokens · 49804 ms · 2026-05-10T18:33:47.488897+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

separating a one-time initialization phase from per-iteration execution to enable reuse of communication metadata and window state across repeated epochs... N_breakeven = ceil(tau_persistent / (T_nonpersistent - T_persistent))
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Fence persistent... Lock persistent... Fence_Hierarchy_persistent... on LLNL Dane with MVAPICH2

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

2022.Analysis of the Communication and Computation Cost of FFT Libraries towards Exascale

Azzam Ayala, Stanimire Tomov, Piotr Luszczek, Samuel Cayrols, Gregory Ragghi- anti, and Jack Dongarra. 2022.Analysis of the Communication and Computation Cost of FFT Libraries towards Exascale. Technical Report ICL-UTK-1558-2022. Innovative Computing Laboratory, University of Tennessee, Knoxville (ICL-UTK). https://icl.utk.edu/files/publications/2022/icl-...

work page 2022
[2]

Maciej Besta and Torsten Hoefler. 2020. Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages. arXiv:2010.09135 [cs.DC] https://arxiv.org/abs/2010.09135

work page arXiv 2020
[3]

Gropp, and Scott Lock- hart

Amanda Bienz, Lois Curfman McInnes Olson, William D. Gropp, and Scott Lock- hart. 2021. Modeling Data Movement Performance on Heterogeneous Archi- tectures. In2021 IEEE High Performance Extreme Computing Conference (HPEC). IEEE. arXiv:2010.10378 [cs.DC] https://arxiv.org/abs/2010.10378

work page arXiv 2021
[4]

Amanda Bienz, Derek Schafer, and Anthony Skjellum. 2023. MPI Advance : Open-Source Message Passing Optimizations. arXiv:2309.07337 [cs.DC] https: //arxiv.org/abs/2309.07337

work page arXiv 2023
[5]

Matthew G. F. Dosanjh, Taylor Groves, Ryan E. Grant, Ron Brightwell, and Patrick G. Bridges. 2016. RMA-MT: A Benchmark Suite for Assessing MPI Multi- threaded RMA Performance. In2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). 550–559. doi:10.1109/CCGrid.2016.84

work page doi:10.1109/ccgrid.2016.84 2016
[6]

Robert Gerstenberger, Maciej Besta, and Torsten Hoefler. 2013. Enabling highly- scalable remote memory access programming with MPI-3 one sided. InProceed- ings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC13). ACM, 1–12. doi:10.1145/2503210.2503286

work page doi:10.1145/2503210.2503286 2013
[7]

Michael Hofmann and Gudula Rünger. 2010. An In-Place Algorithm for Irregular All-to-All Communication with Limited Memory. InRecent Advances in the Message Passing Interface, Rainer Keller, Edgar Gabriel, Michael Resch, and Jack Dongarra (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 113–121

work page 2010
[8]

Alexander Jocksch, Marco Kraushaar, and Daniele Daverio. 2019. Optimized all-to-all communication on multicore architectures applied to FFTs with pencil decomposition.Concurrency and Computation: Practice and Experience31, 16 (2019), e4964. doi:10.1002/cpe.4964

work page doi:10.1002/cpe.4964 2019
[9]

Rahul Kumar, Amith Mamidala, and D. K. Panda. 2008. Scaling alltoall collective on multi-core systems. In2008 IEEE International Symposium on Parallel and Distributed Processing. 1–8. doi:10.1109/IPDPS.2008.4536141

work page doi:10.1109/ipdps.2008.4536141 2008
[10]

Kini, Pete Wyckoff, and Dhabaleswar K

Jiuxing Liu, Jiesheng Wu, Sushmitha P. Kini, Pete Wyckoff, and Dhabaleswar K. Panda. 2003. High performance RDMA-based MPI implementation over Infini- Band. InProceedings of the 17th Annual International Conference on Supercomput- ing(San Francisco, CA, USA)(ICS ’03). Association for Computing Machinery, New York, NY, USA, 295–304. doi:10.1145/782814.782855

work page doi:10.1145/782814.782855 2003
[11]

Hyacinthe Nzigou Mamadou, Feng Long Gu, Vivien Oddou, Takeshi Nanri, and Kazuaki Murakami. 2009. A Dynamic Solution for Efficient MPI Collective Communications. In2009 International Joint Conference on Computational Sciences and Optimization, Vol. 1. 3–7. doi:10.1109/CSO.2009.448

work page doi:10.1109/cso.2009.448 2009
[12]

2021.MPI: A Message-Passing Interface Standard Version 4.0

Message Passing Interface Forum. 2021.MPI: A Message-Passing Interface Standard Version 4.0. https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf

work page 2021
[13]

Holmes, Anthony Skjellum, Purushotham Bangalore, and Srinivas Sridharan

Bradley Morgan, Daniel J. Holmes, Anthony Skjellum, Purushotham Bangalore, and Srinivas Sridharan. 2017. Planning for performance: persistent collective operations for MPI. InProceedings of the 24th European MPI Users’ Group Meeting (Chicago, Illinois)(EuroMPI ’17). Association for Computing Machinery, New York, NY, USA, Article 4, 11 pages. doi:10.1145/3...

work page doi:10.1145/3127024.3127028 2017
[14]

Evelyn Namugwanya, Amanda Bienz, Derek Schafer, and Anthony Skjellum

work page
[15]

arXiv:2306.16589 [cs.MS] https://arxiv.org/ abs/2306.16589

Collective-Optimized FFTs. arXiv:2306.16589 [cs.MS] https://arxiv.org/ abs/2306.16589

work page arXiv
[16]

Joseph Schuchart, Christoph Niethammer, José Gracia, and George Bosilca. 2021. Quo Vadis MPI RMA? Towards a More Efficient Use of MPI One-Sided Commu- nication. arXiv:2111.08142 [cs.DC] https://arxiv.org/abs/2111.08142

work page arXiv 2021
[17]

S. Sur, U. K. R. Bondhugula, A. Mamidala, H. W. Jin, and D. K. Panda. 2005. High Performance RDMA Based All-to-All Broadcast for InfiniBand Clusters. InHigh Performance Computing – HiPC 2005, David A. Bader, Manish Parashar, Varadarajan Sridhar, and Viktor K. Prasanna (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 148–157

work page 2005
[18]

DK Panda

Tu Tran, Goutham Kalikrishna Reddy Kuncham, Bharath Ramesh, Shulei Xu, Hari Subramoni, and Dhabaleswar K. DK Panda. 2025. OHIO: Enhancing RDMA Scalability in Alltoall With Optimized Communication Overlap .IEEE Micro45, 02 (March 2025), 36–45. doi:10.1109/MM.2024.3524891

work page doi:10.1109/mm.2024.3524891 2025
[19]

III White, James Buford. 2025. Large-Message All-to-All Communication at Fron- tier Scale. InProceedings of the SC ’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC Work- shops ’25). Association for Computing Machinery, New York, NY, USA, 461–467. doi:10.1145/3731599.3767389

work page doi:10.1145/3731599.3767389 2025

[1] [1]

2022.Analysis of the Communication and Computation Cost of FFT Libraries towards Exascale

Azzam Ayala, Stanimire Tomov, Piotr Luszczek, Samuel Cayrols, Gregory Ragghi- anti, and Jack Dongarra. 2022.Analysis of the Communication and Computation Cost of FFT Libraries towards Exascale. Technical Report ICL-UTK-1558-2022. Innovative Computing Laboratory, University of Tennessee, Knoxville (ICL-UTK). https://icl.utk.edu/files/publications/2022/icl-...

work page 2022

[2] [2]

Maciej Besta and Torsten Hoefler. 2020. Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages. arXiv:2010.09135 [cs.DC] https://arxiv.org/abs/2010.09135

work page arXiv 2020

[3] [3]

Gropp, and Scott Lock- hart

Amanda Bienz, Lois Curfman McInnes Olson, William D. Gropp, and Scott Lock- hart. 2021. Modeling Data Movement Performance on Heterogeneous Archi- tectures. In2021 IEEE High Performance Extreme Computing Conference (HPEC). IEEE. arXiv:2010.10378 [cs.DC] https://arxiv.org/abs/2010.10378

work page arXiv 2021

[4] [4]

Amanda Bienz, Derek Schafer, and Anthony Skjellum. 2023. MPI Advance : Open-Source Message Passing Optimizations. arXiv:2309.07337 [cs.DC] https: //arxiv.org/abs/2309.07337

work page arXiv 2023

[5] [5]

Matthew G. F. Dosanjh, Taylor Groves, Ryan E. Grant, Ron Brightwell, and Patrick G. Bridges. 2016. RMA-MT: A Benchmark Suite for Assessing MPI Multi- threaded RMA Performance. In2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). 550–559. doi:10.1109/CCGrid.2016.84

work page doi:10.1109/ccgrid.2016.84 2016

[6] [6]

Robert Gerstenberger, Maciej Besta, and Torsten Hoefler. 2013. Enabling highly- scalable remote memory access programming with MPI-3 one sided. InProceed- ings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC13). ACM, 1–12. doi:10.1145/2503210.2503286

work page doi:10.1145/2503210.2503286 2013

[7] [7]

Michael Hofmann and Gudula Rünger. 2010. An In-Place Algorithm for Irregular All-to-All Communication with Limited Memory. InRecent Advances in the Message Passing Interface, Rainer Keller, Edgar Gabriel, Michael Resch, and Jack Dongarra (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 113–121

work page 2010

[8] [8]

Alexander Jocksch, Marco Kraushaar, and Daniele Daverio. 2019. Optimized all-to-all communication on multicore architectures applied to FFTs with pencil decomposition.Concurrency and Computation: Practice and Experience31, 16 (2019), e4964. doi:10.1002/cpe.4964

work page doi:10.1002/cpe.4964 2019

[9] [9]

Rahul Kumar, Amith Mamidala, and D. K. Panda. 2008. Scaling alltoall collective on multi-core systems. In2008 IEEE International Symposium on Parallel and Distributed Processing. 1–8. doi:10.1109/IPDPS.2008.4536141

work page doi:10.1109/ipdps.2008.4536141 2008

[10] [10]

Kini, Pete Wyckoff, and Dhabaleswar K

Jiuxing Liu, Jiesheng Wu, Sushmitha P. Kini, Pete Wyckoff, and Dhabaleswar K. Panda. 2003. High performance RDMA-based MPI implementation over Infini- Band. InProceedings of the 17th Annual International Conference on Supercomput- ing(San Francisco, CA, USA)(ICS ’03). Association for Computing Machinery, New York, NY, USA, 295–304. doi:10.1145/782814.782855

work page doi:10.1145/782814.782855 2003

[11] [11]

Hyacinthe Nzigou Mamadou, Feng Long Gu, Vivien Oddou, Takeshi Nanri, and Kazuaki Murakami. 2009. A Dynamic Solution for Efficient MPI Collective Communications. In2009 International Joint Conference on Computational Sciences and Optimization, Vol. 1. 3–7. doi:10.1109/CSO.2009.448

work page doi:10.1109/cso.2009.448 2009

[12] [12]

2021.MPI: A Message-Passing Interface Standard Version 4.0

Message Passing Interface Forum. 2021.MPI: A Message-Passing Interface Standard Version 4.0. https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf

work page 2021

[13] [13]

Holmes, Anthony Skjellum, Purushotham Bangalore, and Srinivas Sridharan

Bradley Morgan, Daniel J. Holmes, Anthony Skjellum, Purushotham Bangalore, and Srinivas Sridharan. 2017. Planning for performance: persistent collective operations for MPI. InProceedings of the 24th European MPI Users’ Group Meeting (Chicago, Illinois)(EuroMPI ’17). Association for Computing Machinery, New York, NY, USA, Article 4, 11 pages. doi:10.1145/3...

work page doi:10.1145/3127024.3127028 2017

[14] [14]

Evelyn Namugwanya, Amanda Bienz, Derek Schafer, and Anthony Skjellum

work page

[15] [15]

arXiv:2306.16589 [cs.MS] https://arxiv.org/ abs/2306.16589

Collective-Optimized FFTs. arXiv:2306.16589 [cs.MS] https://arxiv.org/ abs/2306.16589

work page arXiv

[16] [16]

Joseph Schuchart, Christoph Niethammer, José Gracia, and George Bosilca. 2021. Quo Vadis MPI RMA? Towards a More Efficient Use of MPI One-Sided Commu- nication. arXiv:2111.08142 [cs.DC] https://arxiv.org/abs/2111.08142

work page arXiv 2021

[17] [17]

S. Sur, U. K. R. Bondhugula, A. Mamidala, H. W. Jin, and D. K. Panda. 2005. High Performance RDMA Based All-to-All Broadcast for InfiniBand Clusters. InHigh Performance Computing – HiPC 2005, David A. Bader, Manish Parashar, Varadarajan Sridhar, and Viktor K. Prasanna (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 148–157

work page 2005

[18] [18]

DK Panda

Tu Tran, Goutham Kalikrishna Reddy Kuncham, Bharath Ramesh, Shulei Xu, Hari Subramoni, and Dhabaleswar K. DK Panda. 2025. OHIO: Enhancing RDMA Scalability in Alltoall With Optimized Communication Overlap .IEEE Micro45, 02 (March 2025), 36–45. doi:10.1109/MM.2024.3524891

work page doi:10.1109/mm.2024.3524891 2025

[19] [19]

III White, James Buford. 2025. Large-Message All-to-All Communication at Fron- tier Scale. InProceedings of the SC ’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC Work- shops ’25). Association for Computing Machinery, New York, NY, USA, 461–467. doi:10.1145/3731599.3767389

work page doi:10.1145/3731599.3767389 2025