Analyzing Persistent Alltoallv RMA Implementations for High-Performance MPI Communication
Pith reviewed 2026-05-10 18:33 UTC · model grok-4.3
The pith
Persistent RMA Alltoallv variants reduce runtime up to 44% for large messages by reusing metadata and window state.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Persistent MPI RMA Alltoallv implementations based on fence and lock synchronization separate initialization from per-iteration execution to enable reuse of communication metadata and window state. Benchmarks demonstrate that the fence-persistent variant outperforms the non-persistent baseline for large messages, achieving up to 44% runtime reduction and better scalability with process count, including a drop from 2.49s to 1.54s at 448 processes. Message-size sweeps and a break-even model establish that persistence provides immediate payoff for messages of 32,768 bytes or larger while smaller messages show limited benefit due to amortization costs.
What carries the argument
Persistent RMA Alltoallv, which splits one-time setup from repeated execution phases to reuse metadata and window state across epochs.
If this is right
- Persistence delivers immediate benefits once messages reach or exceed 32,768 bytes as runtime becomes dominated by data movement rather than setup.
- Time savings increase with larger message sizes and higher process counts, improving overall scalability.
- Fence-based synchronization shows stronger gains than lock-based in the tested scenarios, including under irregular sparse communication patterns.
- Hierarchical extensions of the designs remain practical for workloads with large messages.
Where Pith is reading between the lines
- The same separation of initialization and execution could be applied to other MPI collectives that involve irregular or repeated communication.
- Application developers working with repeated Alltoallv patterns could adopt the persistent interface with minimal code changes to gain the reported speedups.
- The break-even model may help predict performance on systems where metadata costs differ from those measured on Dane.
- Similar persistence techniques might extend to non-MPI communication libraries that support remote memory access primitives.
Load-bearing premise
The measured runtime reductions are caused by the persistence mechanism and metadata reuse rather than unstated details of the MPI library, compiler effects, or specific characteristics of the Dane supercomputer hardware.
What would settle it
Running identical benchmarks on a different supercomputer with a different MPI implementation and finding no comparable runtime reduction for the persistent variant would indicate the improvements are not due to persistence itself.
Figures
read the original abstract
Collective communication operations such as MPI_Alltoallv are central to many HPC applications, particularly those with irregular message sizes. We design, implement, and evaluate persistent MPI RMA variants of Alltoallv based on fence and lock synchronization, separating a one time initialization phase from per iteration execution to enable reuse of communication metadata and window state across repeated epochs. Our benchmarks tested on LLNL's Dane supercomputer show that the fence-persistent variant consistently outperforms the non-persistent baseline for large message sizes, achieving up to 44% reduction in runtime and improving scalability with increasing process counts; at 448 processes the runtime decreases from 2.49s to 1.54s (38% faster). We further evaluate the algorithms under irregular sparse communication patterns and compare fence- and lock-based designs, including hierarchical extensions. Message-size sweeps and a break-even model demonstrate that persistence provides immediate payoff for messages greater or equal to 32,768 bytes, while smaller messages show limited benefit due to metadata amortization costs. These results indicate that persistent RMA Alltoallv is a practical approach for workloads with large messages, where removing repeated metadata processing leaves runtime dominated by data movement, as evidenced by the increasing time savings with message size, and they clarify the trade-offs between fence and lock synchronization on modern HPC systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript designs and evaluates persistent MPI RMA variants of Alltoallv using fence and lock synchronization. By separating a one-time initialization phase from repeated execution epochs, the approach enables reuse of communication metadata and window state. Benchmarks on LLNL's Dane supercomputer show the fence-persistent variant outperforming the non-persistent baseline for large messages (up to 44% runtime reduction), with improved scalability at higher process counts (e.g., runtime dropping from 2.49s to 1.54s at 448 processes, a 38% improvement). The work also examines irregular sparse patterns, compares fence vs. lock designs, and presents a break-even model indicating benefits for messages >= 32 KB.
Significance. If the performance gains are confirmed to stem from metadata reuse, the results would offer practical guidance for optimizing repeated collective operations in HPC applications with irregular communication. The message-size sweeps and synchronization trade-off analysis could inform MPI library implementations and encourage broader use of persistent RMA primitives where data movement dominates runtime.
major comments (1)
- Abstract and benchmarks section: The central performance claims (e.g., 38% reduction from 2.49s to 1.54s at 448 processes and up to 44% overall) are reported without error bars, the number of repetitions per measurement, warm-up details, or statistical tests for significance. This absence directly affects verification that the observed speedups are attributable to persistence and metadata reuse rather than measurement variability or unstated factors.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comment below and will incorporate the suggested improvements in the revised version.
read point-by-point responses
-
Referee: Abstract and benchmarks section: The central performance claims (e.g., 38% reduction from 2.49s to 1.54s at 448 processes and up to 44% overall) are reported without error bars, the number of repetitions per measurement, warm-up details, or statistical tests for significance. This absence directly affects verification that the observed speedups are attributable to persistence and metadata reuse rather than measurement variability or unstated factors.
Authors: We agree that the manuscript would benefit from additional details on the benchmark methodology to strengthen the claims. In the revised version, we will expand the abstract and benchmarks section to include the number of repetitions per measurement, warm-up iterations, error bars on the reported timings (e.g., standard deviation), and a description of any statistical tests applied to assess significance. This will better demonstrate that the performance improvements arise from the persistent RMA approach rather than variability. revision: yes
Circularity Check
No significant circularity
full rationale
This is a pure empirical benchmarking paper that reports direct runtime measurements, message-size sweeps, and scalability comparisons between persistent and non-persistent MPI RMA Alltoallv implementations on LLNL's Dane supercomputer. It contains no equations, derivations, fitted parameters presented as predictions, or self-citation chains that reduce the central claims to inputs by construction. The reported speedups (e.g., 44% for large messages) and break-even observations are straightforward empirical outcomes from the described experiments, not logical reductions of the paper's own premises.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption MPI RMA fence and lock synchronization primitives behave according to the MPI standard specification.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
separating a one-time initialization phase from per-iteration execution to enable reuse of communication metadata and window state across repeated epochs... N_breakeven = ceil(tau_persistent / (T_nonpersistent - T_persistent))
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Fence persistent... Lock persistent... Fence_Hierarchy_persistent... on LLNL Dane with MVAPICH2
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
2022.Analysis of the Communication and Computation Cost of FFT Libraries towards Exascale
Azzam Ayala, Stanimire Tomov, Piotr Luszczek, Samuel Cayrols, Gregory Ragghi- anti, and Jack Dongarra. 2022.Analysis of the Communication and Computation Cost of FFT Libraries towards Exascale. Technical Report ICL-UTK-1558-2022. Innovative Computing Laboratory, University of Tennessee, Knoxville (ICL-UTK). https://icl.utk.edu/files/publications/2022/icl-...
work page 2022
- [2]
-
[3]
Amanda Bienz, Lois Curfman McInnes Olson, William D. Gropp, and Scott Lock- hart. 2021. Modeling Data Movement Performance on Heterogeneous Archi- tectures. In2021 IEEE High Performance Extreme Computing Conference (HPEC). IEEE. arXiv:2010.10378 [cs.DC] https://arxiv.org/abs/2010.10378
- [4]
-
[5]
Matthew G. F. Dosanjh, Taylor Groves, Ryan E. Grant, Ron Brightwell, and Patrick G. Bridges. 2016. RMA-MT: A Benchmark Suite for Assessing MPI Multi- threaded RMA Performance. In2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). 550–559. doi:10.1109/CCGrid.2016.84
-
[6]
Robert Gerstenberger, Maciej Besta, and Torsten Hoefler. 2013. Enabling highly- scalable remote memory access programming with MPI-3 one sided. InProceed- ings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC13). ACM, 1–12. doi:10.1145/2503210.2503286
-
[7]
Michael Hofmann and Gudula Rünger. 2010. An In-Place Algorithm for Irregular All-to-All Communication with Limited Memory. InRecent Advances in the Message Passing Interface, Rainer Keller, Edgar Gabriel, Michael Resch, and Jack Dongarra (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 113–121
work page 2010
-
[8]
Alexander Jocksch, Marco Kraushaar, and Daniele Daverio. 2019. Optimized all-to-all communication on multicore architectures applied to FFTs with pencil decomposition.Concurrency and Computation: Practice and Experience31, 16 (2019), e4964. doi:10.1002/cpe.4964
-
[9]
Rahul Kumar, Amith Mamidala, and D. K. Panda. 2008. Scaling alltoall collective on multi-core systems. In2008 IEEE International Symposium on Parallel and Distributed Processing. 1–8. doi:10.1109/IPDPS.2008.4536141
-
[10]
Kini, Pete Wyckoff, and Dhabaleswar K
Jiuxing Liu, Jiesheng Wu, Sushmitha P. Kini, Pete Wyckoff, and Dhabaleswar K. Panda. 2003. High performance RDMA-based MPI implementation over Infini- Band. InProceedings of the 17th Annual International Conference on Supercomput- ing(San Francisco, CA, USA)(ICS ’03). Association for Computing Machinery, New York, NY, USA, 295–304. doi:10.1145/782814.782855
-
[11]
Hyacinthe Nzigou Mamadou, Feng Long Gu, Vivien Oddou, Takeshi Nanri, and Kazuaki Murakami. 2009. A Dynamic Solution for Efficient MPI Collective Communications. In2009 International Joint Conference on Computational Sciences and Optimization, Vol. 1. 3–7. doi:10.1109/CSO.2009.448
-
[12]
2021.MPI: A Message-Passing Interface Standard Version 4.0
Message Passing Interface Forum. 2021.MPI: A Message-Passing Interface Standard Version 4.0. https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf
work page 2021
-
[13]
Holmes, Anthony Skjellum, Purushotham Bangalore, and Srinivas Sridharan
Bradley Morgan, Daniel J. Holmes, Anthony Skjellum, Purushotham Bangalore, and Srinivas Sridharan. 2017. Planning for performance: persistent collective operations for MPI. InProceedings of the 24th European MPI Users’ Group Meeting (Chicago, Illinois)(EuroMPI ’17). Association for Computing Machinery, New York, NY, USA, Article 4, 11 pages. doi:10.1145/3...
-
[14]
Evelyn Namugwanya, Amanda Bienz, Derek Schafer, and Anthony Skjellum
-
[15]
arXiv:2306.16589 [cs.MS] https://arxiv.org/ abs/2306.16589
Collective-Optimized FFTs. arXiv:2306.16589 [cs.MS] https://arxiv.org/ abs/2306.16589
- [16]
-
[17]
S. Sur, U. K. R. Bondhugula, A. Mamidala, H. W. Jin, and D. K. Panda. 2005. High Performance RDMA Based All-to-All Broadcast for InfiniBand Clusters. InHigh Performance Computing – HiPC 2005, David A. Bader, Manish Parashar, Varadarajan Sridhar, and Viktor K. Prasanna (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 148–157
work page 2005
-
[18]
Tu Tran, Goutham Kalikrishna Reddy Kuncham, Bharath Ramesh, Shulei Xu, Hari Subramoni, and Dhabaleswar K. DK Panda. 2025. OHIO: Enhancing RDMA Scalability in Alltoall With Optimized Communication Overlap .IEEE Micro45, 02 (March 2025), 36–45. doi:10.1109/MM.2024.3524891
-
[19]
III White, James Buford. 2025. Large-Message All-to-All Communication at Fron- tier Scale. InProceedings of the SC ’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC Work- shops ’25). Association for Computing Machinery, New York, NY, USA, 461–467. doi:10.1145/3731599.3767389
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.