User Experiences with MPI RMA and ULFM in a Resilient Key-Value Store Implementation
Pith reviewed 2026-05-10 04:04 UTC · model grok-4.3
The pith
Implementing a resilient key-value store with MPI RMA and ULFM proved difficult due to missing failure mitigation features.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our implementation of a resilient key-value store using passive target MPI RMA functions for one-sided operations and ULFM for failure mitigation proved difficult due to several unimplemented ULFM functionalities for RMA. Even if those features existed, the programming task could be simplified. The store maintains redundant copies of key-value pairs across processes to allow recovery after failures on surviving processes.
What carries the argument
A resilient key-value store implemented with passive target MPI RMA for one-sided read/write operations and ULFM for detecting process aborts and enabling recovery.
If this is right
- The resilient store can be integrated into MPI applications to protect data from node losses.
- Recovery is possible by continuing on surviving processes with intact data copies.
- Current ULFM implementations require workarounds for RMA-based resilience.
- Additional ULFM features for RMA would reduce programming complexity.
Where Pith is reading between the lines
- API designers for MPI extensions may need to prioritize RMA-specific failure handling to encourage resilient applications.
- Similar challenges could arise when adding resilience to other one-sided communication models.
- Future MPI standards might benefit from built-in support for redundant data structures.
Load-bearing premise
The main source of implementation difficulties is the absence of certain ULFM functionalities for RMA, not a fundamental incompatibility between the store design and MPI's RMA or failure model.
What would settle it
Successfully implementing the missing ULFM RMA features and rewriting the store to use them, resulting in substantially simpler code, would confirm the claim; if complexity remains high, it would indicate other factors are at play.
Figures
read the original abstract
As hardware failures such as node losses become increasingly common, MPI programmers may want to save vulnerable data in a resilient store. While third-party storage solutions such as Redis or the Hazelcast IMap exist, a tailored, MPI-based store may be easier to integrate and can be optimized for particular application needs. This paper considers the implementation of such a store, which is intended as a component in a resilient task-based runtime system written in MPI. The store holds redundant data copies as key-value pairs in the main memories of multiple processes. Since store access operations, such as reads and writes, are naturally one-sided, we implemented the store with passive target MPI RMA functions. Process aborts are detected with the user-level failure mitigation (ULFM) extension of Open MPI. After failures, the program recovers on the surviving processes and continues with the intact data copies. Our implementation proved difficult, since several proposed ULFM functionalities for RMA have not yet been implemented. Even assuming their existence, we think that the programming task could be simplified. This paper describes our experiences, lists functionalities that we missed, and explains a workaround that we adopted in our implementation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is an experience report on implementing a resilient key-value store as a component for a task-based MPI runtime. The store maintains redundant key-value pairs in main memory across processes using passive-target MPI RMA for one-sided access; process failures are detected via the ULFM extension of Open MPI, after which the application recovers on surviving processes using the intact data copies. The central claim is that the implementation proved difficult because several proposed ULFM functionalities for RMA remain unimplemented in current libraries, and that the programming task would remain non-trivial even if those features existed; the authors list the missing capabilities and describe the workaround they adopted.
Significance. If the reported experiences hold, the paper offers concrete, practitioner-level insight into gaps in the ULFM specification for RMA-based resilience, which is relevant to the growing number of MPI applications that must tolerate node failures. The workaround description and enumeration of missing features provide immediate guidance for similar implementations and could help prioritize future ULFM development. The report is grounded in a specific use case (redundant KV store for a runtime), which strengthens its utility over purely abstract discussions.
major comments (1)
- The central claim that implementation difficulty stems primarily from missing ULFM RMA features (and would persist even if those features existed) rests on the authors' direct experience, yet the manuscript supplies only a high-level list of missing functionalities and a summary of the workaround without code fragments, API call sequences, or concrete failure scenarios (e.g., in the sections describing the store implementation and recovery). This limits independent assessment of whether the difficulties are due to feature gaps versus design choices in the KV store or RMA usage pattern.
minor comments (1)
- The abstract and introduction could more explicitly separate the description of the KV-store design from the enumeration of ULFM limitations to improve readability for readers primarily interested in the missing features.
Simulated Author's Rebuttal
We thank the referee for the careful review and constructive suggestion to strengthen the presentation of our implementation experiences. We have revised the manuscript to address the concern by adding concrete details.
read point-by-point responses
-
Referee: The central claim that implementation difficulty stems primarily from missing ULFM RMA features (and would persist even if those features existed) rests on the authors' direct experience, yet the manuscript supplies only a high-level list of missing functionalities and a summary of the workaround without code fragments, API call sequences, or concrete failure scenarios (e.g., in the sections describing the store implementation and recovery). This limits independent assessment of whether the difficulties are due to feature gaps versus design choices in the KV store or RMA usage pattern.
Authors: We agree that the original manuscript presented the missing ULFM RMA functionalities and the adopted workaround at a summary level, which could make it harder for readers to independently evaluate the source of the difficulties. In the revised version, we have expanded the store implementation and recovery sections with specific code fragments (e.g., MPI_Win_create and MPI_Get/MPI_Put sequences under passive target epochs), detailed API call sequences showing how failures interrupt epochs without notification, and a concrete failure scenario (a single process abort during a write operation, detected via MPI_Comm_failure_ack and subsequent recovery using redundant copies). These additions illustrate that the core issues stemmed from unimplemented ULFM features for RMA—such as failure propagation within active epochs and automatic handling of orphaned windows—rather than from idiosyncratic choices in the KV store design or RMA pattern. The experiences remain grounded in our direct implementation attempts for the task-based runtime, and the details also support our view that even the proposed features would leave the task non-trivial, motivating the workaround of explicit replication and manual recovery logic. revision: yes
Circularity Check
No significant circularity; descriptive experience report
full rationale
This paper is a purely descriptive experience report on implementing a resilient key-value store with MPI RMA and ULFM. It contains no equations, derivations, fitted parameters, predictions, or mathematical claims. The central narrative (implementation difficulties due to missing ULFM RMA features, plus a workaround) rests on concrete implementation choices and the current state of the ULFM specification, with no reduction to self-definition, self-citation chains, or renamed inputs. No load-bearing steps exist that could be circular by the enumerated patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
Ansel, J., Arya, K., Cooperman, G.: DMTCP: Transparent checkpointing for clus- ter computations and the desktop. In: Proc. Int. Parallel and Distributed Process- ing Symp. (IPDPS) (2009). https://doi.org/10.1109/ipdps.2009.5161063
-
[3]
Scalable Funding of Bitcoin Micropayment Channel Networks
Bland, W., Bouteiller, A., Herault, T., et al.: An evaluation of user-level failure mit- igation support in MPI. In: Recent Advances in the Message Passing Interface (Eu- roMPI). pp. 193–203. Springer LNCS 7490 (2012). https://doi.org/10.1007/978-3- 642-33518-1_24
-
[4]
Bouteiller, A., Bosilca, G.: Implicit actions and non-blocking failure recovery with MPI. In: Proc. Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) (2022). https://doi.org/10.1109/FTXS56515.2022.00009
-
[5]
ACM Transactions on Parallel Computing (TOPC)1(2) (2015)
Bouteiller, A., Herault, T., Bosilca, G., et al.: Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy. ACM Transactions on Parallel Computing (TOPC)1(2) (2015). https://doi.org/10.1145/2686892
- [6]
-
[7]
Elmasri, R., Navathe, S.: Fundamentals of Database Systems. Pearson, 7 edn. (2015)
work page 2015
-
[8]
Master’s thesis, University of Kassel (2025)
Fink, R.: Entwurf und Implementierung eines fehlertoleranten Speichers für HPC- Cluster. Master’s thesis, University of Kassel (2025)
work page 2025
-
[9]
Fohry, C., Bungart, M., Posner, J.: Fault Tolerance Schemes for Global Load Bal- ancinginX10.ScalableComputing:PracticeandExperience16(2),169–185(2015)
work page 2015
- [10]
- [11]
-
[12]
Gridgain, www.gridgain.com
-
[13]
Hamouda, S.S.: Resilience in High-Level Parallel Programming Languages. Ph.D. thesis, Australian National University (2019)
work page 2019
- [14]
-
[15]
Hazelcast, hazelcast.org
-
[16]
In: Re- cent Advances in the Message Passing Interface (EuroMPI)
Hjelm, N.: An evaluation of the one-sided performance in Open MPI. In: Re- cent Advances in the Message Passing Interface (EuroMPI). pp. 184–187. Springer (2016) Title Suppressed Due to Excessive Length 15
work page 2016
-
[17]
Kolla, H., Mayo, J.R., Teranishi, K., Armstrong, R.C.: Improving scalability of silent-error resilience for message-passing solvers via local recovery and asynchrony. In: Proc. Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) (2020). https://doi.org/10.1109/FTXS51974.2020.00006
-
[18]
Future Generation Computer Systems (FGCS)106, 467–481 (2020)
Losada, N., González, P., Martín, M.J., et al.: Fault tolerance of MPI applications in exascale systems: The ULFM solution. Future Generation Computer Systems (FGCS)106, 467–481 (2020)
work page 2020
-
[19]
MPI: A message-passing interface standard (version 5.0) (2025), mpi-forum.org
work page 2025
-
[20]
Nather, R., Reitz, M., Fohry, C.: Distributed, Resilient and In-Memory Storage of Key-Value Data for HPC (WIP Talk). In: Int. Parallel Data Systems Workshop (PDSW) (2024)
work page 2024
-
[21]
Nicolae, B., Moody, A., Gonsiorowski, E., et al.: Veloc: Towards high perfor- mance adaptive asynchronous checkpointing at large scale. In: Int. Parallel and Distributed Processing Symposium (IPDPS). pp. 911–920. IEEE (2019)
work page 2019
-
[22]
Open MPI: Open High Performance Computing, www.open-mpi.org
-
[23]
OpenMP Architecture Review Board: OpenMP application programming interface (version 6.0) (2024), openmp.org
work page 2024
-
[24]
Special Issue International Journal of Networking and Computing (IJNC)8(1), 2–31 (2018)
Posner, J., Fohry, C.: A Java task pool framework providing fault tolerant global load balancing. Special Issue International Journal of Networking and Computing (IJNC)8(1), 2–31 (2018)
work page 2018
-
[25]
Future Generation Computing Systems (FGCS)105, 119– 134 (2019)
Posner, J., Reitz, M., Fohry, C.: A comparison of application-level fault tolerance schemes for task pools. Future Generation Computing Systems (FGCS)105, 119– 134 (2019). https://doi.org/10.1016/j.future.2019.11.031
-
[26]
Posner, J., Reitz, M., Fohry, C.: Task-level resilience: Checkpointing vs. supervi- sion. Special Issue Int. Journal of Networking and Computing (IJNC)12(1), 47–72 (2022)
work page 2022
-
[27]
SN Computer Science5(320) (2024)
Reitz, M., Fohry, C.: Task-level checkpointing and localized recovery to tolerate permanent node failures for nested fork–join programs in clusters. SN Computer Science5(320) (2024). https://doi.org/10.1007/s42979-024-02624-8
- [28]
-
[29]
https://doi.org/10.48550/arXiv.2111.08142
Schuchart, J., Niethammer, C., Gracia, J., Bosilca, G.: Quo vadis MPI RMA? towards a more efficient use of MPI one-sided communication (2021). https://doi.org/10.48550/arXiv.2111.08142
-
[30]
Schuchart, J., Samfass, P., Niethammer, C., et al.: Callback-based comple- tion notification using MPI continuations. Parallel Computing106(2021). https://doi.org/10.1016/j.parco.2021.102793
- [31]
-
[32]
TOP500.org: Goethe-NHR of the University of Frankfurt (2025), https://www.top500.org/system/180175
work page 2025
-
[33]
Process fault tolerance: Chapter 16 of draft document for a standard message- passing interface (October 31, 2022), available at fault-tolerance.org
work page 2022
-
[34]
Whitlock, M., Kolla, H., Bouteiller, A., et al.: Asynchrony and failure mask- ing via pseudo-local process recovery in MPI applications. In: Proc. Par- allel and Distributed Processing Symposium Workshops (IPDPSW) (2024). https://doi.org/10.1109/IPDPSW63119.2024.00193
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.