pith. sign in

arxiv: 2604.18098 · v1 · submitted 2026-04-20 · 💻 cs.DC

User Experiences with MPI RMA and ULFM in a Resilient Key-Value Store Implementation

Pith reviewed 2026-05-10 04:04 UTC · model grok-4.3

classification 💻 cs.DC
keywords MPIRMAULFMresilient computingkey-value storefault toleranceone-sided communicationfailure recovery
0
0 comments X

The pith

Implementing a resilient key-value store with MPI RMA and ULFM proved difficult due to missing failure mitigation features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

MPI programmers facing hardware failures can store vulnerable data in a tailored resilient key-value store instead of relying on external solutions like Redis. The paper explores building such a store as a component for a task-based runtime, using one-sided MPI RMA for accesses and ULFM to detect and recover from process aborts while preserving redundant data copies. The implementation encountered significant difficulties because proposed ULFM features for handling RMA during failures are not yet implemented in Open MPI. The authors describe their experiences, identify specific missing functionalities, and explain a workaround they adopted to make the store operational.

Core claim

Our implementation of a resilient key-value store using passive target MPI RMA functions for one-sided operations and ULFM for failure mitigation proved difficult due to several unimplemented ULFM functionalities for RMA. Even if those features existed, the programming task could be simplified. The store maintains redundant copies of key-value pairs across processes to allow recovery after failures on surviving processes.

What carries the argument

A resilient key-value store implemented with passive target MPI RMA for one-sided read/write operations and ULFM for detecting process aborts and enabling recovery.

If this is right

  • The resilient store can be integrated into MPI applications to protect data from node losses.
  • Recovery is possible by continuing on surviving processes with intact data copies.
  • Current ULFM implementations require workarounds for RMA-based resilience.
  • Additional ULFM features for RMA would reduce programming complexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • API designers for MPI extensions may need to prioritize RMA-specific failure handling to encourage resilient applications.
  • Similar challenges could arise when adding resilience to other one-sided communication models.
  • Future MPI standards might benefit from built-in support for redundant data structures.

Load-bearing premise

The main source of implementation difficulties is the absence of certain ULFM functionalities for RMA, not a fundamental incompatibility between the store design and MPI's RMA or failure model.

What would settle it

Successfully implementing the missing ULFM RMA features and rewriting the store to use them, resulting in substantially simpler code, would confirm the claim; if complexity remains high, it would indicate other factors are at play.

Figures

Figures reproduced from arXiv: 2604.18098 by Claudia Fohry, Rainer Fink.

Figure 1
Figure 1. Figure 1: store_put latencies of Hazel￾cast, with 6 replicas and 40 processes per node 1 4 8 16 32 64 Number of nodes 55 60 65 70 75 80 Tim e [ s] Developed KV-Store [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗
read the original abstract

As hardware failures such as node losses become increasingly common, MPI programmers may want to save vulnerable data in a resilient store. While third-party storage solutions such as Redis or the Hazelcast IMap exist, a tailored, MPI-based store may be easier to integrate and can be optimized for particular application needs. This paper considers the implementation of such a store, which is intended as a component in a resilient task-based runtime system written in MPI. The store holds redundant data copies as key-value pairs in the main memories of multiple processes. Since store access operations, such as reads and writes, are naturally one-sided, we implemented the store with passive target MPI RMA functions. Process aborts are detected with the user-level failure mitigation (ULFM) extension of Open MPI. After failures, the program recovers on the surviving processes and continues with the intact data copies. Our implementation proved difficult, since several proposed ULFM functionalities for RMA have not yet been implemented. Even assuming their existence, we think that the programming task could be simplified. This paper describes our experiences, lists functionalities that we missed, and explains a workaround that we adopted in our implementation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript is an experience report on implementing a resilient key-value store as a component for a task-based MPI runtime. The store maintains redundant key-value pairs in main memory across processes using passive-target MPI RMA for one-sided access; process failures are detected via the ULFM extension of Open MPI, after which the application recovers on surviving processes using the intact data copies. The central claim is that the implementation proved difficult because several proposed ULFM functionalities for RMA remain unimplemented in current libraries, and that the programming task would remain non-trivial even if those features existed; the authors list the missing capabilities and describe the workaround they adopted.

Significance. If the reported experiences hold, the paper offers concrete, practitioner-level insight into gaps in the ULFM specification for RMA-based resilience, which is relevant to the growing number of MPI applications that must tolerate node failures. The workaround description and enumeration of missing features provide immediate guidance for similar implementations and could help prioritize future ULFM development. The report is grounded in a specific use case (redundant KV store for a runtime), which strengthens its utility over purely abstract discussions.

major comments (1)
  1. The central claim that implementation difficulty stems primarily from missing ULFM RMA features (and would persist even if those features existed) rests on the authors' direct experience, yet the manuscript supplies only a high-level list of missing functionalities and a summary of the workaround without code fragments, API call sequences, or concrete failure scenarios (e.g., in the sections describing the store implementation and recovery). This limits independent assessment of whether the difficulties are due to feature gaps versus design choices in the KV store or RMA usage pattern.
minor comments (1)
  1. The abstract and introduction could more explicitly separate the description of the KV-store design from the enumeration of ULFM limitations to improve readability for readers primarily interested in the missing features.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and constructive suggestion to strengthen the presentation of our implementation experiences. We have revised the manuscript to address the concern by adding concrete details.

read point-by-point responses
  1. Referee: The central claim that implementation difficulty stems primarily from missing ULFM RMA features (and would persist even if those features existed) rests on the authors' direct experience, yet the manuscript supplies only a high-level list of missing functionalities and a summary of the workaround without code fragments, API call sequences, or concrete failure scenarios (e.g., in the sections describing the store implementation and recovery). This limits independent assessment of whether the difficulties are due to feature gaps versus design choices in the KV store or RMA usage pattern.

    Authors: We agree that the original manuscript presented the missing ULFM RMA functionalities and the adopted workaround at a summary level, which could make it harder for readers to independently evaluate the source of the difficulties. In the revised version, we have expanded the store implementation and recovery sections with specific code fragments (e.g., MPI_Win_create and MPI_Get/MPI_Put sequences under passive target epochs), detailed API call sequences showing how failures interrupt epochs without notification, and a concrete failure scenario (a single process abort during a write operation, detected via MPI_Comm_failure_ack and subsequent recovery using redundant copies). These additions illustrate that the core issues stemmed from unimplemented ULFM features for RMA—such as failure propagation within active epochs and automatic handling of orphaned windows—rather than from idiosyncratic choices in the KV store design or RMA pattern. The experiences remain grounded in our direct implementation attempts for the task-based runtime, and the details also support our view that even the proposed features would leave the task non-trivial, motivating the workaround of explicit replication and manual recovery logic. revision: yes

Circularity Check

0 steps flagged

No significant circularity; descriptive experience report

full rationale

This paper is a purely descriptive experience report on implementing a resilient key-value store with MPI RMA and ULFM. It contains no equations, derivations, fitted parameters, predictions, or mathematical claims. The central narrative (implementation difficulties due to missing ULFM RMA features, plus a workaround) rests on concrete implementation choices and the current state of the ULFM specification, with no reduction to self-definition, self-citation chains, or renamed inputs. No load-bearing steps exist that could be circular by the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an implementation experience paper with no mathematical content. It relies on existing MPI and ULFM standards rather than introducing new parameters, axioms, or entities.

pith-pipeline@v0.9.0 · 5504 in / 1052 out tokens · 43683 ms · 2026-05-10T04:04:42.191805+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

  1. [1]

    In: Proc

    Ali, M.M., Southern, J., Strazdins, P., Harding, B.: Application level fault recovery: Using fault-tolerant Open MPI in a PDE solver. In: Proc. Parallel and Distributed Processing Symposium Workshops (IPDPSW). p. 1169–1178. IEEE (2014)

  2. [2]

    In: Proc

    Ansel, J., Arya, K., Cooperman, G.: DMTCP: Transparent checkpointing for clus- ter computations and the desktop. In: Proc. Int. Parallel and Distributed Process- ing Symp. (IPDPS) (2009). https://doi.org/10.1109/ipdps.2009.5161063

  3. [3]

    Scalable Funding of Bitcoin Micropayment Channel Networks

    Bland, W., Bouteiller, A., Herault, T., et al.: An evaluation of user-level failure mit- igation support in MPI. In: Recent Advances in the Message Passing Interface (Eu- roMPI). pp. 193–203. Springer LNCS 7490 (2012). https://doi.org/10.1007/978-3- 642-33518-1_24

  4. [4]

    In: Proc

    Bouteiller, A., Bosilca, G.: Implicit actions and non-blocking failure recovery with MPI. In: Proc. Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) (2022). https://doi.org/10.1109/FTXS56515.2022.00009

  5. [5]

    ACM Transactions on Parallel Computing (TOPC)1(2) (2015)

    Bouteiller, A., Herault, T., Bosilca, G., et al.: Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy. ACM Transactions on Parallel Computing (TOPC)1(2) (2015). https://doi.org/10.1145/2686892

  6. [6]

    In: Proc

    Cunningham, D., Grove, D., Herta, B., et al.: Resilient X10: Efficient failure-aware programming. In: Proc. ACM SIGPLAN Symp. on Principles and Practice of Par- allel Programming (PPoPP). pp. 67–80 (2014)

  7. [7]

    Pearson, 7 edn

    Elmasri, R., Navathe, S.: Fundamentals of Database Systems. Pearson, 7 edn. (2015)

  8. [8]

    Master’s thesis, University of Kassel (2025)

    Fink, R.: Entwurf und Implementierung eines fehlertoleranten Speichers für HPC- Cluster. Master’s thesis, University of Kassel (2025)

  9. [9]

    Fohry, C., Bungart, M., Posner, J.: Fault Tolerance Schemes for Global Load Bal- ancinginX10.ScalableComputing:PracticeandExperience16(2),169–185(2015)

  10. [10]

    In: Proc

    Gamell, M., Katz, D.S., Kolla, H., et al.: Exploring automatic, online failure recov- ery for scientific applications at extreme scales. In: Proc. Int. Conf. for High Per- formance Computing, Networking, Storage and Analysis (SC). pp. 895–906 (2014)

  11. [11]

    In: Proc

    Georgakoudis, G., Guo, L., Laguna, I.: Reinit++: Evaluating the performance of global-restart recovery methods for MPI fault tolerance. In: Proc. Int. Conf. on High Performance Computing (ISC). pp. 536–554 (2020)

  12. [12]

    Gridgain, www.gridgain.com

  13. [13]

    Hamouda, S.S.: Resilience in High-Level Parallel Programming Languages. Ph.D. thesis, Australian National University (2019)

  14. [14]

    In: Proc

    Hamouda, S.S., Herta, B., Milthorpe, J., et al.: Resilient X10 over MPI user level failure mitigation. In: Proc. ACM SIGPLAN X10 Workshop. pp. 18–23 (2016)

  15. [15]

    Hazelcast, hazelcast.org

  16. [16]

    In: Re- cent Advances in the Message Passing Interface (EuroMPI)

    Hjelm, N.: An evaluation of the one-sided performance in Open MPI. In: Re- cent Advances in the Message Passing Interface (EuroMPI). pp. 184–187. Springer (2016) Title Suppressed Due to Excessive Length 15

  17. [17]

    In: Proc

    Kolla, H., Mayo, J.R., Teranishi, K., Armstrong, R.C.: Improving scalability of silent-error resilience for message-passing solvers via local recovery and asynchrony. In: Proc. Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) (2020). https://doi.org/10.1109/FTXS51974.2020.00006

  18. [18]

    Future Generation Computer Systems (FGCS)106, 467–481 (2020)

    Losada, N., González, P., Martín, M.J., et al.: Fault tolerance of MPI applications in exascale systems: The ULFM solution. Future Generation Computer Systems (FGCS)106, 467–481 (2020)

  19. [19]

    MPI: A message-passing interface standard (version 5.0) (2025), mpi-forum.org

  20. [20]

    Nather, R., Reitz, M., Fohry, C.: Distributed, Resilient and In-Memory Storage of Key-Value Data for HPC (WIP Talk). In: Int. Parallel Data Systems Workshop (PDSW) (2024)

  21. [21]

    Nicolae, B., Moody, A., Gonsiorowski, E., et al.: Veloc: Towards high perfor- mance adaptive asynchronous checkpointing at large scale. In: Int. Parallel and Distributed Processing Symposium (IPDPS). pp. 911–920. IEEE (2019)

  22. [22]

    Open MPI: Open High Performance Computing, www.open-mpi.org

  23. [23]

    OpenMP Architecture Review Board: OpenMP application programming interface (version 6.0) (2024), openmp.org

  24. [24]

    Special Issue International Journal of Networking and Computing (IJNC)8(1), 2–31 (2018)

    Posner, J., Fohry, C.: A Java task pool framework providing fault tolerant global load balancing. Special Issue International Journal of Networking and Computing (IJNC)8(1), 2–31 (2018)

  25. [25]

    Future Generation Computing Systems (FGCS)105, 119– 134 (2019)

    Posner, J., Reitz, M., Fohry, C.: A comparison of application-level fault tolerance schemes for task pools. Future Generation Computing Systems (FGCS)105, 119– 134 (2019). https://doi.org/10.1016/j.future.2019.11.031

  26. [26]

    supervi- sion

    Posner, J., Reitz, M., Fohry, C.: Task-level resilience: Checkpointing vs. supervi- sion. Special Issue Int. Journal of Networking and Computing (IJNC)12(1), 47–72 (2022)

  27. [27]

    SN Computer Science5(320) (2024)

    Reitz, M., Fohry, C.: Task-level checkpointing and localized recovery to tolerate permanent node failures for nested fork–join programs in clusters. SN Computer Science5(320) (2024). https://doi.org/10.1007/s42979-024-02624-8

  28. [28]

    In: Proc

    Reitz, M., Hundhausen, J., Fohry, C.: Fail-stop failure protection for coordinated work stealing of tasks that communicate through futures. In: Proc. Workshop on Asynchronous Many-Task Systems and Applications (WAMTA). pp. 44–55. Springer LNCS 15690 (2025)

  29. [29]

    https://doi.org/10.48550/arXiv.2111.08142

    Schuchart, J., Niethammer, C., Gracia, J., Bosilca, G.: Quo vadis MPI RMA? towards a more efficient use of MPI one-sided communication (2021). https://doi.org/10.48550/arXiv.2111.08142

  30. [30]

    Parallel Computing106(2021)

    Schuchart, J., Samfass, P., Niethammer, C., et al.: Callback-based comple- tion notification using MPI continuations. Parallel Computing106(2021). https://doi.org/10.1016/j.parco.2021.102793

  31. [31]

    In: Proc

    Tardieu, O.: The APGAS library: resilient parallel and distributed programming in Java 8. In: Proc. ACM SIGPLAN Workshop on X10. pp. 25–26 (2015)

  32. [32]

    TOP500.org: Goethe-NHR of the University of Frankfurt (2025), https://www.top500.org/system/180175

  33. [33]

    Process fault tolerance: Chapter 16 of draft document for a standard message- passing interface (October 31, 2022), available at fault-tolerance.org

  34. [34]

    Collinson, A

    Whitlock, M., Kolla, H., Bouteiller, A., et al.: Asynchrony and failure mask- ing via pseudo-local process recovery in MPI applications. In: Proc. Par- allel and Distributed Processing Symposium Workshops (IPDPSW) (2024). https://doi.org/10.1109/IPDPSW63119.2024.00193