User Experiences with MPI RMA and ULFM in a Resilient Key-Value Store Implementation

Claudia Fohry; Rainer Fink

arxiv: 2604.18098 · v1 · submitted 2026-04-20 · 💻 cs.DC

User Experiences with MPI RMA and ULFM in a Resilient Key-Value Store Implementation

Claudia Fohry , Rainer Fink This is my paper

Pith reviewed 2026-05-10 04:04 UTC · model grok-4.3

classification 💻 cs.DC

keywords MPIRMAULFMresilient computingkey-value storefault toleranceone-sided communicationfailure recovery

0 comments

The pith

Implementing a resilient key-value store with MPI RMA and ULFM proved difficult due to missing failure mitigation features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

MPI programmers facing hardware failures can store vulnerable data in a tailored resilient key-value store instead of relying on external solutions like Redis. The paper explores building such a store as a component for a task-based runtime, using one-sided MPI RMA for accesses and ULFM to detect and recover from process aborts while preserving redundant data copies. The implementation encountered significant difficulties because proposed ULFM features for handling RMA during failures are not yet implemented in Open MPI. The authors describe their experiences, identify specific missing functionalities, and explain a workaround they adopted to make the store operational.

Core claim

Our implementation of a resilient key-value store using passive target MPI RMA functions for one-sided operations and ULFM for failure mitigation proved difficult due to several unimplemented ULFM functionalities for RMA. Even if those features existed, the programming task could be simplified. The store maintains redundant copies of key-value pairs across processes to allow recovery after failures on surviving processes.

What carries the argument

A resilient key-value store implemented with passive target MPI RMA for one-sided read/write operations and ULFM for detecting process aborts and enabling recovery.

If this is right

The resilient store can be integrated into MPI applications to protect data from node losses.
Recovery is possible by continuing on surviving processes with intact data copies.
Current ULFM implementations require workarounds for RMA-based resilience.
Additional ULFM features for RMA would reduce programming complexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

API designers for MPI extensions may need to prioritize RMA-specific failure handling to encourage resilient applications.
Similar challenges could arise when adding resilience to other one-sided communication models.
Future MPI standards might benefit from built-in support for redundant data structures.

Load-bearing premise

The main source of implementation difficulties is the absence of certain ULFM functionalities for RMA, not a fundamental incompatibility between the store design and MPI's RMA or failure model.

What would settle it

Successfully implementing the missing ULFM RMA features and rewriting the store to use them, resulting in substantially simpler code, would confirm the claim; if complexity remains high, it would indicate other factors are at play.

Figures

Figures reproduced from arXiv: 2604.18098 by Claudia Fohry, Rainer Fink.

**Figure 1.** Figure 1: store_put latencies of Hazelcast, with 6 replicas and 40 processes per node 1 4 8 16 32 64 Number of nodes 55 60 65 70 75 80 Tim e [ s] Developed KV-Store [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗

read the original abstract

As hardware failures such as node losses become increasingly common, MPI programmers may want to save vulnerable data in a resilient store. While third-party storage solutions such as Redis or the Hazelcast IMap exist, a tailored, MPI-based store may be easier to integrate and can be optimized for particular application needs. This paper considers the implementation of such a store, which is intended as a component in a resilient task-based runtime system written in MPI. The store holds redundant data copies as key-value pairs in the main memories of multiple processes. Since store access operations, such as reads and writes, are naturally one-sided, we implemented the store with passive target MPI RMA functions. Process aborts are detected with the user-level failure mitigation (ULFM) extension of Open MPI. After failures, the program recovers on the surviving processes and continues with the intact data copies. Our implementation proved difficult, since several proposed ULFM functionalities for RMA have not yet been implemented. Even assuming their existence, we think that the programming task could be simplified. This paper describes our experiences, lists functionalities that we missed, and explains a workaround that we adopted in our implementation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a narrow but useful experience report on the concrete difficulties of wiring ULFM into an MPI RMA-based resilient key-value store, including a list of still-missing features and the workaround the authors chose.

read the letter

The main point is that the authors built a redundant key-value store inside MPI for a task-based runtime, using passive-target RMA for one-sided reads and writes and ULFM to detect aborts and recover on surviving processes. They found the job harder than expected because several ULFM capabilities proposed for RMA are not yet implemented, and they list those gaps plus the workaround they adopted. Even with the missing pieces in place, they suspect the programming would still be non-trivial.

Referee Report

1 major / 1 minor

Summary. The manuscript is an experience report on implementing a resilient key-value store as a component for a task-based MPI runtime. The store maintains redundant key-value pairs in main memory across processes using passive-target MPI RMA for one-sided access; process failures are detected via the ULFM extension of Open MPI, after which the application recovers on surviving processes using the intact data copies. The central claim is that the implementation proved difficult because several proposed ULFM functionalities for RMA remain unimplemented in current libraries, and that the programming task would remain non-trivial even if those features existed; the authors list the missing capabilities and describe the workaround they adopted.

Significance. If the reported experiences hold, the paper offers concrete, practitioner-level insight into gaps in the ULFM specification for RMA-based resilience, which is relevant to the growing number of MPI applications that must tolerate node failures. The workaround description and enumeration of missing features provide immediate guidance for similar implementations and could help prioritize future ULFM development. The report is grounded in a specific use case (redundant KV store for a runtime), which strengthens its utility over purely abstract discussions.

major comments (1)

The central claim that implementation difficulty stems primarily from missing ULFM RMA features (and would persist even if those features existed) rests on the authors' direct experience, yet the manuscript supplies only a high-level list of missing functionalities and a summary of the workaround without code fragments, API call sequences, or concrete failure scenarios (e.g., in the sections describing the store implementation and recovery). This limits independent assessment of whether the difficulties are due to feature gaps versus design choices in the KV store or RMA usage pattern.

minor comments (1)

The abstract and introduction could more explicitly separate the description of the KV-store design from the enumeration of ULFM limitations to improve readability for readers primarily interested in the missing features.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and constructive suggestion to strengthen the presentation of our implementation experiences. We have revised the manuscript to address the concern by adding concrete details.

read point-by-point responses

Referee: The central claim that implementation difficulty stems primarily from missing ULFM RMA features (and would persist even if those features existed) rests on the authors' direct experience, yet the manuscript supplies only a high-level list of missing functionalities and a summary of the workaround without code fragments, API call sequences, or concrete failure scenarios (e.g., in the sections describing the store implementation and recovery). This limits independent assessment of whether the difficulties are due to feature gaps versus design choices in the KV store or RMA usage pattern.

Authors: We agree that the original manuscript presented the missing ULFM RMA functionalities and the adopted workaround at a summary level, which could make it harder for readers to independently evaluate the source of the difficulties. In the revised version, we have expanded the store implementation and recovery sections with specific code fragments (e.g., MPI_Win_create and MPI_Get/MPI_Put sequences under passive target epochs), detailed API call sequences showing how failures interrupt epochs without notification, and a concrete failure scenario (a single process abort during a write operation, detected via MPI_Comm_failure_ack and subsequent recovery using redundant copies). These additions illustrate that the core issues stemmed from unimplemented ULFM features for RMA—such as failure propagation within active epochs and automatic handling of orphaned windows—rather than from idiosyncratic choices in the KV store design or RMA pattern. The experiences remain grounded in our direct implementation attempts for the task-based runtime, and the details also support our view that even the proposed features would leave the task non-trivial, motivating the workaround of explicit replication and manual recovery logic. revision: yes

Circularity Check

0 steps flagged

No significant circularity; descriptive experience report

full rationale

This paper is a purely descriptive experience report on implementing a resilient key-value store with MPI RMA and ULFM. It contains no equations, derivations, fitted parameters, predictions, or mathematical claims. The central narrative (implementation difficulties due to missing ULFM RMA features, plus a workaround) rests on concrete implementation choices and the current state of the ULFM specification, with no reduction to self-definition, self-citation chains, or renamed inputs. No load-bearing steps exist that could be circular by the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an implementation experience paper with no mathematical content. It relies on existing MPI and ULFM standards rather than introducing new parameters, axioms, or entities.

pith-pipeline@v0.9.0 · 5504 in / 1052 out tokens · 43683 ms · 2026-05-10T04:04:42.191805+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

[1]

In: Proc

Ali, M.M., Southern, J., Strazdins, P., Harding, B.: Application level fault recovery: Using fault-tolerant Open MPI in a PDE solver. In: Proc. Parallel and Distributed Processing Symposium Workshops (IPDPSW). p. 1169–1178. IEEE (2014)

work page 2014
[2]

In: Proc

Ansel, J., Arya, K., Cooperman, G.: DMTCP: Transparent checkpointing for clus- ter computations and the desktop. In: Proc. Int. Parallel and Distributed Process- ing Symp. (IPDPS) (2009). https://doi.org/10.1109/ipdps.2009.5161063

work page doi:10.1109/ipdps.2009.5161063 2009
[3]

Scalable Funding of Bitcoin Micropayment Channel Networks

Bland, W., Bouteiller, A., Herault, T., et al.: An evaluation of user-level failure mit- igation support in MPI. In: Recent Advances in the Message Passing Interface (Eu- roMPI). pp. 193–203. Springer LNCS 7490 (2012). https://doi.org/10.1007/978-3- 642-33518-1_24

work page doi:10.1007/978-3- 2012
[4]

In: Proc

Bouteiller, A., Bosilca, G.: Implicit actions and non-blocking failure recovery with MPI. In: Proc. Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) (2022). https://doi.org/10.1109/FTXS56515.2022.00009

work page doi:10.1109/ftxs56515.2022.00009 2022
[5]

ACM Transactions on Parallel Computing (TOPC)1(2) (2015)

Bouteiller, A., Herault, T., Bosilca, G., et al.: Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy. ACM Transactions on Parallel Computing (TOPC)1(2) (2015). https://doi.org/10.1145/2686892

work page doi:10.1145/2686892 2015
[6]

In: Proc

Cunningham, D., Grove, D., Herta, B., et al.: Resilient X10: Efficient failure-aware programming. In: Proc. ACM SIGPLAN Symp. on Principles and Practice of Par- allel Programming (PPoPP). pp. 67–80 (2014)

work page 2014
[7]

Pearson, 7 edn

Elmasri, R., Navathe, S.: Fundamentals of Database Systems. Pearson, 7 edn. (2015)

work page 2015
[8]

Master’s thesis, University of Kassel (2025)

Fink, R.: Entwurf und Implementierung eines fehlertoleranten Speichers für HPC- Cluster. Master’s thesis, University of Kassel (2025)

work page 2025
[9]

Fohry, C., Bungart, M., Posner, J.: Fault Tolerance Schemes for Global Load Bal- ancinginX10.ScalableComputing:PracticeandExperience16(2),169–185(2015)

work page 2015
[10]

In: Proc

Gamell, M., Katz, D.S., Kolla, H., et al.: Exploring automatic, online failure recov- ery for scientific applications at extreme scales. In: Proc. Int. Conf. for High Per- formance Computing, Networking, Storage and Analysis (SC). pp. 895–906 (2014)

work page 2014
[11]

In: Proc

Georgakoudis, G., Guo, L., Laguna, I.: Reinit++: Evaluating the performance of global-restart recovery methods for MPI fault tolerance. In: Proc. Int. Conf. on High Performance Computing (ISC). pp. 536–554 (2020)

work page 2020
[12]

Gridgain, www.gridgain.com

work page
[13]

Hamouda, S.S.: Resilience in High-Level Parallel Programming Languages. Ph.D. thesis, Australian National University (2019)

work page 2019
[14]

In: Proc

Hamouda, S.S., Herta, B., Milthorpe, J., et al.: Resilient X10 over MPI user level failure mitigation. In: Proc. ACM SIGPLAN X10 Workshop. pp. 18–23 (2016)

work page 2016
[15]

Hazelcast, hazelcast.org

work page
[16]

In: Re- cent Advances in the Message Passing Interface (EuroMPI)

Hjelm, N.: An evaluation of the one-sided performance in Open MPI. In: Re- cent Advances in the Message Passing Interface (EuroMPI). pp. 184–187. Springer (2016) Title Suppressed Due to Excessive Length 15

work page 2016
[17]

In: Proc

Kolla, H., Mayo, J.R., Teranishi, K., Armstrong, R.C.: Improving scalability of silent-error resilience for message-passing solvers via local recovery and asynchrony. In: Proc. Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) (2020). https://doi.org/10.1109/FTXS51974.2020.00006

work page doi:10.1109/ftxs51974.2020.00006 2020
[18]

Future Generation Computer Systems (FGCS)106, 467–481 (2020)

Losada, N., González, P., Martín, M.J., et al.: Fault tolerance of MPI applications in exascale systems: The ULFM solution. Future Generation Computer Systems (FGCS)106, 467–481 (2020)

work page 2020
[19]

MPI: A message-passing interface standard (version 5.0) (2025), mpi-forum.org

work page 2025
[20]

Nather, R., Reitz, M., Fohry, C.: Distributed, Resilient and In-Memory Storage of Key-Value Data for HPC (WIP Talk). In: Int. Parallel Data Systems Workshop (PDSW) (2024)

work page 2024
[21]

Nicolae, B., Moody, A., Gonsiorowski, E., et al.: Veloc: Towards high perfor- mance adaptive asynchronous checkpointing at large scale. In: Int. Parallel and Distributed Processing Symposium (IPDPS). pp. 911–920. IEEE (2019)

work page 2019
[22]

Open MPI: Open High Performance Computing, www.open-mpi.org

work page
[23]

OpenMP Architecture Review Board: OpenMP application programming interface (version 6.0) (2024), openmp.org

work page 2024
[24]

Special Issue International Journal of Networking and Computing (IJNC)8(1), 2–31 (2018)

Posner, J., Fohry, C.: A Java task pool framework providing fault tolerant global load balancing. Special Issue International Journal of Networking and Computing (IJNC)8(1), 2–31 (2018)

work page 2018
[25]

Future Generation Computing Systems (FGCS)105, 119– 134 (2019)

Posner, J., Reitz, M., Fohry, C.: A comparison of application-level fault tolerance schemes for task pools. Future Generation Computing Systems (FGCS)105, 119– 134 (2019). https://doi.org/10.1016/j.future.2019.11.031

work page doi:10.1016/j.future.2019.11.031 2019
[26]

supervi- sion

Posner, J., Reitz, M., Fohry, C.: Task-level resilience: Checkpointing vs. supervi- sion. Special Issue Int. Journal of Networking and Computing (IJNC)12(1), 47–72 (2022)

work page 2022
[27]

SN Computer Science5(320) (2024)

Reitz, M., Fohry, C.: Task-level checkpointing and localized recovery to tolerate permanent node failures for nested fork–join programs in clusters. SN Computer Science5(320) (2024). https://doi.org/10.1007/s42979-024-02624-8

work page doi:10.1007/s42979-024-02624-8 2024
[28]

In: Proc

Reitz, M., Hundhausen, J., Fohry, C.: Fail-stop failure protection for coordinated work stealing of tasks that communicate through futures. In: Proc. Workshop on Asynchronous Many-Task Systems and Applications (WAMTA). pp. 44–55. Springer LNCS 15690 (2025)

work page 2025
[29]

https://doi.org/10.48550/arXiv.2111.08142

Schuchart, J., Niethammer, C., Gracia, J., Bosilca, G.: Quo vadis MPI RMA? towards a more efficient use of MPI one-sided communication (2021). https://doi.org/10.48550/arXiv.2111.08142

work page doi:10.48550/arxiv.2111.08142 2021
[30]

Parallel Computing106(2021)

Schuchart, J., Samfass, P., Niethammer, C., et al.: Callback-based comple- tion notification using MPI continuations. Parallel Computing106(2021). https://doi.org/10.1016/j.parco.2021.102793

work page doi:10.1016/j.parco.2021.102793 2021
[31]

In: Proc

Tardieu, O.: The APGAS library: resilient parallel and distributed programming in Java 8. In: Proc. ACM SIGPLAN Workshop on X10. pp. 25–26 (2015)

work page 2015
[32]

TOP500.org: Goethe-NHR of the University of Frankfurt (2025), https://www.top500.org/system/180175

work page 2025
[33]

Process fault tolerance: Chapter 16 of draft document for a standard message- passing interface (October 31, 2022), available at fault-tolerance.org

work page 2022
[34]

Collinson, A

Whitlock, M., Kolla, H., Bouteiller, A., et al.: Asynchrony and failure mask- ing via pseudo-local process recovery in MPI applications. In: Proc. Par- allel and Distributed Processing Symposium Workshops (IPDPSW) (2024). https://doi.org/10.1109/IPDPSW63119.2024.00193

work page doi:10.1109/ipdpsw63119.2024.00193 2024

[1] [1]

In: Proc

Ali, M.M., Southern, J., Strazdins, P., Harding, B.: Application level fault recovery: Using fault-tolerant Open MPI in a PDE solver. In: Proc. Parallel and Distributed Processing Symposium Workshops (IPDPSW). p. 1169–1178. IEEE (2014)

work page 2014

[2] [2]

In: Proc

Ansel, J., Arya, K., Cooperman, G.: DMTCP: Transparent checkpointing for clus- ter computations and the desktop. In: Proc. Int. Parallel and Distributed Process- ing Symp. (IPDPS) (2009). https://doi.org/10.1109/ipdps.2009.5161063

work page doi:10.1109/ipdps.2009.5161063 2009

[3] [3]

Scalable Funding of Bitcoin Micropayment Channel Networks

Bland, W., Bouteiller, A., Herault, T., et al.: An evaluation of user-level failure mit- igation support in MPI. In: Recent Advances in the Message Passing Interface (Eu- roMPI). pp. 193–203. Springer LNCS 7490 (2012). https://doi.org/10.1007/978-3- 642-33518-1_24

work page doi:10.1007/978-3- 2012

[4] [4]

In: Proc

Bouteiller, A., Bosilca, G.: Implicit actions and non-blocking failure recovery with MPI. In: Proc. Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) (2022). https://doi.org/10.1109/FTXS56515.2022.00009

work page doi:10.1109/ftxs56515.2022.00009 2022

[5] [5]

ACM Transactions on Parallel Computing (TOPC)1(2) (2015)

Bouteiller, A., Herault, T., Bosilca, G., et al.: Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy. ACM Transactions on Parallel Computing (TOPC)1(2) (2015). https://doi.org/10.1145/2686892

work page doi:10.1145/2686892 2015

[6] [6]

In: Proc

Cunningham, D., Grove, D., Herta, B., et al.: Resilient X10: Efficient failure-aware programming. In: Proc. ACM SIGPLAN Symp. on Principles and Practice of Par- allel Programming (PPoPP). pp. 67–80 (2014)

work page 2014

[7] [7]

Pearson, 7 edn

Elmasri, R., Navathe, S.: Fundamentals of Database Systems. Pearson, 7 edn. (2015)

work page 2015

[8] [8]

Master’s thesis, University of Kassel (2025)

Fink, R.: Entwurf und Implementierung eines fehlertoleranten Speichers für HPC- Cluster. Master’s thesis, University of Kassel (2025)

work page 2025

[9] [9]

Fohry, C., Bungart, M., Posner, J.: Fault Tolerance Schemes for Global Load Bal- ancinginX10.ScalableComputing:PracticeandExperience16(2),169–185(2015)

work page 2015

[10] [10]

In: Proc

Gamell, M., Katz, D.S., Kolla, H., et al.: Exploring automatic, online failure recov- ery for scientific applications at extreme scales. In: Proc. Int. Conf. for High Per- formance Computing, Networking, Storage and Analysis (SC). pp. 895–906 (2014)

work page 2014

[11] [11]

In: Proc

Georgakoudis, G., Guo, L., Laguna, I.: Reinit++: Evaluating the performance of global-restart recovery methods for MPI fault tolerance. In: Proc. Int. Conf. on High Performance Computing (ISC). pp. 536–554 (2020)

work page 2020

[12] [12]

Gridgain, www.gridgain.com

work page

[13] [13]

Hamouda, S.S.: Resilience in High-Level Parallel Programming Languages. Ph.D. thesis, Australian National University (2019)

work page 2019

[14] [14]

In: Proc

Hamouda, S.S., Herta, B., Milthorpe, J., et al.: Resilient X10 over MPI user level failure mitigation. In: Proc. ACM SIGPLAN X10 Workshop. pp. 18–23 (2016)

work page 2016

[15] [15]

Hazelcast, hazelcast.org

work page

[16] [16]

In: Re- cent Advances in the Message Passing Interface (EuroMPI)

Hjelm, N.: An evaluation of the one-sided performance in Open MPI. In: Re- cent Advances in the Message Passing Interface (EuroMPI). pp. 184–187. Springer (2016) Title Suppressed Due to Excessive Length 15

work page 2016

[17] [17]

In: Proc

Kolla, H., Mayo, J.R., Teranishi, K., Armstrong, R.C.: Improving scalability of silent-error resilience for message-passing solvers via local recovery and asynchrony. In: Proc. Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) (2020). https://doi.org/10.1109/FTXS51974.2020.00006

work page doi:10.1109/ftxs51974.2020.00006 2020

[18] [18]

Future Generation Computer Systems (FGCS)106, 467–481 (2020)

Losada, N., González, P., Martín, M.J., et al.: Fault tolerance of MPI applications in exascale systems: The ULFM solution. Future Generation Computer Systems (FGCS)106, 467–481 (2020)

work page 2020

[19] [19]

MPI: A message-passing interface standard (version 5.0) (2025), mpi-forum.org

work page 2025

[20] [20]

Nather, R., Reitz, M., Fohry, C.: Distributed, Resilient and In-Memory Storage of Key-Value Data for HPC (WIP Talk). In: Int. Parallel Data Systems Workshop (PDSW) (2024)

work page 2024

[21] [21]

Nicolae, B., Moody, A., Gonsiorowski, E., et al.: Veloc: Towards high perfor- mance adaptive asynchronous checkpointing at large scale. In: Int. Parallel and Distributed Processing Symposium (IPDPS). pp. 911–920. IEEE (2019)

work page 2019

[22] [22]

Open MPI: Open High Performance Computing, www.open-mpi.org

work page

[23] [23]

OpenMP Architecture Review Board: OpenMP application programming interface (version 6.0) (2024), openmp.org

work page 2024

[24] [24]

Special Issue International Journal of Networking and Computing (IJNC)8(1), 2–31 (2018)

Posner, J., Fohry, C.: A Java task pool framework providing fault tolerant global load balancing. Special Issue International Journal of Networking and Computing (IJNC)8(1), 2–31 (2018)

work page 2018

[25] [25]

Future Generation Computing Systems (FGCS)105, 119– 134 (2019)

Posner, J., Reitz, M., Fohry, C.: A comparison of application-level fault tolerance schemes for task pools. Future Generation Computing Systems (FGCS)105, 119– 134 (2019). https://doi.org/10.1016/j.future.2019.11.031

work page doi:10.1016/j.future.2019.11.031 2019

[26] [26]

supervi- sion

Posner, J., Reitz, M., Fohry, C.: Task-level resilience: Checkpointing vs. supervi- sion. Special Issue Int. Journal of Networking and Computing (IJNC)12(1), 47–72 (2022)

work page 2022

[27] [27]

SN Computer Science5(320) (2024)

Reitz, M., Fohry, C.: Task-level checkpointing and localized recovery to tolerate permanent node failures for nested fork–join programs in clusters. SN Computer Science5(320) (2024). https://doi.org/10.1007/s42979-024-02624-8

work page doi:10.1007/s42979-024-02624-8 2024

[28] [28]

In: Proc

Reitz, M., Hundhausen, J., Fohry, C.: Fail-stop failure protection for coordinated work stealing of tasks that communicate through futures. In: Proc. Workshop on Asynchronous Many-Task Systems and Applications (WAMTA). pp. 44–55. Springer LNCS 15690 (2025)

work page 2025

[29] [29]

https://doi.org/10.48550/arXiv.2111.08142

Schuchart, J., Niethammer, C., Gracia, J., Bosilca, G.: Quo vadis MPI RMA? towards a more efficient use of MPI one-sided communication (2021). https://doi.org/10.48550/arXiv.2111.08142

work page doi:10.48550/arxiv.2111.08142 2021

[30] [30]

Parallel Computing106(2021)

Schuchart, J., Samfass, P., Niethammer, C., et al.: Callback-based comple- tion notification using MPI continuations. Parallel Computing106(2021). https://doi.org/10.1016/j.parco.2021.102793

work page doi:10.1016/j.parco.2021.102793 2021

[31] [31]

In: Proc

Tardieu, O.: The APGAS library: resilient parallel and distributed programming in Java 8. In: Proc. ACM SIGPLAN Workshop on X10. pp. 25–26 (2015)

work page 2015

[32] [32]

TOP500.org: Goethe-NHR of the University of Frankfurt (2025), https://www.top500.org/system/180175

work page 2025

[33] [33]

Process fault tolerance: Chapter 16 of draft document for a standard message- passing interface (October 31, 2022), available at fault-tolerance.org

work page 2022

[34] [34]

Collinson, A

Whitlock, M., Kolla, H., Bouteiller, A., et al.: Asynchrony and failure mask- ing via pseudo-local process recovery in MPI applications. In: Proc. Par- allel and Distributed Processing Symposium Workshops (IPDPSW) (2024). https://doi.org/10.1109/IPDPSW63119.2024.00193

work page doi:10.1109/ipdpsw63119.2024.00193 2024