pith. sign in

arxiv: 2604.27430 · v1 · submitted 2026-04-30 · 💻 cs.DC

Towards the Democratization and Standardization of Dynamic Resources with MPI Spawning

Pith reviewed 2026-05-07 07:56 UTC · model grok-4.3

classification 💻 cs.DC
keywords dynamic resource managementMPI spawningHPCmalleabilityDMR frameworkProteo engineMPDATA
0
0 comments X

The pith

A unified API lets HPC applications manage dynamic resources with MPI spawning without direct interaction with the DMR system.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a unified dynamic resource management API to support many HPC applications in adapting resources at runtime. This API connects to the DMR framework through an upgraded DMRlib that incorporates the Proteo reconfiguration engine, allowing different strategies for changing resources across various managers. The approach avoids the need to respawn all processes or require full support from every resource management system. The authors demonstrate the result by producing a malleable version of the MPDATA solver and report gains in both runtime performance and the effort needed to write the code. The work aims to standardize dynamic resource handling so that more production applications can use it without custom low-level changes.

Core claim

The central claim is that an enhanced modular DMR framework, combined with an upgraded DMRlib containing the Proteo reconfiguration engine, supplies a standard API that lets applications request and receive resource changes through MPI spawning. This design supports multiple reconfiguration methods without respawning every process and without depending on full RMS capabilities, as shown by the creation of a malleable MPDATA implementation that improves performance while reducing coding effort.

What carries the argument

The unified dynamic resource management API, which applications call to request resource adjustments and which is implemented by the Proteo reconfiguration engine inside the DMR framework to carry out the changes across different managers.

If this is right

  • Applications can add dynamic resource management with only standard API calls rather than manager-specific code.
  • Different underlying resource managers can be substituted without rewriting the application logic.
  • Malleable codes such as MPDATA can improve resource efficiency by adjusting allocation during execution.
  • The modular structure makes it easier to extend the framework to new reconfiguration strategies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same API pattern could be applied to other scientific solvers to achieve comparable reductions in adaptation effort.
  • Standardization through MPI spawning may encourage hybrid cloud-HPC workloads to adopt dynamic resources more readily.
  • Future extensions could measure how the framework behaves when resource requests cross administrative domains.

Load-bearing premise

That the Proteo engine and unified API can supply effective reconfiguration across different resource managers without requiring full process respawning and that the resulting system delivers measurable benefits for real production codes such as MPDATA.

What would settle it

Running the malleable MPDATA version on a cluster with a resource manager outside the current DMR support set and checking whether resource changes occur without respawning all processes while still showing the reported performance and productivity gains.

Figures

Figures reproduced from arXiv: 2604.27430 by Antonio J. Pe\~na, Iker Mart\'in-Alvarez, Jos\'e I. Aliaga, Krzystof Rojek, Maribel Castillo, Sergio Iserte.

Figure 1
Figure 1. Figure 1: Dynamic Management of Resources Software Stack In this work, we have extended DMR with an interoperability interface for the new version of DMRlib (and it has been done for Dynamic Processes with PSets (DPP) [4] and other ongoing work for other dynamic resources frameworks). Besides, DMRlib has been expanded with a pure-MPI backend able to operate without OmpSs, which was a strong dependency in the previou… view at source ↗
Figure 2
Figure 2. Figure 2: Workload Completion Times. Comparing the execution time between the baseline and merge versions, there is no significant difference in performance. There are two related reasons for this conclusion: First, each job performs at most one reconfiguration in its lifetime, and the difference between the methods becomes more noticeable when there are more reconfigurations [13]. Second, the way in which malleabil… view at source ↗
read the original abstract

This paper presents an efficient tool for managing dynamic resources in production high-performance computing (HPC) settings, focusing on flexibility, adaptability, and user-friendliness. We introduce a unified dynamic resource management application programming interface (API) that supports a wide range of HPC applications, allowing seamless integration without direct interaction with Dynamic Management of Resources (DMR). The DMR framework, evolved from the DMRlib structure, now supports various dynamic resource managers and includes the Proteo reconfiguration engine to enhance malleability strategies. This integration addresses previous limitations by allowing diverse reconfiguration methods without respawning all processes or lacking RMS support. The paper also showcases the solution's performance and coding productivity with the MPDATA (Multidimensional Positive Definite Advection Transport Algorithm) application. Key contributions include an enhanced modular DMR framework supporting different reconfiguration managers, upgraded DMRlib with the Proteo reconfiguration engine, offering extensive reconfiguration strategies, and a malleable version of the MPDATA solver.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a unified dynamic resource management API for HPC applications that uses MPI spawning to enable seamless integration with dynamic resource managers without direct interaction with the DMR framework. It evolves the prior DMRlib into a modular DMR framework that incorporates the Proteo reconfiguration engine, supporting diverse reconfiguration strategies across multiple resource managers while avoiding full process respawning. The approach is demonstrated via a malleable implementation of the MPDATA solver, with asserted gains in performance and coding productivity.

Significance. If the unified API and Proteo engine deliver the claimed generality and malleability benefits, the work could help standardize dynamic resource management in HPC, lowering barriers for developers to create adaptive applications and improving utilization across schedulers. The modular design and new reconfiguration engine constitute concrete engineering contributions toward more flexible resource handling.

major comments (2)
  1. [§4] §4 (MPDATA evaluation): The central claim that the unified API and Proteo engine support 'a wide range of HPC applications' and 'various dynamic resource managers' rests on a single MPDATA case. No porting effort, reconfiguration success rates, or avoidance of respawning are shown for any second application or alternative RMS (e.g., Slurm versus another scheduler), leaving the generality and 'seamless integration without direct interaction with DMR' assertions untested.
  2. [Abstract and §3] Abstract and §3 (design and claims): Performance and productivity benefits are asserted for MPDATA, yet the manuscript supplies no quantitative metrics, baselines, error bars, or implementation details of the unified API calls. This prevents assessment of whether the claimed efficiency and user-friendliness improvements are realized.
minor comments (2)
  1. [Abstract] The acronym DMR is used before its expansion in the abstract; a parenthetical definition on first use would improve readability.
  2. Consider adding a table that enumerates the reconfiguration strategies implemented in the Proteo engine and which RMS each has been tested against.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We appreciate the opportunity to clarify and strengthen our claims regarding the generality and benefits of the unified API and Proteo engine. Below, we provide point-by-point responses to the major comments and indicate the revisions we will incorporate in the updated version of the paper.

read point-by-point responses
  1. Referee: [§4] §4 (MPDATA evaluation): The central claim that the unified API and Proteo engine support 'a wide range of HPC applications' and 'various dynamic resource managers' rests on a single MPDATA case. No porting effort, reconfiguration success rates, or avoidance of respawning are shown for any second application or alternative RMS (e.g., Slurm versus another scheduler), leaving the generality and 'seamless integration without direct interaction with DMR' assertions untested.

    Authors: We acknowledge that demonstrating the framework's generality with only a single application (MPDATA) and one resource manager is insufficient to fully support the broad claims made in the abstract and introduction. MPDATA serves as a representative example of a complex, production HPC application involving dynamic workloads, but we agree that additional evidence is needed. In the revised manuscript, we will include results from applying the framework to a second application, such as a different scientific solver, detailing the porting effort, reconfiguration success rates, and performance without full process respawning. We will also test and report on integration with an alternative resource management system (e.g., Slurm) to validate seamless integration without direct DMR interaction. This will provide concrete support for the asserted wide applicability. revision: yes

  2. Referee: [Abstract and §3] Abstract and §3 (design and claims): Performance and productivity benefits are asserted for MPDATA, yet the manuscript supplies no quantitative metrics, baselines, error bars, or implementation details of the unified API calls. This prevents assessment of whether the claimed efficiency and user-friendliness improvements are realized.

    Authors: We agree that the current version of the manuscript lacks sufficient quantitative details to allow readers to evaluate the performance and productivity claims. While the full paper includes some performance results for the malleable MPDATA implementation, we recognize that baselines, error bars from repeated experiments, and specific implementation details (such as code examples of the unified API) are missing or inadequately presented. In the revised manuscript, we will expand the abstract and §3 to include quantitative metrics (e.g., percentage improvements in execution time and resource utilization), comparison baselines (non-malleable MPDATA), error bars, and detailed code snippets or pseudocode illustrating the API calls. This will enable a proper assessment of the efficiency and user-friendliness benefits. revision: yes

Circularity Check

0 steps flagged

No significant circularity; implementation paper with no derivations or fitted predictions

full rationale

The manuscript describes a software framework (evolved DMR with Proteo engine and unified API) and demonstrates it via a single MPDATA implementation plus performance/productivity measurements. No equations, first-principles derivations, fitted parameters, or quantitative predictions exist that could reduce to their own inputs by construction. Self-citation to prior DMRlib work is acknowledged but does not bear the load of any claimed result; the current contributions (API design, reconfiguration strategies, MPDATA port) are presented as new engineering artifacts supported by direct implementation evidence. The generality claim is empirically limited to one application, but this is a question of evidence strength, not circularity. The paper is therefore self-contained against external benchmarks with no load-bearing step that collapses to a self-definition or self-citation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

This is a systems and software-engineering paper with no mathematical derivations. No free parameters, standard axioms, or invented physical entities are introduced. The Proteo engine is a new software component whose behavior is described at the framework level.

invented entities (1)
  • Proteo reconfiguration engine no independent evidence
    purpose: To provide extensive reconfiguration strategies within the upgraded DMRlib and support malleability without full process respawning
    Presented as a new module that enhances the DMR framework; no independent falsifiable evidence (such as predicted performance on external benchmarks) is supplied beyond the paper's own description.

pith-pipeline@v0.9.0 · 5491 in / 1283 out tokens · 37546 ms · 2026-05-07T07:56:52.812128+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    A Survey on Malleability Solutions for High-Performance Distributed Computing

    Aliaga, J.I., Castillo, M., Iserte, S., Mart´ ın-´Alvarez, I., Mayo, R.: A Survey on Mal- leability Solutions for High-Performance Distributed Computing. Applied Sciences 12(10), 5231 (Jan 2022). https://doi.org/10.3390/app12105231

  2. [2]

    In: Proceedings of the 23rd EuroMPI

    Compr´ es, I., Mo-Hellenbrand, A., Gerndt, M., Bungartz, H.J.: Infrastruc- ture and API Extensions for Elastic Execution of MPI Applications. In: Proceedings of the 23rd EuroMPI. pp. 82–97. EuroMPI 2016 (2016). https://doi.org/10.1145/2966884.2966917

  3. [3]

    Barcelona Supercomputing Center (BSC) (2019)

    Corbalan, J., Brochard, L.: Ear: Energy management framework for supercomput- ers. Barcelona Supercomputing Center (BSC) (2019)

  4. [4]

    In: ISC High Performance (Jun 2024)

    Dutot, P., Fecht, J., Gaddameedi, K., Huber, D., Iserte, S., Minion, M., Schulz, M., Schreiber, M., Sch¨ uller, V., Pe˜ na, A.J., Richard, O.: Leveraging Dynamic Resource Management in HPC. In: ISC High Performance (Jun 2024)

  5. [5]

    In: Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2006 (March 2006)

    Huang, C., Zheng, G., Kumar, S., Kal´ e, L.V.: Performance Evaluation of Adaptive MPI. In: Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2006 (March 2006)

  6. [6]

    In: Proceedings of the 29th EuroMPI/USA

    Huber, D., Streubel, M., Compr´ es, I., Schulz, M., Schreiber, M., Pritchard, H.: Towards Dynamic Resource Management with MPI Sessions and PMIx. In: Proceedings of the 29th EuroMPI/USA. pp. 57–67 (2022). https://doi.org/10.1145/3555819.3555856

  7. [7]

    IEEE Transactions on Computers (2020)

    Iserte, S., Mayo, R., Quintana-Orti, E., Pena, A.: DMRlib: Easy-coding and Effi- cient Resource Management for Job Malleability. IEEE Transactions on Computers (2020). https://doi.org/10.1109/TC.2020.3022933

  8. [8]

    Iserte, S.: High-throughput Computation through Efficient Resource Manage- ment. Ph.D. thesis, Universitat Jaume I, Castell´ o de la Plana (nov 2018). https://doi.org/10.6035/14101.2018.176272

  9. [9]

    Dynamic reconfiguration of noniterative scientific applications: A case study with HPG aligner

    Iserte, S., Mart´ ınez, H., Barrachina, S., Castillo, M., Mayo, R., Pe˜ na, A.J.: Dynamic reconfiguration of noniterative scientific applications. The Inter- national Journal of High Performance Computing Applications (sep 2018). https://doi.org/10.1177/1094342018802347

  10. [10]

    In: 46th International Conference on Parallel Processing Workshops

    Iserte, S., Mayo, R., Quintana-Ort´ ı, E.S., Beltran, V., Pe˜ na, A.J.: Efficient Scal- able Computing through Flexible Applications and Adaptive Workloads. In: 46th International Conference on Parallel Processing Workshops. pp. 180–189. IEEE, Bristol (UK) (aug 2017). https://doi.org/10.1109/ICPPW.2017.36 14 S. Iserte et al

  11. [11]

    A Study of the Effect of Process MalleabilityintheEnergyEfficiencyonGPU-basedClusters

    Iserte, S., Rojek, K.: An study of the effect of process malleability in the energy efficiency on GPU-based clusters. The Journal of Supercomputing pp. 1–20 (oct 2019). https://doi.org/10.1007/s11227-019-03034-x

  12. [12]

    In: Proceedings of PER- MAVOST (2021)

    Lopez, V., Ramirez Miranda, G., Garcia-Gasulla, M.: TALP: A Lightweight Tool to Unveil Parallel Efficiency of Large-Scale Executions. In: Proceedings of PER- MAVOST (2021). https://doi.org/10.1145/3452412.3462753

  13. [13]

    The Journal of Su- percomputing (in 2nd revision)

    Mart´ ın-´Alvarez, I., Aliaga, J.I., Castillo, M., Iserte, S.: Proteo: A framework for the generation and evaluation of malleable MPI applications. The Journal of Su- percomputing (in 2nd revision)

  14. [14]

    In: IEEE IPDPS

    Prabhakaran, S., Neumann, M., Rinke, S., Wolf, F., Gupta, A., Kale, L.V.: A Batch System with Efficient Adaptive Scheduling for Malleable and Evolving Ap- plications. In: IEEE IPDPS. pp. 429–438 (May 2015)

  15. [15]

    Reina Le´ on, J.: Implementaci´ on distribuida maleable del m´ etodo laplace (2024), https://openaccess.uoc.edu/handle/10609/149763, UOC

  16. [16]

    The Journal of Supercomputing73(2), 664–675 (2017)

    Rojek, K., Wyrzykowski, R.: Performance modeling of 3D MPDATA simula- tions on GPU cluster. The Journal of Supercomputing73(2), 664–675 (2017). https://doi.org/10.1007/s11227-016-1774-z

  17. [17]

    Concurrency and Computation: Practice and Experience29(9), e3970 (2017)

    Rojek, K., Wyrzykowski, R., Kuczynski, L.: Systematic adaptation of stencil-based 3D MPDATA to GPU architectures. Concurrency and Computation: Practice and Experience29(9), e3970 (2017). https://doi.org/10.1002/cpe.3970

  18. [18]

    In: SC14: International Conference for High Performance Computing, Networking, Storage and Analysis

    Sarood, O., Langer, A., Gupta, A., Kale, L.: Maximizing Throughput of Overpro- visioned HPC Data Centers Under a Strict Power Budget. In: SC14: International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 807–818. IEEE (nov 2014)

  19. [19]

    In: Proceed- ings of the International Conference on Parallel Processing (2007)

    Sudarsan, R., Ribbens, C.J.: ReSHAPE: a Framework for Dynamic Resizing and Scheduling of Homogeneous Applications in a Parallel Environment. In: Proceed- ings of the International Conference on Parallel Processing (2007)

  20. [20]

    IEEE Transaction on Parallel and Distributed Systems (2024), (in-press)

    Tarraf, A., et al.: Malleability in Modern HPC Systems: Current Experiences, Chal- lenges, and Future Opportunities. IEEE Transaction on Parallel and Distributed Systems (2024), (in-press)