pith. machine review for the scientific record. sign in

arxiv: 2604.26824 · v1 · submitted 2026-04-29 · 💻 cs.DC · cs.SE

Recognition: unknown

A Test Taxonomy and Continuous Integration Ecosystem for Dynamic Resource Management in HPC

Authors on Pith no claims yet

Pith reviewed 2026-05-07 11:39 UTC · model grok-4.3

classification 💻 cs.DC cs.SE
keywords HPCdynamic resource managementMPI malleabilitytest taxonomycontinuous integrationfault detectionDMR frameworkautomated testing
0
0 comments X

The pith

A test taxonomy paired with a continuous integration ecosystem improves fault detection and maintenance for dynamic resource management frameworks in HPC.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a methodology for testing dynamic resource management and malleable MPI applications in high-performance computing. It pairs a taxonomy that classifies functional and non-functional tests at component-integration and system levels with an HPC-oriented continuous integration ecosystem running in a containerized virtual cluster. The approach is evaluated on the DMR framework as a case study. It addresses the problem of ad hoc, hard-to-reproduce experiments by providing structured, automated validation. A sympathetic reader would care because reliable testing supports HPC systems that must adapt to heterogeneous hardware, changing workloads, and energy limits.

Core claim

The authors establish that combining a taxonomy of tests for MPI malleable libraries with an HPC-oriented continuous integration ecosystem instantiated in a containerized virtual cluster, when applied to the DMR framework, improves early fault detection, simplifies maintenance under evolving dependencies, and transfers to other malleability solutions that expose analogous primitives for initialization, readiness checking, and reconfiguration.

What carries the argument

The test taxonomy that structures functional and non-functional tests at both component-integration and system levels, instantiated via a containerized virtual cluster for automated validation.

If this is right

  • The methodology improves early fault detection through automated validation of dynamic resource management frameworks.
  • Maintenance is simplified when testing suites must adapt to evolving software dependencies.
  • The approach transfers to other malleability solutions that provide similar primitives for initialization, readiness checking, and reconfiguration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers of additional resource management libraries could apply the same taxonomy to create reusable test suites for malleability features.
  • Containerized virtual clusters may allow broader hardware coverage during testing without requiring dedicated physical systems.
  • Standardized testing practices could shorten the time needed to validate updates to dynamic HPC software.

Load-bearing premise

The DMR framework is representative of other dynamic resource management solutions and the taxonomy comprehensively covers functional and non-functional tests at component-integration and system levels.

What would settle it

An experiment in which the taxonomy and CI ecosystem fail to detect known faults in DMR or another similar framework, or show no reduction in maintenance effort when dependencies change, would disprove the claimed benefits.

Figures

Figures reproduced from arXiv: 2604.26824 by Antonio J. Pe\~na, \'I\~nigo Ar\'ejula-A\'isa, Petter Sand{\aa}s, Sergio Iserte.

Figure 2
Figure 2. Figure 2: Technology stack for an MPI malleable application, showing interactions between the application, DynRM frame￾work, MPI runtime, and resource manager. system and the MPI runtime, while preserving correct communication and data distribution. This requires tightly coupled interactions between the application, a malleability library, the MPI process manager, and the batch scheduler, and small changes in any of… view at source ↗
Figure 3
Figure 3. Figure 3: Jenkins deployment using Docker Compose. The controller container runs the Jenkins server and is the only externally exposed component, while the worker container, instantiated from a Docker-in-Docker (DinD) image, executes the containerized cluster-based CI pipelines view at source ↗
Figure 4
Figure 4. Figure 4: DMR architecture. resource management in MPI applications running on HPC systems [41]. Its goal is to let a running job acquire, release, or reconfigure compute resources at runtime—expanding or shrinking transparently during execution—to improve system utilization and respon￾siveness without user intervention. DMR sits in between four main components: the scientific application, the MPI runtime, the perfo… view at source ↗
Figure 5
Figure 5. Figure 5: DMR Core API State Diagram. from any other state and that the expected transitions from Uninitialized to Wait for Data Receive and No Pending Reconfiguration are exercised correctly. Furthermore, DMR must validate a number of internal and external preconditions, represented in view at source ↗
Figure 6
Figure 6. Figure 6: DMR’s CI pipeline set up with Slurm version 23.02.07 (shown as Slurm v23) and the DMR-specific resource manager Slurm4DMR. The dotted line shows the path through the pipeline when all stages complete successfully. in this environment there is contention for resources. Note that this waiting time depends on the current load of the shared cluster, and it is therefore uncontrolled and stochastic. In contrast,… view at source ↗
read the original abstract

High-performance computing (HPC) systems are increasingly exploring dynamic resource management and malleable MPI applications to better adapt to heterogeneous architectures, fluctuating workloads, and energy constraints. However, the correctness of the libraries that support these techniques is often evaluated through ad hoc experiments that can be difficult to reproduce and maintain. This article introduces methodology for testing dynamic resource management frameworks that combines a taxonomy of tests for MPI malleable libraries with an HPC-oriented continuous integration (CI) ecosystem. The taxonomy structures functional and non-functional tests at both component-integration and system levels. The CI ecosystem instantiates this taxonomy in a containerized virtual cluster enabling automated validation. The approach is instantiated and evaluated using the Dynamic Management of Resources (DMR) framework as a representative case study. Results show that the proposed methodology improves early fault detection, simplifies maintenance under evolving dependencies, and transfers to other malleability solutions that expose analogous primitives for initialization, readiness checking, and reconfiguration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a taxonomy of functional and non-functional tests for MPI malleable libraries at component-integration and system levels, together with a containerized HPC-oriented continuous integration ecosystem that automates validation. The approach is instantiated and evaluated via a single case study on the Dynamic Management of Resources (DMR) framework; the authors claim that the methodology improves early fault detection, simplifies maintenance under evolving dependencies, and transfers to other malleability solutions that expose analogous primitives for initialization, readiness checking, and reconfiguration.

Significance. If the claims hold, the work supplies a reusable structured testing methodology for an increasingly important class of dynamic HPC systems, with the containerized CI ecosystem offering a concrete, reproducible mechanism for ongoing validation. The explicit taxonomy at multiple levels of abstraction is a constructive contribution that could reduce ad-hoc experimentation in the field.

major comments (2)
  1. [Abstract and case-study evaluation sections] The transferability claim (abstract and concluding sections) is central to the paper's contribution yet rests on a single DMR case study. No second framework is instantiated, no adaptation effort or semantic mismatches are measured, and no boundary cases (e.g., frameworks whose state or failure modes fall outside the three primitives) are examined. This makes the generalization assertion load-bearing but unsupported by the presented evidence.
  2. [Results / evaluation section] The results section asserts improvements in early fault detection and simplified maintenance, but supplies no quantitative metrics, baseline comparisons, error bars, or statistical analysis. Without these data the central empirical claims cannot be evaluated for magnitude or robustness.
minor comments (2)
  1. [Abstract] The abstract would benefit from a concise statement of the concrete test counts, coverage achieved, or maintenance-effort reduction observed in the DMR study.
  2. [Taxonomy section] Notation for the taxonomy categories (functional vs. non-functional, component vs. system) should be introduced once with a small table or diagram for quick reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below, indicating the revisions we plan to incorporate to strengthen the presentation of our claims.

read point-by-point responses
  1. Referee: [Abstract and case-study evaluation sections] The transferability claim (abstract and concluding sections) is central to the paper's contribution yet rests on a single DMR case study. No second framework is instantiated, no adaptation effort or semantic mismatches are measured, and no boundary cases (e.g., frameworks whose state or failure modes fall outside the three primitives) are examined. This makes the generalization assertion load-bearing but unsupported by the presented evidence.

    Authors: We agree that the transferability claim would be strengthened by additional supporting evidence beyond the single DMR case study. The taxonomy and CI ecosystem are designed around the three core primitives (initialization, readiness checking, and reconfiguration) that we identify as common to malleable MPI libraries. In the revised manuscript, we will add a dedicated discussion subsection within the evaluation section that maps the taxonomy to other known malleability frameworks, highlights potential semantic mismatches, and explicitly examines boundary cases where state or failure modes may differ. We will also moderate the language in the abstract and conclusions to frame transferability as a design-supported hypothesis demonstrated via DMR, rather than an empirically validated property across multiple systems. These changes will provide a more precise and balanced account of the evidence. revision: partial

  2. Referee: [Results / evaluation section] The results section asserts improvements in early fault detection and simplified maintenance, but supplies no quantitative metrics, baseline comparisons, error bars, or statistical analysis. Without these data the central empirical claims cannot be evaluated for magnitude or robustness.

    Authors: The current evaluation presents improvements through concrete qualitative examples from the DMR case study, including specific faults detected by the taxonomy that ad-hoc methods missed and the reduction in manual effort enabled by the automated CI pipeline. We acknowledge that quantitative metrics would allow better assessment of magnitude and robustness. In the revision, we will augment the results section with available quantitative data from our development and validation logs, such as the total number of test cases across taxonomy levels, measured reductions in validation cycle time, and timelines of fault detection before versus after adopting the methodology. Where direct baselines exist from prior ad-hoc practices, we will include them for comparison. This will be presented in additional tables or figures to support the claims more rigorously. revision: yes

Circularity Check

0 steps flagged

No circularity in methodology proposal or case study

full rationale

This is a software engineering methodology paper that defines a test taxonomy and CI ecosystem, then applies it to a single representative framework (DMR) as a case study. No equations, fitted parameters, predictions, or derivations exist that could reduce to inputs by construction. The transferability statement is presented as an assumption based on shared primitives rather than a derived result, and no self-citation chains or ansatzes are used to justify core claims. The work is self-contained as a descriptive contribution with empirical instantiation; any limitations lie in validation breadth, not in circular reasoning.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard domain assumptions from HPC and software testing without introducing free parameters or new entities. The central assumption is that DMR adequately represents the class of malleable frameworks.

axioms (1)
  • domain assumption Malleable MPI applications and dynamic resource management frameworks expose specific primitives for initialization, readiness checking, and reconfiguration.
    This assumption defines the scope of the taxonomy and its claimed transferability to other solutions.

pith-pipeline@v0.9.0 · 5479 in / 1220 out tokens · 53641 ms · 2026-05-07T11:39:06.552083+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 19 canonical work pages · 1 internal anchor

  1. [1]

    Vázquez, G

    M. Vázquez, G. Houzeaux, S. Koric, A. Artigues, J. Aguado- Sierra, R. Arís, D. Mira, H. Calmet, F. Cucchietti, H. Owen, A. Taha, E. D. Burness, J. M. Cela, M. Valero, Alya: Multiphysics engineering simulation towards exascale, J. Comput. Sci. 14 (2016) 15–27.doi:10.1016/j.jocs.2015. 12.007

  2. [2]

    Malleable implementation of SERGHEI-SWE with DMR, https://jlesc.github.io/projects/serghei-dmr, project status: starting; accessed: April 30, 2026(2025)

  3. [3]

    SERGHEI (SERGHEI-SWE) v1.0: a performance-portable high-performance parallel-computing shallow-water solver for hydrology and environmental hydraulics

    D. Caviedes-Voullieme, M. Morales-Hernández, M. R. Nor- man, I. Özgen-Xian, SERGHEI (SERGHEI-SWE) v1.0: A performance-portable high-performance parallel-computing shallow-water solver for hydrology and environmental hy- draulics, Geoscientific Model Development 16 (3) (2023) 977–1008.doi:10.5194/gmd-16-977-2023

  4. [4]

    EUPILOT: Pilot using independent, local & open tech- nologies,https://eupilot.eu/, started December 2021 – Coordinated by Barcelona Supercomputing Center (BSC) (2021)

  5. [5]

    Rojek, R

    K. Rojek, R. Wyrzykowski, Parallelization of 3D MPDATA algorithm using many graphics processors, in: Proceedings of the 13th International Conference on Parallel Computing Technologies - Volume 9251, 2015, pp. 445–457.doi:10. 1007/978-3-319-21909-7_43

  6. [6]

    Iserte, K

    S. Iserte, K. Rojek, A study of the effect of process malleability in the energy efficiency on GPU-based clusters, Journal of Supercomputing 76 (2020) 255–274.doi:10. 1007/s11227-019-03034-x

  7. [7]

    Martínez, J

    H. Martínez, J. Tárraga, I. Medina, S. Barrachina, M. Castillo, J. Dopazo, E. S. Quintana-Ortí, A dynamic pipeline for RNA sequencing on multicore processors, in: Proceedings of the 20th European MPI Users’ Group Meet- ing, 2013, pp. 235–240

  8. [8]

    Dynamic reconfiguration of noniterative scientific applications: A case study with HPG aligner

    S. Iserte, H. Martínez, S. Barrachina, M. Castillo, R. Mayo, A. J. Peña, Dynamic reconfiguration of non-iterative scien- tific applications: A case study with HPG-aligner, Interna- tional Journal of High Performance Computing Application 33 (2018) 1–10.doi:10.1177/1094342018802347

  9. [9]

    Martín-Álvarez, J

    I. Martín-Álvarez, J. I. Aliaga, M. Castillo, S. Iserte, R. Mayo, Dynamic spawning of MPI processes applied to malleability, International Journal of High Performance Computing Applications 0 (2023) 1–25, iSBN: pending. doi:10.1177/10943420231176527

  10. [10]

    Iserte, I

    S. Iserte, I. Martín-Álvarez, K. Rojek, J. I. Aliaga, M. Castillo, A. J. Peña, Towards the democratization and standardization of dynamic resources with MPI spawning, in: Parallel Processing and Applied Mathematics, Springer Nature Switzerland, Cham, 2025, pp. 287–300.doi:10. 1007/978-3-031-85697-6_19

  11. [11]

    URLhttps://arxiv.org/abs/2403.17107

    D.Huber,M.Schreiber,M.Schulz,H.Pritchard,D.Holmes, Design principles of dynamic resource management for high- performance parallel programming models (2024).arXiv: 2403.17107. URLhttps://arxiv.org/abs/2403.17107

  12. [12]

    Huber, S

    D. Huber, S. Iserte, M. Schreiber, A. J. Peña, M. Schulz, Bridging the gap between genericity and programmability of dynamic resources in HPC, in: ISC High Performance 2025 Research Paper Proceedings (40th International Con- ference), 2025, pp. 1–11. URLhttps://ieeexplore.ieee.org/document/11018304

  13. [13]

    A comprehensive software stack for dynamic resource man- agement: Integration of DPP and DMR in OAR,https: //jlesc.github.io/projects/dmr-dpp-oar/, project web- page. (2025)

  14. [14]

    Garcia, J

    M. Garcia, J. Labarta, J. Corbalan, Hints to improve automatic load balancing with lewi for hybrid applications, Journal of Parallel and Distributed Computing 74 (9) (2014) 2781–2794.doi:10.1016/j.jpdc.2014.05.004. URLhttps://www.sciencedirect.com/science/article/ pii/S0743731514000926

  15. [15]

    De Rosso, Empowering the DMR malleability framework for mpi with the ULFM extension (Oct

    M. De Rosso, Empowering the DMR malleability framework for mpi with the ULFM extension (Oct. 2025). URLhttps://www.politesi.polimi.it/handle/10589/ 243438

  16. [16]

    Bungartz, P.-F

    H.-J. Bungartz, P.-F. Dutot, J. Fecht, K. Gaddameedi, D. Huber, S. Iserte, M. Minion, T. Neckel, A. Peña, O. Richard, M. Schreiber, M. Schulz, V. Schüller, A lay- ered approach for dynamic resource management in HPC, in: Euro-Par 2024: Parallel Processing Workshops: Euro- Par 2024 International Workshops, Madrid, Spain, August 26–30, 2024, Proceedings, Pa...

  17. [17]

    The ICO and artificial intelligence: The role of fairness in the GDPR framework

    S. Iserte, I. Martín-Álvarez, K. Rojek, J. I. Aliaga, M. Castillo, W. Folwarska, A. J. Peña, Resource op- timization with MPI process malleability for dynamic workloads in HPC clusters, Future Generation Computer Systems (2025) 107949doi:https://doi.org/10.1016/j. P. Sandås et al.:Preprint submitted to ElsevierPage 14 of 15 CI for Dynamic Resource Managem...

  18. [18]

    Iserte, M

    S. Iserte, M. Madon, G. Da Costa, J.-M. Pierson, A. J. Peña, MPI malleability validation under replayed real-world HPC conditions, Future Generation Computer Systems (2025) 108305doi:10.1016/j.future.2025.108305

  19. [19]

    Dynamic Solutions for Hybrid Quantum-HPC Resource Allocation

    R. Rocco, S. Rizzo, M. Barbieri, G. Bettonte, E. Boella, F. Ganz, S. Iserte, A. J. Peña, P. Sandås, A. Scionti, et al., Dynamic solutions for hybrid quantum-HPC resource allocation, arXiv preprint arXiv:2508.04217 (2025)

  20. [20]

    Lemarinier, K

    P. Lemarinier, K. Hasanov, S. Venugopal, K. Katrinis, Architecting malleable MPI applications for priority-driven adaptive scheduling, in: 23rd EuroMPI, 2016, pp. 74–81

  21. [21]

    F. S. Ribeiro, A. P. Nascimento, C. Boeres, V. E. F. Rebello, A. C. Sena, Autonomic malleability in iterative MPI appli- cations, in: Proceedings of the International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2013, pp. 192–199

  22. [22]

    Sudarsan, C

    R. Sudarsan, C. J. Ribbens, Reshape: A framework for dynamic resizing and scheduling of homogeneous appli- cations in a parallel environment, in: Proceedings of the International Conference on Parallel Processing (ICPP), 2007

  23. [23]

    El Maghraoui, T

    K. El Maghraoui, T. J. Desell, B. K. Szymanski, C. A. Varela, Dynamic malleability in iterative MPI applications, in: Proceedings of the 7th IEEE/ACM International Sym- posium on Cluster, Cloud and Grid Computing (CCGrid), 2007, pp. 591–598

  24. [24]

    Gupta, B

    A. Gupta, B. Acun, O. Sarood, L. V. Kalé, Towards real- izing the potential of malleable jobs, in: 21st International Conference on High Performance Computing, 2014

  25. [25]

    Martín, M.-C

    G. Martín, M.-C. Marinescu, D. E. Singh, J. Carretero, FLEX-MPI: An MPI extension for supporting dynamic load balancing on heterogeneous non-dedicated systems, in: Euro-Par Parallel Processing, 2013, pp. 138–149

  26. [26]

    Comprés, A

    I. Comprés, A. Mo-Hellenbrand, M. Gerndt, H.-J. Bungartz, Infrastructure and API extensions for elastic execution of MPI applications, in: Proceedings of the 23rd EuroMPI, 2016, pp. 82–97

  27. [27]

    Bland, A

    W. Bland, A. Bouteiller, T. Herault, G. Bosilca, J. Don- garra, Post-failure recovery of MPI communication capa- bility: Design and rationale, International Journal of High Performance Computing Applications 27 (3) (2013) 244– 254

  28. [28]

    Martín-Álvarez, J

    I. Martín-Álvarez, J. I. Aliaga, M. Castillo, S. Iserte, Proteo: A framework for the generation and evaluation of malleable MPI applications, The Journal of Supercomputing (Jul. 2024).doi:10.1007/s11227-024-06277-5

  29. [29]

    El Maghraoui, T

    K. El Maghraoui, T. J. Desell, B. K. Szymanski, C. A. Varela, Malleable iterative MPI applications, Concurrency and Computation: Practice and Experience 21 (3) (Mar. 2009)

  30. [30]

    Prabhakaran, M

    S. Prabhakaran, M. Neumann, S. Rinke, F. Wolf, A. Gupta, L. V. Kale, A batch system with efficient adaptive schedul- ing for malleable and evolving applications, in: IEEE In- terantional Parallel and Distributed Processing Symposium (IPDPS), 2015

  31. [31]

    Martín-Álvarez, J

    I. Martín-Álvarez, J. I. Aliaga, M. Castillo, S. Iserte, MaM: A user-friendly interface to incorporate malleability into MPI applications, in: Euro-Par 2024: Parallel Processing Workshops, Cham, 2025, pp. 346–358

  32. [32]

    Sampedro, A

    Z. Sampedro, A. Holt, T. Hauser, Continuous integration and delivery for HPC: Using Singularity and Jenkins, in: Proceedings of the Practice and Experience on Advanced Research Computing: Seamless Creativity, PEARC ’18, AssociationforComputingMachinery,NewYork,NY,USA, 2018, pp. 1–6.doi:10.1145/3219104.3219147

  33. [33]

    Society of Research Software Engineering, Building an efficient continuous integration workflow on HPC systems (Feb. 2025). URLhttps://www.youtube.com/watch?v=xchS7wef0L0

  34. [34]

    Peters, S

    S. Peters, S. Marcus, D. Gläser, J. Linxweiler, Enhancing scientific reproducibility: A continuous integration workflow for high-performance computing, Electronic Communica- tions of the EASST 83 (Feb. 2025).doi:10.14279/eceasst. v83.2625

  35. [35]

    Maric, D

    T. Maric, D. Gläser, J.-P. Lehr, I. Papagiannidis, B. Lambie, C. Bischof, D. Bothe, A research software engineering workflow for computational science and engineering (Aug. 2022).doi:10.48550/arXiv.2208.07460

  36. [36]

    Schubert, R

    A. Schubert, R. Argent, Promoting scientific software qual- ity through transition to continuous integration and con- tinuous delivery, Socio-Environmental Systems Modelling 6 (2024) 18779.doi:10.18174/sesmo.18779

  37. [37]

    J. I. Aliaga, M. Castillo, S. Iserte, I. Martín-Álvarez, R. Mayo, A survey on malleability solutions for high- performance distributed computing, Applied Science 12 (2022) 1–32.doi:10.3390/app12105231

  38. [38]

    Tarraf, M

    A. Tarraf, M. Schreiber, A. Cascajo, J.-B. Besnard, M.-A. Vef, D. Huber, S. Happ, A. Brinkmann, D. E. Singh, H.-C. Hoppe, A. Miranda, A. J. Peña, R. Machado, M. G. Gasulla, M. Schulz, P. Carpenter, S. Pickartz, T. Rotaru, S. Iserte, V. Lopez, J. Ejarque, H. Sirwani, F. Wolf, Malleability in modern HPC systems: Current experiences, challenges, and future o...

  39. [39]

    Hursey, R

    J. Hursey, R. Castain, PMIx swarm toy box,https:// github.com/jjhursey/pmix-swarm-toy-box, gitHub repos- itory. Accessed: 2025-10-30 (2021)

  40. [40]

    Accessed: 2025-10-30 (2023)

    I.A.ComprésUreña,D.Huber,DPPDockercluster,https: //gitlab.inria.fr/dynres/dyn-procs/docker-cluster, gitLab repository. Accessed: 2025-10-30 (2023)

  41. [41]

    Iserte Agut, High-throughput Computation through Efficient Re- source Management, Ph.D

    S. Iserte, High-throughput computation through efficient resource management, Ph.D. Thesis, Universitat Jaume I (UJI) (Nov. 2018).doi:10.6035/14101.2018.176272

  42. [42]

    Torres, Slurm-Docker-Cluster,https://github.com/ giovtorres/slurm-docker-cluster/tree/main, accessed: 2025-10-30 (2025)

    G. Torres, Slurm-Docker-Cluster,https://github.com/ giovtorres/slurm-docker-cluster/tree/main, accessed: 2025-10-30 (2025). P. Sandås et al.:Preprint submitted to ElsevierPage 15 of 15