Recognition: unknown
A Test Taxonomy and Continuous Integration Ecosystem for Dynamic Resource Management in HPC
Pith reviewed 2026-05-07 11:39 UTC · model grok-4.3
The pith
A test taxonomy paired with a continuous integration ecosystem improves fault detection and maintenance for dynamic resource management frameworks in HPC.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that combining a taxonomy of tests for MPI malleable libraries with an HPC-oriented continuous integration ecosystem instantiated in a containerized virtual cluster, when applied to the DMR framework, improves early fault detection, simplifies maintenance under evolving dependencies, and transfers to other malleability solutions that expose analogous primitives for initialization, readiness checking, and reconfiguration.
What carries the argument
The test taxonomy that structures functional and non-functional tests at both component-integration and system levels, instantiated via a containerized virtual cluster for automated validation.
If this is right
- The methodology improves early fault detection through automated validation of dynamic resource management frameworks.
- Maintenance is simplified when testing suites must adapt to evolving software dependencies.
- The approach transfers to other malleability solutions that provide similar primitives for initialization, readiness checking, and reconfiguration.
Where Pith is reading between the lines
- Developers of additional resource management libraries could apply the same taxonomy to create reusable test suites for malleability features.
- Containerized virtual clusters may allow broader hardware coverage during testing without requiring dedicated physical systems.
- Standardized testing practices could shorten the time needed to validate updates to dynamic HPC software.
Load-bearing premise
The DMR framework is representative of other dynamic resource management solutions and the taxonomy comprehensively covers functional and non-functional tests at component-integration and system levels.
What would settle it
An experiment in which the taxonomy and CI ecosystem fail to detect known faults in DMR or another similar framework, or show no reduction in maintenance effort when dependencies change, would disprove the claimed benefits.
Figures
read the original abstract
High-performance computing (HPC) systems are increasingly exploring dynamic resource management and malleable MPI applications to better adapt to heterogeneous architectures, fluctuating workloads, and energy constraints. However, the correctness of the libraries that support these techniques is often evaluated through ad hoc experiments that can be difficult to reproduce and maintain. This article introduces methodology for testing dynamic resource management frameworks that combines a taxonomy of tests for MPI malleable libraries with an HPC-oriented continuous integration (CI) ecosystem. The taxonomy structures functional and non-functional tests at both component-integration and system levels. The CI ecosystem instantiates this taxonomy in a containerized virtual cluster enabling automated validation. The approach is instantiated and evaluated using the Dynamic Management of Resources (DMR) framework as a representative case study. Results show that the proposed methodology improves early fault detection, simplifies maintenance under evolving dependencies, and transfers to other malleability solutions that expose analogous primitives for initialization, readiness checking, and reconfiguration.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a taxonomy of functional and non-functional tests for MPI malleable libraries at component-integration and system levels, together with a containerized HPC-oriented continuous integration ecosystem that automates validation. The approach is instantiated and evaluated via a single case study on the Dynamic Management of Resources (DMR) framework; the authors claim that the methodology improves early fault detection, simplifies maintenance under evolving dependencies, and transfers to other malleability solutions that expose analogous primitives for initialization, readiness checking, and reconfiguration.
Significance. If the claims hold, the work supplies a reusable structured testing methodology for an increasingly important class of dynamic HPC systems, with the containerized CI ecosystem offering a concrete, reproducible mechanism for ongoing validation. The explicit taxonomy at multiple levels of abstraction is a constructive contribution that could reduce ad-hoc experimentation in the field.
major comments (2)
- [Abstract and case-study evaluation sections] The transferability claim (abstract and concluding sections) is central to the paper's contribution yet rests on a single DMR case study. No second framework is instantiated, no adaptation effort or semantic mismatches are measured, and no boundary cases (e.g., frameworks whose state or failure modes fall outside the three primitives) are examined. This makes the generalization assertion load-bearing but unsupported by the presented evidence.
- [Results / evaluation section] The results section asserts improvements in early fault detection and simplified maintenance, but supplies no quantitative metrics, baseline comparisons, error bars, or statistical analysis. Without these data the central empirical claims cannot be evaluated for magnitude or robustness.
minor comments (2)
- [Abstract] The abstract would benefit from a concise statement of the concrete test counts, coverage achieved, or maintenance-effort reduction observed in the DMR study.
- [Taxonomy section] Notation for the taxonomy categories (functional vs. non-functional, component vs. system) should be introduced once with a small table or diagram for quick reference.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below, indicating the revisions we plan to incorporate to strengthen the presentation of our claims.
read point-by-point responses
-
Referee: [Abstract and case-study evaluation sections] The transferability claim (abstract and concluding sections) is central to the paper's contribution yet rests on a single DMR case study. No second framework is instantiated, no adaptation effort or semantic mismatches are measured, and no boundary cases (e.g., frameworks whose state or failure modes fall outside the three primitives) are examined. This makes the generalization assertion load-bearing but unsupported by the presented evidence.
Authors: We agree that the transferability claim would be strengthened by additional supporting evidence beyond the single DMR case study. The taxonomy and CI ecosystem are designed around the three core primitives (initialization, readiness checking, and reconfiguration) that we identify as common to malleable MPI libraries. In the revised manuscript, we will add a dedicated discussion subsection within the evaluation section that maps the taxonomy to other known malleability frameworks, highlights potential semantic mismatches, and explicitly examines boundary cases where state or failure modes may differ. We will also moderate the language in the abstract and conclusions to frame transferability as a design-supported hypothesis demonstrated via DMR, rather than an empirically validated property across multiple systems. These changes will provide a more precise and balanced account of the evidence. revision: partial
-
Referee: [Results / evaluation section] The results section asserts improvements in early fault detection and simplified maintenance, but supplies no quantitative metrics, baseline comparisons, error bars, or statistical analysis. Without these data the central empirical claims cannot be evaluated for magnitude or robustness.
Authors: The current evaluation presents improvements through concrete qualitative examples from the DMR case study, including specific faults detected by the taxonomy that ad-hoc methods missed and the reduction in manual effort enabled by the automated CI pipeline. We acknowledge that quantitative metrics would allow better assessment of magnitude and robustness. In the revision, we will augment the results section with available quantitative data from our development and validation logs, such as the total number of test cases across taxonomy levels, measured reductions in validation cycle time, and timelines of fault detection before versus after adopting the methodology. Where direct baselines exist from prior ad-hoc practices, we will include them for comparison. This will be presented in additional tables or figures to support the claims more rigorously. revision: yes
Circularity Check
No circularity in methodology proposal or case study
full rationale
This is a software engineering methodology paper that defines a test taxonomy and CI ecosystem, then applies it to a single representative framework (DMR) as a case study. No equations, fitted parameters, predictions, or derivations exist that could reduce to inputs by construction. The transferability statement is presented as an assumption based on shared primitives rather than a derived result, and no self-citation chains or ansatzes are used to justify core claims. The work is self-contained as a descriptive contribution with empirical instantiation; any limitations lie in validation breadth, not in circular reasoning.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Malleable MPI applications and dynamic resource management frameworks expose specific primitives for initialization, readiness checking, and reconfiguration.
Reference graph
Works this paper leans on
-
[1]
M. Vázquez, G. Houzeaux, S. Koric, A. Artigues, J. Aguado- Sierra, R. Arís, D. Mira, H. Calmet, F. Cucchietti, H. Owen, A. Taha, E. D. Burness, J. M. Cela, M. Valero, Alya: Multiphysics engineering simulation towards exascale, J. Comput. Sci. 14 (2016) 15–27.doi:10.1016/j.jocs.2015. 12.007
-
[2]
Malleable implementation of SERGHEI-SWE with DMR, https://jlesc.github.io/projects/serghei-dmr, project status: starting; accessed: April 30, 2026(2025)
2026
-
[3]
D. Caviedes-Voullieme, M. Morales-Hernández, M. R. Nor- man, I. Özgen-Xian, SERGHEI (SERGHEI-SWE) v1.0: A performance-portable high-performance parallel-computing shallow-water solver for hydrology and environmental hy- draulics, Geoscientific Model Development 16 (3) (2023) 977–1008.doi:10.5194/gmd-16-977-2023
-
[4]
EUPILOT: Pilot using independent, local & open tech- nologies,https://eupilot.eu/, started December 2021 – Coordinated by Barcelona Supercomputing Center (BSC) (2021)
2021
-
[5]
Rojek, R
K. Rojek, R. Wyrzykowski, Parallelization of 3D MPDATA algorithm using many graphics processors, in: Proceedings of the 13th International Conference on Parallel Computing Technologies - Volume 9251, 2015, pp. 445–457.doi:10. 1007/978-3-319-21909-7_43
2015
-
[6]
Iserte, K
S. Iserte, K. Rojek, A study of the effect of process malleability in the energy efficiency on GPU-based clusters, Journal of Supercomputing 76 (2020) 255–274.doi:10. 1007/s11227-019-03034-x
2020
-
[7]
Martínez, J
H. Martínez, J. Tárraga, I. Medina, S. Barrachina, M. Castillo, J. Dopazo, E. S. Quintana-Ortí, A dynamic pipeline for RNA sequencing on multicore processors, in: Proceedings of the 20th European MPI Users’ Group Meet- ing, 2013, pp. 235–240
2013
-
[8]
Dynamic reconfiguration of noniterative scientific applications: A case study with HPG aligner
S. Iserte, H. Martínez, S. Barrachina, M. Castillo, R. Mayo, A. J. Peña, Dynamic reconfiguration of non-iterative scien- tific applications: A case study with HPG-aligner, Interna- tional Journal of High Performance Computing Application 33 (2018) 1–10.doi:10.1177/1094342018802347
-
[9]
I. Martín-Álvarez, J. I. Aliaga, M. Castillo, S. Iserte, R. Mayo, Dynamic spawning of MPI processes applied to malleability, International Journal of High Performance Computing Applications 0 (2023) 1–25, iSBN: pending. doi:10.1177/10943420231176527
-
[10]
Iserte, I
S. Iserte, I. Martín-Álvarez, K. Rojek, J. I. Aliaga, M. Castillo, A. J. Peña, Towards the democratization and standardization of dynamic resources with MPI spawning, in: Parallel Processing and Applied Mathematics, Springer Nature Switzerland, Cham, 2025, pp. 287–300.doi:10. 1007/978-3-031-85697-6_19
2025
-
[11]
URLhttps://arxiv.org/abs/2403.17107
D.Huber,M.Schreiber,M.Schulz,H.Pritchard,D.Holmes, Design principles of dynamic resource management for high- performance parallel programming models (2024).arXiv: 2403.17107. URLhttps://arxiv.org/abs/2403.17107
-
[12]
D. Huber, S. Iserte, M. Schreiber, A. J. Peña, M. Schulz, Bridging the gap between genericity and programmability of dynamic resources in HPC, in: ISC High Performance 2025 Research Paper Proceedings (40th International Con- ference), 2025, pp. 1–11. URLhttps://ieeexplore.ieee.org/document/11018304
-
[13]
A comprehensive software stack for dynamic resource man- agement: Integration of DPP and DMR in OAR,https: //jlesc.github.io/projects/dmr-dpp-oar/, project web- page. (2025)
2025
-
[14]
M. Garcia, J. Labarta, J. Corbalan, Hints to improve automatic load balancing with lewi for hybrid applications, Journal of Parallel and Distributed Computing 74 (9) (2014) 2781–2794.doi:10.1016/j.jpdc.2014.05.004. URLhttps://www.sciencedirect.com/science/article/ pii/S0743731514000926
-
[15]
De Rosso, Empowering the DMR malleability framework for mpi with the ULFM extension (Oct
M. De Rosso, Empowering the DMR malleability framework for mpi with the ULFM extension (Oct. 2025). URLhttps://www.politesi.polimi.it/handle/10589/ 243438
2025
-
[16]
H.-J. Bungartz, P.-F. Dutot, J. Fecht, K. Gaddameedi, D. Huber, S. Iserte, M. Minion, T. Neckel, A. Peña, O. Richard, M. Schreiber, M. Schulz, V. Schüller, A lay- ered approach for dynamic resource management in HPC, in: Euro-Par 2024: Parallel Processing Workshops: Euro- Par 2024 International Workshops, Madrid, Spain, August 26–30, 2024, Proceedings, Pa...
-
[17]
The ICO and artificial intelligence: The role of fairness in the GDPR framework
S. Iserte, I. Martín-Álvarez, K. Rojek, J. I. Aliaga, M. Castillo, W. Folwarska, A. J. Peña, Resource op- timization with MPI process malleability for dynamic workloads in HPC clusters, Future Generation Computer Systems (2025) 107949doi:https://doi.org/10.1016/j. P. Sandås et al.:Preprint submitted to ElsevierPage 14 of 15 CI for Dynamic Resource Managem...
work page doi:10.1016/j 2025
-
[18]
S. Iserte, M. Madon, G. Da Costa, J.-M. Pierson, A. J. Peña, MPI malleability validation under replayed real-world HPC conditions, Future Generation Computer Systems (2025) 108305doi:10.1016/j.future.2025.108305
-
[19]
Dynamic Solutions for Hybrid Quantum-HPC Resource Allocation
R. Rocco, S. Rizzo, M. Barbieri, G. Bettonte, E. Boella, F. Ganz, S. Iserte, A. J. Peña, P. Sandås, A. Scionti, et al., Dynamic solutions for hybrid quantum-HPC resource allocation, arXiv preprint arXiv:2508.04217 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Lemarinier, K
P. Lemarinier, K. Hasanov, S. Venugopal, K. Katrinis, Architecting malleable MPI applications for priority-driven adaptive scheduling, in: 23rd EuroMPI, 2016, pp. 74–81
2016
-
[21]
F. S. Ribeiro, A. P. Nascimento, C. Boeres, V. E. F. Rebello, A. C. Sena, Autonomic malleability in iterative MPI appli- cations, in: Proceedings of the International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2013, pp. 192–199
2013
-
[22]
Sudarsan, C
R. Sudarsan, C. J. Ribbens, Reshape: A framework for dynamic resizing and scheduling of homogeneous appli- cations in a parallel environment, in: Proceedings of the International Conference on Parallel Processing (ICPP), 2007
2007
-
[23]
El Maghraoui, T
K. El Maghraoui, T. J. Desell, B. K. Szymanski, C. A. Varela, Dynamic malleability in iterative MPI applications, in: Proceedings of the 7th IEEE/ACM International Sym- posium on Cluster, Cloud and Grid Computing (CCGrid), 2007, pp. 591–598
2007
-
[24]
Gupta, B
A. Gupta, B. Acun, O. Sarood, L. V. Kalé, Towards real- izing the potential of malleable jobs, in: 21st International Conference on High Performance Computing, 2014
2014
-
[25]
Martín, M.-C
G. Martín, M.-C. Marinescu, D. E. Singh, J. Carretero, FLEX-MPI: An MPI extension for supporting dynamic load balancing on heterogeneous non-dedicated systems, in: Euro-Par Parallel Processing, 2013, pp. 138–149
2013
-
[26]
Comprés, A
I. Comprés, A. Mo-Hellenbrand, M. Gerndt, H.-J. Bungartz, Infrastructure and API extensions for elastic execution of MPI applications, in: Proceedings of the 23rd EuroMPI, 2016, pp. 82–97
2016
-
[27]
Bland, A
W. Bland, A. Bouteiller, T. Herault, G. Bosilca, J. Don- garra, Post-failure recovery of MPI communication capa- bility: Design and rationale, International Journal of High Performance Computing Applications 27 (3) (2013) 244– 254
2013
-
[28]
I. Martín-Álvarez, J. I. Aliaga, M. Castillo, S. Iserte, Proteo: A framework for the generation and evaluation of malleable MPI applications, The Journal of Supercomputing (Jul. 2024).doi:10.1007/s11227-024-06277-5
-
[29]
El Maghraoui, T
K. El Maghraoui, T. J. Desell, B. K. Szymanski, C. A. Varela, Malleable iterative MPI applications, Concurrency and Computation: Practice and Experience 21 (3) (Mar. 2009)
2009
-
[30]
Prabhakaran, M
S. Prabhakaran, M. Neumann, S. Rinke, F. Wolf, A. Gupta, L. V. Kale, A batch system with efficient adaptive schedul- ing for malleable and evolving applications, in: IEEE In- terantional Parallel and Distributed Processing Symposium (IPDPS), 2015
2015
-
[31]
Martín-Álvarez, J
I. Martín-Álvarez, J. I. Aliaga, M. Castillo, S. Iserte, MaM: A user-friendly interface to incorporate malleability into MPI applications, in: Euro-Par 2024: Parallel Processing Workshops, Cham, 2025, pp. 346–358
2024
-
[32]
Z. Sampedro, A. Holt, T. Hauser, Continuous integration and delivery for HPC: Using Singularity and Jenkins, in: Proceedings of the Practice and Experience on Advanced Research Computing: Seamless Creativity, PEARC ’18, AssociationforComputingMachinery,NewYork,NY,USA, 2018, pp. 1–6.doi:10.1145/3219104.3219147
-
[33]
Society of Research Software Engineering, Building an efficient continuous integration workflow on HPC systems (Feb. 2025). URLhttps://www.youtube.com/watch?v=xchS7wef0L0
2025
-
[34]
S. Peters, S. Marcus, D. Gläser, J. Linxweiler, Enhancing scientific reproducibility: A continuous integration workflow for high-performance computing, Electronic Communica- tions of the EASST 83 (Feb. 2025).doi:10.14279/eceasst. v83.2625
-
[35]
T. Maric, D. Gläser, J.-P. Lehr, I. Papagiannidis, B. Lambie, C. Bischof, D. Bothe, A research software engineering workflow for computational science and engineering (Aug. 2022).doi:10.48550/arXiv.2208.07460
-
[36]
A. Schubert, R. Argent, Promoting scientific software qual- ity through transition to continuous integration and con- tinuous delivery, Socio-Environmental Systems Modelling 6 (2024) 18779.doi:10.18174/sesmo.18779
-
[37]
J. I. Aliaga, M. Castillo, S. Iserte, I. Martín-Álvarez, R. Mayo, A survey on malleability solutions for high- performance distributed computing, Applied Science 12 (2022) 1–32.doi:10.3390/app12105231
-
[38]
A. Tarraf, M. Schreiber, A. Cascajo, J.-B. Besnard, M.-A. Vef, D. Huber, S. Happ, A. Brinkmann, D. E. Singh, H.-C. Hoppe, A. Miranda, A. J. Peña, R. Machado, M. G. Gasulla, M. Schulz, P. Carpenter, S. Pickartz, T. Rotaru, S. Iserte, V. Lopez, J. Ejarque, H. Sirwani, F. Wolf, Malleability in modern HPC systems: Current experiences, challenges, and future o...
-
[39]
Hursey, R
J. Hursey, R. Castain, PMIx swarm toy box,https:// github.com/jjhursey/pmix-swarm-toy-box, gitHub repos- itory. Accessed: 2025-10-30 (2021)
2025
-
[40]
Accessed: 2025-10-30 (2023)
I.A.ComprésUreña,D.Huber,DPPDockercluster,https: //gitlab.inria.fr/dynres/dyn-procs/docker-cluster, gitLab repository. Accessed: 2025-10-30 (2023)
2025
-
[41]
Iserte Agut, High-throughput Computation through Efficient Re- source Management, Ph.D
S. Iserte, High-throughput computation through efficient resource management, Ph.D. Thesis, Universitat Jaume I (UJI) (Nov. 2018).doi:10.6035/14101.2018.176272
-
[42]
Torres, Slurm-Docker-Cluster,https://github.com/ giovtorres/slurm-docker-cluster/tree/main, accessed: 2025-10-30 (2025)
G. Torres, Slurm-Docker-Cluster,https://github.com/ giovtorres/slurm-docker-cluster/tree/main, accessed: 2025-10-30 (2025). P. Sandås et al.:Preprint submitted to ElsevierPage 15 of 15
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.