pith. sign in

arxiv: 2605.03743 · v1 · submitted 2026-05-05 · 💻 cs.DC · cs.AI· cs.HC· cs.SE

A Workflow-Oriented Framework for Asynchronous Human-AI Collaboration in Hybrid and Compute-Intensive HPC Environments

Pith reviewed 2026-05-07 14:11 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.HCcs.SE
keywords workflow frameworkasynchronous human-AI collaborationHPC environmentsSLURM schedulinghybrid infrastructureAI model trainingcheckpoint pausingnon-blocking supervision
0
0 comments X

The pith

A workflow framework enables asynchronous human-AI collaboration in hybrid HPC environments by pausing at checkpoints without halting jobs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to integrate necessary human judgment into AI processes that run on high-performance computing systems where constant real-time interaction is impractical. It does so by creating a workflow system that inserts defined pause points for human input while the underlying compute tasks continue running on the hardware. This matters because high-stakes applications such as defense and security AI require occasional human oversight, yet tying up expensive HPC resources during waits wastes capacity and slows progress. The framework works with SLURM scheduling, containerized and native tasks, and spans HPC clusters, local machines, and cloud platforms. A demonstration on model training illustrates how the approach preserves efficiency and portability.

Core claim

We present a workflow framework that enables asynchronous human-AI collaboration across hybrid infrastructures, including HPC clusters, local machines, and cloud platforms. Workflows can pause at defined checkpoints for human input without halting underlying compute jobs, preventing idle resources and enabling non-blocking supervision. The framework supports interaction with SLURM-based scheduling, containerized and native tasks, and is customized for scenarios requiring human judgment and adaptability. We demonstrate its application in model training on systems like MareNostrum 5, highlighting benefits in portability, efficiency, and oversight in operational AI workflows.

What carries the argument

Checkpoint insertion mechanism in workflow orchestration that decouples human interaction timing from ongoing compute execution across hybrid infrastructures.

Load-bearing premise

Human checkpoints can be inserted into SLURM-based and containerized HPC workflows while preserving efficiency, portability, and non-blocking behavior across hybrid infrastructures.

What would settle it

Controlled runs on a system like MareNostrum 5 that compare total job runtime, CPU/GPU utilization, and resume success rates with and without checkpoints; significant added overhead or failed resumptions would falsify the non-blocking and efficiency claims.

Figures

Figures reproduced from arXiv: 2605.03743 by Cedric Bhihe, David Modesto, Jesus Gomez Canovas, Jose Martin Bugallo Batalla, Miguel Perez Espinosa, Natalia Zamora, Rafel Palomo Avellaneda, Sergio Mendoza.

Figure 1
Figure 1. Figure 1: Layered architecture of CIF. At the top, CIF-CLI receives the workflow file and starts execution. The view at source ↗
Figure 2
Figure 2. Figure 2: UML sequence of asynchronous HITL in CIF. Two stakeholders interact through the CIF-CLI: the view at source ↗
Figure 3
Figure 3. Figure 3: Abstract workflow for the ship-detection case study. Two execution sites are used: the Tactical Unit view at source ↗
Figure 4
Figure 4. Figure 4: Workflow structure for the ship-detection case study. The figure shows the initial configuration of view at source ↗
Figure 5
Figure 5. Figure 5: Validation workflow for CIF (flowchart view). Inference runs on the Tactical Unit, while training view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative inference results before and after HITL-supervised retraining. (a) Baseline: generic “vehi view at source ↗
read the original abstract

Human involvement is critical in training and deploying AI systems in high-stakes defence and security contexts. However, real-time interaction is impractical in HPC environments due to compute intensity and resource constraints. We present a workflow framework that enables asynchronous human-AI collaboration across hybrid infrastructures, including HPC clusters, local machines, and cloud platforms. Workflows can pause at defined checkpoints for human input without halting underlying compute jobs, preventing idle resources and enabling non-blocking supervision. The framework supports interaction with SLURM-based scheduling, containerized and native tasks, and is customized for scenarios requiring human judgment and adaptability. We demonstrate its application in model training on systems like MareNostrum 5, highlighting benefits in portability, efficiency, and oversight in operational AI workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a workflow-oriented framework for asynchronous human-AI collaboration in hybrid HPC environments. It claims that workflows can pause at defined checkpoints for human input without halting underlying compute jobs, thereby avoiding idle resources, while supporting SLURM-based scheduling, containerized and native tasks, and hybrid infrastructures (HPC clusters, local machines, cloud). The framework is customized for high-stakes defense/security scenarios requiring human judgment and is demonstrated via model training on systems such as MareNostrum 5, with asserted benefits in portability, efficiency, and oversight.

Significance. If the framework's non-blocking checkpoint mechanism and hybrid integration function as described, the work would offer a practical advance for integrating human oversight into compute-intensive AI pipelines in resource-constrained HPC settings. This could reduce waste in long-running jobs and improve adaptability in operational contexts, addressing a real gap between real-time human interaction needs and HPC constraints. The absence of any metrics or implementation details, however, prevents assessment of whether these benefits are realized.

major comments (2)
  1. [Abstract] Abstract: The claim that 'workflows can pause at defined checkpoints for human input without halting underlying compute jobs' and 'preventing idle resources' is load-bearing for the entire contribution, yet the manuscript provides zero description of the checkpoint mechanism, how asynchrony is maintained with SLURM job control or container orchestration, or how hybrid infrastructure handoffs occur.
  2. [Demonstration] Demonstration section: The assertion of successful application in model training on MareNostrum 5 is unsupported by any implementation details, architecture diagrams, pseudocode, performance metrics, overhead measurements, or empirical results on efficiency, portability, or non-blocking behavior.
minor comments (1)
  1. [Abstract] The abstract refers to 'customized for scenarios requiring human judgment and adaptability' without clarifying how the framework exposes or manages such customization points.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address the major comments point by point below and will make revisions to improve the clarity and completeness of the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that 'workflows can pause at defined checkpoints for human input without halting underlying compute jobs' and 'preventing idle resources' is load-bearing for the entire contribution, yet the manuscript provides zero description of the checkpoint mechanism, how asynchrony is maintained with SLURM job control or container orchestration, or how hybrid infrastructure handoffs occur.

    Authors: We acknowledge that the current manuscript does not provide sufficient technical details on the checkpoint mechanism in the main text, despite the abstract summarizing the key benefit. In the revised version, we will add a detailed description of the checkpointing approach, explaining how asynchrony is maintained through SLURM job management (e.g., using job dependencies and external triggers for human input), container orchestration for task handoffs, and mechanisms for hybrid infrastructure integration. This will be supported by an architecture diagram and pseudocode to allow readers to understand and potentially replicate the non-blocking behavior. revision: yes

  2. Referee: [Demonstration] Demonstration section: The assertion of successful application in model training on MareNostrum 5 is unsupported by any implementation details, architecture diagrams, pseudocode, performance metrics, overhead measurements, or empirical results on efficiency, portability, or non-blocking behavior.

    Authors: The referee correctly identifies that the demonstration section is currently high-level and lacks the requested supporting materials. We will revise this section to include architecture diagrams, pseudocode for the workflow execution, and available empirical results from the MareNostrum 5 deployment, including any measurements of overhead, efficiency gains, and portability across hybrid setups. This will provide concrete evidence for the claimed benefits in the context of model training. revision: yes

Circularity Check

0 steps flagged

No circularity: systems-description paper with no derivations or predictions

full rationale

The paper presents a workflow framework for asynchronous human-AI collaboration in hybrid HPC environments. It contains no mathematical derivations, equations, fitted parameters, predictions, or self-referential logic. Claims rest on the existence and application of the implemented framework (e.g., SLURM support and non-blocking checkpoints) rather than any reduction to inputs by construction. No load-bearing steps match the enumerated circularity patterns; the work is self-contained as a descriptive systems contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The paper rests on domain assumptions about the need for human oversight in high-stakes AI and the impracticality of real-time interaction in HPC, with the framework itself as the primary invented contribution rather than any fitted parameters.

axioms (2)
  • domain assumption Human involvement is critical in training and deploying AI systems in high-stakes defence and security contexts.
    Stated as the opening motivation in the abstract.
  • domain assumption Real-time interaction is impractical in HPC environments due to compute intensity and resource constraints.
    Core premise used to justify the asynchronous design.
invented entities (1)
  • Workflow framework with defined checkpoints for asynchronous human input no independent evidence
    purpose: To enable non-blocking human supervision while preventing idle HPC resources
    This is the central new system introduced by the authors.

pith-pipeline@v0.9.0 · 5462 in / 1371 out tokens · 77699 ms · 2026-05-07T14:11:13.946584+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    Instance selection mechanisms for human-in-the- loop systems in few-shot learning,

    J. Jakubik, B. Blumenstiel, M. V¨ ossing, and P. Hemmer, “Instance selection mechanisms for human-in-the- loop systems in few-shot learning,” 2022

  2. [2]

    Apache airflow documentation

    Apache Software Foundation, “Apache airflow documentation.”https://airflow.apache.org, 2025

  3. [3]

    Luigi: Workflow management system for batch data processing

    Spotify AB, “Luigi: Workflow management system for batch data processing.”https://github.com/ spotify/luigi, 2019

  4. [4]

    Prefect: The modern workflow orchestration framework

    Prefect Technologies, “Prefect: The modern workflow orchestration framework.”https://www.prefect.io, 2025

  5. [5]

    Nextflow enables reproducible computational workflows,

    P. Di Tommaso, M. Chatzou, E. W. Floden, P. Prieto Barja, E. Palumbo, and C. Notredame, “Nextflow enables reproducible computational workflows,”Nature Biotechnology35(4), pp. 316–319, 2017

  6. [6]

    Snakemake—a scalable bioinformatics workflow engine,

    J. K¨ oster and S. Rahmann, “Snakemake—a scalable bioinformatics workflow engine,”Bioinformatics28(19), pp. 2520–2522, 2012

  7. [7]

    Argo workflows: Kubernetes-native workflow engine for orchestrating parallel jobs

    The Argo Project, “Argo workflows: Kubernetes-native workflow engine for orchestrating parallel jobs.” https://argoproj.github.io/workflows/, 2025. Accessed: 2025-08-25

  8. [8]

    COMP Superscalar, an interoperable programming framework,

    R. M. Badia, J. Conejero, C. Diaz, J. Ejarque, D. Lezzi, F. Lordan, C. Ramon-Corts, and R. Sirvent, “COMP Superscalar, an interoperable programming framework,”SoftwareX3–4, pp. 32–36, 2015

  9. [9]

    Pegasus: a workflow management system for science automation,

    E. Deelman, K. Vahi, G. Juve, M. Rynge, S. Callaghan, P. J. Maechling, R. Mayani, W. Chen, R. Ferreira da Silva, M. Livny, and K. Wenger, “Pegasus: a workflow management system for science automation,”Future Generation Computer Systems46, pp. 17–35, 2015

  10. [10]

    Kepler: An extensible system for design and execution of scientific workflows,

    I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Lud¨ ascher, and S. Mock, “Kepler: An extensible system for design and execution of scientific workflows,” inProceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM),SSDBM ’04, pp. 423–424, IEEE, 2004

  11. [11]

    Kubernetes multi-cluster services documentation

    Cloud Native Computing Foundation, “Kubernetes multi-cluster services documentation.”https:// kubernetes.io, 2025

  12. [12]

    Enabling dynamic and intelligent workflows for hpc, data analytics, and ai convergence,

    J. Ejarque and R. M. Badia, “Enabling dynamic and intelligent workflows for hpc, data analytics, and ai convergence,”Future Generation Computer Systems134, pp. 414–429, 2022

  13. [13]

    The galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update,

    Galaxy Community, “The galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update,”Nucleic Acids Research50(W1), pp. W345–W351, 2022

  14. [14]

    Common workflow language v1.2 specification

    CWL Community, “Common workflow language v1.2 specification.”https://www.commonwl.org/v1.2/,

  15. [15]

    Accessed: 2025-08-25

  16. [16]

    Workflow description language (wdl) documentation

    OpenWDL Community, “Workflow description language (wdl) documentation.”https://docs.openwdl. org/, 2025. Accessed: 2025-08-25

  17. [17]

    Causability and explainability of artificial intelligence in medicine,

    A. Holzinger, G. Langs, H. Denk, K. Zatloukal, and H. M¨ uller, “Causability and explainability of artificial intelligence in medicine,”Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery9(4), p. e1312, 2019. Published online 2 April 2019; e-pub ahead of print Jul–Aug 2019 issue

  18. [18]

    Learning autonomous driving tasks via human feed- backs with large language models,

    Y. Ma, X. Cao, W. Ye, C. Cui, K. Mei, and Z. Wang, “Learning autonomous driving tasks via human feed- backs with large language models,” inFindings of the Association for Computational Linguistics: EMNLP 2024, Y. Al-Onaizan, M. Bansal, and Y.-N. Chen, eds., pp. 4985–4995, Association for Computational Linguistics, (Miami, Florida, USA), Nov. 2024

  19. [19]

    Human-centered artificial intelligence: Reliable, safe & trustworthy,

    B. Shneiderman, “Human-centered artificial intelligence: Reliable, safe & trustworthy,”International Jour- nal of Human–Computer Interaction36(6), pp. 495–504, 2020

  20. [20]

    Business process management: A comprehensive survey,

    W. M. P. van der Aalst, “Business process management: A comprehensive survey,”ISRN Software Engi- neering2013, pp. Article ID 507984, 1–37, 2013

  21. [21]

    Airflow sensors documentation

    Apache Software Foundation, “Airflow sensors documentation.”https://airflow.apache.org/docs/ apache-airflow/stable/core-concepts/sensors.html, 2025. Accessed: 2025-08-25

  22. [22]

    n8n documentation: Manual trigger node,

    n8n.io, “n8n documentation: Manual trigger node,” 2024. Accessed: 2025-08-24

  23. [23]

    Human-ai collaboration is not very collab- orative yet: a taxonomy of interaction patterns in ai-assisted decision making from a systematic review,

    C. G´ omez, S. M. Cho, S. Ke, C.-M. Huang, and M. Unberath, “Human-ai collaboration is not very collab- orative yet: a taxonomy of interaction patterns in ai-assisted decision making from a systematic review,” Frontiers in Computer Science6, p. 1521066, 2025

  24. [24]

    Hemmer, M

    P. Hemmer, M. Schemmer, N. K¨ uhl, M. V¨ ossing, and G. Satzger, “Complementarity in human-ai collabo- ration: Concept, sources, and evidence,”arXiv preprint arXiv:2404.00029, 2024. Accessed: 2025-08-25