From Specification to Execution: AI Assisted Scientific Workflow Management

Anirban Mandal; Ewa Deelman; Hamza Safri; Komal Thareja; Rajiv Mayani

arxiv: 2606.18425 · v1 · pith:3M32MB7Anew · submitted 2026-06-16 · 💻 cs.SE · cs.AI· cs.DC

From Specification to Execution: AI Assisted Scientific Workflow Management

Komal Thareja , Hamza Safri , Rajiv Mayani , Anirban Mandal , Ewa Deelman This is my paper

Pith reviewed 2026-06-26 23:21 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.DC

keywords scientific workflow managementAI-assisted workflow generationLLM debugging agentstructured specificationfederated learning workflowPegasus workflow systemdistributed execution

0 comments

The pith

An AI system generates and executes large scientific workflows from natural language by separating intent, design, and implementation before code creation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to move from natural language descriptions to running scientific workflows by inserting a structured specification step that validates the user's goal, the design choices, and the implementation details before any code is written. An LLM-based agent then handles debugging across workflow logic, execution environment, and system layers, while a protocol layer connects the whole process to the Pegasus workflow manager for distributed runs. In a test with a federated learning pipeline for medical imaging, the system produced and ran workflows containing thousands of jobs, cut the time spent fixing errors, and let users without workflow expertise apply advanced patterns that experts normally use. The result is presented as evidence that the full cycle of workflow creation, correction, and execution can be assisted by AI rather than performed entirely by hand.

Core claim

The method introduces a structured specification phase that separates workflow intent, design, and implementation, allowing validation prior to code generation, together with an LLM-based debugging agent that diagnoses and resolves failures across multiple system layers and an integration of Pegasus with a Model Context Protocol layer, enabling non-expert users to generate and execute workflows with thousands of jobs that follow expert-level patterns.

What carries the argument

The structured specification phase that separates intent, design, and implementation before code generation, paired with the LLM-based debugging agent.

If this is right

Workflows containing thousands of jobs can be generated and executed from natural language specifications after validation.
Debugging effort decreases because an automated agent addresses failures at multiple layers.
Users without prior workflow expertise can apply advanced design patterns that experts normally employ.
Integration with an existing workflow manager supports distributed execution and user monitoring through a single interface.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of specification stages could be applied to workflow systems other than Pegasus.
The debugging agent might be tested on failure modes drawn from domains beyond medical imaging.
Repeated use of the structured specification could reduce the need for workflow experts in new scientific projects over time.

Load-bearing premise

The LLM-based debugging agent can reliably diagnose and resolve failures across workflow, execution, and system layers without introducing new errors or requiring human oversight.

What would settle it

Running the system on a new workflow type where the debugging agent encounters an unseen failure and either leaves the workflow broken or adds further errors without any human correction.

Figures

Figures reproduced from arXiv: 2606.18425 by Anirban Mandal, Ewa Deelman, Hamza Safri, Komal Thareja, Rajiv Mayani.

**Figure 2.** Figure 2: Integrated architecture combining AI-assisted workflow [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Single-round federated learning workflow showing fan [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Generated top-level workflow DAG for a single dataset [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Convergence comparison: FedAvg (E1) vs. FedProx [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Scientific workflow management systems (WMS) support scalable and reproducible execution of complex pipelines, but workflow design, implementation, and debugging remain largely manual and require significant expertise. Recent approaches using large language models (LLMs) show promise for workflow generation from natural language, but often rely on direct code synthesis, which limits transparency, reproducibility, and integration with workflow systems. We present an AI-assisted approach to scientific workflow management that combines specification-driven workflow generation, automated debugging, and distributed execution. The method introduces a structured specification phase that separates workflow intent, design, and implementation, allowing validation prior to code generation. We also develop an LLM-based debugging agent that diagnoses and resolves failures across multiple system layers. To support distributed execution and user interaction, we integrate Pegasus, a widely used WMS, with a Model Context Protocol (MCP) layer, providing a unified interface for workflow submission, monitoring, and control. We evaluate the approach using a federated learning workflow for medical imaging, chosen for its parallel, iterative, and dependency-intensive structure. The system generated and executed large-scale workflows with thousands of jobs, reduced debugging effort, and allowed non-expert users to construct workflows with expert-level design patterns. These results indicate that end-to-end AI-assisted workflow generation and execution is feasible, and point toward AI-driven platforms for managing the scientific workflow lifecycle.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper describes a structured-spec plus LLM-debugger plus Pegasus integration but supplies zero metrics or baselines for its core claims about reduced effort and agent reliability.

read the letter

Colleague,

The main thing to know is that this work puts together a structured specification layer (intent, design, implementation), an LLM debugging agent that is supposed to fix failures across workflow/execution/system layers, and an MCP interface to Pegasus. They ran it on a federated-learning medical-imaging workflow with thousands of jobs and say it let non-experts produce expert patterns while cutting debugging effort. That combination is the concrete contribution.

The structured specification step is a sensible move; it gives a place to validate before code is generated and should improve transparency over direct LLM code synthesis. Linking to Pegasus rather than building a new executor is also practical for anyone already using that system.

The soft spot is the debugging agent and the evaluation. The abstract asserts that the agent diagnoses and resolves issues without new errors or human oversight, yet there are no success rates, no autonomous-versus-intervention counts, no failure-injection tests, and no person-hour or iteration comparison against a manual baseline. The claim that the system “reduced debugging effort” therefore sits on an unevidenced assertion. Without those numbers the feasibility conclusion does not land.

This is for people who build or use scientific workflow tools and want to explore LLM assistance. A reader looking for architecture ideas or integration patterns could extract something useful; anyone expecting evidence that the AI part actually works at scale will come away empty.

I would send it to peer review because the topic matters and the integration is real, but the authors need to add quantitative evaluation of the agent before the central claims can be taken seriously.

Referee Report

2 major / 1 minor

Summary. The manuscript describes an AI-assisted system for scientific workflow management that integrates specification-driven workflow generation using LLMs, an automated debugging agent, and execution via the Pegasus workflow management system through a Model Context Protocol layer. The approach is evaluated on a federated learning workflow for medical imaging, with the authors claiming successful generation and execution of large-scale workflows involving thousands of jobs, reduced debugging effort, and the ability for non-expert users to employ expert-level design patterns, concluding that end-to-end AI-assisted workflow management is feasible.

Significance. If the central claims were supported by quantitative evidence, the work could have moderate significance for scientific computing and software engineering by reducing the expertise required for complex workflow design and debugging while leveraging an established WMS. The structured specification phase and MCP integration are concrete strengths that promote transparency and reproducibility. No machine-checked proofs, parameter-free derivations, or reproducible artifacts are referenced.

major comments (2)

[Abstract] Abstract: The assertion that the system 'reduced debugging effort' and 'allowed non-expert users to construct workflows with expert-level design patterns' supplies no quantitative metrics (e.g., iteration counts, autonomous resolution rate, person-hours saved, or baseline comparison) for the federated-learning workflow of thousands of jobs. This is load-bearing for the feasibility conclusion.
[Abstract] Abstract (debugging agent paragraph): The LLM-based debugging agent is described as diagnosing and resolving failures across workflow/execution/system layers without new errors or human oversight, yet no success rates, failure-injection protocol, or error analysis are reported. This directly undermines the reduced-effort claim.

minor comments (1)

[Abstract] The Model Context Protocol (MCP) is introduced without a citation or brief definition on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for quantitative support in the abstract. The evaluation is a case study demonstration rather than a controlled experiment, so we will revise the abstract to qualify the claims accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that the system 'reduced debugging effort' and 'allowed non-expert users to construct workflows with expert-level design patterns' supplies no quantitative metrics (e.g., iteration counts, autonomous resolution rate, person-hours saved, or baseline comparison) for the federated-learning workflow of thousands of jobs. This is load-bearing for the feasibility conclusion.

Authors: We agree the claims require qualification. The manuscript reports a successful end-to-end demonstration on the federated-learning workflow but does not include controlled user studies or baseline comparisons. We will revise the abstract to state that the approach enabled construction and execution of the workflow with the debugging agent handling encountered issues, based on the case study, without asserting quantitative reductions in effort. revision: yes
Referee: [Abstract] Abstract (debugging agent paragraph): The LLM-based debugging agent is described as diagnosing and resolving failures across workflow/execution/system layers without new errors or human oversight, yet no success rates, failure-injection protocol, or error analysis are reported. This directly undermines the reduced-effort claim.

Authors: The abstract description reflects the observed behavior during the reported workflow execution, where the agent resolved issues without introducing new errors. No systematic failure-injection experiments or success-rate statistics were performed. We will revise the abstract to describe the agent's role more precisely as having been applied successfully in this instance, removing the implication of general autonomous resolution rates. revision: yes

Circularity Check

0 steps flagged

No circularity: paper contains no derivations, equations, or self-referential claims

full rationale

The manuscript is a systems/engineering description of an AI-assisted workflow platform. It introduces a specification phase, an LLM debugging agent, and Pegasus integration, then reports an empirical evaluation on a federated-learning workflow. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear. The central feasibility claim rests on observed execution of thousands of jobs and qualitative reduction in debugging effort, not on any chain that reduces to its own inputs by construction. This is the normal non-circular outcome for a descriptive implementation paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5784 in / 1059 out tokens · 20742 ms · 2026-06-26T23:21:11.197793+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 2 linked inside Pith

[1]

Pegasus, a workflow management system for science automation,

E. Deelman, K. Vahi, G. Juve, M. Rynge, S. Callaghan, P. J. Maechling, R. Mayani, W. Chen, R. F. da Silva, M. Livny, and K. Wenger, “Pegasus, a workflow management system for science automation,” Future Generation Computer Systems, vol. 46, pp. 17–35, 2015

2015
[3]

The galaxy platform for accessible, reproducible and collaborative biomedical anal- yses,

E. Afgan, D. Baker, B. Batut, M. van den Beek, D. Bouvier, M. ˇCech, J. Chilton, D. Clements, N. Coraor, B. A. Gr ¨uninget al., “The galaxy platform for accessible, reproducible and collaborative biomedical anal- yses,”Nucleic Acids Research, vol. 46, no. W1, pp. W537–W544, 2018

2018
[4]

Wings for pegasus: Creating large-scale scientific workflows using semantic representations,

Y . Gil, V . Ratnakar, E. Deelman, G. Mehta, and J. Kim, “Wings for pegasus: Creating large-scale scientific workflows using semantic representations,” inProceedings of the 19th International Conference on Scientific and Statistical Database Management (SSDBM), 2007

2007
[5]

Flowmind: Automatic workflow generation with large language models,

H. Zenget al., “Flowmind: Automatic workflow generation with large language models,”arXiv preprint arXiv:2404.13050, 2024

arXiv 2024
[6]

Aflow: Automating agentic workflow generation,

C. Zhanget al., “Aflow: Automating agentic workflow generation,”arXiv preprint arXiv:2410.10762, 2024

Pith/arXiv arXiv 2024
[7]

Claude code pluginpegasus-ai,

K. Thareja and R. Mayani, “Claude code pluginpegasus-ai,” https://github.com/pegasus-isi/claude-plugin-marketplace/tree/main/ plugins/pegasus-ai, 2026, claude Code pluginpegasus-ai

2026
[8]

Spec-driven development: From code to contract in the age of ai coding assistants,

D. B. Piskala, “Spec-driven development: From code to contract in the age of ai coding assistants,”arXiv preprint arXiv:2602.00180, 2026

arXiv 2026
[9]

Kiso: A foundation for complex, agentic, and reproducible experi- ments,

R. Mayani, K. Vahi, M. Rynge, K. Thareja, X. Casas-Moreno, H. Jin, A. Mandal, F. Lordan, K. Raghavan, R. M. Badia, and E. Deelman, “Kiso: A foundation for complex, agentic, and reproducible experi- ments,”Frontiers in Complex Systems, vol. 4, p. 1800335, 2026

2026
[10]

Distributed computing in prac- tice: The condor experience,

D. Thain, T. Tannenbaum, and M. Livny, “Distributed computing in prac- tice: The condor experience,”Concurrency and Computation: Practice and Experience, vol. 17, no. 2–4, pp. 323–356, 2005

2005
[11]

Communication-efficient learning of deep networks from decentralized data,

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication-efficient learning of deep networks from decentralized data,” inArtificial Intelligence and Statistics. PMLR, 2017, pp. 1273– 1282

2017
[12]

The cancer imaging archive (tcia): maintaining and operating a public information repository,

K. Clark, B. Vendt, K. Smith, J. Freymann, J. Kirby, P. Koppel, S. Moore, S. Phillips, D. Maffitt, M. Pringleet al., “The cancer imaging archive (tcia): maintaining and operating a public information repository,”Jour- nal of Digital Imaging, vol. 26, no. 6, pp. 1045–1057, 2013

2013
[13]

Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases,

X. Wang, Y . Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers, “Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases,”Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2097–2106, 2017

2097
[14]

Federated optimization in heterogeneous networks,

T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V . Smith, “Federated optimization in heterogeneous networks,”Proceedings of Machine Learning and Systems, vol. 2, pp. 429–450, 2020

2020
[15]

Fabric: A national-scale programmable ex- perimental network infrastructure,

I. Baldin, A. Nikolich, J. Griffioen, I. I. S. Monga, K.-C. Wang, T. Lehman, and P. Ruth, “Fabric: A national-scale programmable ex- perimental network infrastructure,”IEEE Internet Computing, vol. 23, no. 6, pp. 38–47, 2020

2020
[16]

Opencode: An open-source ai coding agent,

Opencode Contributors, “Opencode: An open-source ai coding agent,” https://github.com/opencode-ai/opencode, 2025, accessed: 2025-05-18

2025
[17]

Medical imaging fed- erated learning workflow (claude),

K. Thareja, H. Safri, and E. Deelman, “Medical imaging fed- erated learning workflow (claude),” https://github.com/pegasus-isi/ medical-imaging-fl-workflow, 2026, generated with Claude Code and thepegasus-aiplugin

2026
[18]

Fl pegasus workflow (codex),

——, “Fl pegasus workflow (codex),” https://github.com/kthare10/ fl-pegasus-workflow-gpt-5.4, 2026, generated with OpenAI Codex

2026
[19]

Fl chest workflow (kimi),

——, “Fl chest workflow (kimi),” https://github.com/kthare10/ fl-chest-workflow-kimi, 2026, generated with Opencode and Kimi K2.6

2026
[20]

A workflow management system approach to federated learning: Application to industry 4.0,

H. Safri, G. Papadimitriou, F. Desprez, and E. Deelman, “A workflow management system approach to federated learning: Application to industry 4.0,” in2024 20th International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT). IEEE, 2024, pp. 259–263

2024
[21]

Nextflow enables reproducible computational work- flows,

P. Di Tommaso, M. Chatzou, E. W. Floden, P. P. Barja, E. Palumbo, and C. Notredame, “Nextflow enables reproducible computational work- flows,”Nature Biotechnology, vol. 35, no. 4, pp. 316–319, 2017

2017
[22]

From prompt to pipeline: Large language models for scientific workflow development in bioinformatics,

Anonymous, “From prompt to pipeline: Large language models for scientific workflow development in bioinformatics,”arXiv preprint, 2025

2025
[23]

From research question to scientific workflow: Leverag- ing agentic ai for science automation,

B. Baliset al., “From research question to scientific workflow: Leverag- ing agentic ai for science automation,”arXiv preprint arXiv:2604.21910, 2026

Pith/arXiv arXiv 2026

[1] [1]

Pegasus, a workflow management system for science automation,

E. Deelman, K. Vahi, G. Juve, M. Rynge, S. Callaghan, P. J. Maechling, R. Mayani, W. Chen, R. F. da Silva, M. Livny, and K. Wenger, “Pegasus, a workflow management system for science automation,” Future Generation Computer Systems, vol. 46, pp. 17–35, 2015

2015

[2] [3]

The galaxy platform for accessible, reproducible and collaborative biomedical anal- yses,

E. Afgan, D. Baker, B. Batut, M. van den Beek, D. Bouvier, M. ˇCech, J. Chilton, D. Clements, N. Coraor, B. A. Gr ¨uninget al., “The galaxy platform for accessible, reproducible and collaborative biomedical anal- yses,”Nucleic Acids Research, vol. 46, no. W1, pp. W537–W544, 2018

2018

[3] [4]

Wings for pegasus: Creating large-scale scientific workflows using semantic representations,

Y . Gil, V . Ratnakar, E. Deelman, G. Mehta, and J. Kim, “Wings for pegasus: Creating large-scale scientific workflows using semantic representations,” inProceedings of the 19th International Conference on Scientific and Statistical Database Management (SSDBM), 2007

2007

[4] [5]

Flowmind: Automatic workflow generation with large language models,

H. Zenget al., “Flowmind: Automatic workflow generation with large language models,”arXiv preprint arXiv:2404.13050, 2024

arXiv 2024

[5] [6]

Aflow: Automating agentic workflow generation,

C. Zhanget al., “Aflow: Automating agentic workflow generation,”arXiv preprint arXiv:2410.10762, 2024

Pith/arXiv arXiv 2024

[6] [7]

Claude code pluginpegasus-ai,

K. Thareja and R. Mayani, “Claude code pluginpegasus-ai,” https://github.com/pegasus-isi/claude-plugin-marketplace/tree/main/ plugins/pegasus-ai, 2026, claude Code pluginpegasus-ai

2026

[7] [8]

Spec-driven development: From code to contract in the age of ai coding assistants,

D. B. Piskala, “Spec-driven development: From code to contract in the age of ai coding assistants,”arXiv preprint arXiv:2602.00180, 2026

arXiv 2026

[8] [9]

Kiso: A foundation for complex, agentic, and reproducible experi- ments,

R. Mayani, K. Vahi, M. Rynge, K. Thareja, X. Casas-Moreno, H. Jin, A. Mandal, F. Lordan, K. Raghavan, R. M. Badia, and E. Deelman, “Kiso: A foundation for complex, agentic, and reproducible experi- ments,”Frontiers in Complex Systems, vol. 4, p. 1800335, 2026

2026

[9] [10]

Distributed computing in prac- tice: The condor experience,

D. Thain, T. Tannenbaum, and M. Livny, “Distributed computing in prac- tice: The condor experience,”Concurrency and Computation: Practice and Experience, vol. 17, no. 2–4, pp. 323–356, 2005

2005

[10] [11]

Communication-efficient learning of deep networks from decentralized data,

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication-efficient learning of deep networks from decentralized data,” inArtificial Intelligence and Statistics. PMLR, 2017, pp. 1273– 1282

2017

[11] [12]

The cancer imaging archive (tcia): maintaining and operating a public information repository,

K. Clark, B. Vendt, K. Smith, J. Freymann, J. Kirby, P. Koppel, S. Moore, S. Phillips, D. Maffitt, M. Pringleet al., “The cancer imaging archive (tcia): maintaining and operating a public information repository,”Jour- nal of Digital Imaging, vol. 26, no. 6, pp. 1045–1057, 2013

2013

[12] [13]

Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases,

X. Wang, Y . Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers, “Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases,”Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2097–2106, 2017

2097

[13] [14]

Federated optimization in heterogeneous networks,

T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V . Smith, “Federated optimization in heterogeneous networks,”Proceedings of Machine Learning and Systems, vol. 2, pp. 429–450, 2020

2020

[14] [15]

Fabric: A national-scale programmable ex- perimental network infrastructure,

I. Baldin, A. Nikolich, J. Griffioen, I. I. S. Monga, K.-C. Wang, T. Lehman, and P. Ruth, “Fabric: A national-scale programmable ex- perimental network infrastructure,”IEEE Internet Computing, vol. 23, no. 6, pp. 38–47, 2020

2020

[15] [16]

Opencode: An open-source ai coding agent,

Opencode Contributors, “Opencode: An open-source ai coding agent,” https://github.com/opencode-ai/opencode, 2025, accessed: 2025-05-18

2025

[16] [17]

Medical imaging fed- erated learning workflow (claude),

K. Thareja, H. Safri, and E. Deelman, “Medical imaging fed- erated learning workflow (claude),” https://github.com/pegasus-isi/ medical-imaging-fl-workflow, 2026, generated with Claude Code and thepegasus-aiplugin

2026

[17] [18]

Fl pegasus workflow (codex),

——, “Fl pegasus workflow (codex),” https://github.com/kthare10/ fl-pegasus-workflow-gpt-5.4, 2026, generated with OpenAI Codex

2026

[18] [19]

Fl chest workflow (kimi),

——, “Fl chest workflow (kimi),” https://github.com/kthare10/ fl-chest-workflow-kimi, 2026, generated with Opencode and Kimi K2.6

2026

[19] [20]

A workflow management system approach to federated learning: Application to industry 4.0,

H. Safri, G. Papadimitriou, F. Desprez, and E. Deelman, “A workflow management system approach to federated learning: Application to industry 4.0,” in2024 20th International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT). IEEE, 2024, pp. 259–263

2024

[20] [21]

Nextflow enables reproducible computational work- flows,

P. Di Tommaso, M. Chatzou, E. W. Floden, P. P. Barja, E. Palumbo, and C. Notredame, “Nextflow enables reproducible computational work- flows,”Nature Biotechnology, vol. 35, no. 4, pp. 316–319, 2017

2017

[21] [22]

From prompt to pipeline: Large language models for scientific workflow development in bioinformatics,

Anonymous, “From prompt to pipeline: Large language models for scientific workflow development in bioinformatics,”arXiv preprint, 2025

2025

[22] [23]

From research question to scientific workflow: Leverag- ing agentic ai for science automation,

B. Baliset al., “From research question to scientific workflow: Leverag- ing agentic ai for science automation,”arXiv preprint arXiv:2604.21910, 2026

Pith/arXiv arXiv 2026