From Research Question to Scientific Workflow: Leveraging Agentic AI for Science Automation

Bartosz Balis; Michal Dygas; Michal Kuszewski; Michal Orzechowski; Piotr Kica

arxiv: 2604.21910 · v1 · submitted 2026-04-23 · 💻 cs.AI

From Research Question to Scientific Workflow: Leveraging Agentic AI for Science Automation

Bartosz Balis , Michal Orzechowski , Piotr Kica , Michal Dygas , Michal Kuszewski This is my paper

Pith reviewed 2026-05-09 21:21 UTC · model grok-4.3

classification 💻 cs.AI

keywords agentic AIscientific workflowsintent extractionworkflow automationreproducible DAGsSkills documentsLLM decompositionKubernetes execution

0 comments

The pith

Domain experts author Skills documents to confine LLM non-determinism to intent extraction so identical research questions produce identical scientific workflows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to automate the step where scientists turn natural language questions into executable workflow specifications. It splits the process into an LLM layer that extracts structured intents from text, a deterministic layer that builds the actual workflow DAGs, and a knowledge layer where experts write Skills as markdown files. Those Skills supply vocabulary mappings, constraints, and strategies that guide the LLM and enable reproducible outputs. The approach matters because it removes the manual translation burden while keeping the final workflows under expert control and free of LLM randomness once the intent is fixed.

Core claim

The architecture decomposes the translation task into three layers so that LLM non-determinism is restricted to intent extraction: the LLM converts varied natural language into a fixed structured intent, validated generators then always produce the same workflow DAG from that intent, and domain experts maintain the mapping rules inside Skills documents. When tested on the 1000 Genomes population genetics workflow running under Hyperflow on Kubernetes, the Skills raise full-match intent accuracy from 44 percent to 83 percent, skill-guided deferred workflow generation cuts data transfer by 92 percent, and the full pipeline runs with LLM overhead below 15 seconds and cost under 0.001 dollars.

What carries the argument

Skills: markdown documents that encode vocabulary mappings, parameter constraints, and optimization strategies to direct intent extraction and enable deterministic DAG generation from fixed intents.

If this is right

Identical intents always produce identical workflows even if the LLM varies its wording.
Full-match intent accuracy rises from 44 percent to 83 percent once Skills are supplied.
Deferred workflow generation driven by Skills reduces data movement by 92 percent.
The complete pipeline finishes queries on Kubernetes with under 15 seconds of LLM time and under 0.001 dollars in cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of concerns could let the same Skills library serve multiple workflow engines beyond Hyperflow.
If Skills can be version-controlled like code, teams might share and reuse domain knowledge across institutions.
Extending the intent schema to include conditional branches or data provenance hints could handle more complex research questions without increasing LLM variability.

Load-bearing premise

Domain experts can write and keep up-to-date Skills documents that correctly capture the needed vocabulary, constraints, and strategies, while the LLM can reliably turn varied natural language into accurate structured intents.

What would settle it

Apply the system to a new domain without pre-written Skills, measure whether full-match intent accuracy falls back near 44 percent or whether generated workflows diverge from expert manual versions.

Figures

Figures reproduced from arXiv: 2604.21910 by Bartosz Balis, Michal Dygas, Michal Kuszewski, Michal Orzechowski, Piotr Kica.

**Figure 1.** Figure 1: Component architecture. The Conductor orchestrates three specialized agents. The Workflow Composer (semantic layer) consults domain Skills (knowledge layer) to produce workflow plans that include data preparation commands. The Deployment Service and Execution Sentinel (deterministic layer) execute these plans on the Kubernetes infrastructure running the HyperFlow engine [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

**Figure 2.** Figure 2: Sequence diagram of the agentic pipeline. Five actors – User, Conductor, Workflow Composer, Deployment Service, and Kubernetes cluster – interact across six phases. The Execution Sentinel runs asynchronously after workflow submission and is omitted for compactness [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Scientific workflow systems automate execution -- scheduling, fault tolerance, resource management -- but not the semantic translation that precedes it. Scientists still manually convert research questions into workflow specifications, a task requiring both domain knowledge and infrastructure expertise. We propose an agentic architecture that closes this gap through three layers: an LLM interprets natural language into structured intents (semantic layer); validated generators produce reproducible workflow DAGs (deterministic layer); and domain experts author ``Skills'': markdown documents encoding vocabulary mappings, parameter constraints, and optimization strategies (knowledge layer). This decomposition confines LLM non-determinism to intent extraction: identical intents always yield identical workflows. We implement and evaluate the architecture on the 1000 Genomes population genetics workflow and Hyperflow WMS running on Kubernetes. In an ablation study on 150 queries, Skills raise full-match intent accuracy from 44% to 83%; skill-driven deferred workflow generation reduces data transfer by 92\%; and the end-to-end pipeline completes queries on Kubernetes with LLM overhead below 15 seconds and cost under $0.001 per query.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This approach stabilizes LLM workflow generation by confining variability to intent extraction via expert Skills, delivering measurable accuracy and efficiency gains on a real genetics workflow while leaving the practicality of Skill creation untested.

read the letter

The main takeaway is that this work isolates the unpredictable part of using an LLM to just turning a research question into a structured intent, then uses markdown Skills written by experts to guide the details, and finishes with deterministic generators for the workflow. Same intent means same output every time. On the 1000 Genomes population genetics workflow they ran an ablation with 150 queries. Skills boosted full-match intent accuracy from 44% to 83%. The skill-driven deferred generation cut data transfer by 92%. End-to-end, queries completed on Kubernetes with LLM overhead under 15 seconds and cost below $0.001 each. They built it on top of Hyperflow and showed it works in practice. The numbers are tied to a real system rather than synthetic tests. The weak point is the Skills layer itself. The paper gives no measurements on how long it takes domain experts to write or update these documents, how consistent they are across experts, or whether the accuracy holds for queries outside the 150 tested ones or in other scientific domains. Without that, it's hard to know if the approach generalizes beyond this one workflow. This is aimed at researchers and developers working on scientific workflow management and AI-assisted automation in data-intensive fields. It offers a practical decomposition and some empirical results on efficiency. I'd bring it to a reading group to talk through the Skills design and the ablation setup. It deserves serious peer review because the core architecture is implemented and evaluated on actual infrastructure with clear metrics, even if additional work on Skill authoring effort would make the claims stronger.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes an agentic architecture to automate the semantic translation from natural language research questions to executable scientific workflows. It decomposes the process into an LLM semantic layer for extracting structured intents, a deterministic layer of validated generators that produce reproducible workflow DAGs, and a knowledge layer where domain experts author 'Skills' as markdown documents encoding vocabulary mappings, parameter constraints, and optimization strategies. This confines LLM non-determinism to intent extraction so that identical intents always yield identical workflows. The system is implemented and evaluated on the 1000 Genomes population genetics workflow running under Hyperflow on Kubernetes. An ablation on 150 queries reports that Skills raise full-match intent accuracy from 44% to 83%, skill-driven deferred generation reduces data transfer by 92%, and end-to-end queries incur LLM overhead below 15 seconds and cost under $0.001.

Significance. If the reported accuracy lift, data-transfer savings, and low overhead hold under broader conditions, the work offers a practical path to reducing the manual burden of workflow specification while preserving reproducibility. The explicit separation of non-deterministic intent extraction from deterministic generation is a clean architectural contribution. Concrete measurements on a production-grade workflow and Kubernetes deployment strengthen the practicality claim. The approach could influence both workflow systems and LLM-augmented scientific tooling, provided the Skills authoring process scales.

major comments (2)

Abstract and Evaluation section: The ablation claims that Skills raise full-match intent accuracy from 44% to 83% on 150 queries, yet no description is given of how the 150 queries were generated or sampled, what constitutes a 'full match', the process by which domain experts created or validated the Skills documents, or any inter-expert consistency metrics. Because the 39-point improvement is presented as central evidence for the architecture's value, the absence of these details prevents assessment of reproducibility and generalizability beyond the single 1000 Genomes case.
Abstract and §3 (Skills layer): The architecture's primary innovation rests on the premise that domain experts can author and maintain effective Skills documents that correctly encode vocabulary, constraints, and strategies. No measurements of authoring time, revision effort, or accuracy on queries outside the tested set are reported. This assumption is load-bearing for the broader claim of automating semantic translation for scientists; without supporting data the empirical results on one workflow cannot be extrapolated.

minor comments (2)

The abstract refers to 'validated generators' without clarifying the validation mechanism or providing pseudocode; a short description or figure in the methods section would improve clarity.
Related-work discussion would benefit from explicit comparison to existing workflow systems (e.g., Pegasus, Kepler) and prior LLM-assisted workflow generation efforts to better situate the contribution.

Simulated Author's Rebuttal

2 responses · 3 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. Below we provide point-by-point responses to the major comments, indicating the revisions we will make to address the concerns raised.

read point-by-point responses

Referee: Abstract and Evaluation section: The ablation claims that Skills raise full-match intent accuracy from 44% to 83% on 150 queries, yet no description is given of how the 150 queries were generated or sampled, what constitutes a 'full match', the process by which domain experts created or validated the Skills documents, or any inter-expert consistency metrics. Because the 39-point improvement is presented as central evidence for the architecture's value, the absence of these details prevents assessment of reproducibility and generalizability beyond the single 1000 Genomes case.

Authors: We agree that the manuscript would benefit from greater transparency in the evaluation methodology to support reproducibility and assessment of generalizability. We will revise the Evaluation section to include details on how the 150 queries were generated and sampled, the definition of a 'full match', and the process by which the domain expert created the Skills documents. We did not measure inter-expert consistency, as the Skills were developed by a single expert; this will be acknowledged as a limitation in the revised manuscript. revision: partial
Referee: Abstract and §3 (Skills layer): The architecture's primary innovation rests on the premise that domain experts can author and maintain effective Skills documents that correctly encode vocabulary, constraints, and strategies. No measurements of authoring time, revision effort, or accuracy on queries outside the tested set are reported. This assumption is load-bearing for the broader claim of automating semantic translation for scientists; without supporting data the empirical results on one workflow cannot be extrapolated.

Authors: We acknowledge that the absence of measurements regarding the effort required to author and maintain Skills documents, as well as their performance on unseen queries, restricts the strength of our claims about broader applicability. Since these data were not collected in the present study, we will add a dedicated paragraph in the Discussion section to discuss this as a key limitation and to describe planned future experiments that will quantify authoring time, revision effort, and accuracy on additional workflows and query sets. revision: yes

standing simulated objections not resolved

Measurements of authoring time and revision effort for the Skills documents
Accuracy of the system on queries outside the tested 150-query set
Inter-expert consistency metrics for Skills validation

Circularity Check

0 steps flagged

No circularity; claims rest on implementation and direct empirical measurements

full rationale

The paper describes a three-layer agentic architecture (LLM intent extraction, deterministic workflow generators, expert-authored Skills) and supports its claims via an ablation study on 150 queries for the 1000 Genomes workflow, reporting concrete accuracy gains (44% to 83%), data-transfer reductions (92%), and runtime/cost metrics. No equations, fitted parameters, or derivations are present that reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The evaluation is self-contained against external benchmarks (Kubernetes execution, query set) rather than relying on self-referential logic or renamed known results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on assumptions about LLM reliability for intent extraction and the feasibility of expert-authored Skills rather than new mathematical derivations or physical laws.

axioms (2)

domain assumption LLMs can extract structured intents from natural language research questions when guided by Skills.
This underpins the semantic layer and the reported accuracy improvements.
domain assumption Deterministic generators can produce valid, reproducible workflow DAGs from structured intents.
Required for the claim that identical intents yield identical workflows.

invented entities (1)

Skills no independent evidence
purpose: Markdown documents that encode vocabulary mappings, parameter constraints, and optimization strategies for the knowledge layer.
New component introduced to guide LLM behavior and enable deferred workflow generation.

pith-pipeline@v0.9.0 · 5498 in / 1372 out tokens · 39266 ms · 2026-05-09T21:21:12.273242+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

Nature526(7571), 68–74 (2015)

1000 Genomes Project Consortium: A global reference for human genetic variation. Nature526(7571), 68–74 (2015)

work page 2015
[2]

Afgan, E., Baker, D., Batut, et al.: The Galaxy platform for accessible, repro- ducibleandcollaborativebiomedicalanalyses:2018update.NucleicAcidsResearch 46(W1), W537–W544 (2018)

work page 2018
[3]

Alam, F.I., Roy, B.: From prompt to pipeline: Can LLMs generate bioinformatics workflows? arXiv preprint arXiv:2505.06145 (2025)

work page arXiv 2025
[4]

Future Generation Computer Systems55, 147–162 (2016)

Balis, B.: HyperFlow: A model of computation, programming approach and en- actment engine for complex distributed workflows. Future Generation Computer Systems55, 147–162 (2016)

work page 2016
[5]

Future Generation Computer Systems 46, 17–35 (2015)

Deelman,E.,Vahi,K.,Juve,G.,Rynge,M.,Callaghan,S.,Maechling,P.J.,Mayani, R., Chen, W., Ferreira da Silva, R., Livny, M., Wenger, K.: Pegasus, a workflow management system for science automation. Future Generation Computer Systems 46, 17–35 (2015)

work page 2015
[6]

Nature Biotechnology 35(4), 316–319 (2017)

DiTommaso,P.,Chatzou,M.,Floden,E.W.,Barja,P.P.,Palumbo,E.,Notredame, C.: Nextflow enables reproducible computational workflows. Nature Biotechnology 35(4), 316–319 (2017)

work page 2017
[7]

In: Proc

Gridach, M., Shmeis, A., Sun, J., et al.: Agentic AI for scientific discovery. In: Proc. International Conference on Learning Representations (ICLR) (2025)

work page 2025
[8]

Bioinformatics28(19), 2520–2522 (2012)

Köster, J., Rahmann, S.: Snakemake—a scalable bioinformatics workflow engine. Bioinformatics28(19), 2520–2522 (2012)

work page 2012
[9]

Towards scientific intelligence: A survey of llm-based scientific agents

Ren, J., Wang, Y., Zhao, Z., Li, S., Liu, J., Chen, X., Sun, Y., Liu, Y., Wang, L., Xu, H.: Towards scientific intelligence: A survey of LLM-based scientific agents. arXiv preprint arXiv:2503.24047 (2025)

work page arXiv 2025
[10]

In: 2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS)

Ferreira da Silva, R., Casanova, H., Chard, K., Altintas, I., Badia, R.M., Balis, B., Coleman, T., Coppens, F., Di Natale, F., Enders, B., et al.: A community roadmap for scientific workflows research and development. In: 2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS). pp. 81–90. IEEE (2021)

work page 2021
[11]

arXiv preprint arXiv:2504.13438 (2025)

So, R.: MCP servers for scientific workflows. arXiv preprint arXiv:2504.13438 (2025)

work page arXiv 2025
[12]

arXiv preprint arXiv:2506.01774 (2025)

Strickland, M., Weerasinghe, N., Chard, R., Ward, L., Foster, I.: Talk freely, ex- ecute strictly: Schema-gated agentic AI for scientific workflows. arXiv preprint arXiv:2506.01774 (2025)

work page arXiv 2025
[13]

Future Generation Computer Systems174, 107974 (2025)

Suter, F., Coleman, T., Altintaş, İ., Badia, R.M., Balis, B., et al.: A terminology for scientific workflow systems. Future Generation Computer Systems174, 107974 (2025)

work page 2025
[14]

arXiv preprint arXiv:2502.09116 (2025)

Yildiz, O., Peterka, T.: Do LLMs speak scientific workflows? an empirical study. arXiv preprint arXiv:2502.09116 (2025)

work page arXiv 2025

[1] [1]

Nature526(7571), 68–74 (2015)

1000 Genomes Project Consortium: A global reference for human genetic variation. Nature526(7571), 68–74 (2015)

work page 2015

[2] [2]

Afgan, E., Baker, D., Batut, et al.: The Galaxy platform for accessible, repro- ducibleandcollaborativebiomedicalanalyses:2018update.NucleicAcidsResearch 46(W1), W537–W544 (2018)

work page 2018

[3] [3]

Alam, F.I., Roy, B.: From prompt to pipeline: Can LLMs generate bioinformatics workflows? arXiv preprint arXiv:2505.06145 (2025)

work page arXiv 2025

[4] [4]

Future Generation Computer Systems55, 147–162 (2016)

Balis, B.: HyperFlow: A model of computation, programming approach and en- actment engine for complex distributed workflows. Future Generation Computer Systems55, 147–162 (2016)

work page 2016

[5] [5]

Future Generation Computer Systems 46, 17–35 (2015)

Deelman,E.,Vahi,K.,Juve,G.,Rynge,M.,Callaghan,S.,Maechling,P.J.,Mayani, R., Chen, W., Ferreira da Silva, R., Livny, M., Wenger, K.: Pegasus, a workflow management system for science automation. Future Generation Computer Systems 46, 17–35 (2015)

work page 2015

[6] [6]

Nature Biotechnology 35(4), 316–319 (2017)

DiTommaso,P.,Chatzou,M.,Floden,E.W.,Barja,P.P.,Palumbo,E.,Notredame, C.: Nextflow enables reproducible computational workflows. Nature Biotechnology 35(4), 316–319 (2017)

work page 2017

[7] [7]

In: Proc

Gridach, M., Shmeis, A., Sun, J., et al.: Agentic AI for scientific discovery. In: Proc. International Conference on Learning Representations (ICLR) (2025)

work page 2025

[8] [8]

Bioinformatics28(19), 2520–2522 (2012)

Köster, J., Rahmann, S.: Snakemake—a scalable bioinformatics workflow engine. Bioinformatics28(19), 2520–2522 (2012)

work page 2012

[9] [9]

Towards scientific intelligence: A survey of llm-based scientific agents

Ren, J., Wang, Y., Zhao, Z., Li, S., Liu, J., Chen, X., Sun, Y., Liu, Y., Wang, L., Xu, H.: Towards scientific intelligence: A survey of LLM-based scientific agents. arXiv preprint arXiv:2503.24047 (2025)

work page arXiv 2025

[10] [10]

In: 2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS)

Ferreira da Silva, R., Casanova, H., Chard, K., Altintas, I., Badia, R.M., Balis, B., Coleman, T., Coppens, F., Di Natale, F., Enders, B., et al.: A community roadmap for scientific workflows research and development. In: 2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS). pp. 81–90. IEEE (2021)

work page 2021

[11] [11]

arXiv preprint arXiv:2504.13438 (2025)

So, R.: MCP servers for scientific workflows. arXiv preprint arXiv:2504.13438 (2025)

work page arXiv 2025

[12] [12]

arXiv preprint arXiv:2506.01774 (2025)

Strickland, M., Weerasinghe, N., Chard, R., Ward, L., Foster, I.: Talk freely, ex- ecute strictly: Schema-gated agentic AI for scientific workflows. arXiv preprint arXiv:2506.01774 (2025)

work page arXiv 2025

[13] [13]

Future Generation Computer Systems174, 107974 (2025)

Suter, F., Coleman, T., Altintaş, İ., Badia, R.M., Balis, B., et al.: A terminology for scientific workflow systems. Future Generation Computer Systems174, 107974 (2025)

work page 2025

[14] [14]

arXiv preprint arXiv:2502.09116 (2025)

Yildiz, O., Peterka, T.: Do LLMs speak scientific workflows? an empirical study. arXiv preprint arXiv:2502.09116 (2025)

work page arXiv 2025