pith. sign in

arxiv: 2604.21910 · v1 · submitted 2026-04-23 · 💻 cs.AI

From Research Question to Scientific Workflow: Leveraging Agentic AI for Science Automation

Pith reviewed 2026-05-09 21:21 UTC · model grok-4.3

classification 💻 cs.AI
keywords agentic AIscientific workflowsintent extractionworkflow automationreproducible DAGsSkills documentsLLM decompositionKubernetes execution
0
0 comments X

The pith

Domain experts author Skills documents to confine LLM non-determinism to intent extraction so identical research questions produce identical scientific workflows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to automate the step where scientists turn natural language questions into executable workflow specifications. It splits the process into an LLM layer that extracts structured intents from text, a deterministic layer that builds the actual workflow DAGs, and a knowledge layer where experts write Skills as markdown files. Those Skills supply vocabulary mappings, constraints, and strategies that guide the LLM and enable reproducible outputs. The approach matters because it removes the manual translation burden while keeping the final workflows under expert control and free of LLM randomness once the intent is fixed.

Core claim

The architecture decomposes the translation task into three layers so that LLM non-determinism is restricted to intent extraction: the LLM converts varied natural language into a fixed structured intent, validated generators then always produce the same workflow DAG from that intent, and domain experts maintain the mapping rules inside Skills documents. When tested on the 1000 Genomes population genetics workflow running under Hyperflow on Kubernetes, the Skills raise full-match intent accuracy from 44 percent to 83 percent, skill-guided deferred workflow generation cuts data transfer by 92 percent, and the full pipeline runs with LLM overhead below 15 seconds and cost under 0.001 dollars.

What carries the argument

Skills: markdown documents that encode vocabulary mappings, parameter constraints, and optimization strategies to direct intent extraction and enable deterministic DAG generation from fixed intents.

If this is right

  • Identical intents always produce identical workflows even if the LLM varies its wording.
  • Full-match intent accuracy rises from 44 percent to 83 percent once Skills are supplied.
  • Deferred workflow generation driven by Skills reduces data movement by 92 percent.
  • The complete pipeline finishes queries on Kubernetes with under 15 seconds of LLM time and under 0.001 dollars in cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of concerns could let the same Skills library serve multiple workflow engines beyond Hyperflow.
  • If Skills can be version-controlled like code, teams might share and reuse domain knowledge across institutions.
  • Extending the intent schema to include conditional branches or data provenance hints could handle more complex research questions without increasing LLM variability.

Load-bearing premise

Domain experts can write and keep up-to-date Skills documents that correctly capture the needed vocabulary, constraints, and strategies, while the LLM can reliably turn varied natural language into accurate structured intents.

What would settle it

Apply the system to a new domain without pre-written Skills, measure whether full-match intent accuracy falls back near 44 percent or whether generated workflows diverge from expert manual versions.

Figures

Figures reproduced from arXiv: 2604.21910 by Bartosz Balis, Michal Dygas, Michal Kuszewski, Michal Orzechowski, Piotr Kica.

Figure 1
Figure 1. Figure 1: Component architecture. The Conductor orchestrates three specialized agents. The Workflow Composer (semantic layer) consults domain Skills (knowledge layer) to produce workflow plans that include data preparation commands. The Deployment Service and Execution Sentinel (deterministic layer) execute these plans on the Kuber￾netes infrastructure running the HyperFlow engine [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗
Figure 2
Figure 2. Figure 2: Sequence diagram of the agentic pipeline. Five actors – User, Conductor, Workflow Composer, Deployment Service, and Kubernetes cluster – interact across six phases. The Execution Sentinel runs asynchronously after workflow submission and is omitted for compactness [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

Scientific workflow systems automate execution -- scheduling, fault tolerance, resource management -- but not the semantic translation that precedes it. Scientists still manually convert research questions into workflow specifications, a task requiring both domain knowledge and infrastructure expertise. We propose an agentic architecture that closes this gap through three layers: an LLM interprets natural language into structured intents (semantic layer); validated generators produce reproducible workflow DAGs (deterministic layer); and domain experts author ``Skills'': markdown documents encoding vocabulary mappings, parameter constraints, and optimization strategies (knowledge layer). This decomposition confines LLM non-determinism to intent extraction: identical intents always yield identical workflows. We implement and evaluate the architecture on the 1000 Genomes population genetics workflow and Hyperflow WMS running on Kubernetes. In an ablation study on 150 queries, Skills raise full-match intent accuracy from 44% to 83%; skill-driven deferred workflow generation reduces data transfer by 92\%; and the end-to-end pipeline completes queries on Kubernetes with LLM overhead below 15 seconds and cost under $0.001 per query.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes an agentic architecture to automate the semantic translation from natural language research questions to executable scientific workflows. It decomposes the process into an LLM semantic layer for extracting structured intents, a deterministic layer of validated generators that produce reproducible workflow DAGs, and a knowledge layer where domain experts author 'Skills' as markdown documents encoding vocabulary mappings, parameter constraints, and optimization strategies. This confines LLM non-determinism to intent extraction so that identical intents always yield identical workflows. The system is implemented and evaluated on the 1000 Genomes population genetics workflow running under Hyperflow on Kubernetes. An ablation on 150 queries reports that Skills raise full-match intent accuracy from 44% to 83%, skill-driven deferred generation reduces data transfer by 92%, and end-to-end queries incur LLM overhead below 15 seconds and cost under $0.001.

Significance. If the reported accuracy lift, data-transfer savings, and low overhead hold under broader conditions, the work offers a practical path to reducing the manual burden of workflow specification while preserving reproducibility. The explicit separation of non-deterministic intent extraction from deterministic generation is a clean architectural contribution. Concrete measurements on a production-grade workflow and Kubernetes deployment strengthen the practicality claim. The approach could influence both workflow systems and LLM-augmented scientific tooling, provided the Skills authoring process scales.

major comments (2)
  1. Abstract and Evaluation section: The ablation claims that Skills raise full-match intent accuracy from 44% to 83% on 150 queries, yet no description is given of how the 150 queries were generated or sampled, what constitutes a 'full match', the process by which domain experts created or validated the Skills documents, or any inter-expert consistency metrics. Because the 39-point improvement is presented as central evidence for the architecture's value, the absence of these details prevents assessment of reproducibility and generalizability beyond the single 1000 Genomes case.
  2. Abstract and §3 (Skills layer): The architecture's primary innovation rests on the premise that domain experts can author and maintain effective Skills documents that correctly encode vocabulary, constraints, and strategies. No measurements of authoring time, revision effort, or accuracy on queries outside the tested set are reported. This assumption is load-bearing for the broader claim of automating semantic translation for scientists; without supporting data the empirical results on one workflow cannot be extrapolated.
minor comments (2)
  1. The abstract refers to 'validated generators' without clarifying the validation mechanism or providing pseudocode; a short description or figure in the methods section would improve clarity.
  2. Related-work discussion would benefit from explicit comparison to existing workflow systems (e.g., Pegasus, Kepler) and prior LLM-assisted workflow generation efforts to better situate the contribution.

Simulated Author's Rebuttal

2 responses · 3 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. Below we provide point-by-point responses to the major comments, indicating the revisions we will make to address the concerns raised.

read point-by-point responses
  1. Referee: Abstract and Evaluation section: The ablation claims that Skills raise full-match intent accuracy from 44% to 83% on 150 queries, yet no description is given of how the 150 queries were generated or sampled, what constitutes a 'full match', the process by which domain experts created or validated the Skills documents, or any inter-expert consistency metrics. Because the 39-point improvement is presented as central evidence for the architecture's value, the absence of these details prevents assessment of reproducibility and generalizability beyond the single 1000 Genomes case.

    Authors: We agree that the manuscript would benefit from greater transparency in the evaluation methodology to support reproducibility and assessment of generalizability. We will revise the Evaluation section to include details on how the 150 queries were generated and sampled, the definition of a 'full match', and the process by which the domain expert created the Skills documents. We did not measure inter-expert consistency, as the Skills were developed by a single expert; this will be acknowledged as a limitation in the revised manuscript. revision: partial

  2. Referee: Abstract and §3 (Skills layer): The architecture's primary innovation rests on the premise that domain experts can author and maintain effective Skills documents that correctly encode vocabulary, constraints, and strategies. No measurements of authoring time, revision effort, or accuracy on queries outside the tested set are reported. This assumption is load-bearing for the broader claim of automating semantic translation for scientists; without supporting data the empirical results on one workflow cannot be extrapolated.

    Authors: We acknowledge that the absence of measurements regarding the effort required to author and maintain Skills documents, as well as their performance on unseen queries, restricts the strength of our claims about broader applicability. Since these data were not collected in the present study, we will add a dedicated paragraph in the Discussion section to discuss this as a key limitation and to describe planned future experiments that will quantify authoring time, revision effort, and accuracy on additional workflows and query sets. revision: yes

standing simulated objections not resolved
  • Measurements of authoring time and revision effort for the Skills documents
  • Accuracy of the system on queries outside the tested 150-query set
  • Inter-expert consistency metrics for Skills validation

Circularity Check

0 steps flagged

No circularity; claims rest on implementation and direct empirical measurements

full rationale

The paper describes a three-layer agentic architecture (LLM intent extraction, deterministic workflow generators, expert-authored Skills) and supports its claims via an ablation study on 150 queries for the 1000 Genomes workflow, reporting concrete accuracy gains (44% to 83%), data-transfer reductions (92%), and runtime/cost metrics. No equations, fitted parameters, or derivations are present that reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The evaluation is self-contained against external benchmarks (Kubernetes execution, query set) rather than relying on self-referential logic or renamed known results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on assumptions about LLM reliability for intent extraction and the feasibility of expert-authored Skills rather than new mathematical derivations or physical laws.

axioms (2)
  • domain assumption LLMs can extract structured intents from natural language research questions when guided by Skills.
    This underpins the semantic layer and the reported accuracy improvements.
  • domain assumption Deterministic generators can produce valid, reproducible workflow DAGs from structured intents.
    Required for the claim that identical intents yield identical workflows.
invented entities (1)
  • Skills no independent evidence
    purpose: Markdown documents that encode vocabulary mappings, parameter constraints, and optimization strategies for the knowledge layer.
    New component introduced to guide LLM behavior and enable deferred workflow generation.

pith-pipeline@v0.9.0 · 5498 in / 1372 out tokens · 39266 ms · 2026-05-09T21:21:12.273242+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

  1. [1]

    Nature526(7571), 68–74 (2015)

    1000 Genomes Project Consortium: A global reference for human genetic variation. Nature526(7571), 68–74 (2015)

  2. [2]

    Afgan, E., Baker, D., Batut, et al.: The Galaxy platform for accessible, repro- ducibleandcollaborativebiomedicalanalyses:2018update.NucleicAcidsResearch 46(W1), W537–W544 (2018)

  3. [3]

    Alam, F.I., Roy, B.: From prompt to pipeline: Can LLMs generate bioinformatics workflows? arXiv preprint arXiv:2505.06145 (2025)

  4. [4]

    Future Generation Computer Systems55, 147–162 (2016)

    Balis, B.: HyperFlow: A model of computation, programming approach and en- actment engine for complex distributed workflows. Future Generation Computer Systems55, 147–162 (2016)

  5. [5]

    Future Generation Computer Systems 46, 17–35 (2015)

    Deelman,E.,Vahi,K.,Juve,G.,Rynge,M.,Callaghan,S.,Maechling,P.J.,Mayani, R., Chen, W., Ferreira da Silva, R., Livny, M., Wenger, K.: Pegasus, a workflow management system for science automation. Future Generation Computer Systems 46, 17–35 (2015)

  6. [6]

    Nature Biotechnology 35(4), 316–319 (2017)

    DiTommaso,P.,Chatzou,M.,Floden,E.W.,Barja,P.P.,Palumbo,E.,Notredame, C.: Nextflow enables reproducible computational workflows. Nature Biotechnology 35(4), 316–319 (2017)

  7. [7]

    In: Proc

    Gridach, M., Shmeis, A., Sun, J., et al.: Agentic AI for scientific discovery. In: Proc. International Conference on Learning Representations (ICLR) (2025)

  8. [8]

    Bioinformatics28(19), 2520–2522 (2012)

    Köster, J., Rahmann, S.: Snakemake—a scalable bioinformatics workflow engine. Bioinformatics28(19), 2520–2522 (2012)

  9. [9]

    Towards scientific intelligence: A survey of llm-based scientific agents

    Ren, J., Wang, Y., Zhao, Z., Li, S., Liu, J., Chen, X., Sun, Y., Liu, Y., Wang, L., Xu, H.: Towards scientific intelligence: A survey of LLM-based scientific agents. arXiv preprint arXiv:2503.24047 (2025)

  10. [10]

    In: 2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS)

    Ferreira da Silva, R., Casanova, H., Chard, K., Altintas, I., Badia, R.M., Balis, B., Coleman, T., Coppens, F., Di Natale, F., Enders, B., et al.: A community roadmap for scientific workflows research and development. In: 2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS). pp. 81–90. IEEE (2021)

  11. [11]

    arXiv preprint arXiv:2504.13438 (2025)

    So, R.: MCP servers for scientific workflows. arXiv preprint arXiv:2504.13438 (2025)

  12. [12]

    arXiv preprint arXiv:2506.01774 (2025)

    Strickland, M., Weerasinghe, N., Chard, R., Ward, L., Foster, I.: Talk freely, ex- ecute strictly: Schema-gated agentic AI for scientific workflows. arXiv preprint arXiv:2506.01774 (2025)

  13. [13]

    Future Generation Computer Systems174, 107974 (2025)

    Suter, F., Coleman, T., Altintaş, İ., Badia, R.M., Balis, B., et al.: A terminology for scientific workflow systems. Future Generation Computer Systems174, 107974 (2025)

  14. [14]

    arXiv preprint arXiv:2502.09116 (2025)

    Yildiz, O., Peterka, T.: Do LLMs speak scientific workflows? an empirical study. arXiv preprint arXiv:2502.09116 (2025)