From Research Question to Scientific Workflow: Leveraging Agentic AI for Science Automation
Pith reviewed 2026-05-09 21:21 UTC · model grok-4.3
The pith
Domain experts author Skills documents to confine LLM non-determinism to intent extraction so identical research questions produce identical scientific workflows.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The architecture decomposes the translation task into three layers so that LLM non-determinism is restricted to intent extraction: the LLM converts varied natural language into a fixed structured intent, validated generators then always produce the same workflow DAG from that intent, and domain experts maintain the mapping rules inside Skills documents. When tested on the 1000 Genomes population genetics workflow running under Hyperflow on Kubernetes, the Skills raise full-match intent accuracy from 44 percent to 83 percent, skill-guided deferred workflow generation cuts data transfer by 92 percent, and the full pipeline runs with LLM overhead below 15 seconds and cost under 0.001 dollars.
What carries the argument
Skills: markdown documents that encode vocabulary mappings, parameter constraints, and optimization strategies to direct intent extraction and enable deterministic DAG generation from fixed intents.
If this is right
- Identical intents always produce identical workflows even if the LLM varies its wording.
- Full-match intent accuracy rises from 44 percent to 83 percent once Skills are supplied.
- Deferred workflow generation driven by Skills reduces data movement by 92 percent.
- The complete pipeline finishes queries on Kubernetes with under 15 seconds of LLM time and under 0.001 dollars in cost.
Where Pith is reading between the lines
- The separation of concerns could let the same Skills library serve multiple workflow engines beyond Hyperflow.
- If Skills can be version-controlled like code, teams might share and reuse domain knowledge across institutions.
- Extending the intent schema to include conditional branches or data provenance hints could handle more complex research questions without increasing LLM variability.
Load-bearing premise
Domain experts can write and keep up-to-date Skills documents that correctly capture the needed vocabulary, constraints, and strategies, while the LLM can reliably turn varied natural language into accurate structured intents.
What would settle it
Apply the system to a new domain without pre-written Skills, measure whether full-match intent accuracy falls back near 44 percent or whether generated workflows diverge from expert manual versions.
Figures
read the original abstract
Scientific workflow systems automate execution -- scheduling, fault tolerance, resource management -- but not the semantic translation that precedes it. Scientists still manually convert research questions into workflow specifications, a task requiring both domain knowledge and infrastructure expertise. We propose an agentic architecture that closes this gap through three layers: an LLM interprets natural language into structured intents (semantic layer); validated generators produce reproducible workflow DAGs (deterministic layer); and domain experts author ``Skills'': markdown documents encoding vocabulary mappings, parameter constraints, and optimization strategies (knowledge layer). This decomposition confines LLM non-determinism to intent extraction: identical intents always yield identical workflows. We implement and evaluate the architecture on the 1000 Genomes population genetics workflow and Hyperflow WMS running on Kubernetes. In an ablation study on 150 queries, Skills raise full-match intent accuracy from 44% to 83%; skill-driven deferred workflow generation reduces data transfer by 92\%; and the end-to-end pipeline completes queries on Kubernetes with LLM overhead below 15 seconds and cost under $0.001 per query.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an agentic architecture to automate the semantic translation from natural language research questions to executable scientific workflows. It decomposes the process into an LLM semantic layer for extracting structured intents, a deterministic layer of validated generators that produce reproducible workflow DAGs, and a knowledge layer where domain experts author 'Skills' as markdown documents encoding vocabulary mappings, parameter constraints, and optimization strategies. This confines LLM non-determinism to intent extraction so that identical intents always yield identical workflows. The system is implemented and evaluated on the 1000 Genomes population genetics workflow running under Hyperflow on Kubernetes. An ablation on 150 queries reports that Skills raise full-match intent accuracy from 44% to 83%, skill-driven deferred generation reduces data transfer by 92%, and end-to-end queries incur LLM overhead below 15 seconds and cost under $0.001.
Significance. If the reported accuracy lift, data-transfer savings, and low overhead hold under broader conditions, the work offers a practical path to reducing the manual burden of workflow specification while preserving reproducibility. The explicit separation of non-deterministic intent extraction from deterministic generation is a clean architectural contribution. Concrete measurements on a production-grade workflow and Kubernetes deployment strengthen the practicality claim. The approach could influence both workflow systems and LLM-augmented scientific tooling, provided the Skills authoring process scales.
major comments (2)
- Abstract and Evaluation section: The ablation claims that Skills raise full-match intent accuracy from 44% to 83% on 150 queries, yet no description is given of how the 150 queries were generated or sampled, what constitutes a 'full match', the process by which domain experts created or validated the Skills documents, or any inter-expert consistency metrics. Because the 39-point improvement is presented as central evidence for the architecture's value, the absence of these details prevents assessment of reproducibility and generalizability beyond the single 1000 Genomes case.
- Abstract and §3 (Skills layer): The architecture's primary innovation rests on the premise that domain experts can author and maintain effective Skills documents that correctly encode vocabulary, constraints, and strategies. No measurements of authoring time, revision effort, or accuracy on queries outside the tested set are reported. This assumption is load-bearing for the broader claim of automating semantic translation for scientists; without supporting data the empirical results on one workflow cannot be extrapolated.
minor comments (2)
- The abstract refers to 'validated generators' without clarifying the validation mechanism or providing pseudocode; a short description or figure in the methods section would improve clarity.
- Related-work discussion would benefit from explicit comparison to existing workflow systems (e.g., Pegasus, Kepler) and prior LLM-assisted workflow generation efforts to better situate the contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and the recommendation for major revision. Below we provide point-by-point responses to the major comments, indicating the revisions we will make to address the concerns raised.
read point-by-point responses
-
Referee: Abstract and Evaluation section: The ablation claims that Skills raise full-match intent accuracy from 44% to 83% on 150 queries, yet no description is given of how the 150 queries were generated or sampled, what constitutes a 'full match', the process by which domain experts created or validated the Skills documents, or any inter-expert consistency metrics. Because the 39-point improvement is presented as central evidence for the architecture's value, the absence of these details prevents assessment of reproducibility and generalizability beyond the single 1000 Genomes case.
Authors: We agree that the manuscript would benefit from greater transparency in the evaluation methodology to support reproducibility and assessment of generalizability. We will revise the Evaluation section to include details on how the 150 queries were generated and sampled, the definition of a 'full match', and the process by which the domain expert created the Skills documents. We did not measure inter-expert consistency, as the Skills were developed by a single expert; this will be acknowledged as a limitation in the revised manuscript. revision: partial
-
Referee: Abstract and §3 (Skills layer): The architecture's primary innovation rests on the premise that domain experts can author and maintain effective Skills documents that correctly encode vocabulary, constraints, and strategies. No measurements of authoring time, revision effort, or accuracy on queries outside the tested set are reported. This assumption is load-bearing for the broader claim of automating semantic translation for scientists; without supporting data the empirical results on one workflow cannot be extrapolated.
Authors: We acknowledge that the absence of measurements regarding the effort required to author and maintain Skills documents, as well as their performance on unseen queries, restricts the strength of our claims about broader applicability. Since these data were not collected in the present study, we will add a dedicated paragraph in the Discussion section to discuss this as a key limitation and to describe planned future experiments that will quantify authoring time, revision effort, and accuracy on additional workflows and query sets. revision: yes
- Measurements of authoring time and revision effort for the Skills documents
- Accuracy of the system on queries outside the tested 150-query set
- Inter-expert consistency metrics for Skills validation
Circularity Check
No circularity; claims rest on implementation and direct empirical measurements
full rationale
The paper describes a three-layer agentic architecture (LLM intent extraction, deterministic workflow generators, expert-authored Skills) and supports its claims via an ablation study on 150 queries for the 1000 Genomes workflow, reporting concrete accuracy gains (44% to 83%), data-transfer reductions (92%), and runtime/cost metrics. No equations, fitted parameters, or derivations are present that reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The evaluation is self-contained against external benchmarks (Kubernetes execution, query set) rather than relying on self-referential logic or renamed known results.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs can extract structured intents from natural language research questions when guided by Skills.
- domain assumption Deterministic generators can produce valid, reproducible workflow DAGs from structured intents.
invented entities (1)
-
Skills
no independent evidence
Reference graph
Works this paper leans on
-
[1]
1000 Genomes Project Consortium: A global reference for human genetic variation. Nature526(7571), 68–74 (2015)
work page 2015
-
[2]
Afgan, E., Baker, D., Batut, et al.: The Galaxy platform for accessible, repro- ducibleandcollaborativebiomedicalanalyses:2018update.NucleicAcidsResearch 46(W1), W537–W544 (2018)
work page 2018
- [3]
-
[4]
Future Generation Computer Systems55, 147–162 (2016)
Balis, B.: HyperFlow: A model of computation, programming approach and en- actment engine for complex distributed workflows. Future Generation Computer Systems55, 147–162 (2016)
work page 2016
-
[5]
Future Generation Computer Systems 46, 17–35 (2015)
Deelman,E.,Vahi,K.,Juve,G.,Rynge,M.,Callaghan,S.,Maechling,P.J.,Mayani, R., Chen, W., Ferreira da Silva, R., Livny, M., Wenger, K.: Pegasus, a workflow management system for science automation. Future Generation Computer Systems 46, 17–35 (2015)
work page 2015
-
[6]
Nature Biotechnology 35(4), 316–319 (2017)
DiTommaso,P.,Chatzou,M.,Floden,E.W.,Barja,P.P.,Palumbo,E.,Notredame, C.: Nextflow enables reproducible computational workflows. Nature Biotechnology 35(4), 316–319 (2017)
work page 2017
- [7]
-
[8]
Bioinformatics28(19), 2520–2522 (2012)
Köster, J., Rahmann, S.: Snakemake—a scalable bioinformatics workflow engine. Bioinformatics28(19), 2520–2522 (2012)
work page 2012
-
[9]
Towards scientific intelligence: A survey of llm-based scientific agents
Ren, J., Wang, Y., Zhao, Z., Li, S., Liu, J., Chen, X., Sun, Y., Liu, Y., Wang, L., Xu, H.: Towards scientific intelligence: A survey of LLM-based scientific agents. arXiv preprint arXiv:2503.24047 (2025)
-
[10]
In: 2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS)
Ferreira da Silva, R., Casanova, H., Chard, K., Altintas, I., Badia, R.M., Balis, B., Coleman, T., Coppens, F., Di Natale, F., Enders, B., et al.: A community roadmap for scientific workflows research and development. In: 2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS). pp. 81–90. IEEE (2021)
work page 2021
-
[11]
arXiv preprint arXiv:2504.13438 (2025)
So, R.: MCP servers for scientific workflows. arXiv preprint arXiv:2504.13438 (2025)
-
[12]
arXiv preprint arXiv:2506.01774 (2025)
Strickland, M., Weerasinghe, N., Chard, R., Ward, L., Foster, I.: Talk freely, ex- ecute strictly: Schema-gated agentic AI for scientific workflows. arXiv preprint arXiv:2506.01774 (2025)
-
[13]
Future Generation Computer Systems174, 107974 (2025)
Suter, F., Coleman, T., Altintaş, İ., Badia, R.M., Balis, B., et al.: A terminology for scientific workflow systems. Future Generation Computer Systems174, 107974 (2025)
work page 2025
-
[14]
arXiv preprint arXiv:2502.09116 (2025)
Yildiz, O., Peterka, T.: Do LLMs speak scientific workflows? an empirical study. arXiv preprint arXiv:2502.09116 (2025)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.