A Self-Evolving Agentic System for Automated Generation and Execution of Biological Protocols

Cheng Liang; Haoran Sun; Lei Bai; Lilong Wang; Meng Yang; Rubo Wang; Weiting Tang; Wenjie Lou; Xiaosong Wang; Yankai Jiang

arxiv: 2606.31763 · v2 · pith:2NR6DFYFnew · submitted 2026-06-30 · 💻 cs.AI

A Self-Evolving Agentic System for Automated Generation and Execution of Biological Protocols

Yankai Jiang , Weiting Tang , Haoran Sun , Zhenyu Tang , Yuejie Hou , Yingnan Han , Rubo Wang , Yueyuxiao Yang

show 6 more authors

Cheng Liang Lilong Wang Wenjie Lou Xiaosong Wang Lei Bai Meng Yang

This is my paper

Pith reviewed 2026-07-03 22:01 UTC · model grok-4.3

classification 💻 cs.AI

keywords autonomous wet-lab experimentationmulti-agent systemsprotocol generationbiological automationSOP expansionOpentrons executionfeedback-guided revision

0 comments

The pith

ProtoPilot converts biological protocols into executable code that passes wet-lab gates at 88 percent success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ProtoPilot as a self-evolving multi-agent system that generates protocols from intent, expands them into SOPs, produces SDK-compliant code, and revises workflows using wet-lab feedback. It tests this pipeline on a benchmark of 294 tasks drawn from 98 gold-standard synthetic-biology and molecular-biology protocols, using expert rubrics, device-level validity gates, and real experimental runs. The system records a 90.2 percent Top@3 expert preference rate, an 89.5 percent protocol-to-code gate pass rate, and an 88.24 percent Opentrons pass rate, far above the 32.35 percent baseline. Wet-lab trials yield interpretable readouts, Sanger-confirmed products, and feedback-corrected DNA assemblies. These outcomes matter because they demonstrate a closed loop from protocol text to physical execution that prior text-only generators have not achieved.

Core claim

ProtoPilot incorporates layer-wise verifiability, multi-agent orchestration, and a runtime-updated skill library to generate protocols, expand SOPs, synthesize SDK-compliant code, and revise workflows from wet-lab feedback, achieving a Top@3 expert-preference rate of 90.2 percent, an overall protocol-to-code gate pass rate of 89.5 percent, and an Opentrons pass rate of 88.24 percent on 294 tasks from 98 gold-standard protocols, with wet-lab validation producing interpretable readouts, Sanger-confirmed products, and feedback-corrected PCA-assembled DNA targets.

What carries the argument

ProtoPilot, the self-evolving multi-agent system that uses layer-wise verifiability, multi-agent orchestration, and a runtime-updated skill library to align protocol text with device execution and experimental feedback.

If this is right

The evaluation framework captures execution-relevant requirements for autonomous wet-lab automation.
ProtoPilot converts protocol and code generation into validated execution and feedback-guided revision at scale.
The system outperforms prior baselines such as OpenTrons-AI by more than 50 percentage points on the same gates.
Wet-lab feedback loops enable correction of PCA-assembled DNA targets to produce Sanger-confirmed products.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Systems built on the same self-evolving loop could be tested on protocols for other lab platforms beyond Opentrons.
The skill-library update mechanism might reduce the number of human revisions needed when new device constraints appear.
If the benchmark tasks generalize, the same architecture could support iterative protocol improvement across multiple experimental rounds without restarting from text.

Load-bearing premise

The 294 tasks derived from 98 gold-standard protocols, together with the expert rubrics and device gates, are representative of the requirements for autonomous wet-lab automation in general.

What would settle it

A new collection of protocols outside the original 98 that produces expert preference or hardware pass rates comparable to the 32 percent baseline instead of the reported 88-90 percent figures.

Figures

Figures reproduced from arXiv: 2606.31763 by Cheng Liang, Haoran Sun, Lei Bai, Lilong Wang, Meng Yang, Rubo Wang, Weiting Tang, Wenjie Lou, Xiaosong Wang, Yankai Jiang, Yingnan Han, Yuejie Hou, Yueyuxiao Yang, Zhenyu Tang.

**Figure 1.** Figure 1: Overview of ProtoPilot. (a) Closed-loop pipeline from scientific intent to wet-lab feedback. (b) Five-layer experimental representation, from scientific intent through protocol, manual SOP and device SOP to instrument code, with stage-wise guards preserving consistency across layers. (c) Hierarchical coordinator-worker mechanism for SOP generation, in which the Orchestrator gates step procedures from the … view at source ↗

**Figure 2.** Figure 2: ProtoPilot evaluation across protocol quality, code executability, cross-device generalization, and expert preference. (a) Rubric-based protocol quality scores across 294 test cases spanning three difficulty tiers (L1–L3) and seven criteria (D1–D7), shown per system (mean ± s.e.m.). (b) Protocol→Code evaluation across 13 systems (n = 294 cases per system), comprising per-case executability strips (gate pas… view at source ↗

**Figure 3.** Figure 3: ProtoPilot automates three independent foundational wet-lab operations. (a) Workflow diagram for E. coli culture inoculation. (b) Automation deck layout for the E. coli culture inoculation experiment. (c) Results of 96-well E. coli inoculation, including a representative culture plate after incubation (i) and an OD600 heatmap of sample wells (ii); heatmap colors indicate OD600 absorbance levels across well… view at source ↗

**Figure 4.** Figure 4: ProtoPilot supports construction of pET21b-GLuc-WT and pET21b-RLuc-WT plasmids. (a) Workflow diagram for wild-type luciferase plasmid construction. (b) Automation deck layout for the wild-type luciferase plasmid construction experiment. (c) Agarose gel electrophoresis of PCR products for the GLuc-WT insert, RLuc-WT insert and pET-21b(+) vector backbone; M2 denotes DL10000 DNA marker, and lanes 1-3 denote t… view at source ↗

**Figure 5.** Figure 5: ProtoPilot supports parallel construction of GLuc and RLuc point mutants. (a) Workflow diagram for whole-plasmid mutagenesis of GLuc and RLuc. (b) Automation deck layout for parallel multi-mutant construction. (c) Agarose gel electrophoresis of whole-plasmid PCR products for GLuc mutants (i) and RLuc mutants (ii); M2 denotes DL10000 DNA marker. Lanes 1–8 indicate the eight designed mutants in each series: … view at source ↗

**Figure 6.** Figure 6: ProtoPilot supports parallel construction of GLuc and RLuc point mutants. (a) ProtoPilot supports PCA-based DNA assembly and iterative protocol refinement through feedback. (b) Automation deck layout for the PCA DNA assembly experiment. (c) Agarose gel electrophoresis of PCA assembly products for target fragments BL01-BL04; M1 denotes DL2000 DNA marker, NTC denotes no-template control, and lanes 1-3 deno… view at source ↗

read the original abstract

Autonomous wet-lab experimentation requires more than plausible protocol text: biological intent, quantitative procedures, device constraints and experimental feedback must remain aligned from protocol and SOP design to code and physical execution. We developed ProtoPilot, a self-evolving multi-agent system, together with an expert-grounded benchmark and evaluation framework for testing this conversion as an experimental automation problem. The framework spans 294 synthetic-biology and molecular-biology tasks derived from 98 gold-standard protocols, wet-lab expert rubrics, device-level validity gates and real experimental tests. ProtoPilot incorporates layer-wise verifiability, multi-agent orchestration and a runtime-updated skill library to generate protocols, expand SOPs, synthesize SDK-compliant code and revise workflows from wet-lab feedback. It achieved a Top@3 expert-preference rate of 90.2%, an overall protocol-to-code gate pass rate of 89.5% and an Opentrons pass rate of 88.24%, compared with 32.35% for OpenTrons-AI. Wet-lab validation produced interpretable readouts, Sanger-confirmed products and feedback-corrected PCA-assembled DNA targets, establishing a verifiable route to autonomous experimentation. Together, these results show that the evaluation framework captures execution-relevant requirements for autonomous wet-lab automation, and that ProtoPilot can meet them by converting protocol and code generation into validated execution and feedback-guided revision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ProtoPilot delivers concrete numbers on protocol-to-execution on their 294-task benchmark with real wet-lab checks, but the representativeness claim for general autonomous automation is not yet backed by coverage arguments or ablations.

read the letter

The key takeaway is that this system turns protocol text into Opentrons code and gets it to run with feedback corrections, hitting 88% pass rates where the baseline sits at 32%. They also report Sanger-confirmed products and interpretable readouts from actual experiments.

What stands out is the combination of layer-wise checks, a runtime skill library, and closed-loop revision from wet-lab data. The multi-agent orchestration plus device gates produces measurable improvement over OpenTrons-AI on the same tasks, and the wet-lab validation step goes beyond simulation. That part is grounded enough to be useful for anyone building lab agents.

The soft spot is the leap from these results to the claim that the framework captures requirements for autonomous wet-lab automation in general. The 294 tasks come from 98 gold-standard protocols in synthetic and molecular biology, but there is no breakdown of domain coverage, no test on cell-based assays or non-Opentrons hardware, and no ablation outside the sampled set. Without that, the high scores stay tied to this collection. The self-evolving aspect also needs clearer separation between feedback used for revision and feedback used for scoring to avoid circularity.

This paper is for groups working on agentic lab automation who want a concrete benchmark and execution numbers rather than abstract workflow ideas. It deserves a serious referee because the performance claims are specific and include physical experiments, even if the generalizability argument needs tightening. I would send it out for review with a request for more on benchmark scope.

Referee Report

2 major / 2 minor

Summary. The paper introduces ProtoPilot, a self-evolving multi-agent system for automated generation of biological protocols, expansion to SOPs, synthesis of SDK-compliant code, and revision from wet-lab feedback. It presents an expert-grounded benchmark of 294 synthetic-biology and molecular-biology tasks derived from 98 gold-standard protocols, along with device-level validity gates and real experimental tests. Reported results include a 90.2% Top@3 expert-preference rate, 89.5% protocol-to-code gate pass rate, and 88.24% Opentrons pass rate (vs. 32.35% for OpenTrons-AI), with wet-lab validations yielding interpretable readouts, Sanger-confirmed products, and feedback-corrected PCA-assembled DNA targets. The central claim is that the evaluation framework captures execution-relevant requirements for autonomous wet-lab automation and that ProtoPilot meets them.

Significance. If the benchmark representativeness holds, the work would be significant for demonstrating an integrated pipeline from protocol text to physical execution with self-evolution via feedback. The inclusion of real wet-lab validation, expert rubrics, and a direct baseline comparison provides concrete, falsifiable evidence of feasibility that goes beyond text generation. These elements strengthen the assessment of practical utility in lab automation.

major comments (2)

[Abstract] Abstract: The claim that 'the evaluation framework captures execution-relevant requirements for autonomous wet-lab automation' is load-bearing for the conclusion but rests on the unverified assumption that the 294 tasks from 98 protocols are representative; no diversity breakdown, coverage argument across domains (e.g., cell-based assays, real-time adaptive control, non-Opentrons hardware), or ablation outside the sampled set is provided.
[Evaluation Framework] Evaluation Framework section: The assumption that expert rubrics and device-level validity gates are sufficiently representative of general autonomous wet-lab automation requirements is central to interpreting the high numeric scores (90.2%, 89.5%, 88.24%) as evidence of broader capability, yet no supporting analysis of protocol diversity or cross-domain generalization is supplied.

minor comments (2)

[Abstract] Abstract: Terms such as 'layer-wise verifiability' and 'runtime-updated skill library' are introduced without concise definitions or pointers to their implementation in the methods.
[Results] The description of how post-hoc protocol revisions from wet-lab feedback were counted and incorporated into performance metrics lacks sufficient detail for independent verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed feedback on the representativeness of the evaluation framework. We address each major comment below and agree that explicit scoping and limitations discussion will improve the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'the evaluation framework captures execution-relevant requirements for autonomous wet-lab automation' is load-bearing for the conclusion but rests on the unverified assumption that the 294 tasks from 98 protocols are representative; no diversity breakdown, coverage argument across domains (e.g., cell-based assays, real-time adaptive control, non-Opentrons hardware), or ablation outside the sampled set is provided.

Authors: We agree that the abstract claim is scoped to the domains and hardware in the study. The 98 protocols were selected as gold-standard examples from synthetic-biology and molecular-biology literature for Opentrons-executable tasks. The manuscript does not provide a diversity breakdown or cross-domain coverage argument. We will revise the abstract to qualify the claim and add a limitations subsection in the Evaluation Framework section that includes a brief overview of protocol categories (e.g., PCR, assembly, transformation) and explicitly states the benchmark's intended scope without asserting generalization beyond Opentrons-based synthetic and molecular biology workflows. revision: yes
Referee: [Evaluation Framework] Evaluation Framework section: The assumption that expert rubrics and device-level validity gates are sufficiently representative of general autonomous wet-lab automation requirements is central to interpreting the high numeric scores (90.2%, 89.5%, 88.24%) as evidence of broader capability, yet no supporting analysis of protocol diversity or cross-domain generalization is supplied.

Authors: The framework and metrics are presented for the specific setting of protocol-to-Opentrons-code conversion on the sampled tasks. No diversity analysis or cross-domain generalization tests appear in the current text. In revision we will add a short diversity table or paragraph summarizing task distribution across the 98 protocols and a limitations paragraph clarifying that the reported scores reflect performance within this scope rather than general autonomous wet-lab automation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical validation rests on external gold standards and wet-lab tests

full rationale

The provided abstract and description contain no equations, fitted parameters, or derivation chain. The benchmark is constructed from 98 external gold-standard protocols plus independent expert rubrics and physical wet-lab execution (Sanger sequencing, PCA assembly). Success metrics are reported against these external references and a baseline (OpenTrons-AI), with no reduction of predictions to the system's own training loop or self-citations. The self-evolving mechanism uses runtime feedback for revision, but the evaluation framework is presented as separately grounded and falsifiable via wet-lab outcomes. This satisfies the criteria for a self-contained empirical result with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that the 98-protocol benchmark and expert rubrics are unbiased and representative.

pith-pipeline@v0.9.1-grok · 5820 in / 1089 out tokens · 21691 ms · 2026-07-03T22:01:50.424147+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Gottweis, J.et al.Accelerating scientific discovery with Co-Scientist.Nature,1–3 (2026)

2026
[2]

L., Pak, J

Swanson, K., Wu, W., Bulaong, N. L., Pak, J. E. & Zou, J. The Virtual Lab of AI agents designs new SARS-CoV-2 nanobodies.Nature646,716–723 (2025)

2025
[3]

Gao, S.et al.Empowering biomedical discovery with AI agents.Cell187,6125–6151 (2024)

2024
[4]

E.et al.A multi-agent system for automating scientific discovery.Nature,1–3 (2026)

Ghareeb, A. E.et al.A multi-agent system for automating scientific discovery.Nature,1–3 (2026)

2026
[5]

Wang, Y.et al.Accelerating primer design for amplicon sequencing using large language model- powered agents.Nature Biomedical Engineering10,338–353 (2026)

2026
[6]

D.et al.The automation of science.Science324,85–89 (2009)

King, R. D.et al.The automation of science.Science324,85–89 (2009)

2009
[7]

T., Bremer, B

Rapp, J. T., Bremer, B. J. & Romero, P. A. Self-driving laboratories to autonomously navigate the protein fitness landscape.Nature chemical engineering1,97–107 (2024)

2024
[8]

A.et al.Using a GPT-5-driven autonomous lab to optimize the cost and titer of cell-free protein synthesis.bioRxiv,2026–02 (2026)

Smith, A. A.et al.Using a GPT-5-driven autonomous lab to optimize the cost and titer of cell-free protein synthesis.bioRxiv,2026–02 (2026)

2026
[9]

Jin, R.et al.BioLab: End-to-end autonomous life sciences research with multi-agents system integrating biological foundation models.BioRxiv,2025–09 (2025)

2025
[10]

Sun,H.et al.UnleashingScientificReasoningforBio-experimentalProtocolGenerationviaStruc- tured Component-based Reward Mechanism.arXiv preprint arXiv:2510.15600(2025)

work page arXiv 2025
[11]

Li, J., Wang, M., Zheng, Z. & Zhang, M.Loogle: Can long-context language models understand long contexts?inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)(2024), 16304–16333

2024
[12]

& Thies, W

Ananthanarayanan, V. & Thies, W. Biocoder: A programming language for standardizing and automating biology protocols.Journal of biological engineering4,13 (2010). 16

2010
[13]

Gao, Y.et al.Autonomous liquid-handling robotics scripting through large language models enables accessible and safe protein engineering workflows.bioRxiv,2025–09 (2025)

2025
[14]

Bartley, B.et al.Building an open representation for biological protocols.ACM Journal on Emerging Technologies in Computing Systems19,1–21 (2023)

2023
[15]

& Goñi-Moreno, Á

Anhel, A.-M., Alejaldre, L. & Goñi-Moreno, Á. The Laboratory Automation Protocol (LAP) Format and Repository: a platform for enhancing workflow efficiency in synthetic biology.ACS synthetic biology12,3514–3520 (2023)

2023
[16]

BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning

Liu, Y., Wang, J., Lv, L. & Tian, Y. BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning.arXiv preprint arXiv:2603.00876(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Acknowledgements This work was jointly supported by Shanghai Artificial Intelligence Laboratory and Genoria AI Tech- nology Co., Ltd

Song, R.et al.Towards autonomous biology: Compiler-Verified Protocols as a Foundation for Real World AI Execution.bioRxiv,2026–05 (2026). Acknowledgements This work was jointly supported by Shanghai Artificial Intelligence Laboratory and Genoria AI Tech- nology Co., Ltd. Competing Interests The authors declare no competing interests. Appendix Extended Dat...

2026

[1] [1]

Gottweis, J.et al.Accelerating scientific discovery with Co-Scientist.Nature,1–3 (2026)

2026

[2] [2]

L., Pak, J

Swanson, K., Wu, W., Bulaong, N. L., Pak, J. E. & Zou, J. The Virtual Lab of AI agents designs new SARS-CoV-2 nanobodies.Nature646,716–723 (2025)

2025

[3] [3]

Gao, S.et al.Empowering biomedical discovery with AI agents.Cell187,6125–6151 (2024)

2024

[4] [4]

E.et al.A multi-agent system for automating scientific discovery.Nature,1–3 (2026)

Ghareeb, A. E.et al.A multi-agent system for automating scientific discovery.Nature,1–3 (2026)

2026

[5] [5]

Wang, Y.et al.Accelerating primer design for amplicon sequencing using large language model- powered agents.Nature Biomedical Engineering10,338–353 (2026)

2026

[6] [6]

D.et al.The automation of science.Science324,85–89 (2009)

King, R. D.et al.The automation of science.Science324,85–89 (2009)

2009

[7] [7]

T., Bremer, B

Rapp, J. T., Bremer, B. J. & Romero, P. A. Self-driving laboratories to autonomously navigate the protein fitness landscape.Nature chemical engineering1,97–107 (2024)

2024

[8] [8]

A.et al.Using a GPT-5-driven autonomous lab to optimize the cost and titer of cell-free protein synthesis.bioRxiv,2026–02 (2026)

Smith, A. A.et al.Using a GPT-5-driven autonomous lab to optimize the cost and titer of cell-free protein synthesis.bioRxiv,2026–02 (2026)

2026

[9] [9]

Jin, R.et al.BioLab: End-to-end autonomous life sciences research with multi-agents system integrating biological foundation models.BioRxiv,2025–09 (2025)

2025

[10] [10]

Sun,H.et al.UnleashingScientificReasoningforBio-experimentalProtocolGenerationviaStruc- tured Component-based Reward Mechanism.arXiv preprint arXiv:2510.15600(2025)

work page arXiv 2025

[11] [11]

Li, J., Wang, M., Zheng, Z. & Zhang, M.Loogle: Can long-context language models understand long contexts?inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)(2024), 16304–16333

2024

[12] [12]

& Thies, W

Ananthanarayanan, V. & Thies, W. Biocoder: A programming language for standardizing and automating biology protocols.Journal of biological engineering4,13 (2010). 16

2010

[13] [13]

Gao, Y.et al.Autonomous liquid-handling robotics scripting through large language models enables accessible and safe protein engineering workflows.bioRxiv,2025–09 (2025)

2025

[14] [14]

Bartley, B.et al.Building an open representation for biological protocols.ACM Journal on Emerging Technologies in Computing Systems19,1–21 (2023)

2023

[15] [15]

& Goñi-Moreno, Á

Anhel, A.-M., Alejaldre, L. & Goñi-Moreno, Á. The Laboratory Automation Protocol (LAP) Format and Repository: a platform for enhancing workflow efficiency in synthetic biology.ACS synthetic biology12,3514–3520 (2023)

2023

[16] [16]

BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning

Liu, Y., Wang, J., Lv, L. & Tian, Y. BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning.arXiv preprint arXiv:2603.00876(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[17] [17]

Acknowledgements This work was jointly supported by Shanghai Artificial Intelligence Laboratory and Genoria AI Tech- nology Co., Ltd

Song, R.et al.Towards autonomous biology: Compiler-Verified Protocols as a Foundation for Real World AI Execution.bioRxiv,2026–05 (2026). Acknowledgements This work was jointly supported by Shanghai Artificial Intelligence Laboratory and Genoria AI Tech- nology Co., Ltd. Competing Interests The authors declare no competing interests. Appendix Extended Dat...

2026