arxiv: 2604.16896 · v1 · submitted 2026-04-18 · 🧬 q-bio.QM · cs.AI

Recognition: unknown

ProtoCycle: Reflective Tool-Augmented Planning for Text-Guided Protein Design

Yutang Ge , Guojiang Zhao , Sihang Li , Zheng Cheng , Zifeng Zhao , Hanchen Xia , Guolin Ke , Linfeng Zhang

show 2 more authors

Zhifeng Gao Yuguang Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:57 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.AI

keywords protein designtext-guided generationlarge language modelsagentic planningreflectionfoldabilitysequence quality

0 comments

The pith

ProtoCycle uses an LLM to plan protein designs, get feedback from simulated engineering tools, reflect on that feedback, and revise until the sequence matches the text request.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ProtoCycle as a framework that lets language models design proteins matching natural language functional requirements without heavy fine-tuning on sequence data. An LLM first creates a plan, then a lightweight set of tools emulates protein engineering steps to test the resulting sequence and return feedback. The LLM then reflects on that feedback to revise the plan over multiple rounds. This cycle is trained with supervised trajectories plus online reinforcement learning. The result is sequences that align well with the input text while remaining competitively foldable, and removing reflection hurts quality.

Core claim

ProtoCycle couples an LLM planner with a lightweight tool environment designed to emulate the iterative workflow of human protein engineering and uses LLM-driven reflection on tool feedback to revise plans. Trained with supervised trajectories and online reinforcement learning, ProtoCycle achieves strong language alignment while maintaining competitive foldability, and ablations show that reflection substantially improves sequence quality.

What carries the argument

The reflective tool-augmented planning cycle: the LLM generates a plan, tools return feedback on the sequence, and the LLM reflects to update the plan.

If this is right

Protein design becomes feasible with limited supervision instead of large labeled sequence datasets.
Reflection on tool feedback produces higher-quality sequences than single-pass planning.
Language alignment and foldability can be maintained together rather than traded off.
The same training approach of supervised trajectories followed by reinforcement learning applies to the full cycle.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on other design tasks where text instructions must produce structured outputs that satisfy physical constraints.
If the tool feedback loop proves reliable, it may reduce reliance on post-hoc filtering or heavy model fine-tuning in applied design settings.
Real laboratory validation of the generated sequences would show whether the simulated feedback translates to actual functional proteins.

Load-bearing premise

A lightweight tool environment can sufficiently emulate the iterative workflow of human protein engineering to generate feedback that the LLM can usefully reflect upon and act on.

What would settle it

Running the same prompts with and without the reflection step and finding no measurable gain in language alignment or sequence quality scores.

Figures

Figures reproduced from arXiv: 2604.16896 by Guojiang Zhao, Guolin Ke, Hanchen Xia, Linfeng Zhang, Sihang Li, Yuguang Wang, Yutang Ge, Zheng Cheng, Zhifeng Gao, Zifeng Zhao.

**Figure 2.** Figure 2: Overview of ProtoCycle compared to a human protein engineer. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: An example of reflection in ProtoCycle. a type t , "argument": a arg t }</tool_call>. A lightweight runtime then parses this tag and invokes the corresponding tool [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Reflection improves step-wise optimization [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Reflection improves decision quality and [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 4.** Figure 4: Reflection ablation on final plausibility, fold [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 7.** Figure 7: Token-level uncertainty visualization for a representative example. The top panel shows the generated [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Case study (Part I) [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Case study (Part II) [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Pinal on the case-study requirement [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: ProtoCycle (final) on the same requirement. [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

read the original abstract

Designing proteins that satisfy natural language functional requirements is a central goal in protein engineering. A straightforward baseline is to fine-tune generic instruction-tuned LLMs as direct text-to-sequence generators, but this is data- and compute-hungry. With limited supervision, LLMs can produce coherent plans in text yet fail to reliably realize them as sequences. This plan-execute gap motivates ProtoCycle, an agentic framework for protein design that uses LLMs primarily to drive a multi-round, feedback-driven decision cycle. ProtoCycle couples an LLM planner with a lightweight tool environment designed to emulate the iterative workflow of human protein engineering and uses LLM-driven reflection on tool feedback to revise plans. Trained with supervised trajectories and online reinforcement learning, ProtoCycle achieves strong language alignment while maintaining competitive foldability, and ablations show that reflection substantially improves sequence quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ProtoCycle adds an explicit reflection loop on lightweight tool feedback to LLM planning for text-to-protein design, but the abstract supplies no numbers so the size of any real gain over direct generation stays unclear.

read the letter

The paper's main contribution is ProtoCycle, a framework where an LLM plans protein designs from text, uses lightweight tools for feedback that mimics engineering checks, reflects on the results to revise the plan, and iterates. They train this with supervised trajectories and online RL to achieve better language alignment and foldability than direct generation baselines. What is new is the explicit reflection mechanism tied to tool outputs in the protein design context. It does well at highlighting the plan-execute gap in current LLM approaches and proposing a structured cycle to address it with limited supervision. The ablations on reflection are a sensible way to test the idea. The soft spots center on the tool environment and the missing details. The tools are described as lightweight, which raises the question of whether they provide signals rich enough for meaningful reflection or if the model is just optimizing for the specific feedback format. Without quantitative results, baselines, or descriptions of the tool fidelity, it's difficult to gauge how much the reflection actually improves functional design. This is a real concern given the stress-test note. For readers in AI for protein engineering, this could spark ideas on agentic methods, though anyone wanting to use or extend it would need the full experimental data. I would recommend sending it for peer review. The core idea is worth referee attention to see if the framework holds up under detailed scrutiny of the methods and results.

Referee Report

3 major / 2 minor

Summary. The paper introduces ProtoCycle, an agentic framework for text-guided protein design in which an LLM planner interacts with a lightweight tool environment that emulates iterative human protein engineering workflows. The system uses LLM-driven reflection on tool feedback to revise plans across multiple rounds, trained via supervised trajectories followed by online reinforcement learning. The central claims are that ProtoCycle achieves strong language alignment with competitive foldability and that ablations demonstrate substantial improvements in sequence quality attributable to the reflection mechanism.

Significance. If the empirical claims hold, ProtoCycle would represent a meaningful step toward data-efficient, planning-based protein design that reduces reliance on direct fine-tuning of LLMs as sequence generators. The explicit use of online RL and controlled ablations isolating reflection constitute strengths that could inform subsequent agentic methods in the field.

major comments (3)

[§4.2] §4.2 (Tool Environment): The lightweight tools are described as providing feedback on sequence statistics, coarse structure predictors, and rule-based checks, yet no quantitative correlation analysis with experimental outcomes or full-physics simulators is reported. This leaves open whether reflection improves genuine engineering reasoning or merely optimizes for the specific proxy signals, directly bearing on the central claim that reflection drives plan revisions beyond non-reflective baselines.
[§5.3] §5.3 (Ablation Studies): The ablation results claim that removing reflection substantially degrades sequence quality, but the reported metrics lack error bars, statistical significance tests, or multiple random seeds. Without these, it is impossible to determine whether the observed gains are robust or could arise from variance in the lightweight feedback loop.
[§5.1] §5.1 (Main Results): The paper asserts 'strong language alignment' and 'competitive foldability' relative to direct generation baselines, but supplies no explicit numerical values, baseline definitions, or dataset sizes in the primary comparison table. This absence prevents verification that the plan-execute gap has been meaningfully closed under limited supervision.

minor comments (2)

[Figure 3] Figure 3 caption and §4.1: The notation for the reflection module (e.g., the exact prompt template and how tool outputs are tokenized) is described at a high level; providing the precise template or pseudocode would improve reproducibility.
[§3] §3 (Training Procedure): The transition from supervised trajectories to online RL is outlined, but the reward function combining language alignment and foldability scores is not given an explicit equation; adding Eq. (X) would clarify the objective.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the insightful comments, which help improve the clarity and rigor of our work. Below we provide point-by-point responses to the major comments and indicate the revisions we will make.

read point-by-point responses

Referee: [§4.2] §4.2 (Tool Environment): The lightweight tools are described as providing feedback on sequence statistics, coarse structure predictors, and rule-based checks, yet no quantitative correlation analysis with experimental outcomes or full-physics simulators is reported. This leaves open whether reflection improves genuine engineering reasoning or merely optimizes for the specific proxy signals, directly bearing on the central claim that reflection drives plan revisions beyond non-reflective baselines.

Authors: We agree that the absence of direct correlation analysis with experimental data or full-physics simulators is a limitation that affects interpretation of whether the reflection mechanism captures genuine engineering principles. Our tools are designed as lightweight emulations of human workflows using accessible predictors, allowing for efficient multi-round planning. We will add a dedicated paragraph in the revised §4.2 discussing the choice of proxies, their known limitations, and references to literature on their correlation with experimental results where available. However, conducting new experimental validations or running full simulators is outside the scope of this work. revision: partial
Referee: [§5.3] §5.3 (Ablation Studies): The ablation results claim that removing reflection substantially degrades sequence quality, but the reported metrics lack error bars, statistical significance tests, or multiple random seeds. Without these, it is impossible to determine whether the observed gains are robust or could arise from variance in the lightweight feedback loop.

Authors: This observation is correct, and we will strengthen the ablation analysis accordingly. In the revised manuscript, we will perform the ablations across multiple random seeds (at least 3-5), include error bars representing standard deviation, and conduct statistical significance testing (e.g., Wilcoxon signed-rank test) to confirm that the improvements due to reflection are statistically significant and not due to variance. revision: yes
Referee: [§5.1] §5.1 (Main Results): The paper asserts 'strong language alignment' and 'competitive foldability' relative to direct generation baselines, but supplies no explicit numerical values, baseline definitions, or dataset sizes in the primary comparison table. This absence prevents verification that the plan-execute gap has been meaningfully closed under limited supervision.

Authors: We acknowledge that the presentation in the main results table could be improved for clarity. We will revise the table in §5.1 to explicitly include all numerical metric values, clearly define the baselines (such as direct generation with fine-tuned LLMs on the same training data), and specify the dataset sizes used for evaluation. Additionally, we will add explanatory text in the section to highlight how these results demonstrate closure of the plan-execute gap under limited supervision. revision: yes

standing simulated objections not resolved

Quantitative correlation analysis between the lightweight tool feedback and experimental outcomes or full-physics simulators, as this would require substantial additional experimental or computational resources not included in the current study.

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on training procedures and ablations

full rationale

The paper presents ProtoCycle as an agentic LLM framework that couples a planner with a lightweight tool environment and uses reflection on feedback to revise plans, trained via supervised trajectories plus online RL. All load-bearing claims (language alignment, foldability, and reflection-driven gains) are supported by described training, performance metrics, and explicit ablations rather than any derivation, equation, or uniqueness theorem that reduces to the inputs by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear; the method is a standard empirical pipeline whose results are independently falsifiable via the reported experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract; no detailed free parameters, axioms, or invented entities are specified beyond the high-level framework description.

axioms (1)

domain assumption LLMs can produce coherent plans in text yet fail to reliably realize them as sequences.
Explicitly stated as the motivation for the plan-execute gap.

invented entities (1)

ProtoCycle no independent evidence
purpose: Agentic framework coupling LLM planner with tool environment and reflection for protein design.
Newly proposed system in the paper.

pith-pipeline@v0.9.0 · 5473 in / 1249 out tokens · 43863 ms · 2026-05-10T06:57:45.096831+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 8 canonical work pages · 5 internal anchors

[1]

Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ron- neberger, Lindsay Willmore, Andrew J

Uniprot: the universal protein knowledgebase in 2025.Nucleic acids research, 53(D1):D609–D617. Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ron- neberger, Lindsay Willmore, Andrew J. Ballard, Joshua Bambrick, Sebastian W. Bodenstein, and 1 others. 2024. Accurate structure prediction of biomolecular interaction...

2025
[2]

Language Models are Few-Shot Learners

Rhea, the reaction knowledgebase in 2022. Nucleic acids research, 50(D1):D693–D700. David Binns, Emily Dimmer, Rachael Huntley, Daniel Barrell, Claire O’donovan, and Rolf Apweiler. 2009. Quickgo: a web-based tool for gene ontology search- ing.Bioinformatics, 25(22):3045–3046. Matthias Blum, Antonina Andreeva, Laise Cavalcanti Florentino, Sara Rocio Chugur...

work page internal anchor Pith review arXiv 2022
[3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Toward de novo protein design from natural language.bioRxiv. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, and 181 others. 2025. DeepSeek-R1: Incentivizing Reasoning Capability i...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Qwen2.5 Technical Report

Qwen2.5 Technical Report.arXiv preprint. ArXiv:2412.15115. Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus

work page internal anchor Pith review Pith/arXiv arXiv
[5]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.Proceedings of the National Academy of Sciences, 118(15):e2016239118. Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language mode...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Finetuned Language Models Are Zero-Shot Learners

A trimodal protein language model enables advanced protein searches.Nature Biotechnology. Yajie Wang, Pu Xue, Mingfeng Cao, Tianhao Yu, Stephan T. Lane, and Huimin Zhao. 2021. Directed Evolution: Methodologies and Applications.Chemi- cal Reviews, 121(20):12384–12444. Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du,...

work page internal anchor Pith review arXiv 2021
[7]

Advancing multimodal reasoning via reinforcement learning with cold start.arXiv preprint arXiv:2505.22334, 2025

Advancing multimodal reasoning via rein- forcement learning with cold start.arXiv preprint arXiv:2505.22334. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations. Zhaochen Yu, Ling Y...

work page arXiv 2022
[8]

binds Zn 2+ with a Cys 2His2 coordination motif

Molreasoner: Toward effective and inter- pretable reasoning for molecular llms.arXiv preprint arXiv:2508.02066. A Token-level Uncertainty Estimation with LogTokU We follow the Logits-induced Token Uncertainty (LogTokU) framework (Ma et al., 2025) to obtain a token-level uncertainty score for each generated token. At decoding step t, the language model pro...

work page arXiv 2025
[9]

The protein should have a specific function in intein-mediated protein splicing, cell-cell signaling and should be localized to plasma membrane, endoplasmic reticulum membrane, Golgi membrane in order to carry out this function efficiently
[10]

specific function in intein-mediated protein splicing, cell-cell signaling

The designed protein have Hint domains with specific sequence motifs that are crucial for its activity. [ASSISTANT] <think> [Requirement decomposition] - general function: "specific function in intein-mediated protein splicing, cell-cell signaling". - pathway: "intein-mediated protein splicing\", "cell-cell signaling" - domain: "Hint domains". - go: "plas...

2024
[11]

call get_score once to aggregate and re-score all known sequences, and then
[12]

no new sequences

in the NEXT turn, produce a FINAL STEP containing only <answer>. - If tools return OBSERVATIONs like "no new sequences" or num_sequences_scored=0, treat that call as FAILED: * in <think>, diagnose why (argument too long, wrong type, not a real motif/cofactor, etc.), * then adjust arguments (simplify or clean them) or switch to a more robust Stage-1 tool. ...

work page arXiv 2025
[13]

The case highlights two behaviors

and use InterProScan (Jones et al., 2014) to identify key functional regions, which are then annotated on the folded structure. The case highlights two behaviors. (i)Replan- ning under tool failure: ProtoCycle initially fol- lows a GO-term scaffold search plus functional- site refinement, but when themotif2constraints call fails in Round 3, it immediately...

2014
[14]

The heme b binding site should be stable and able to withstand changes in environmental conditions
[15]

For general function, the protein need meet that Involved in oxygen transport from the gills to the various peripheral tissues
[16]

Involved in oxygen transport from the gills to the various peripheral tissues

The protein can be targeted to hemoglobin complex and perform heme binding, oxygen binding. [ASSISTANT] <think> [Requirement decomposition] - general function: "Involved in oxygen transport from the gills to the various peripheral tissues." - pathway: "oxygen transport pathway." - co-factor: "heme b binding site." - go: "oxygen transport," "heme binding,"...