pith. sign in

arxiv: 2605.26690 · v1 · pith:H2JLHTPVnew · submitted 2026-05-26 · 💻 cs.LG · cs.AI· q-bio.QM

Self-Improvement Imitation with Biologically Guided Search for Protein Design Under Oracle Budgets

Pith reviewed 2026-06-29 19:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AIq-bio.QM
keywords protein designimitation learningactive learningoracle budgetfitness landscapemutation policyself-improvement
0
0 comments X

The pith

SILO achieves the highest maximum and top-100 mean fitness on all eight protein fitness landscapes by self-improvement imitation on guided edit trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SILO as a method for optimizing protein sequences when only a small number of true evaluations can be performed. It trains a policy that proposes mutations by first choosing a position and then a residue, samples trajectories with stochastic beam search, picks which ones to evaluate using a proxy model ensemble and alanine scanning, and then retrains the policy by imitating the best evaluated trajectories. This loop is repeated, and the method is tested against five prior approaches on eight different protein landscapes. A reader would care because many real protein design problems are limited by the cost or speed of laboratory measurements, so any approach that extracts more signal from each evaluation could speed up useful discoveries.

Core claim

SILO is a trajectory-level self-improvement imitation framework for oracle-budgeted protein design that decomposes mutations into a hierarchical edit policy, samples candidate trajectories with incremental stochastic beam search without replacement, selects candidates for oracle evaluation via a UCB-based proxy ensemble combined with an alanine-scan fitness score, and updates the policy by next-action cross-entropy imitation on the round's best oracle-labeled trajectories, achieving the highest maximum and top-100 mean fitness on all eight reproduced protein fitness landscapes and remaining competitive in low-data and noisy-proxy settings.

What carries the argument

The hierarchical edit policy updated by imitation on oracle-labeled trajectories that are selected by UCB proxy ensemble plus alanine-scan fitness score after stochastic beam search sampling.

If this is right

  • SILO stays competitive or best when data are scarce or the proxy models are noisy, while several baselines degrade.
  • Ablation results indicate that stochastic beam search combined with the alanine-scan score accounts for most of the performance lift.
  • The iterative imitation step supplies additional gains on top of the search procedure alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the selection step generalizes beyond the eight tested landscapes, the method could lower the total number of wet-lab measurements required in practical protein engineering projects.
  • Replacing the single-round imitation with a longer-horizon planning step might improve performance on landscapes that require many sequential edits.
  • Varying the noise level or correlation structure of the proxy ensemble could identify the regimes where the UCB selection is most or least reliable.

Load-bearing premise

The UCB-based proxy ensemble together with the alanine-scan fitness score reliably identifies candidates whose edits are functionally relevant enough to justify oracle evaluation.

What would settle it

Evaluating SILO and the same five baselines on a fresh collection of protein fitness landscapes and finding that SILO no longer records the highest maximum or top-100 mean fitness values.

Figures

Figures reproduced from arXiv: 2605.26690 by Ashima Khanna, Dominik Grimm.

Figure 1
Figure 1. Figure 1: Active learning with SILO. We define the following four steps within our active learning loop: (A) A proxy ensemble fϕ(x) is trained on the current dataset Dn−1. (B) The policy πθ generates candidate mutated sequences by sampling action trajectories (position then residue) with incremental stochastic beam search. The proxy scores the generated sequences with the objective S(x) (Equation 5). (C) Top K candi… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the policy architecture. An input sequence is encoded by a frozen ESM Cambrian model and the embeddings are processed by a transformer based policy. Two policy heads use latent representations from the policy to emit logits: h0 over positions (R L) and h1 over residues per position (R (L×|V|) ). The policy is invoked iteratively over dmax edit steps, updating the input sequence after each step.… view at source ↗
Figure 3
Figure 3. Figure 3: Active learning performance across eight protein fitness landscapes. All methods improve over O(xstart) fitness; however, SILO consistently achieves the highest maximum fitness compared to all baselines. High quality sequences are discovered in earlier rounds, indicating more efficient exploration under limited oracle budgets. Shaded regions denote standard deviation over five runs. variants under a fixed … view at source ↗
Figure 4
Figure 4. Figure 4: Performance across low-data (A-B) and noisy settings (C-D). We compare the perfor￾mance of all methods in terms of maximum fitness achieved across 4 tasks. SILO outperforms or matches baselines and remains stable in both settings. baseline performs substantially worse across all tasks, indicating that model-guided candidate genera￾tion is important under limited oracle budgets. SBS outperforms standard BS … view at source ↗
read the original abstract

Protein sequence optimization under tight oracle budgets requires methods that explore vast combinatorial spaces while making each evaluation informative. Existing reinforcement learning and off-policy generative approaches often degrade under surrogate noise, and position-agnostic mutation proposals risk disrupting functionally critical residues. We introduce SILO, a trajectory-level self-improvement imitation framework for oracle-budgeted protein design. SILO uses a hierarchical edit policy that decomposes each mutation into a position choice followed by a residue choice. In each active-learning round, the policy samples candidate trajectories via incremental stochastic beam search without replacement (SBS), and a UCB-based proxy ensemble, combined with an alanine-scan fitness score (AFS), selects candidates with functionally relevant edits for in silico oracle evaluation. The policy is then updated by next-action cross-entropy imitation on the round's best oracle-labeled trajectories, avoiding value-function estimation. Across eight reproduced protein fitness landscapes and five strong baselines from prior work, SILO achieves the highest maximum and top-100 mean fitness on 8 of 8 landscapes within our evaluations, often exhibiting faster early-stage improvement. In low-data and noisy-proxy stress tests on two landscapes per setting, SILO remains competitive or best when several baselines degrade. Ablations show that SBS with AFS account for much of the gains, with iterative imitation providing additional improvement. Code is available at: https://github.com/grimmlab/SILO.git

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SILO, a self-improvement imitation framework for protein sequence optimization under tight oracle budgets. It decomposes mutations via a hierarchical edit policy (position then residue), samples trajectories with incremental stochastic beam search without replacement (SBS), selects candidates using a UCB-based proxy ensemble combined with an alanine-scan fitness score (AFS), evaluates the top candidates with an in silico oracle, and updates the policy via next-action cross-entropy imitation on the best oracle-labeled trajectories. Across eight reproduced protein fitness landscapes, SILO reports the highest maximum fitness and top-100 mean fitness on all 8 landscapes versus five strong baselines, with faster early improvement and robustness in low-data and noisy-proxy stress tests; ablations attribute much of the gain to SBS with AFS.

Significance. If the empirical claims hold, the approach offers a practical route to oracle-efficient design by grounding updates in external oracle labels rather than fitted value estimates and by incorporating a biologically motivated position-selection heuristic. The open-source code release supports reproducibility. The method's avoidance of value-function estimation and its explicit handling of position importance distinguish it from standard RL or generative baselines in this domain.

major comments (2)
  1. [Ablation studies and active-learning description] The active-learning loop (SBS sampling followed by UCB+AFS selection, then imitation) is load-bearing for the headline claim of consistent outperformance on 8/8 landscapes. The manuscript states that 'SBS with AFS account for much of the gains,' yet provides no direct evidence (e.g., correlation plots or per-landscape ablation) that AFS scores reliably identify positions whose edits drive fitness on the evaluated landscapes; if this correlation is weak or landscape-dependent, the selected trajectories may be uninformative and the subsequent imitation step may reinforce suboptimal policies rather than discover better ones.
  2. [Stress-test experiments] The stress-test results on two landscapes per setting claim that SILO 'remains competitive or best when several baselines degrade,' but the manuscript does not report the exact noise levels, proxy-ensemble variance, or number of oracle calls per round used in those tests; without these details it is impossible to assess whether the UCB+AFS selection remains reliable under the reported conditions.
minor comments (2)
  1. [Experimental protocol] The abstract and main text should explicitly state the total oracle budget per landscape and the number of active-learning rounds so that the 'oracle-budgeted' claim can be directly compared with prior work.
  2. [Method] Notation for the hierarchical policy (position choice followed by residue choice) and the exact form of the next-action imitation loss should be formalized in a single equation or algorithm box for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our ablation analysis and the reporting of stress-test details. We address each major comment below and commit to revisions that strengthen the manuscript.

read point-by-point responses
  1. Referee: [Ablation studies and active-learning description] The active-learning loop (SBS sampling followed by UCB+AFS selection, then imitation) is load-bearing for the headline claim of consistent outperformance on 8/8 landscapes. The manuscript states that 'SBS with AFS account for much of the gains,' yet provides no direct evidence (e.g., correlation plots or per-landscape ablation) that AFS scores reliably identify positions whose edits drive fitness on the evaluated landscapes; if this correlation is weak or landscape-dependent, the selected trajectories may be uninformative and the subsequent imitation step may reinforce suboptimal policies rather than discover better ones.

    Authors: We agree that direct evidence on the correlation between AFS scores and fitness-driving edits would strengthen the justification for the selection step. Our existing ablations compare full SILO against variants without AFS and show clear performance drops, but these are indirect. In the revision we will add per-landscape Spearman correlation plots between AFS and observed fitness deltas on the oracle-evaluated trajectories, plus expanded ablation tables broken down by landscape. This will allow readers to assess the reliability of AFS directly. revision: yes

  2. Referee: [Stress-test experiments] The stress-test results on two landscapes per setting claim that SILO 'remains competitive or best when several baselines degrade,' but the manuscript does not report the exact noise levels, proxy-ensemble variance, or number of oracle calls per round used in those tests; without these details it is impossible to assess whether the UCB+AFS selection remains reliable under the reported conditions.

    Authors: We acknowledge the omission of precise experimental parameters. The stress tests were performed with controlled Gaussian noise added to the proxy predictions and fixed oracle budgets per round; these values are present in the released code but were not stated explicitly in the text. We will add a dedicated paragraph (and table) in the revised manuscript listing the exact noise standard deviations, ensemble variance statistics, and oracle-call counts used in each stress-test setting. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical active-learning method whose policy update step is defined by next-action imitation on trajectories explicitly labeled by an external oracle. This grounding is independent of the model's own value estimates or fitted parameters. The UCB ensemble and AFS are used solely for pre-oracle candidate ranking and do not generate the imitation targets. No equations reduce a claimed prediction to a fitted input by construction, no uniqueness theorems are imported from self-citations, and no ansatz is smuggled via prior work. The reported performance is therefore an external benchmark result rather than a self-referential derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

Based solely on the abstract, the paper introduces several algorithmic components whose mathematical details and parameter choices are not specified. It relies on standard assumptions in active learning and protein fitness modeling.

axioms (2)
  • domain assumption Reproduced protein fitness landscapes from prior work provide valid test environments for oracle-budgeted optimization.
    The evaluation uses eight such landscapes.
  • domain assumption The oracle returns accurate fitness values for evaluated sequences.
    Central to labeling trajectories for imitation.
invented entities (3)
  • SILO framework no independent evidence
    purpose: trajectory-level self-improvement imitation for oracle-budgeted protein design
    New method introduced by the authors.
  • hierarchical edit policy no independent evidence
    purpose: decomposes mutation into position choice then residue choice
    Core policy structure described in the abstract.
  • alanine-scan fitness score (AFS) no independent evidence
    purpose: guides selection toward functionally relevant edits
    Biological component combined with UCB proxy.

pith-pipeline@v0.9.1-grok · 5785 in / 1623 out tokens · 82108 ms · 2026-06-29T19:18:27.610119+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 14 canonical work pages · 4 internal anchors

  1. [1]

    Hindsight experience replay

    Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. Advances in neural information processing systems, 30, 2017

  2. [2]

    Model-based reinforcement learning for biological sequence design

    Christof Angermueller, David Dohan, David Belanger, Ramya Deshpande, Kevin Murphy, and Lucy Colwell. Model-based reinforcement learning for biological sequence design. InInternational conference on learning representations, 2019

  3. [3]

    Flow network based generative models for non-iterative diverse candidate generation.Advances in neural information processing systems, 34:27381–27394, 2021

    Emmanuel Bengio, Moksh Jain, Maksym Korablyov, Doina Precup, and Yoshua Bengio. Flow network based generative models for non-iterative diverse candidate generation.Advances in neural information processing systems, 34:27381–27394, 2021

  4. [4]

    Robustness– epistasis link shapes the fitness landscape of a randomly drifting protein.Nature, 444(7121):929–932, 2006

    Shimon Bershtein, Michal Segal, Roy Bekerman, Nobuhiko Tokuriki, and Dan S Tawfik. Robustness– epistasis link shapes the fitness landscape of a randomly drifting protein.Nature, 444(7121):929–932, 2006

  5. [5]

    Conditioning by adaptive sampling for robust design

    David Brookes, Hahnbeom Park, and Jennifer Listgarten. Conditioning by adaptive sampling for robust design. InInternational conference on machine learning, pages 773–782. PMLR, 2019

  6. [6]

    Design by adaptive sampling.arXiv preprint arXiv:1810.03714, 2018

    David H Brookes and Jennifer Listgarten. Design by adaptive sampling.arXiv preprint arXiv:1810.03714, 2018

  7. [7]

    Self-labeling the job shop scheduling problem.Advances in Neural Information Processing Systems, 37:105528–105551, 2024

    Andrea Corsini, Angelo Porrello, Simone Calderara, and Mauro Dell’Amico. Self-labeling the job shop scheduling problem.Advances in Neural Information Processing Systems, 37:105528–105551, 2024

  8. [8]

    High-resolution epitope mapping of hgh-receptor interactions by alanine-scanning mutagenesis.Science, 244(4908):1081–1085, 1989

    Brian C Cunningham and James A Wells. High-resolution epitope mapping of hgh-receptor interactions by alanine-scanning mutagenesis.Science, 244(4908):1081–1085, 1989

  9. [9]

    Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35: 16344–16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35: 16344–16359, 2022

  10. [10]

    Arjan G.M

    J. Arjan G.M. de Visser and Joachim Krug. Empirical fitness landscapes and the predictability of evolution.Nature Reviews Genetics, 15(7):480–490, 2014. ISSN 1471-0064. doi: 10.1038/nrg3744. URL http://dx.doi.org/10.1038/nrg3744

  11. [11]

    A comprehensive, high-resolution map of a gene’s fitness landscape.Molecular biology and evolution, 31(6):1581–1592, 2014

    Elad Firnberg, Jason W Labonte, Jeffrey J Gray, and Marc Ostermeier. A comprehensive, high-resolution map of a gene’s fitness landscape.Molecular biology and evolution, 31(6):1581–1592, 2014

  12. [12]

    Deep reinforcement learning that matters

    Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  13. [13]

    Searching sequence space by definably random mutagenesis: improving the catalytic potency of an enzyme.Proceedings of the National Academy of Sciences, 87(2): 696–700, January 1990

    J D Hermes, S C Blacklow, and J R Knowles. Searching sequence space by definably random mutagenesis: improving the catalytic potency of an enzyme.Proceedings of the National Academy of Sciences, 87(2): 696–700, January 1990. ISSN 1091-6490. doi: 10.1073/pnas.87.2.696. URL http://dx.doi.org/10. 1073/pnas.87.2.696

  14. [14]

    Capturing the mutational landscape of the beta-lactamase tem-1.Proceedings of the National Academy of Sciences, 110(32): 13067–13072, 2013

    Hervé Jacquier, André Birgy, Hervé Le Nagard, Yves Mechulam, Emmanuelle Schmitt, Jérémy Glodt, Beatrice Bercot, Emmanuelle Petit, Julie Poulain, Guilène Barnaud, et al. Capturing the mutational landscape of the beta-lactamase tem-1.Proceedings of the National Academy of Sciences, 110(32): 13067–13072, 2013

  15. [15]

    Biological sequence design with gflownets

    Moksh Jain, Emmanuel Bengio, Alex Hernandez-Garcia, Jarrid Rector-Brooks, Bonaventure FP Dossou, Chanakya Ajit Ekbote, Jie Fu, Tianyu Zhang, Michael Kilgour, Dinghuai Zhang, et al. Biological sequence design with gflownets. InInternational conference on machine learning, pages 9786–9801. PMLR, 2022

  16. [16]

    Sanjeevaiah Kasturi, Ako Kihara, David FitzGerald, and Ira Pastan. Alanine scanning mutagenesis identifies surface amino acids on domain ii of pseudomonas exotoxin required for cytotoxicity, proper folding, and secretion into periplasm.Journal of Biological Chemistry, 267(32):23427–23433, 1992. 10

  17. [17]

    Improved off-policy reinforcement learning in biological sequence design.arXiv preprint arXiv:2410.04461, 2024

    Hyeonah Kim, Minsu Kim, Taeyoung Yun, Sanghyeok Choi, Emmanuel Bengio, Alex Hernández-García, and Jinkyoo Park. Improved off-policy reinforcement learning in biological sequence design.arXiv preprint arXiv:2410.04461, 2024

  18. [18]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  19. [19]

    Trade-offs between enzyme fitness and solubility illuminated by deep mutational scanning.Proceedings of the National Academy of Sciences, 114(9):2265–2270, 2017

    Justin R Klesmith, John-Paul Bacik, Emily E Wrenbeck, Ryszard Michalczyk, and Timothy A Whitehead. Trade-offs between enzyme fitness and solubility illuminated by deep mutational scanning.Proceedings of the National Academy of Sciences, 114(9):2265–2270, 2017

  20. [20]

    Prospero: Active learning for robust protein design beyond wild-type neighborhoods.arXiv preprint arXiv:2505.22494, 2025

    Michal Kmicikiewicz, Vincent Fortuin, and Ewa Szczurek. Prospero: Active learning for robust protein design beyond wild-type neighborhoods.arXiv preprint arXiv:2505.22494, 2025

  21. [21]

    Bandit based monte-carlo planning

    Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. InEuropean conference on machine learning, pages 282–293. Springer, 2006

  22. [22]

    Stochastic beams and where to find them: The gumbel- top-k trick for sampling sequences without replacement

    Wouter Kool, Herke Van Hoof, and Max Welling. Stochastic beams and where to find them: The gumbel- top-k trick for sampling sequences without replacement. InInternational conference on machine learning, pages 3499–3508. PMLR, 2019

  23. [23]

    Recent advances in (therapeutic protein) drug development

    HA Daniel Lagassé, Aikaterini Alexaki, Vijaya L Simhadri, Nobuko H Katagiri, Wojciech Jankowski, Zuben E Sauna, and Chava Kimchi-Sarfaty. Recent advances in (therapeutic protein) drug development. F1000Research, 6:113, 2017

  24. [24]

    Robust optimization in protein fitness landscapes using reinforcement learning in latent space.arXiv preprint arXiv:2405.18986, 2024

    Minji Lee, Luiz Felipe Vecchietti, Hyunkyu Jung, Hyun Joo Ro, Meeyoung Cha, and Ho Min Kim. Robust optimization in protein fitness landscapes using reinforcement learning in latent space.arXiv preprint arXiv:2405.18986, 2024

  25. [25]

    Highplay: Cyclic peptide sequence design based on reinforcement learning and protein structure prediction.Journal of Medicinal Chemistry, 68(11):12047–12057, 2025

    Huitian Lin, Cheng Zhu, Tianfeng Shang, Ning Zhu, Kang Lin, Chengyun Zhang, Xiang Shao, Xudong Wang, and Hongliang Duan. Highplay: Cyclic peptide sequence design based on reinforcement learning and protein structure prediction.Journal of Medicinal Chemistry, 68(11):12047–12057, 2025

  26. [26]

    Language models of protein sequences at the scale of evolution enable accurate structure prediction.BioRxiv, 2022:500902, 2022

    Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction.BioRxiv, 2022:500902, 2022

  27. [27]

    Self-improved learning for scalable neural combinatorial optimization.arXiv preprint arXiv:2403.19561, 2024

    Fu Luo, Xi Lin, Zhenkun Wang, Xialiang Tong, Mingxuan Yuan, and Qingfu Zhang. Self-improved learning for scalable neural combinatorial optimization.arXiv preprint arXiv:2403.19561, 2024

  28. [28]

    Deep mutational scanning of an rrm domain of the saccharomyces cerevisiae poly (a)-binding protein.Rna, 19 (11):1537–1551, 2013

    Daniel Melamed, David L Young, Caitlin E Gamble, Christina R Miller, and Stanley Fields. Deep mutational scanning of an rrm domain of the saccharomyces cerevisiae poly (a)-binding protein.Rna, 19 (11):1537–1551, 2013

  29. [29]

    Ray: A distributed framework for emerging {AI} applications

    Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al. Ray: A distributed framework for emerging {AI} applications. In13th USENIX symposium on operating systems design and implementation (OSDI 18), pages 561–577, 2018

  30. [30]

    New advances in protein engineering for industrial applications: Key takeaways.Open life sciences, 19(1):20220856, 2024

    Giles Obinna Ndochinwa, Qing-Yan Wang, Nkwachukwu Oziamara Okoro, Oyetugo Chioma Amadi, Tochukwu Nwamaka Nwagu, Chukwudi Innocent Nnamchi, Anene Nwabu Moneke, and Arome Solomon Odiba. New advances in protein engineering for industrial applications: Key takeaways.Open life sciences, 19(1):20220856, 2024

  31. [31]

    Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

  32. [32]

    Self-improvement for neural combinatorial optimization: Sample without replacement, but improvement.arXiv preprint arXiv:2403.15180, 2024

    Jonathan Pirnay and Dominik G Grimm. Self-improvement for neural combinatorial optimization: Sample without replacement, but improvement.arXiv preprint arXiv:2403.15180, 2024

  33. [33]

    Graphxform: graph transformer for computer-aided molecular design.Digital Discovery, 4(4):1052–1065, 2025

    Jonathan Pirnay, Jan G Rittig, Alexander B Wolf, Martin Grohe, Jakob Burger, Alexander Mitsos, and Dominik G Grimm. Graphxform: graph transformer for computer-aided molecular design.Digital Discovery, 4(4):1052–1065, 2025

  34. [34]

    Evaluating protein transfer learning with tape.Advances in neural information processing systems, 32, 2019

    Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Peter Chen, John Canny, Pieter Abbeel, and Yun Song. Evaluating protein transfer learning with tape.Advances in neural information processing systems, 32, 2019. 11

  35. [35]

    Reisenbauer, Kathleen M

    Julia C. Reisenbauer, Kathleen M. Sicinski, and Frances H. Arnold. Catalyzing the future: recent advances in chemical synthesis using enzymes.Current Opinion in Chemical Biology, 83:102536, 2024. ISSN 1367-5931. doi: https://doi.org/10.1016/j.cbpa.2024.102536. URL https://www.sciencedirect.com/ science/article/pii/S1367593124001121

  36. [36]

    Proximal exploration for model- guided protein sequence design

    Zhizhou Ren, Jiahan Li, Fan Ding, Yuan Zhou, Jianzhu Ma, and Jian Peng. Proximal exploration for model- guided protein sequence design. InInternational Conference on Machine Learning, pages 18520–18536. PMLR, 2022

  37. [37]

    Local fitness landscape of the green fluorescent protein.Nature, 533(7603):397–401, 2016

    Karen S Sarkisyan, Dmitry A Bolotin, Margarita V Meer, Dinara R Usmanova, Alexander S Mishin, George V Sharonov, Dmitry N Ivankov, Nina G Bozhanova, Mikhail S Baranov, Onuralp Soylemez, et al. Local fitness landscape of the green fluorescent protein.Nature, 533(7603):397–401, 2016

  38. [38]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  39. [39]

    Understanding the effects of dataset characteristics on offline reinforcement learning

    Kajetan Schweighofer, Markus Hofmarcher, Marius-Constantin Dinu, Philipp Renz, Angela Bitto-Nemling, Vihang Prakash Patil, and Sepp Hochreiter. Understanding the effects of dataset characteristics on offline reinforcement learning. InDeep RL Workshop NeurIPS 2021, 2021. URL https://openreview.net/ forum?id=A4EWtf-TO3Y

  40. [40]

    Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm.arXiv preprint arXiv:1712.01815, 2017

  41. [41]

    Adalead: A simple and robust adaptive greedy search algorithm for sequence design.arXiv preprint arXiv:2010.02141, 2020

    Sam Sinai, Richard Wang, Alexander Whatley, Stewart Slocum, Elina Locane, and Eric D Kelsic. Adalead: A simple and robust adaptive greedy search algorithm for sequence design.arXiv preprint arXiv:2010.02141, 2020

  42. [42]

    Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design

    Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design.arXiv preprint arXiv:0912.3995, 2009

  43. [43]

    Activity-enhancing mutations in an e3 ubiquitin ligase identified by high-throughput mutagenesis.Proceedings of the National Academy of Sciences, 110(14):E1263–E1272, 2013

    Lea M Starita, Jonathan N Pruneda, Russell S Lo, Douglas M Fowler, Helen J Kim, Joseph B Hiatt, Jay Shendure, Peter S Brzovic, Stanley Fields, and Rachel E Klevit. Activity-enhancing mutations in an e3 ubiquitin ligase identified by high-throughput mutagenesis.Proceedings of the National Academy of Sciences, 110(14):E1263–E1272, 2013

  44. [44]

    Esm cambrian: Revealing the mysteries of proteins with unsupervised learning.Evolu- tionaryScale Website, 2024

    ESM Team et al. Esm cambrian: Revealing the mysteries of proteins with unsupervised learning.Evolu- tionaryScale Website, 2024

  45. [45]

    Conservative objective models for effective offline model-based optimization

    Brandon Trabucco, Aviral Kumar, Xinyang Geng, and Sergey Levine. Conservative objective models for effective offline model-based optimization. InInternational Conference on Machine Learning, pages 10358–10368. PMLR, 2021

  46. [46]

    Protein design by directed evolution guided by large language models

    Thanh VT Tran and Truong Son Hy. Protein design by directed evolution guided by large language models. IEEE Transactions on Evolutionary Computation, 29(2):418–428, 2024

  47. [47]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  48. [48]

    Self-play reinforcement learning guides protein engineering.Nature Machine Intelligence, 5(8):845–860, 2023

    Yi Wang, Hui Tang, Lichao Huang, Lulu Pan, Lixiang Yang, Huanming Yang, Feng Mu, and Meng Yang. Self-play reinforcement learning guides protein engineering.Nature Machine Intelligence, 5(8):845–860, 2023

  49. [49]

    A framework for exhaustively mapping functional missense variants.Molecular systems biology, 13(12):MSB177908, 2017

    Jochen Weile, Song Sun, Atina G Cote, Jennifer Knapp, Marta Verby, Joseph C Mellor, Yingzhou Wu, Carles Pons, Cassandra Wong, Natascha van Lieshout, et al. A framework for exhaustively mapping functional missense variants.Molecular systems biology, 13(12):MSB177908, 2017

  50. [50]

    Single-mutation fitness landscapes for an enzyme on multiple substrates reveal specificity is globally encoded.Nature communications, 8(1):15695, 2017

    Emily E Wrenbeck, Laura R Azouz, and Timothy A Whitehead. Single-mutation fitness landscapes for an enzyme on multiple substrates reveal specificity is globally encoded.Nature communications, 8(1):15695, 2017. 12 Appendix A Related work 14 B Experiment setting and implementation details 14 B.1 Benchmark datasets . . . . . . . . . . . . . . . . . . . . . ....

  51. [51]

    The dataset D0 containing 15,307 sequences, was generated by Kim et al

    Adeno-associated Virus (AA V):The task focuses on improving binding affinity of an amino acid segment (position 450-540) of the VP1 protein located in the capsid of the Adeno-associated virus. The dataset D0 containing 15,307 sequences, was generated by Kim et al. [17] by randomly mutating the wild-type sequence while filtering out sequences that have hig...

  52. [52]

    Initial dataset D0 contains 6417 sequences with single mutations to model the fitness landscape

    Aliphatic Amide Hydrolase (AMIE):The task aims is optimize amidase sequences for increased enzymatic activity [50]. Initial dataset D0 contains 6417 sequences with single mutations to model the fitness landscape. The length of sequence L= 341. The starting fitness is O(xstart) = 0.224 , and Novelty (D0,x start) = 2

  53. [53]

    Starita et al

    Ubiquitination Factor Ube4b (E4B):The goal is generate sequences that enhance E4B ubiquitination enzyme. Starita et al. [43] measured the rates of ubiquitination of the mutants to the target protein. The full dataset includes 91,032 sequences of length L= 102 , from which [20] randomly selected 10,000 sequences to form D0. The starting sequence has O(xsta...

  54. [54]

    The D0 dataset contains 10,200 sequences generated by [17], similarly to AA V

    Green Fluorescent Protein (GFP):The task seeks to generate sequences with high log-fluorescence intensity [37]. The D0 dataset contains 10,200 sequences generated by [17], similarly to AA V . The sequence length isL=238, withO(x start) =3.572 and Novelty (D 0,x start) = 42.87

  55. [55]

    The initial dataset D0 includes 7,633 sequences of length L= 439, with O(xstart) = 0.020 and Novelty (D0, xstart) = 2.0

    Levoglucosan Kinase (LGK):This task aims to improve enzymatic activity oflevoglucosan kinase, which converts LG to the glycolytic intermediate glucose-6-phosphate [19]. The initial dataset D0 includes 7,633 sequences of length L= 439, with O(xstart) = 0.020 and Novelty (D0, xstart) = 2.0

  56. [56]

    Melamed et al

    Poly(A)-binding Protein (Pab1):The poly(A)-binding protein Pab1 is binds to polyadenosine (poly- A) sequences via its RNA recognition motif (RRM). Melamed et al. [28] conducted a high-throughput screening assay to measure binding fitness for approximately 36,000 double mutants of Pab1 within the RRM region. This task focuses on improving binding affinity ...

  57. [57]

    The aim of this task is to TEM variants with improved thermodynamic stability

    TEM-1 β-Lactamase (TEM):TEM-1 β-Lactamase resistance to penicillin antibiotics inE.coliis widely studied to understand mutational effect and fitness landscape [14], [4]. The aim of this task is to TEM variants with improved thermodynamic stability. The datasetD 0 derived [11] from contains 5,199 sequences of length L= 286. The starting sequence has fitnes...

  58. [58]

    [49] and the goal of this task is to optimize these variants for functional mapping applications

    SUMO E2 Conjugase (UBE2I):Variants of the disease-relevant protein, human SUMO E2 conjugase were generated by Weile et al. [49] and the goal of this task is to optimize these variants for functional mapping applications. D0 comprise of 3,022 sequences of length L= 159. The starting sequence has O(xstart) =2.978 and Novelty (D0,x start) = 2.0. B.2 Proxy ar...

  59. [59]

    [ 41] available at https://github.com/samsinai/FLEXS/tree/master under the Apache-2.0 license

    AdaLead [41]:We employed the open-source implementations provided by Sinai et al. [ 41] available at https://github.com/samsinai/FLEXS/tree/master under the Apache-2.0 license. We used the default hyperparameters of the model, with a recombination rate of 0.2, a mutation rate of 1/L, where L is the sequence length, and a threshold τ= 0.05. The numbers of ...

  60. [60]

    [ 17], we employ an adaptive δ with a maximum masking radius of 0.05 and rank-based proxy training with a reweighting factor k=0.01

    GFN-AL-δCS [ 17]:For the conservative strategy GFN-AL- δCS proposed by Kim et al. [ 17], we employ an adaptive δ with a maximum masking radius of 0.05 and rank-based proxy training with a reweighting factor k=0.01. We set the scaling factor λ= 0.1 for AA V , E4B, and Pab1, and λ= 1 for GFP, AMIE, TEM, UBE2I, and LGK. We use the publically released codebas...

  61. [61]

    Adapting their setup to perform active learning, we perform 10 rounds of surrogate-guided optimization 15 with a population size of 128 and a beam size of 4

    MLDE [46]:We use the MLDE implementation from Tran and Hy [ 46], from their publicly avail- able codebase https://github.com/HySonLab/Directed_Evolution under the GPL-3.0 license. Adapting their setup to perform active learning, we perform 10 rounds of surrogate-guided optimization 15 with a population size of 128 and a beam size of 4. The masking strateg...

  62. [62]

    [ 36] using the official codebase https://github

    PEX [36]:We implement PEX from Ren et al. [ 36] using the official codebase https://github. com/HeliXonProtein/proximal-exploration/tree/main under the Apache-2.0 license. We use the default configuration, including 2 random mutations and a frontier neighborhood size of 5

  63. [63]

    ProSpero [ 20]:We use ProSpero under the official codebase https://github.com/ szczurek-lab/ProSpero.git released by Kmicikiewicz, Fortuin, and Szczurek [ 20] under the GPL-3.0 license. We use the default hyperparameters of the method, keeping the range of number of corruptions introduced within the sequences between 3-10 for shorter sequences (E4B, Pab1,...

  64. [64]

    Novelty:Quantifies the average Hamming distance between the top sequences and the starting sequencex start, reflecting deviation from the wild-type: Novelty(Dtop, xstart) = 1 |Dtop| X x∈Dtop d(x, xstart)(8)

  65. [65]

    Experiments were conducted on an NVIDIA A40 GPU using CUDA 12.2

    Diversity:Measures the average pairwise Hamming distance among the top sequences, indicating the extent of exploration: Diversity(Dtop) = 1 |Dtop|(|Dtop| −1) X x,x′∈Dtop,x̸=x′ d(x, x′)(9) C Discussion C.1 Hardware and runtime details Our code is developed in PyTorch [31] v.2.8.0. Experiments were conducted on an NVIDIA A40 GPU using CUDA 12.2. SILO incurs...