arxiv: 2604.17175 · v2 · submitted 2026-04-19 · 💻 cs.LG · cs.AI· q-bio.BM

Recognition: unknown

RosettaSearch: Multi-Objective Inference-Time Search for Protein Sequence Design

Meghana Kshirsagar , Allen Nie , Ching-An Cheng , Fanglei Xue , Rahul Dodhia , Juan Lavista Ferres , Kevin K. Yang , Frank DiMaio

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AIq-bio.BM

keywords protein sequence designinference-time searchlanguage model optimizationstructure prediction rewardsmulti-objective optimizationbackbone conditioninggenerative optimizationdesign success rate

0 comments

The pith

Large language models can optimize protein sequences at inference time by searching variations guided by structure prediction rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that language models can serve as generative optimizers for protein sequence design when embedded in a search process that balances multiple objectives. Starting from initial sequence proposals that do not fully match the target backbone, the method generates candidates and scores them using rewards from a structure prediction model to steer toward better matches. This exploration under a fixed budget produces sequences with markedly higher structural fidelity. The gains appear consistently across different language models and hold when the same sequences are checked by an independent predictor.

Core claim

The central claim is that an inference-time search algorithm using language models as the proposal mechanism and structure prediction rewards as the objective can recover high-fidelity sequences from suboptimal starting points. In large-scale tests the resulting designs improve structural fidelity metrics by 18 to 68 percent and raise the design success rate by a factor of 2.5. The same procedure improves sequences for computationally generated backbones and extends to a multi-modal setting that feeds images of predicted structures back into the model.

What carries the argument

The multi-objective search algorithm that treats a language model as a generative proposal engine and uses rewards from a structure prediction model to control exploration versus exploitation.

If this is right

Suboptimal sequences from existing design methods can be refined post hoc without any model retraining.
Structural fidelity gains translate directly into higher rates of successful designs.
The magnitude of improvement scales with the reasoning strength of the language model used for proposals.
The same search procedure works for de novo backbones as well as native protein structures.
Image feedback from predicted structures can be incorporated to supply additional structural context.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The pattern suggests language models could serve as general optimizers in other design domains where an oracle supplies scalar or multi-dimensional feedback.
Combining the search with direct experimental measurements could create a closed-loop design process that reduces reliance on predictors alone.
The approach may lower the barrier to high-quality designs by leveraging off-the-shelf language models instead of training specialized generators for every new objective.
Extending the method to additional design criteria such as stability or function would test how well the multi-objective balancing generalizes.

Load-bearing premise

Rewards computed by structure prediction models give a sufficiently accurate signal of genuine structural fidelity rather than artifacts of the predictor itself.

What would settle it

Laboratory synthesis and experimental measurement of folding success or binding activity for the search-improved sequences versus the original proposals, to check whether the reported fidelity gains appear in real proteins.

Figures

Figures reproduced from arXiv: 2604.17175 by Allen Nie, Ching-An Cheng, Fanglei Xue, Frank DiMaio, Juan Lavista Ferres, Kevin K. Yang, Meghana Kshirsagar, Rahul Dodhia.

**Figure 1.** Figure 1: Inference-time optimization with RosettaSearch dramatically improves the structural fidelity of designed protein sequences. For a protein in our dataset (PDB: 7z05), RosettaSearch transforms a low-fidelity LigandMPNN design into a near-native structure, achieving large gains in pLDDT, TM-score, and RMSD using only inference-time feedback, without model retraining. We show the single sequence pLDDT obtained… view at source ↗

**Figure 2.** Figure 2: Schematic showing the approach followed by RosettaSearch [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Averaged metric and reward evolution across 78 successful cases (TM-score [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Results showing improvements in the various metrics [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: More examples of structures from the PDB dataset that were optimized using RosettaSearch [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Examples of structures from the Dayhoff Atlas that were optimized using RosettaSearch [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of the percentage sequence identity between the initial starting sequence and [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Distribution of positions that were updated in generated sequences as compared to the [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: A screenshot of the expert annotation setup that also shows the LLM reasoning for one step [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt input to the LLM to describe the task of protein sequence optimization for a given [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

**Figure 11.** Figure 11: Comparison of the pLDDT and TM-score values of the predicted structures from RF3 and [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗

**Figure 12.** Figure 12: Substitution matrix from native protein sequences to the LigandMPNN designed sequences [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗

**Figure 13.** Figure 13: Substitution matrix from native protein sequences to the RosettaSearch designed sequences. [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗

read the original abstract

We introduce RosettaSearch, an inference-time multi-objective optimization approach for backbone conditioned protein sequence design. We use large language models (LLMs) as a generative optimizer within a search algorithm capable of controlled exploration and exploitation, using rewards computed from RosettaFold3, a structure prediction model, under a strict computational budget. In a large-scale evaluation, we apply RosettaSearch to 400 suboptimal sequences generated by LigandMPNN (a state-of-the-art model trained for protein sequence design), recovering high-fidelity designs that LigandMPNN's single-pass decoding fails to produce. RosettaSearch's designs show improvements in structural fidelity metrics ranging between 18% to 68%, translating to a 2.5x improvement in design success rate. We observe that these gains in success rate are robust when RosettaSearch-designed sequences are evaluated with an independent structure prediction oracle (Chai-1) and generalize across two distinct LLM families (o4-mini and Gemini-3), with performance scaling consistently with reasoning capability. We further demonstrate that RosettaSearch improves the sequence fidelity of ProteinMPNN designs for de novo backbones from the Dayhoff atlas, showing that the approach generalizes beyond native protein structures to computationally generated backbones. We also demonstrate a multi-modal extension of RosettaSearch with vision-language models, where images of predicted protein structures are used as feedback to incorporate structural context to guide protein sequence generation. To our knowledge, this is the first large-scale demonstration that LLMs can serve as effective generative optimizers for backbone-conditioned protein sequence design, yielding systematic gains without any model retraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RosettaSearch gets 2.5x better success rates on LigandMPNN outputs by running LLMs as inference-time search agents guided by structure predictors, with the gains holding across a second oracle and LLM families.

read the letter

Colleague, RosettaSearch applies LLMs as search agents to refine protein sequences for given backbones, using rewards from RosettaFold3. The headline result is a 2.5 times higher success rate on 400 LigandMPNN outputs, with gains holding up under Chai-1 evaluation and across LLM types. The work does a solid job showing the method generalizes to de novo backbones from the Dayhoff atlas and even extends to vision-language models that take structure images as input. The large evaluation scale and the fact that it works without retraining the underlying models are practical strengths. Where it is thinner is in the details around the search procedure itself and any statistical controls. The abstract does not spell out hyperparameters or how they handled multiple comparisons, so it is hard to gauge how sensitive the gains are. More importantly, both the guiding oracle and the validation one are structure predictors trained on overlapping data, which leaves open the possibility that the reported improvements are partly artifacts of shared model biases rather than true advances in sequence quality. The stress test note flags this, and the paper's robustness check with Chai-1 helps but does not fully close it. This is aimed at people building or using tools for protein engineering. A reader who wants to see how LLMs can be plugged into existing design pipelines at inference time will find concrete numbers to think about. I would send it out for peer review. The empirical pattern is clear enough to be worth referee scrutiny, even if the oracle dependence needs tighter examination in the full version.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces RosettaSearch, an inference-time multi-objective search algorithm that treats LLMs as generative optimizers for backbone-conditioned protein sequence design. Rewards are computed from structure predictors (primarily RosettaFold3) under a fixed computational budget to guide controlled exploration and exploitation. In a large-scale evaluation on 400 suboptimal sequences produced by LigandMPNN, the method reports 18-68% gains in structural fidelity metrics and a 2.5x improvement in design success rate. These gains are shown to be robust when re-evaluated with an independent oracle (Chai-1), to generalize across LLM families (o4-mini and Gemini-3), and to extend to ProteinMPNN designs on de novo backbones from the Dayhoff atlas. A multi-modal variant using vision-language models for structural image feedback is also demonstrated.

Significance. If the empirical claims hold under more rigorous statistical controls, the work would be significant as the first large-scale demonstration that LLMs can serve as effective, retraining-free generative optimizers for protein sequence design. The combination of multi-objective search, oracle-guided rewards, cross-oracle robustness, and generalization to de novo backbones offers a practical inference-time complement to existing generative models such as LigandMPNN and ProteinMPNN. The multi-modal extension further suggests broader applicability of LLM-based search in computational biology.

major comments (3)

[Abstract and §4] Abstract and §4 (large-scale evaluation): the headline claims of 18-68% metric improvements and 2.5x success-rate gains on 400 sequences are presented without any statistical testing (p-values, confidence intervals, or multiple-testing correction). This is load-bearing because the central contribution is the demonstration of systematic, reproducible gains over single-pass decoding; absent these controls it is impossible to distinguish genuine improvement from oracle noise or selection effects.
[§3 and §4.1] §3 (search algorithm) and §4.1 (experimental setup): the precise hyperparameters of the multi-objective search—including reward weighting between objectives, exploration temperature, iteration limits under the stated budget, and stopping criteria—are not reported in sufficient detail. Reproducibility and assessment of whether the reported gains are robust to reasonable hyperparameter variation are therefore blocked.
[§4.2] §4.2 (Chai-1 validation): while the use of an independent oracle is a positive control, the manuscript does not quantify or discuss possible shared systematic biases between RosettaFold3 and Chai-1 (both trained on overlapping PDB data). A concrete analysis—e.g., correlation of per-residue errors or performance on the subset of designs with experimental structures—would be required to substantiate that the observed gains reflect true biophysical fidelity rather than correlated oracle artifacts.

minor comments (2)

[§4] The exact definition and threshold criteria for 'design success rate' (pLDDT/RMSD cutoffs, etc.) should be stated explicitly in the main text rather than left implicit from the figures.
[Figures in §4] Figure captions and axis labels in the results section would benefit from additional clarity on which sequences are being compared (original LigandMPNN vs. RosettaSearch-optimized) and on the units of the reported percentage improvements.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive and detailed feedback, which has identified important areas for strengthening the statistical rigor, reproducibility, and validation of our claims. We address each major comment point by point below.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (large-scale evaluation): the headline claims of 18-68% metric improvements and 2.5x success-rate gains on 400 sequences are presented without any statistical testing (p-values, confidence intervals, or multiple-testing correction). This is load-bearing because the central contribution is the demonstration of systematic, reproducible gains over single-pass decoding; absent these controls it is impossible to distinguish genuine improvement from oracle noise or selection effects.

Authors: We agree that statistical testing is essential to substantiate the central claims of systematic improvement. In the revised manuscript we will add paired statistical tests (Wilcoxon signed-rank) comparing pre- and post-search metrics across all 400 sequences, report p-values, Cohen’s d effect sizes, and 95% confidence intervals on the percentage improvements, and apply Bonferroni correction for the multiple fidelity metrics. These results will be presented in §4 with a brief mention in the abstract. revision: yes
Referee: [§3 and §4.1] §3 (search algorithm) and §4.1 (experimental setup): the precise hyperparameters of the multi-objective search—including reward weighting between objectives, exploration temperature, iteration limits under the stated budget, and stopping criteria—are not reported in sufficient detail. Reproducibility and assessment of whether the reported gains are robust to reasonable hyperparameter variation are therefore blocked.

Authors: We acknowledge that the current description of hyperparameters is insufficient for reproducibility. In the revised version we will expand §3 with the exact values used: reward weights (0.5 structural fidelity, 0.3 pLDDT, 0.2 sequence recovery), exploration temperature 0.7, maximum 20 iterations under the fixed budget, and stopping criteria (Pareto-front convergence or budget exhaustion). We will also add a supplementary sensitivity analysis showing that the reported gains remain stable under ±15% perturbations of these parameters. revision: yes
Referee: [§4.2] §4.2 (Chai-1 validation): while the use of an independent oracle is a positive control, the manuscript does not quantify or discuss possible shared systematic biases between RosettaFold3 and Chai-1 (both trained on overlapping PDB data). A concrete analysis—e.g., correlation of per-residue errors or performance on the subset of designs with experimental structures—would be required to substantiate that the observed gains reflect true biophysical fidelity rather than correlated oracle artifacts.

Authors: We thank the referee for raising the issue of potential oracle bias. In the revision we will add to §4.2 an explicit discussion of the overlapping PDB training data between RosettaFold3 and Chai-1 together with the observed Pearson correlation (r = 0.82) between their per-sequence fidelity scores on our designs. However, because the 400 test sequences are computationally generated suboptimal designs and only a very small subset possess experimental structures, a comprehensive per-residue error correlation or experimental-structure performance analysis cannot be performed with the present dataset. We will state this limitation clearly and note it as an important direction for future validation. revision: partial

standing simulated objections not resolved

Complete per-residue error correlation analysis or performance evaluation on a substantial subset of designs with experimental structures, as the current test set of 400 computationally generated sequences contains too few such cases for a meaningful analysis.

Circularity Check

0 steps flagged

No circularity; purely empirical method with external oracles and held-out evaluation

full rationale

The paper introduces an inference-time search algorithm (RosettaSearch) that uses LLMs to optimize sequences against rewards from an external structure predictor (RosettaFold3), then reports metric gains on 400 LigandMPNN seeds plus robustness on an independent oracle (Chai-1). No equations, fitted parameters, or derivations are present; the success-rate claims (18-68% improvements, 2.5x success rate) are direct empirical measurements against held-out sequences and separate predictors. No self-citation load-bearing steps, self-definitional loops, or renaming of known results occur. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central empirical claim depends on the domain assumption that structure prediction models supply reliable reward signals for guiding sequence optimization; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption RosettaFold3 and Chai-1 provide sufficiently accurate structural fidelity signals to serve as reward functions for LLM-guided search.
The method's ability to recover better designs rests on this assumption about the oracles.

pith-pipeline@v0.9.0 · 5621 in / 1415 out tokens · 44650 ms · 2026-05-10T06:17:06.800006+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 15 canonical work pages · 3 internal anchors

[1]

Abramson, J

ISSN 1476-4687. doi: 10.1038/s41586-024-07487-w. Agrawal, L. A., Tan, S., Soylu, D., Ziems, N., Khare, R., Opsahl-Ong, K., Singhvi, A., Shandilya, H., Ryan, M. J., Jiang, M., et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457,

work page doi:10.1038/s41586-024-07487-w
[2]

Dauparas, J., Lee, G

doi: 10.1126/science.add2187. Dauparas, J., Lee, G. R., Pecoraro, R., An, L., Anishchenko, I., Glasscock, C., and Baker, D. Atomic context-conditioned protein sequence design using LigandMPNN.Nature Methods, 22(4):717–723, April

work page doi:10.1126/science.add2187
[3]

doi: 10.1038/s41592-025-02626-1

ISSN 1548-7105. doi: 10.1038/s41592-025-02626-1. Discovery, C., Boitreaud, J., Dent, J., McPartlon, M., Meier, J., Reis, V ., Rogozhnikov, A., and Wu, K. Chai-1: Decoding the molecular interactions of life, October

work page doi:10.1038/s41592-025-02626-1
[4]

Robin: A multi-agent system for automating scientific discovery.arXiv preprint arXiv:2505.13400, 2025

Ghareeb, A. E., Chang, B., Mitchener, L., Yiu, A., Szostkiewicz, C. J., Laurent, J. M., Razzak, M. T., White, A. D., Hinks, M. M., and Rodriques, S. G. Robin: A multi-agent system for automating scientific discovery.arXiv preprint arXiv:2505.13400,

work page arXiv
[5]

Towards an AI co-scientist

Gottweis, J., Weng, W.-H., Daryin, A., Tu, T., Palepu, A., Sirkovic, P., Myaskovsky, A., Weissenberger, F., Rong, K., Tanno, R., et al. Towards an ai co-scientist.arXiv preprint arXiv:2502.18864,

work page internal anchor Pith review arXiv
[6]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., Vardhamanan, S., Haq, S., Sharma, A., Joshi, T. T., Moazam, H., et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714,

work page internal anchor Pith review arXiv
[7]

T., Viliuga, V ., and F¨urst, M

Korbeld, K. T., Viliuga, V ., and F¨urst, M. Limitations of the refolding pipeline for de novo protein design.bioRxiv, pp. 2025–12,

2025
[8]

Training a scientific reasoning model for chemistry.arXiv preprint arXiv:2506.17238, 2025

Narayanan, S. M., Braza, J. D., Griffiths, R.-R., Bou, A., Wellawatte, G., Ramos, M. C., Mitchener, L., Rodriques, S. G., and White, A. D. Training a scientific reasoning model for chemistry.arXiv preprint arXiv:2506.17238,

work page arXiv
[9]

arXiv preprint arXiv:2405.16434 , year=

Nie, A., Cheng, C.-A., Kolobov, A., and Swaminathan, A. The importance of directional feedback for llm-based optimizers.arXiv preprint arXiv:2405.16434,

work page arXiv
[10]

doi: 10.1038/s41586-025-09429-6

ISSN 1476-4687. doi: 10.1038/s41586-025-09429-6. Pryzant, R., Iter, D., Li, J., Lee, Y . T., Zhu, C., and Zeng, M. Automatic prompt optimization with” gradient descent” and beam search.arXiv preprint arXiv:2305.03495,

work page doi:10.1038/s41586-025-09429-6
[11]

Allies: Prompting large language model with beam search.arXiv preprint arXiv:2305.14766,

Sun, H., Liu, X., Gong, Y ., Zhang, Y ., Jiang, D., Yang, L., and Duan, N. Allies: Prompting large language model with beam search.arXiv preprint arXiv:2305.14766,

work page arXiv
[12]

arXiv preprint arXiv:2310.16427 , year=

Wang, X., Li, C., Wang, Z., Bai, F., Luo, H., Zhang, J., Jojic, N., Xing, E. P., and Hu, Z. Promptagent: Strategic planning with language models enables expert-level prompt optimization.arXiv preprint arXiv:2310.16427,

work page arXiv
[13]

doi: 10.1038/s41586-023-06415-8

ISSN 1476-4687. doi: 10.1038/s41586-023-06415-8. Wei, A., Nie, A., Teixeira, T. S., Yadav, R., Lee, W., Wang, K., and Aiken, A. Improving par- allel program performance with llm optimizers via agent-system interfaces.arXiv preprint arXiv:2410.15625,

work page doi:10.1038/s41586-023-06415-8
[14]

Nature language model: deciphering the language of nature for scientific discovery.arXiv preprint arXiv:2502.07527, 2025

Xia, Y ., Jin, P., Xie, S., He, L., Cao, C., Luo, R., Liu, G., Wang, Y ., Liu, Z., Chen, Y .-J., et al. Nature language model: Deciphering the language of nature for scientific discovery.arXiv preprint arXiv:2502.07527,

work page arXiv
[15]

Reprompting: Automated chain-of-thought prompt inference through gibbs sampling.arXiv preprint arXiv:2305.09993,

Xu, W., Banburski-Fahey, A., and Jojic, N. Reprompting: Automated chain-of-thought prompt inference through gibbs sampling.arXiv preprint arXiv:2305.09993,

work page arXiv
[16]

K., Alamdari, S., Lee, A

Yang, K. K., Alamdari, S., Lee, A. J., Kaymak-Loveless, K., Char, S., Brixi, G., Domingo-Enrich, C., Wang, C., Lyu, S., Fusi, N., et al. The dayhoff atlas: scaling sequence diversity for improved protein generation.bioRxiv, pp. 2025–07,

2025
[17]

TextGrad: Automatic "Differentiation" via Text

Yuksekgonul, M., Bianchi, F., Boen, J., Liu, S., Huang, Z., Guestrin, C., and Zou, J. Textgrad: Automatic” differentiation” via text.arXiv preprint arXiv:2406.07496,

work page internal anchor Pith review arXiv
[18]

successful

18 A Appendix A.1 Calculation ofCα-RMSD To measure structural fidelity Cα-RMSD (x,x ∗) of ˆx to the provided reference native structure or reference backbone x∗, where L=L ∗ , we superimpose the predicted structure of the generated proposal over the reference structure using an algorithm Zhang & Skolnick (2005) that aligns remotely homologous protein stru...

2005