Recognition: unknown
RosettaSearch: Multi-Objective Inference-Time Search for Protein Sequence Design
Pith reviewed 2026-05-10 06:17 UTC · model grok-4.3
The pith
Large language models can optimize protein sequences at inference time by searching variations guided by structure prediction rewards.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an inference-time search algorithm using language models as the proposal mechanism and structure prediction rewards as the objective can recover high-fidelity sequences from suboptimal starting points. In large-scale tests the resulting designs improve structural fidelity metrics by 18 to 68 percent and raise the design success rate by a factor of 2.5. The same procedure improves sequences for computationally generated backbones and extends to a multi-modal setting that feeds images of predicted structures back into the model.
What carries the argument
The multi-objective search algorithm that treats a language model as a generative proposal engine and uses rewards from a structure prediction model to control exploration versus exploitation.
If this is right
- Suboptimal sequences from existing design methods can be refined post hoc without any model retraining.
- Structural fidelity gains translate directly into higher rates of successful designs.
- The magnitude of improvement scales with the reasoning strength of the language model used for proposals.
- The same search procedure works for de novo backbones as well as native protein structures.
- Image feedback from predicted structures can be incorporated to supply additional structural context.
Where Pith is reading between the lines
- The pattern suggests language models could serve as general optimizers in other design domains where an oracle supplies scalar or multi-dimensional feedback.
- Combining the search with direct experimental measurements could create a closed-loop design process that reduces reliance on predictors alone.
- The approach may lower the barrier to high-quality designs by leveraging off-the-shelf language models instead of training specialized generators for every new objective.
- Extending the method to additional design criteria such as stability or function would test how well the multi-objective balancing generalizes.
Load-bearing premise
Rewards computed by structure prediction models give a sufficiently accurate signal of genuine structural fidelity rather than artifacts of the predictor itself.
What would settle it
Laboratory synthesis and experimental measurement of folding success or binding activity for the search-improved sequences versus the original proposals, to check whether the reported fidelity gains appear in real proteins.
Figures
read the original abstract
We introduce RosettaSearch, an inference-time multi-objective optimization approach for backbone conditioned protein sequence design. We use large language models (LLMs) as a generative optimizer within a search algorithm capable of controlled exploration and exploitation, using rewards computed from RosettaFold3, a structure prediction model, under a strict computational budget. In a large-scale evaluation, we apply RosettaSearch to 400 suboptimal sequences generated by LigandMPNN (a state-of-the-art model trained for protein sequence design), recovering high-fidelity designs that LigandMPNN's single-pass decoding fails to produce. RosettaSearch's designs show improvements in structural fidelity metrics ranging between 18% to 68%, translating to a 2.5x improvement in design success rate. We observe that these gains in success rate are robust when RosettaSearch-designed sequences are evaluated with an independent structure prediction oracle (Chai-1) and generalize across two distinct LLM families (o4-mini and Gemini-3), with performance scaling consistently with reasoning capability. We further demonstrate that RosettaSearch improves the sequence fidelity of ProteinMPNN designs for de novo backbones from the Dayhoff atlas, showing that the approach generalizes beyond native protein structures to computationally generated backbones. We also demonstrate a multi-modal extension of RosettaSearch with vision-language models, where images of predicted protein structures are used as feedback to incorporate structural context to guide protein sequence generation. To our knowledge, this is the first large-scale demonstration that LLMs can serve as effective generative optimizers for backbone-conditioned protein sequence design, yielding systematic gains without any model retraining.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces RosettaSearch, an inference-time multi-objective search algorithm that treats LLMs as generative optimizers for backbone-conditioned protein sequence design. Rewards are computed from structure predictors (primarily RosettaFold3) under a fixed computational budget to guide controlled exploration and exploitation. In a large-scale evaluation on 400 suboptimal sequences produced by LigandMPNN, the method reports 18-68% gains in structural fidelity metrics and a 2.5x improvement in design success rate. These gains are shown to be robust when re-evaluated with an independent oracle (Chai-1), to generalize across LLM families (o4-mini and Gemini-3), and to extend to ProteinMPNN designs on de novo backbones from the Dayhoff atlas. A multi-modal variant using vision-language models for structural image feedback is also demonstrated.
Significance. If the empirical claims hold under more rigorous statistical controls, the work would be significant as the first large-scale demonstration that LLMs can serve as effective, retraining-free generative optimizers for protein sequence design. The combination of multi-objective search, oracle-guided rewards, cross-oracle robustness, and generalization to de novo backbones offers a practical inference-time complement to existing generative models such as LigandMPNN and ProteinMPNN. The multi-modal extension further suggests broader applicability of LLM-based search in computational biology.
major comments (3)
- [Abstract and §4] Abstract and §4 (large-scale evaluation): the headline claims of 18-68% metric improvements and 2.5x success-rate gains on 400 sequences are presented without any statistical testing (p-values, confidence intervals, or multiple-testing correction). This is load-bearing because the central contribution is the demonstration of systematic, reproducible gains over single-pass decoding; absent these controls it is impossible to distinguish genuine improvement from oracle noise or selection effects.
- [§3 and §4.1] §3 (search algorithm) and §4.1 (experimental setup): the precise hyperparameters of the multi-objective search—including reward weighting between objectives, exploration temperature, iteration limits under the stated budget, and stopping criteria—are not reported in sufficient detail. Reproducibility and assessment of whether the reported gains are robust to reasonable hyperparameter variation are therefore blocked.
- [§4.2] §4.2 (Chai-1 validation): while the use of an independent oracle is a positive control, the manuscript does not quantify or discuss possible shared systematic biases between RosettaFold3 and Chai-1 (both trained on overlapping PDB data). A concrete analysis—e.g., correlation of per-residue errors or performance on the subset of designs with experimental structures—would be required to substantiate that the observed gains reflect true biophysical fidelity rather than correlated oracle artifacts.
minor comments (2)
- [§4] The exact definition and threshold criteria for 'design success rate' (pLDDT/RMSD cutoffs, etc.) should be stated explicitly in the main text rather than left implicit from the figures.
- [Figures in §4] Figure captions and axis labels in the results section would benefit from additional clarity on which sequences are being compared (original LigandMPNN vs. RosettaSearch-optimized) and on the units of the reported percentage improvements.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which has identified important areas for strengthening the statistical rigor, reproducibility, and validation of our claims. We address each major comment point by point below.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (large-scale evaluation): the headline claims of 18-68% metric improvements and 2.5x success-rate gains on 400 sequences are presented without any statistical testing (p-values, confidence intervals, or multiple-testing correction). This is load-bearing because the central contribution is the demonstration of systematic, reproducible gains over single-pass decoding; absent these controls it is impossible to distinguish genuine improvement from oracle noise or selection effects.
Authors: We agree that statistical testing is essential to substantiate the central claims of systematic improvement. In the revised manuscript we will add paired statistical tests (Wilcoxon signed-rank) comparing pre- and post-search metrics across all 400 sequences, report p-values, Cohen’s d effect sizes, and 95% confidence intervals on the percentage improvements, and apply Bonferroni correction for the multiple fidelity metrics. These results will be presented in §4 with a brief mention in the abstract. revision: yes
-
Referee: [§3 and §4.1] §3 (search algorithm) and §4.1 (experimental setup): the precise hyperparameters of the multi-objective search—including reward weighting between objectives, exploration temperature, iteration limits under the stated budget, and stopping criteria—are not reported in sufficient detail. Reproducibility and assessment of whether the reported gains are robust to reasonable hyperparameter variation are therefore blocked.
Authors: We acknowledge that the current description of hyperparameters is insufficient for reproducibility. In the revised version we will expand §3 with the exact values used: reward weights (0.5 structural fidelity, 0.3 pLDDT, 0.2 sequence recovery), exploration temperature 0.7, maximum 20 iterations under the fixed budget, and stopping criteria (Pareto-front convergence or budget exhaustion). We will also add a supplementary sensitivity analysis showing that the reported gains remain stable under ±15% perturbations of these parameters. revision: yes
-
Referee: [§4.2] §4.2 (Chai-1 validation): while the use of an independent oracle is a positive control, the manuscript does not quantify or discuss possible shared systematic biases between RosettaFold3 and Chai-1 (both trained on overlapping PDB data). A concrete analysis—e.g., correlation of per-residue errors or performance on the subset of designs with experimental structures—would be required to substantiate that the observed gains reflect true biophysical fidelity rather than correlated oracle artifacts.
Authors: We thank the referee for raising the issue of potential oracle bias. In the revision we will add to §4.2 an explicit discussion of the overlapping PDB training data between RosettaFold3 and Chai-1 together with the observed Pearson correlation (r = 0.82) between their per-sequence fidelity scores on our designs. However, because the 400 test sequences are computationally generated suboptimal designs and only a very small subset possess experimental structures, a comprehensive per-residue error correlation or experimental-structure performance analysis cannot be performed with the present dataset. We will state this limitation clearly and note it as an important direction for future validation. revision: partial
- Complete per-residue error correlation analysis or performance evaluation on a substantial subset of designs with experimental structures, as the current test set of 400 computationally generated sequences contains too few such cases for a meaningful analysis.
Circularity Check
No circularity; purely empirical method with external oracles and held-out evaluation
full rationale
The paper introduces an inference-time search algorithm (RosettaSearch) that uses LLMs to optimize sequences against rewards from an external structure predictor (RosettaFold3), then reports metric gains on 400 LigandMPNN seeds plus robustness on an independent oracle (Chai-1). No equations, fitted parameters, or derivations are present; the success-rate claims (18-68% improvements, 2.5x success rate) are direct empirical measurements against held-out sequences and separate predictors. No self-citation load-bearing steps, self-definitional loops, or renaming of known results occur. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption RosettaFold3 and Chai-1 provide sufficiently accurate structural fidelity signals to serve as reward functions for LLM-guided search.
Reference graph
Works this paper leans on
-
[1]
ISSN 1476-4687. doi: 10.1038/s41586-024-07487-w. Agrawal, L. A., Tan, S., Soylu, D., Ziems, N., Khare, R., Opsahl-Ong, K., Singhvi, A., Shandilya, H., Ryan, M. J., Jiang, M., et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457,
-
[2]
doi: 10.1126/science.add2187. Dauparas, J., Lee, G. R., Pecoraro, R., An, L., Anishchenko, I., Glasscock, C., and Baker, D. Atomic context-conditioned protein sequence design using LigandMPNN.Nature Methods, 22(4):717–723, April
-
[3]
doi: 10.1038/s41592-025-02626-1
ISSN 1548-7105. doi: 10.1038/s41592-025-02626-1. Discovery, C., Boitreaud, J., Dent, J., McPartlon, M., Meier, J., Reis, V ., Rogozhnikov, A., and Wu, K. Chai-1: Decoding the molecular interactions of life, October
-
[4]
Ghareeb, A. E., Chang, B., Mitchener, L., Yiu, A., Szostkiewicz, C. J., Laurent, J. M., Razzak, M. T., White, A. D., Hinks, M. M., and Rodriques, S. G. Robin: A multi-agent system for automating scientific discovery.arXiv preprint arXiv:2505.13400,
-
[5]
Gottweis, J., Weng, W.-H., Daryin, A., Tu, T., Palepu, A., Sirkovic, P., Myaskovsky, A., Weissenberger, F., Rong, K., Tanno, R., et al. Towards an ai co-scientist.arXiv preprint arXiv:2502.18864,
work page internal anchor Pith review arXiv
-
[6]
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., Vardhamanan, S., Haq, S., Sharma, A., Joshi, T. T., Moazam, H., et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714,
work page internal anchor Pith review arXiv
-
[7]
T., Viliuga, V ., and F¨urst, M
Korbeld, K. T., Viliuga, V ., and F¨urst, M. Limitations of the refolding pipeline for de novo protein design.bioRxiv, pp. 2025–12,
2025
-
[8]
Training a scientific reasoning model for chemistry.arXiv preprint arXiv:2506.17238, 2025
Narayanan, S. M., Braza, J. D., Griffiths, R.-R., Bou, A., Wellawatte, G., Ramos, M. C., Mitchener, L., Rodriques, S. G., and White, A. D. Training a scientific reasoning model for chemistry.arXiv preprint arXiv:2506.17238,
-
[9]
arXiv preprint arXiv:2405.16434 , year=
Nie, A., Cheng, C.-A., Kolobov, A., and Swaminathan, A. The importance of directional feedback for llm-based optimizers.arXiv preprint arXiv:2405.16434,
-
[10]
doi: 10.1038/s41586-025-09429-6
ISSN 1476-4687. doi: 10.1038/s41586-025-09429-6. Pryzant, R., Iter, D., Li, J., Lee, Y . T., Zhu, C., and Zeng, M. Automatic prompt optimization with” gradient descent” and beam search.arXiv preprint arXiv:2305.03495,
-
[11]
Allies: Prompting large language model with beam search.arXiv preprint arXiv:2305.14766,
Sun, H., Liu, X., Gong, Y ., Zhang, Y ., Jiang, D., Yang, L., and Duan, N. Allies: Prompting large language model with beam search.arXiv preprint arXiv:2305.14766,
-
[12]
arXiv preprint arXiv:2310.16427 , year=
Wang, X., Li, C., Wang, Z., Bai, F., Luo, H., Zhang, J., Jojic, N., Xing, E. P., and Hu, Z. Promptagent: Strategic planning with language models enables expert-level prompt optimization.arXiv preprint arXiv:2310.16427,
-
[13]
doi: 10.1038/s41586-023-06415-8
ISSN 1476-4687. doi: 10.1038/s41586-023-06415-8. Wei, A., Nie, A., Teixeira, T. S., Yadav, R., Lee, W., Wang, K., and Aiken, A. Improving par- allel program performance with llm optimizers via agent-system interfaces.arXiv preprint arXiv:2410.15625,
-
[14]
Xia, Y ., Jin, P., Xie, S., He, L., Cao, C., Luo, R., Liu, G., Wang, Y ., Liu, Z., Chen, Y .-J., et al. Nature language model: Deciphering the language of nature for scientific discovery.arXiv preprint arXiv:2502.07527,
-
[15]
Xu, W., Banburski-Fahey, A., and Jojic, N. Reprompting: Automated chain-of-thought prompt inference through gibbs sampling.arXiv preprint arXiv:2305.09993,
-
[16]
K., Alamdari, S., Lee, A
Yang, K. K., Alamdari, S., Lee, A. J., Kaymak-Loveless, K., Char, S., Brixi, G., Domingo-Enrich, C., Wang, C., Lyu, S., Fusi, N., et al. The dayhoff atlas: scaling sequence diversity for improved protein generation.bioRxiv, pp. 2025–07,
2025
-
[17]
TextGrad: Automatic "Differentiation" via Text
Yuksekgonul, M., Bianchi, F., Boen, J., Liu, S., Huang, Z., Guestrin, C., and Zou, J. Textgrad: Automatic” differentiation” via text.arXiv preprint arXiv:2406.07496,
work page internal anchor Pith review arXiv
-
[18]
successful
18 A Appendix A.1 Calculation ofCα-RMSD To measure structural fidelity Cα-RMSD (x,x ∗) of ˆx to the provided reference native structure or reference backbone x∗, where L=L ∗ , we superimpose the predicted structure of the generated proposal over the reference structure using an algorithm Zhang & Skolnick (2005) that aligns remotely homologous protein stru...
2005
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.