pith. sign in

arxiv: 2506.03800 · v2 · pith:KVOIX7N2new · submitted 2025-06-04 · 🧬 q-bio.BM

STELLA: A Multimodal LLM for Protein Functional Annotation via Unified Sequence-Structure Encoding

Pith reviewed 2026-05-19 12:00 UTC · model grok-4.3

classification 🧬 q-bio.BM
keywords multimodal LLMprotein functional annotationsequence-structure encodingESM3enzyme reaction predictionfunctional descriptionbimodal representations
0
0 comments X

The pith

STELLA multimodal LLM achieves state-of-the-art results in protein functional annotation by aligning sequence-structure data with text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces STELLA as a multimodal large language model that integrates protein sequence and structure information with textual descriptions for functional annotation. It employs ESM3 to produce unified bimodal encodings and pairs this with Llama-3.1-8B-Instruct for language processing. The central effort is to address the limits of sequence-only protein language models by incorporating geometric structural details that help determine biological roles. A sympathetic reader would care because more accurate function predictions support better understanding of biological mechanisms and biomedical applications.

Core claim

STELLA leverages ESM3 for unified bimodal encoding and Llama-3.1-8B-Instruct for natural language modeling, achieving state-of-the-art performance in two critical tasks: Functional Description Prediction and Enzyme-catalyzed Reaction Prediction by synergistically aligning bimodal (sequence-structure) representations with the textual modality.

What carries the argument

Synergistic alignment of bimodal sequence-structure representations from ESM3 with the textual modality inside the multimodal LLM.

If this is right

  • Multimodal LLMs can outperform pure protein language models on functional annotation tasks.
  • Incorporating structural geometry alongside sequence data yields more accurate predictions of protein roles and enzyme reactions.
  • Textual context from natural language descriptions supports richer functional annotations than sequence or structure alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alignment strategy might improve function prediction for proteins with incomplete or low-resolution structural data.
  • Similar multimodal fusion could be tested on related tasks such as protein-protein interaction prediction.

Load-bearing premise

High-resolution structural information from ESM3 can be effectively aligned with textual context inside a multimodal LLM to produce measurably better functional annotations than sequence-only models.

What would settle it

A head-to-head test on the same benchmarks showing that STELLA performs no better than or worse than sequence-only models on Functional Description Prediction or Enzyme-catalyzed Reaction Prediction would falsify the central claim.

Figures

Figures reproduced from arXiv: 2506.03800 by Boya Wu, Hongwang Xiao, Hui Wang, Jiashan Li, Kai Chen, Qiwei Ye, Sicheng Dai, Wenjun Lin, Xi Chen, Yuancheng Sun.

Figure 1
Figure 1. Figure 1: Demo capability of STELLA. (STELLA-ESM3-Llama-3.1-8B-Instruct). The examples involve two proteins—G1TFE0 (left) and A0A1D0BR98 (right)—sourced from the newly released Swiss-Prot 2024_02. The orange box indicates the ground-truth functional annotation. Text highlighted in green denotes critical and correct functional information generated by STELLA. User and assistant icons are AI-generated. 3 Related work … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of STELLA architecture. Stage1 of MMIT: to fine-tune the modality connector using the OPI-Struc dataset by freezing the protein structure encoder and LLM. Stage2 of MMIT: to continually fine-tune the modality connector and the LLM simultaneously with different learning rates, by freezing the protein structure encoder. Flame: model is trainable; Snowflake: model is frozen. Protein structure credits… view at source ↗
Figure 3
Figure 3. Figure 3: shows two case studies of STELLA-ESM3-Llama-3.1-8B-Instruct to uncover protein functions and related properties. Function of this protein (SwissProt ID: Q9W3K5, from hold-out testing set): Catalyzes the ATP-dependent ligation of L-glutamate and L￾cysteine and participates in the first and rate-limiting step in glutathione biosynthesis. Function of this protein (SwissProt ID: Q5KYR2, from hold-out testing s… view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of protein sequence lengths across the FP (left) and EP (right) tasks for training and testing sets. The variation in sequence length distribution between the training and testing sets ensures model robustness across proteins with diverse structural complexities. H Comparison of protein structure encoders STELLA employs three different encoders ESM3 [13], Prot2Text [6], and SaProt [4] for abla… view at source ↗
Figure 5
Figure 5. Figure 5: (a): Length distribution of functional descriptions in the Function dataset. (b): Frequency of enzyme names in the Enzyme dataset. The enzyme name distribution in the training set follows a long-tailed pattern, but the label distribution in the test set differs significantly from that in the training set [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: UMAP visualization of 4,203 protein structure embeddings in the testing set Funcf t_test generated by ESM3, Prot2Text, and SaProt. Each plot illustrates the clustering of protein structures based on their embeddings, revealing the representational differences among the three encoders. The highlighted proteins belong to specific functions as detailed in the legend. ESM3 demonstrates the strongest representa… view at source ↗
Figure 7
Figure 7. Figure 7: Metrics trend for training with the dataset mix3 over different training epochs. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
read the original abstract

Understanding the intricate interplay among sequence, structure, and function remains a fundamental challenge in proteomics. The sequence-structure-function paradigm posits that biological roles are governed by the tertiary geometric conformations encoded within primary sequences; consequently, integrating these multi-modal descriptors is imperative for accurate functional annotation. While protein language models (pLMs) have achieved significant progress via representation learning on massive sequence data, they often lack the capacity to incorporate high-resolution structural information and the rich textual context that characterizes protein roles. In this work, we present STELLA, a multimodal LLM that synergistically aligns bimodal (sequence-structure) representations with the textual modality to advance protein functional annotation. By leveraging ESM3 for unified bimodal encoding and Llama-3.1-8B-Instruct for natural language modeling, STELLA achieves state-of-the-art performance in two critical tasks: Functional Description Prediction and Enzyme-catalyzed Reaction Prediction. This study demonstrates that multimodal LLMs represent a paradigm shift beyond pure pLMs, offering a new frontier for protein biology and biomedical discovery. The codes can be accessed via https://github.com/ocx-lab/STELLA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces STELLA, a multimodal LLM that uses ESM3 for unified sequence-structure encoding of proteins and aligns the resulting bimodal representations with textual context inside a Llama-3.1-8B-Instruct backbone. It reports state-of-the-art results on Functional Description Prediction and Enzyme-catalyzed Reaction Prediction, arguing that this multimodal approach constitutes a paradigm shift beyond sequence-only protein language models.

Significance. If the performance gains are shown to arise specifically from the structure-aware encoding rather than from model scale, instruction tuning, or dataset differences, the work could meaningfully advance multimodal methods in proteomics and support more accurate functional annotation. The public code release is a clear strength for reproducibility.

major comments (1)
  1. [Results] Results section (and associated tables/figures): the manuscript contains no ablation that isolates the contribution of ESM3 structure tokens. A controlled comparison that freezes the Llama-3.1 backbone, training regime, and prompting while removing or masking structure information is required to attribute any reported gains to the claimed bimodal synergy rather than to the larger LLM capacity or other factors. This directly tests the central claim and is therefore load-bearing.
minor comments (2)
  1. [Abstract] The abstract states that STELLA 'achieves state-of-the-art performance' but does not name the specific baselines, datasets, or metrics used; the main text should make these explicit in the first results subsection for immediate clarity.
  2. [Methods] Notation for the alignment module (e.g., how sequence and structure embeddings are projected and fused before the LLM) should be defined once in a dedicated methods subsection rather than introduced piecemeal in the text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our work. Below we address the major comment point by point.

read point-by-point responses
  1. Referee: [Results] Results section (and associated tables/figures): the manuscript contains no ablation that isolates the contribution of ESM3 structure tokens. A controlled comparison that freezes the Llama-3.1 backbone, training regime, and prompting while removing or masking structure information is required to attribute any reported gains to the claimed bimodal synergy rather than to the larger LLM capacity or other factors. This directly tests the central claim and is therefore load-bearing.

    Authors: We acknowledge that the current manuscript does not include an explicit ablation study isolating the contribution of the structure tokens from ESM3. To address this, we will add a new experiment in the revised version. Specifically, we will train and evaluate a sequence-only variant of STELLA by disabling the structure encoding component of ESM3 while freezing the Llama-3.1-8B-Instruct backbone, maintaining the same training regime, dataset, and prompting strategy. This controlled comparison will directly demonstrate whether the performance gains stem from the bimodal sequence-structure encoding as claimed. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical model and benchmark claims

full rationale

The paper presents STELLA as an engineering contribution that combines pre-trained ESM3 bimodal encodings with a Llama-3.1 backbone and reports experimental SOTA results on two annotation tasks. No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing uniqueness theorems appear in the provided text. Performance numbers are obtained from standard train/test splits on external benchmarks rather than any quantity that is defined in terms of itself or forced by the model's own training objective. The central claim therefore remains an independent empirical statement rather than a tautological reduction to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption that sequence, structure, and function are tightly coupled and that multimodal alignment will improve annotation; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption The sequence-structure-function paradigm posits that biological roles are governed by the tertiary geometric conformations encoded within primary sequences.
    Explicitly stated in the opening of the abstract as the justification for multimodal integration.

pith-pipeline@v0.9.0 · 5759 in / 1106 out tokens · 44218 ms · 2026-05-19T12:00:01.611547+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Multimodal Protein Language Models for Enzyme Kinetic Parameters: From Substrate Recognition to Conformational Adaptation

    cs.CV 2026-03 unverdicted novelty 7.0

    ERBA is a new staged multimodal adapter that improves protein language model predictions of enzyme kinetic parameters by separately modeling substrate recognition and induced-fit conformational changes.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Berman, John Westbrook, Zukang Feng, Gary Gilliland, T

    Helen M. Berman, John Westbrook, Zukang Feng, Gary Gilliland, T. N. Bhat, Helge Weissig, Ilya N. Shindyalov, and Philip E. Bourne. The Protein Data Bank. Nucleic Acids Research, 28(1):235–242, 01 2000

  2. [2]

    AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models

    Mihaly Varadi, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, Oana Stroe, Gemma Wood, Agata Laydon, Augustin Žídek, Tim Green, Kathryn Tunyasuvunakool, Stig Petersen, John Jumper, Ellen Clancy, Richard Green, Ankur V ora, Mira Lutfi, Michael Figurnov, Andrew Cowie, Nicole Hobbs, Pushmeet Kohli, Gerard Kle...

  3. [3]

    John M. Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Zídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A A Kohl, Andy Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David A. Reiman,...

  4. [4]

    SaProt: Protein language modeling with structure-aware vocabulary

    Jin Su, Chenchen Han, Yuyang Zhou, Junjie Shan, Xibin Zhou, and Fajie Yuan. SaProt: Protein language modeling with structure-aware vocabulary. bioRxiv 2023.10.01.560349, 2023

  5. [5]

    Sable: bridging the gap in protein structure understanding with an empowering and versatile pre-training paradigm

    Jiashan Li, Xi Chen, He Huang, Mingliang Zeng, Jingcheng Yu, Xinqi Gong, and Qiwei Ye. Sable: bridging the gap in protein structure understanding with an empowering and versatile pre-training paradigm. Briefings in Bioinformatics, 26(2):bbaf120, 03 2025

  6. [6]

    Prot2Text: Multimodal Protein’s Function Generation with GNNs and Transformers

    Hadi Abdine, Michail Chatzianastasis, Costas Bouyioukos, and Michalis Vazirgiannis. Prot2Text: Multimodal protein’s function generation with gnns and transformers. arXiv preprint arXiv:2307.14367, 2023

  7. [7]

    ProteinGPT: Multimodal LLM for protein property prediction and structure understanding

    Yijia Xiao, Edward Sun, Yiqiao Jin, Qifan Wang, and Wei Wang. ProteinGPT: Multimodal LLM for protein property prediction and structure understanding. arXiv preprint arXiv:2408.11363, 2024

  8. [8]

    ProtChatGPT: towards understanding proteins with large language models

    Chao Wang, Hehe Fan, Ruijie Quan, and Yi Yang. ProtChatGPT: Towards understanding proteins with large language models. arXiv preprint arXiv:2402.09649, 2024

  9. [9]

    arXiv preprint arXiv:2009.01411 , year=

    Bowen Jing, Stephan Eismann, Patricia Suriana, Raphael JL Townshend, and Ron Dror. Learning from protein structure with geometric vector perceptrons. arXiv preprint arXiv:2009.01411, 2020

  10. [10]

    Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

    Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C Lawrence Zitnick, Jerry Ma, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021

  11. [11]

    NetSurfP-2.0: Improved prediction of protein structural features by integrated deep learning

    Michael Schantz Klausen, Martin Closter Jespersen, Henrik Nielsen, Kamilla Kjaergaard Jensen, Vanessa Isabell Jurtz, Casper Kaae Soenderby, Morten Otto Alexander Sommer, Ole Winther, Morten Nielsen, Bent Petersen, et al. NetSurfP-2.0: Improved prediction of protein structural features by integrated deep learning. Proteins: Structure, Function, and Bioinfo...

  12. [12]

    Learning inverse folding from millions of predicted structures

    Chloe Hsu, Robert Verkuil, Jason Liu, Zeming Lin, Brian Hie, Tom Sercu, Adam Lerer, and Alexander Rives. Learning inverse folding from millions of predicted structures. In International conference on machine learning, pages 8946–8970. PMLR, 2022

  13. [13]

    Simulating 500 million years of evolution with a language model

    Tomas Hayes, Roshan Rao, Halil Akin, Nicholas J Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q Tran, Jonathan Deaton, Marius Wiggert, et al. Simulating 500 million years of evolution with a language model. bioRxiv, pages 2024–07, 2024

  14. [14]

    arXiv preprint arXiv:2501.10282 (2025)

    Wenqi Fan, Yi Zhou, Shijie Wang, Yuyao Yan, Hui Liu, Qian Zhao, Le Song, and Qing Li. Computational protein science in the era of large language models (llms). arXiv preprint arXiv:2501.10282, 2025

  15. [15]

    ProtST: Multi-modality learning of protein sequences and biomedical texts

    Minghao Xu, Xinyu Yuan, Santiago Miret, and Jian Tang. ProtST: Multi-modality learning of protein sequences and biomedical texts. arXiv preprint arXiv:2301.12040, 2023

  16. [16]

    A text-guided protein design framework

    Shengchao Liu, Yutao Zhu, Jiarui Lu, Zhao Xu, Weili Nie, Anthony Gitter, Chaowei Xiao, Jian Tang, Hongyu Guo, and Anima Anandkumar. A text-guided protein design framework. arXiv preprint arXiv:2302.04611, 2023

  17. [17]

    ProTokens: A machine-learned language for compact and informative encoding of protein 3D structures

    Xiaohan Lin, Zhenyu Chen, Yanheng Li, Xingyu Lu, Chuanliu Fan, Ziqiang Cao, Shihao Feng, Yi Qin Gao, and Jun Zhang. ProTokens: A machine-learned language for compact and informative encoding of protein 3D structures. bioRxiv, pages 2023–11, 2023

  18. [18]

    Instruct- Protein: Aligning human and protein language via knowledge instruction

    Zeyuan Wang, Qiang Zhang, Keyan Ding, Ming Qin, Xiang Zhuang, Xiaotong Li, and Huajun Chen. Instruct- Protein: Aligning human and protein language via knowledge instruction. arXiv preprint arXiv:2310.03269, 2023

  19. [19]

    Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine, 2023

    Yizhen Luo, Jiahuan Zhang, Siqi Fan, Kai Yang, Yushuai Wu, Mu Qiao, and Zaiqing Nie. Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine, 2023

  20. [20]

    Language models of protein sequences at the scale of evolution enable accurate structure prediction

    Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv, 2022:500902, 2022

  21. [21]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  22. [22]

    S2ORC: The semantic scholar open research corpus

    Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel Weld. S2ORC: The semantic scholar open research corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4969–4983, Online, July 2020. Association for Computational Linguistics

  23. [23]

    Grotjahn, Elizabeth Villa, Le Song, and Pengtao Xie

    Mingjia Huo, Han Guo, Xingyi Cheng, Digvijay Singh, Hamidreza Rahmani, Shen Li, Philipp Gerlof, Trey Ideker, Danielle A. Grotjahn, Elizabeth Villa, Le Song, and Pengtao Xie. Multi-modal large language model enables protein function prediction. bioRxiv, 2024

  24. [24]

    xtrimopglm: unified 100b-scale pre-trained transformer for deciphering the language of protein

    Bo Chen, Xingyi Cheng, Pan Li, Yangli-ao Geng, Jing Gong, Shen Li, Zhilei Bei, Xu Tan, Boyan Wang, Xin Zeng, et al. xtrimopglm: unified 100b-scale pre-trained transformer for deciphering the language of protein. arXiv preprint arXiv:2401.06199, 2024

  25. [25]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

  26. [26]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023

  27. [27]

    Efficient multimodal learning from data-centric perspective, 2024

    Muyang He, Yexin Liu, Boya Wu, Jianhao Yuan, Yueze Wang, Tiejun Huang, and Bo Zhao. Efficient multimodal learning from data-centric perspective, 2024

  28. [28]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  29. [29]

    Measuring massive multitask language understanding, 2021

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021

  30. [30]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024

  31. [31]

    Instruction-following evaluation for large language models, 2023

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 2023

  32. [32]

    Training verifiers to solve math word problems, 2021

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

  33. [33]

    Measuring mathematical problem solving with the math dataset, 2021

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021

  34. [34]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023

  35. [35]

    Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

  36. [36]

    Evaluating large language models trained on code, 2021

    Mark Chen and Jerry Tworek et al. Evaluating large language models trained on code, 2021

  37. [37]

    Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation, 2023

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation, 2023

  38. [38]

    Patil, Ion Stoica, and Joseph E

    Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Berkeley function calling leaderboard. 2024

  39. [39]

    Nexusraven: a commercially-permissive language model for function calling

    Venkat Krishna Srinivasan, Zhen Dong, Banghua Zhu, Brian Yu, Hanzi Mao, Damon Mosk-Aoyama, Kurt Keutzer, Jiantao Jiao, and Jian Zhang. Nexusraven: a commercially-permissive language model for function calling. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023

  40. [40]

    ∞bench: Extending long context evaluation beyond 100k tokens, 2024

    Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Khai Hao, Xu Han, Zhen Leng Thai, Shuo Wang, Zhiyuan Liu, and Maosong Sun. ∞bench: Extending long context evaluation beyond 100k tokens, 2024

  41. [41]

    Language models are multilingual chain-of- thought reasoners, 2022

    Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush V osoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. Language models are multilingual chain-of- thought reasoners, 2022

  42. [42]

    SIFTS: updated Structure Integration with Function, Taxonomy and Sequences resource allows 40-fold increase in coverage of structure-based annotations for proteins

    Jose M Dana, Aleksandras Gutmanas, Nidhi Tyagi, Guoying Qi, Claire O’Donovan, Maria Martin, and Sameer Velankar. SIFTS: updated Structure Integration with Function, Taxonomy and Sequences resource allows 40-fold increase in coverage of structure-based annotations for proteins. Nucleic Acids Research, 47(D1):D482–D489, 11 2018

  43. [43]

    Intrinsic-extrinsic convolution and pooling for learning on 3D protein structures

    Pedro Hermosilla, Marco Schäfer, Matej Lang, Gloria Fackelmann, Pere-Pau Vázquez, Barbora Kozlikova, Michael Krone, Tobias Ritschel, and Timo Ropinski. Intrinsic-extrinsic convolution and pooling for learning on 3D protein structures. In International Conference on Learning Representations, 2021

  44. [44]

    Fast and accurate protein structure search with foldseek

    Michel Van Kempen, Stephanie S Kim, Charlotte Tumescheit, Milot Mirdita, Jeongjae Lee, Cameron LM Gilchrist, Johannes Söding, and Martin Steinegger. Fast and accurate protein structure search with foldseek. Nature biotechnology, 42(2):243–246, 2024

  45. [45]

    Continuous-discrete convolution for geometry- sequence modeling in proteins

    Hehe Fan, Zhangyang Wang, Yi Yang, and Mohan Kankanhalli. Continuous-discrete convolution for geometry- sequence modeling in proteins. In The Eleventh International Conference on Learning Representations, 2022

  46. [46]

    Unified rational protein engineering with sequence-based deep representation learning

    Ethan C Alley, Grigory Khimulya, Surojit Biswas, Mohammed AlQuraishi, and George M Church. Unified rational protein engineering with sequence-based deep representation learning. Nature methods, 16(12):1315–1322, 2019

  47. [47]

    Deep convolutional networks for quality assessment of protein folds

    Georgy Derevyanko, Sergei Grudinin, Yoshua Bengio, and Guillaume Lamoureux. Deep convolutional networks for quality assessment of protein folds. Bioinformatics, 34(23):4046–4053, 2018

  48. [48]

    Evaluating protein transfer learning with TAPE

    Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Peter Chen, John Canny, Pieter Abbeel, and Yun Song. Evaluating protein transfer learning with TAPE. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019

  49. [49]

    HH-suite3 for fast remote homology detection and deep protein annotation

    Martin Steinegger, Markus Meier, Milot Mirdita, Harald Vöhringer, Stephan J Haunsberger, and Johannes Söding. HH-suite3 for fast remote homology detection and deep protein annotation. BMC bioinformatics, 20:1–15, 2019

  50. [50]

    arXiv preprint arXiv:2203.06125 , year=

    Zuobai Zhang, Minghao Xu, Arian Jamasb, Vijil Chenthamarakshan, Aurelie Lozano, Payel Das, and Jian Tang. Protein representation learning by geometric structure pretraining. arXiv preprint arXiv:2203.06125, 2022

  51. [51]

    Contrastive representation learning for 3D protein structures

    Pedro Hermosilla and Timo Ropinski. Contrastive representation learning for 3D protein structures

  52. [52]

    Structure-based protein function prediction using graph convolutional networks

    Vladimir Gligorijevi´c, P Douglas Renfrew, Tomasz Kosciolek, Julia Koehler Leman, Daniel Berenberg, Tommi Vatanen, Chris Chandler, Bryn C Taylor, Ian M Fisk, Hera Vlamakis, et al. Structure-based protein function prediction using graph convolutional networks. Nature communications, 12(1):3168, 2021

  53. [53]

    ProtTrans: Toward understanding the language of life through self-supervised learning

    Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rehawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, Debsindhu Bhowmik, and Burkhard Rost. ProtTrans: Toward understanding the language of life through self-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):7112–7127, 2022

  54. [54]

    Lawrence Zitnick, Jerry Ma, and Rob Fergus

    Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences , 118(15):e2016239118, 2021

  55. [55]

    Contrastive representation learning for 3D protein structures.arXiv preprint arXiv:2205.15675, 2022

    Pedro Hermosilla and Timo Ropinski. Contrastive representation learning for 3D protein structures.arXiv preprint arXiv:2205.15675, 2022

  56. [56]

    Large multi-modal model for video captioning

    Wenhao Chai. Large multi-modal model for video captioning. Master’s thesis, University of Washington, 2025

  57. [57]

    Llama 3 model card, 2024

    AI@Meta. Llama 3 model card, 2024

  58. [58]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023

  59. [59]

    Phi-3 technical report: A highly capable language model locally on your phone, 2024

    Marah Abdin, Jyoti Aneja, and et al Hany Awadalla. Phi-3 technical report: A highly capable language model locally on your phone, 2024

  60. [60]

    Language models are super mario: Absorbing abilities from homologous models as a free lunch, 2024

    Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch, 2024

  61. [61]

    Biomistral: A collection of open-source pretrained large language models for medical domains, 2024

    Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, and Richard Dufour. Biomistral: A collection of open-source pretrained large language models for medical domains, 2024. A Example demonstration of STELLA’s capabilities through case studies Figure 3 shows two case studies of STELLA-ESM3-Llama-3.1-8B-Instruct to uncover ...

  62. [62]

    For example: "Required for accurate and efficient protein synthesis under certain stress conditions

    Prepare ground truth functional descriptions as LLM input: We start with accurate, expert-reviewed descriptions of protein functions. For example: "Required for accurate and efficient protein synthesis under certain stress conditions. May act as a fidelity factor of the translation reaction by catalyzing a one-codon backward translocation of tRNAs on impr...

  63. [63]

    Prompt Llama-2-13B-Chat to generate conversational data: We utilize the Llama-2-13B-Chat model to convert these structured descriptions into conversational question-answer pairs. Specifically, we employ the following prompt to ensure detailed and meaningful dialogues: "Given a functional description of the protein, design two or three rounds of questions ...

  64. [64]

    swissprot_id

    Save the augmentated data in the format shown in the example L.2 in Appendix L. K Diversified instructions generated by ChatGPT (GPT-3.5) This section presents a comprehensive collection of diversified natural language instructions (see K.1- K.3) generated by ChatGPT (GPT-3.5), designed for two tasks–FP and EP. These instructions aim to simulate realistic...